Data lake is a very popular technology in the field of big data in recent years. Traditional data warehouses cannot realize real-time update of incremental data, nor can they support flexible metadata formats. Data lake technology was born in this context. Incremental changes to the database are the main source of incremental data in the data lake. However, the current entry path of TiDB is relatively fragmented. Dumpling components are used for full changes, and TiCDC components are used for incremental changes. The two are in a split link, and TiDB cannot complete the real-time cleaning and processing of data into the lake through real-time materialized views.

In the TiDB Hackathon 2021, the TiLaker team's project solved the problem of TiDB data entering the lake. TiLaker can help users simplify the process of transferring TiDB data to the data lake, and fully digest and use the data stored on TiDB. The TiLaker team also won three awards, "Second Prize", "Best Market Potential Award Specially Sponsored by Huachuang Capital" and "Best Popularity Award" for this project.

"Last year, Hackathon actually had quite a few projects integrated with Flink. I saw one in the finals this year. I was a little disappointed. But this year TiLaker has done quite well. With the participation of Flink committer, a CDC Connector has been implemented for Flink. , which allows Flink to directly read the incremental data of TiDB and synchronize it to the downstream. With the help of Flink's capabilities, TiDB can better connect with the downstream ecosystem, and I hope that more application cases will come out later."

——Judge Tang Liu

In the past year, TiDB has attached great importance to ecological construction. The most important thing in the ecology is the integration and interoperability between TiDB as a distributed database and the big data ecology. TiLaker established a fast, efficient and simplified channel through Flink CDC, which solved the problem of efficient entry into the lake and better integrated the two ecosystems. In this article, through the dialogue between the TiLaker team and Huachuang Capital partner Xie Jia, we will reveal the wonderful stories before and behind the TiLaker competition, and hope to bring some inspiration to developers and users on how to put data into the lake.

The origin of the TiLaker team: the origin of the community
The TiLaker team consists of four senior contributors and open source software enthusiasts from the TiDB community and the Flink community. Let's get to know them first:

Wu Xuelian: TiDB classmates may be familiar with me. I have been working as a kernel in TiDB, and I am also a TiDB Committer & TiKV Committer. One of my hobbies is to tease Flink's classmates. My wife is working on the data lake in the Flink team. This time, I got together with Flink's friends because of this relationship.

Xu Bangjiang (Xue Jin): I have been working on a real-time monitoring system in the Alibaba network team. Later, I felt that real-time computing was a good direction, so I came to the Flink community and worked on it for about two or three years. In the first two years, I focused on Flink SQL, and then I gradually found some great space in the direction of data integration. Flink is often used as a data pipeline, but the data pipeline is a very thin layer, unlike databases and data lakes, which have strong data dependencies. So we have the Flink CDC project, which is equivalent to adding an integration capability to the Flink pipeline, which can support more data sources and more upstream and downstream. I have been working on Flink CDC for the past six months, and now I am also a Flink CDC Maintainer & Apache Flink Committer.

Jiang Xiaofeng (Ziyi): My name is Jiang Xiaofeng, Ali's flower name is Ziyi, and Xuejin are both from the Flink team. I am Apache ShardingSphere & Apache RocketMQ Commttier, and I am mainly responsible for Apache Flink AI related work. Counting this competition, we have participated in three TiDB Hackathons.

Jiang Yongbo: I am currently doing research and development in the TiDB scheduling team, and I am a colleague in the same group as Xuelian. In fact, I only recently joined TiDB. I worked in data development in Alibaba's network department before, and I was in a big network team with Mr. Xue Jin. However, when I joined Ali, Xuejin seemed to have left the team at that time, and some of the SQL I maintained was left by Xuejin.

It is not difficult to see from the personal introduction that the four members of TiLaker are actually very related, and the story of TiLaker also originates from this:

Back in time when the Hackathon opened for registration—

One day when Wu Xuelian came home, her lover said to her: I have a colleague who wants to find you to form a team.

Jiang Yongbo: This person is me. At that time, I saw Xue Jin in the competition group. Although I have not communicated directly with him before, I still went to chat privately and asked if we would like to form a team to play together? After that, Xuejin contacted Ziyi again, and I contacted Xuelian again.

In this way, the TiLaker team was officially formed.

The intersection of two communities
During the competition, the TiLaker team left a deep impression on the investor judge, Huachuang Capital partner Xie Jia, who has always been very interested in such Infra projects.

Xie Jia: Among the contestants and projects in this competition, TiLaker's project is very representative. The two top-level communities that are very popular with developers mutually amplify each other's ecology and communicate with each other. Data entering the lake comes from a very wide range of practical needs in the industry. For Flink students, they may do this sooner or later. This Hackathon event just accelerated the iteration of the Flink and TiDB communities. From a team perspective, everyone is a variety of good players in the open source community, and they can all work on their own. Combined, these aspects are very attractive to me.

Xuejin: People from two communities do an integration project, which is equivalent to an interdisciplinary project. Our team has students from both the Flink and TiDB communities, which is very helpful for such integration projects or ecological projects. . For example, there are some details of TiDB technology. I can quickly ask Mr. Xuelian and Mr. Yongbo, and they can give me the answer quickly, so I don't have to go to GitHub to look at the code and documentation.

Ziyi and I don't actually have much experience with TP-type databases. I have a certain understanding of MySQL, PG, and Oracle, but we don't know much about the underlying update mechanism and clustering of TiKV. These are all learned through this Hackathon. So at the beginning, we were not very clear about some of the logic of TiDB, and it took two nights to solve it. Later, after Xuelian and Yongbo came in, when some codes were incomprehensible, we would communicate directly in the group, saving a lot of time and energy.

Inspiration for the project
Wu Xuelian: The name of the project was named by the talented teacher Ziyi, including our team declaration, defense PPT, and team introduction video. The implementation mechanism of the project is mainly proposed by Mr. Xue Jin. He has been working on Flink CDC all the time. He has connected a lot of databases upstream, and many TiDB users have come to ask if they can support TiDB. Just at this time, we saw that TiDB Hackathon 2021 opened for registration, so we decided to take this project to the competition.

Xue Jin: In terms of data sources, Flink CDC actually supports all the mainstream DBs. Even if there is no support, the community's PRs are also issued, but I don't have the energy to give everyone a review. The PRs of MySQL, Oracle, PG, MongoDB, and TiDB are all being sorted out. The students of Ali's PolarDB and OceanBase saw that we came to the TiDB Hackathon competition to open PRs, and they are also urging reviews recently.

In fact, the Flink pipeline has strong integration capabilities, and it can be connected to many downstreams. When data reaches a certain scale, database storage is always more expensive than cheap data lake storage. A lot of historical data doesn't need that expensive storage and doesn't need to be analyzed. It is enough to import it directly into the data lake for historical storage. Moreover, the data lake also has the ability to update, and Flink combined with the data lake can even achieve minute-level updates. The database is connected to Flink and then to the data lake, because the data lake is cheap and can be updated.

What was achieved at this Hackathon?

TiLaker developed the Flink CDC Connector for TiDB based on the TiCDC component. TiDB CDC Connector provides the ability to read historical data in full and incremental data in real time. When switching between full and incremental data, it ensures that one piece is not lost, and one piece is not much. Eactly-once semantics; at the same time, TiLaker provides SQL API, users only need a few lines of Flink SQL to capture the full amount of historical data and incremental data in TiDB. Thanks to the hangelog mechanism of Flink SQL, Flink SQL can seamlessly connect with the change data of the database. The tidb-cdc table defined by Flink SQL is the real-time materialized view of the corresponding table in TiDB. Every change in the database will make tidb -cdc table auto update;

The Flink CDC project also provides support for databases such as MySQL, MariaDB, Postgres, Oracle, and Mongo, which means that after supporting TiDB, users can realize the integration of heterogeneous data sources, such as some tables in MySQL and some tables in TiDB , can do real-time Join, Union and other Streaming processing; in addition, as an excellent computing engine, Flink can provide powerful computing power and excellent pipeline capabilities, support a variety of data lake products in the industry (Hudi, Iceberg), and provide SQL API support. This allows TiDB users to easily write data into the data lake in real time by using only SQL, and easily realize the construction of the data lake.

As judge Tang Liu commented, the TiLaker team's projects in the Hackathon have been very complete, and they have already completed more than half of the competition. The next challenge for everyone is how to make the judges' highlights of the project better in the competition. Feeling?

Wu Xuelian: The main code and Flink CDC are mainly done by Mr. Xuejin and Mr. Ziyi. Yongbo and I are soy sauce, and the main focus is on demo. At that time, I was thinking about how to highlight the highlights of this project? At that time, we thought of the real-time intelligent scheduling system for online car-hailing. It basically simulates how ride-hailing is scheduled. Real-time data will be written to TiDB when writing TP, but like some recommended data, for example, if a passenger wants to take a taxi, the easiest way is to recommend vehicles near him. In this case of big data, TiDB may not be enough. We use Flink CDC to import data into Flink for calculation, and realize real-time recommendation business. In addition, after the data entered the lake, a report was also made, that is, the report of the car running around. These data were taken from the lake, which is equivalent to an offline analysis.

Jiang Yongbo: At that time, I wanted to do this demo mainly because I thought that online car-hailing is a typical Internet architecture. It will have some core systems for order transactions. This system needs data, which must run on the TP database. Then it will also have some real-time computing requirements, but it is difficult for general TP systems to do streaming computing. At the same time, there will also be some real-time large-screen requirements for operation students, which need to be convenient for them to do some business analysis.

What are your expectations for the future of TiLaker?
Xuejin: From my experience, if a project is really going to be put into production, there are still a lot of things to do. Especially in terms of testing, it takes a lot of energy. Based on my recent understanding of various databases, there are actually many pits of compatibility between different versions. If you want to provide services for customers like banks, there are quite a few Much way to go.

Jiang Xiaofeng: In real-time machine learning scenarios, our TiLaker solution is very user-friendly. Now that machine learning is more real-time, it will bring a lot of real-time feature storage. Now most companies use KV for real-time feature storage, such as Redis, which stores the features of each time point. However, if the TiDB + TiLaker solution is used, customers can reduce the cost of feature storage. I am usually mainly responsible for real-time risk control and real-time recommendation solutions, and I also connect with many customers, such as Himalaya, who all have this demand, and the demand is very large. If this project can be implemented, it can be used as a commercial solution.

Judge Xie Jia also gave his own views from the perspective of investors: "Frankly speaking, in fact, this type of product will have some challenges in commercialization. This type of product will be more inclined to solve the problem of a certain link in the middle. The middle tier will face a business paradox and challenge - what you solve is indeed a common problem that everyone encounters, but it is not a complete business problem. It will bring about a certain degree of efficiency improvement, but for the business The value enhancement is not very direct. If you can look at the whole business from the intermediate link, the upstream and downstream of its value chain can be extended more. When a customer gets this product, it is a complete product, and he can use this complete product to solve the problem. More problems. From the perspective of commercialization, it is easier to create more customer value. In addition, many companies will have some specific needs and non-standardized scenarios, and it is difficult for you to use one product to meet all needs. But in the future, if everyone goes to the cloud and becomes more standardized, it will remove many obstacles to commercialization.”

The harvest and expectation of participating in the TiDB Hackathon
Wu Xuelian: I have been a mentor and a family member of the onlookers before, but this year is indeed the first time to participate in the TiDB Hackathon in the true sense. In fact, I have always wanted to know more about the ecological contact between Flink and TiDB, so I took this opportunity to go. Through the TiLaker project, I have a better understanding of Flink and some ecosystems downstream of Flink, which is very helpful to my work in TiDB. In addition, through participating in the Hackathon, I have communicated with many students who are engaged in open source in Hangzhou, and I am very happy.

Xue Jin: I also participated in the TiDB Hackathon for the first time. Although the Flink community has held it before, it should be similar to the Hackathon held in the early days of TiDB, which is a two-day extreme programming. This time, the organizing committee has left us plenty of time, which is also one of the reasons why we can finally do a good job in the completion of this project.

This year's Hackathon is really a master. I think the ideas and actual effects of many contestants are very powerful. We didn't expect to win the second prize at first, it feels like the third prize at most. I remember a team doing performance enhancements. Their Benchmark measured a minimum improvement of 18% and a maximum of 40%. That kind of project is indeed very technical. In addition, there is also a Best Creative Award in Hangzhou, which is combined with the game, which left a deep impression on me. Participating in the TiDB Hackathon, I can get to know all the classmates in the community, which I think is more rewarding.

Jiang Xiaofeng: I have already thought about the idea of the next Hackathon competition. Flink Forward Asia 2021 introduced the concept of Streamhouse, which introduced Dynamic Table Storage. I want to use TiFlash as the underlying implementation of Dynamic Table Storage, which can strengthen the integration between the Flink ecosystem and the TiDB ecosystem. This is not only to do docking and cooperation in data integration, but also to make some innovations in other modules. About Hackathon I look forward to the future Hackathon can write code online PK. This time, it can be seen that the workload of many participating projects is not only completed in two days, but in fact, it is not so satisfactory in the sense of Hackathon competition.

Xie Jia: This is my second time participating in the TiDB Hackathon. When I first participated, everyone always said that PingCAP is a Hackathon-driven company. I didn't have a strong feeling at the time, but this time I think it is quite strong. Among the domestic companies that do basic software, PingCAP should be the first to embrace open source and do a good job in commercialization at the same time. Many people may occasionally generate some ideas, but because the work routine can not be done normally, these ideas can be released through Hackathon and become another kind of driven.


PingCAP
1.9k 声望4.9k 粉丝

PingCAP 是国内开源的新型分布式数据库公司,秉承开源是基础软件的未来这一理念,PingCAP 持续扩大社区影响力,致力于前沿技术领域的创新实现。其研发的分布式关系型数据库 TiDB 项目,具备「分布式强一致性事务...