End-to-end real-time computing: TiDB + Flink best practices

About the Author

Sun Xiaoguang, PingCAP Community Development team leader, originally known as the architect of the basic R&D team, has long been engaged in distributed system-related R&D work, focusing on cloud native technology.

This article comes from Apache Flink x TiDB Meetup · Beijing Station 161505714201dd. It mainly shares some of Zhihu’s work in the integration of TiDB x Flink batch flow, and uses actual business as an example to introduce how make full use of the characteristics of both. The closed-loop delivery of end-to-end real-time computing is .

background

The above picture is a very typical real-time data warehouse link of the various components and data, you can see that in many places TiDB and Flink can be combined to solve our business problems. For example, TiDB's base camp is online transaction , so ODS can use TiDB, and the dimension tables and application data storage can also use TiDB.

Real-time business scenarios

Scene analysis

Let’s take a look at an actual business scenario on website provides creators with the analysis capabilities of 16150572050810 content interaction data Here, creators can see the approvals, comments, likes, favorites, and changes in these data over the past period of time.

These data can help creators better optimize their creations. For example, the creator makes some adjustments to the content, and then finds that the interactive data has begun to change significantly, the creator can make corresponding adjustments to the content based on this signal to avoid bad or further promote the great strategy, so the creator Has a very large value. In addition, the more immediate this data is, the more immediate the creator's strategy adjustments can be. For example, the creator has just updated an answer and hopes to see the relevant data changes immediately. If the data changes are positive, they can make more similar adjustments next time. Or abstract out what the good adjustments are in the past, so that every time you can make a creation that readers prefer based on previous experience.

It's a pity that data that is so valuable to creators is still real-time. You can see the instructions for data update in the upper right corner. This is a proof that we have not covered well enough in real-time applications. It is still a product realized T+1

Flink is our inevitable choice for real-time application scenarios such as creation centers, but it is different from companies that use MySQL extensively. We know that 40% has completed the migration to TiDB, so we must integrate the real-time computing capabilities of TiDB and Flink in-depth. In the future, when TiDB becomes our absolute main database, we will be able to obtain better comprehensive benefits.

Next, we discuss how to make the statistics of content interaction data real-time, and use TiDB and Flink to realize the real-time calculation of the likes, comments, and approval data of the two content of answers and articles.

Business data model analysis

The picture shows the related businesses that need to be paid attention to for real-time calculation of these data. These businesses include question and answer, which is QA on the left, and column articles on the right, as well as comments, user interactions, and video answers. We hope that by integrating these data scattered in different businesses, we can get the statistical data of user interaction in the Creator Center, and we hope it is real-time at .

First, let’s zoom in on the Q&A service . On the left are the basic tables in the QA business. In fact, we don’t need to know all the details of all the tables for calculating the interactive information. We only need to pay attention to the tables on the right. Some fields are fine. From these tables, we only need to know the id of the answer, the member_id of the answer creator and the id of the answer that was liked, and then we can fully calculate how many likes a certain person’s answer has.

Similar to this is the column article , which also lists some basic tables. If you want to do real-time calculation of column likes, we pay attention to the article and article_vote . Using the member_id, id, and vote fields, you can easily calculate the number of likes for the article.

In addition to the like interaction data in the business system, other types of interaction data are scattered in multiple different business systems. For example comment system, the vote table of video responses, and the reaction table other interactions. Based on these user behavior data, plus content data, the complete interaction data created by the user can be calculated.

From the business model, the essence of interactive data calculation is to use the of various types of content and various interactive behaviors of 161505734e8b98 as the source table, and then group these data by the ID of the content to perform aggregation calculation . For example, a like is a count calculation, because a row of data in the table is a like. If it is a score, then the calculation of this data is sum. After getting all the content and all interactive aggregation results, make a left join with the content table again to get the final calculation result.

Traditional solution

Before starting to talk about Flink calculations, we can first take a look at what kind of development model the same real-time application is without Flink. Knowing that there is a set of technical frameworks that I have accumulated in the past to do such event-driven calculations. If you use this technology to do real-time calculations, the development method is like the picture above.

Business engineers need to use their familiar language and framework to develop the middle red message system complements and aggregate operations on the real-time data change events they get, and then the calculated results are agreed in advance Good format is sent to the message system. Finally, a final worker is used to splice the content source table and the real-time calculation results of multiple upstream workers to obtain the final calculation result and save it downstream. In this way, we can implement real-time applications based on more traditional technologies. In this development method, business engineers need to pay attention to the implementation of multiple workers and the format of data transfer between different systems. The database and message system are maintained by the platform team, and there is no additional learning cost for engineers. has low learning costs and easy to understand is the advantage of this traditional development method.

There are some problems with this development method. For example, there are 5 workers in the above figure. The worker program is first a consumer of a message system. It needs to aggregate and calculate the received real-time data according to business requirements and fill in the necessary dimensional data. After ensuring the correctness of these calculation logics, the results of these calculations must be correctly sent to the downstream topics of the message system. No exaggeration to say that such a program requires at least 1000 lines of work, five such worker, whether from management or development even maintenance aspects cost are very high of. In addition, the worker programs developed by these business teams need to solve the scalability problem needs to be independently reserved to deal with the global resource waste caused by sudden traffic . It is difficult to balance the system scale problem caused by insufficient flexibility at a reasonable cost.

Flink solution

In contrast, if you use Flink to develop the structure of the entire application, it will become very simple. When we use SQL to develop applications, thanks to higher maintainability and understandability , we can, without loss of maintainability of the application of this logic all in one the Job in unity maintain. Regardless of the business team’s development cost or maintenance cost it is a better choice.

As shown in the figure above, this is the real-time calculation logic for answering user interaction data developed with Flink SQL, and the SQL finally obtained. Using SQL to develop business logic in a declarative way makes it easy to understand and verify its correctness.

Next, let’s take a look at the advantage this way:
First, a single SQL development has high maintainability, a small number of components, and low maintenance costs.

Secondly, Flink handles system-level issues uniformly, and the business layer not need to care about scalability, high availability, performance optimization and correctness issues , which greatly reduces the burden of dealing with these issues.

Finally, SQL development almost no additional learning cost . Why do you say "almost", this business is a typical online engineer's work area, and online engineers must be very familiar with SQL. However, the scope of SQL used in their daily work is slightly different from the scope of SQL used by big data engineers. I can not say Flink SQL no learning costs, but the cost is very low, learning curve very gentle .

Everything has . Real-time applications based on Flink also need to solve the following problems:
First of all, SQL's expressive power and are not unlimited , there must be some business logic and business scenarios that are difficult to be fully covered by the current Flink SQL. If we use 28 rule look at this problem, SQL plus some UDFs can solve 80% of the problems that cannot be covered by standard Flink SQL. In the end, there are 20% of the problems that cannot be solved by DataStream API to help ensure the entire business. All problems can be solved on a Flink technology stack.

In addition, Flink SQL is easy to develop, but the Flink system itself is not lower than . These complexities are a very heavy burden for many business engineers, and they don't want to understand how Flink works and how to maintain it. They prefer to write SQL on a self-service platform to solve their own domain problems, and avoid paying attention to such a complicated problem as the operation and maintenance of Flink. we need to reduce the cost of the business access system by 16150574d868d7 platform-based , and use technical means and economies of scale to reduce the cost of a single business to a reasonable level.

So although the problem exists, there are suitable solutions to solve it.

POC Demo

The real-time application of the creative center just mentioned is still in the POC process. The POC uses the actual table structure on Zhihu. You can experience from the POC Demo what business engineers can achieve based on Flink, the effect achieved, and whether the correctness is guaranteed.

The previously seen part only includes the scope of the online business technology stack, that is to say, the source data is on TiDB, and the calculation results after Flink processing are also stored in TiDB. end-to-end solves real-time calculation problems. What if the data generated offline need to be into the calculation 16150574d869a0? For example, if we want to include the real-time PV of each content in the calculation result, we can perform a union operation on the PV history table and the PV real-time stream in the big data system, and then sum them together according to the content ID to get the real-time PV. Content dimension PV data. The traditional way of implementation may need to write 1-2 workers, and now it can be implemented only by adding a few lines of SQL code to the Flink job.

Possible question

If you are not familiar with Flink and those who are not familiar with big data, you may have some questions now. Next, let's take a look at these questions one by one.

The first one is how the calculation is done. In the TP system, the client requests to trigger the calculation . How is the calculation triggered by Flink?

The answer is to calculate when an event is triggered, and a calculation is triggered every time an event is generated. The change of any row in the database triggers a calculation, and the triggered granularity may be too fine and the cost is too high. Therefore, there is mini batch optimization in Flink, which can accumulate a batch of change events and drive calculations in batch mode. If it is about the calculation of data in a time period, you can also use the window mechanism and use Watermark/Trigger to trigger the calculation and obtain the result. If the state needs to be maintained during the calculation process, then the Flink runtime will be responsible for managing the state data.

The second question is where ?

Not all businesses need to use windows. When the calculation and trigger logic have nothing to do with the time period, you do not need to use windows. For example, the demo scenario calculation logic here is triggered by the data change and the state permanently valid and permanently valid, and the entire logic does not need to use window.

If you need to use window, how to deal with late events? There are discard and retract , to deal with late events. When a late event is encountered, developers can choose to discard the late data or use the retract mechanism to deal with it. In addition, we can also use custom logic to handle late events. In short, the role of window is to assist users with the preset window strategy , to gather data that fall within a certain period of time to trigger calculations, and when delayed data arrives beyond the window, it is processed in the manner expected by the application.

The third is how difficult is it to get started?

Streaming SQL is established on the basis of standard SQL, and its learning process is gradual and gentle . Coupled with the easy-to-extensible UDF capabilities, it can solve most of the problems that can't be solved by using Flink SQL alone, and the few problems that are only suitable for writing code can still be solved by Flink's DataStream API.

Finally TiDB Flink and how to ensure results correctness ?

TiDB is a default snapshot isolation level database , we can still get direct global snapshot of a state point of time. It is very easy to ensure the correctness of the entire data flow under the SI isolation level. We only need to get a timestamp, and read a static snapshot of all data at the time of this timestamp, and after processing the snapshot data, connect to all CDC events that occurred after the timestamp in the CDC. From Flink's point of view, this is a stream batch integration . Flink's own mechanism can ensure the correctness of the calculation results of the events flowing into the system.

TiDB x Flink batch stream integration

Let's understand what work we have done in the integration of TiDB and Flink in the process of doing POC, and the state of the capabilities brought about by these integration work.

TiDB as MySQL

As a distributed database MySQL, even if we do not do the native integration of TiDB to Flink, we can still use TiDB as a large MySQL and Flink together in the illustrated way. Under this architecture, all batch task traffic needs to pass LB first, then pass TiDB and finally access the corresponding TiKV node according to the read data range. The flow task traffic is to use TiCDC to capture data change events from TiKV, and deliver them to Flink for processing through the messaging system.

Although this non-native docking with can work, it cannot make full use of the features of the TiDB architecture to achieve more extreme cost optimization and value amplification in many scenarios. For example, in an application scenario with large traffic fluctuations, since all traffic has to be on the entire path, each layer from LB to TiDB to TiKV is walked once. The traffic will have a full impact on each process. In order to ensure the business performance under the impact of peak traffic, we have to prepare all resources according to the peak traffic, resulting in a great waste of resources.

There are big data scene frequently encountered data skew problem. Without business knowledge, facing the various table structure design and business data distribution characteristics of the business, it is difficult for us to automatically solve all the data tilt problems in a unified way. In fact, on the current version of the Flink JDBC connector, if the primary key of the table is not an integer type and there is no partition table, then the source of Flink can only process all data with 1 degree of parallelism. This is very difficult in the scenario where TiDB stores business data in Shanghai.

Finally, we cannot directly use the flink-cdc-connector project designed for MySQL to provide a stream and batch-integrated connector for TiDB. In many application scenarios that require this capability, the business side needs to pay attention to the unified processing of batch and stream data.

TiDB adaptation

In order to solve these shortcomings encountered when using non-native TiDB support in Flink, we have made full use of the characteristics of the TiDB architecture and developed a native Flink Connector for TiDB to better serve Flink's wide range of computing scenarios.

The first is high-traffic impact scene. There is a system table in TiDB to know the addresses and ports of all TiDB servers in the entire cluster. to the native MySQL JDBC driver, and used the cluster topology information in the system table to directly implement 16150574d86c95 load balancing on the client. By directly connecting to TiDB-server, we have realized the traffic bypass of the load balancer. Only the small data requests for the initial and subsequent periodic updates of the cluster information will pass through the load balancer, and the real large-traffic data read and write requests will pass to TiDB. Direct connection to carry.

Next is to avoid flow TiDB-server impact . When reading data on TiDB, we can let the client obtain all the region information within the data range that needs to be read from the PD. By directly connecting region behind TiKV node, we can all read traffic detour TiDB, greatly reducing the load TiDB layer, save hardware resource costs . When implementing the TiDB bypass solution, we achieved the predicate push-down and projection push-down capabilities consistent with TiDB. The pressure exerted by the TiDB connector on TiKV is very close to that of real TiDB and will not cause additional burden on TiKV.

The next one is to use placement rules to allow a batch of physically isolated TiKV nodes to only carry data copies of the follower role, and with the follower read capability, we can load the large flow of real-time calculation without paying additional server costs, the same as online business The loads are physically separated. So that everyone can rest assured to support online business and big data business at the same time on a TiDB cluster.

Next is the business-independent data balancing capability . As mentioned earlier, in the absence of business layer domain knowledge and data distribution information, the JDBC method can only split the data of the integer primary key approximately evenly, while for non-partitioned tables with non-integer and other types of primary keys, it can only be sequenced. All data is processed in an optimized manner. In the case of massive data storage like TiDB, whether it is single concurrency or imbalance, it will cause the problem of low task execution efficiency. When introducing TiDB detours, you also saw that the task splitting granularity of the TiDB connector is at the region level. Size is determined by the region in accordance with a TiKV to automatically maintain optimum size, so for any a table structure, we are able to do equalization task units, in any case be completely avoided without professional knowledge data skew Issue .

Next is the batch-stream integrated capability TiDB connector. Its principle is to use TiDB's snapshot isolation level to get a global snapshot of the data. After processing the snapshot data, access all CDC events whose commit version number is greater than the snapshot version number. Through this built-in streaming batch integration capability , while data processing is greatly simplified, it can also ensure the absolute correctness of the entire real-time computing pipeline.

Finally, in order to further optimize the CDC traffic impact caused by TiDB's high-traffic write capability. binary encoding for the TiCDC data encoding and decoding format. The canel json and open protocol that you often use in TiCDC are both in JSON format. However, these protocols that use JSON as the physical format tend to be larger in size and consume too much CPU codec. The newly designed binary protocol makes full use of some of the characteristics of CDC data. In typical scenarios, it can compress the data size to 42% of the open protocol, and at the same time increase the encode speed to 6 times the original, and the decode speed to close to the original. 10 times.

The above is the work we have done in the native integration of TiDB and Flink. These work have solved some of the problems encountered when using TiDB and Flink to achieve end-to-end real-time computing.

With the help of TiDB connector, the way that TiDB and Flink cooperate becomes like this on the picture. The read traffic bypasses the load balancing and tidb-server, and directly requests the follower node of TiKV. The write traffic is currently achieved with the help of JDBC, but with the help of client load balancing capabilities, we can still bypass the load balancer and reduce the cost of the load balancer.

At present, Flink already has many application scenarios in Zhihu. We built a data integration platform based on Flink, and used the TiDB connector to provide TiDB to Hive and Hive to TiDB capabilities. solved the ODS layer data synchronization and offline computing data online service synchronization problem . There are many other real-time applications outside the data integration platform, such as click data processing programs for business teams. Another example is the timeliness analysis in the search, as well as the real-time data warehouse of key indicators. Finally, some businesses use Flink to save real-time behavior data to TiDB for online query.

Outlook

In addition to the progress mentioned above, we have many areas that can be improved to create more value for TiDB and Flink users. Next, let's take a look at the future directions where we can continue to tap value.

TiDB x Flink core capabilities enhanced

The first is global transaction support . The current Flink sink implemented based on JDBC has the same limitations as the JDBC connector, and cannot realize distributed global transactions. In addition, the use of JDBC to connect to TiDB also brings the limitation of the maximum transaction size of , which cannot support the writing of oversized transactions to 16150574d86e67. When we encounter global visibility requirements or requirements similar to banks running batch tasks, the current TiDB connector still cannot provide the ideal capabilities. We hope to implement native write capabilities in the next step and directly submit two-step commits to TiKV in a distributed manner, so as to achieve global large transaction write capabilities. Global transactions can not only bring about transaction isolation and the benefits of large transactions, we can also completely release the pressure of tidb-server by bypassing all high-traffic requests to TiDB, and completely eliminate unnecessary waste of resources.

Another improvement direction is the native lookup table support , which is currently implemented based on the JDBC connector. Although the throughput of dimension table query is usually not particularly large, bypass TiDB can still obtain additional benefits in latency. And this improvement can play a very key positive role in improving the calculation throughput of the stream computing system and avoiding the backlog of events.

Finally, there is not yet a clear direction to improve earnings is based on the state backend TiKV, it may solve some scenes checkpoint at the problem of slow .

More application scenarios

With the native support of TiDB and many new capabilities, we can imagine that TiDB x Flink can support more application scenarios in the future.

For example, the current data integration platform only supports the data extraction task in batch mode of 16150574d86efd. With the help of TiDB streaming batch integration capability, we can cooperate with Hudi or Iceberg to complete all ODS layer data real-time at a very low cost. If all the ODS layer data has real-time capabilities, the data warehouse students will not have much pre-reliance when considering the construction path of the real-time data warehouse. Cooperate with common real-time buried point data and real-time ODS data, and arrange the real-time construction of data warehouses in accordance with the level of business value.

In addition to real-time number of positions, as the technology matures there will be more real-time scenarios born . For example, we can generate a real-time content pool from the existing content on the site at a very low cost. Another example is the real-time index update of search engines, and of course the real-time statistics of demo content interaction data and so on. I believe that after the completion of the construction of the Flink SQL platform in Zhihu, more and more application scenarios based on the TiDB x Flink end-to-end technical system will be generated.

Finally, if you are interested in the ecological integration of TiDB x Flink or the ability of TiDB in the entire big data ecosystem, you can follow the TiBigData project on GitHub. First of all, everyone is welcome to try this project in actual scenarios. If you encounter problems or have comments or suggestions during use, you can submit an issue to the project at any time. Finally, we hope that more developers will participate in the development of this project. Together, we let it provide TiDB with a mature and complete one-stop solution in the field of big data.

End-to-end real-time computing: TiDB + Flink best practices