Application of Apache Flink in NIO

Abstract: This article organizes the speech made by Wu Jiang, the data development and OLAP platform tech lead of NIO's big data department, at the Flink Forward Asia 2021 industry practice session. The main contents include:
The development history of real-time computing in NIO
real-time computing platform
Real-time Kanban
CDP
real-time data warehouse
Other application scenarios

Click to view live replay & speech PDF

1. The development history of real-time computing in NIO

Around May 2018, we began to contact the concept of real-time computing, initially using Spark Streaming to do some simple stream computing data processing;
In September 2019, we introduced Flink to submit through the command line, including managing the life cycle of the entire job;
By January 21, we launched the real-time computing platform 1.0, and we are currently developing version 2.0.

2. Real-time computing platform

In the real-time computing platform 1.0, we compile the code, then upload the jar package to a server, and submit it by command line. There are many problems in this process:

First, all processes are manual, very cumbersome and error-prone;
Secondly, there is a lack of monitoring. Flink itself has a lot of built-in monitoring, but there is no automatic way to add them, and it still needs to be configured manually;
In addition, the maintenance of tasks is also very troublesome, and some unfamiliar developers are prone to problems when operating, and it is difficult to troubleshoot after problems occur.

The life cycle of the real-time computing platform 1.0 is shown in the figure above. After the task is written, it is packaged into a jar package for upload and submission, and subsequent tasks are started, stopped, restored and monitored automatically.

Job management is mainly responsible for creating, running, stopping, resuming and updating jobs. The log mainly records some logs when the Flink task is submitted. If it is a runtime log, it is still a little troublesome to view it through the log in the Yarn cluster. Regarding the monitoring and alarm module, first, metrics monitoring mainly uses Flink's built-in indicators to upload to Prometheus, and then configures various monitoring interfaces. Alarms also use some indicators of Prometheus to set rules, and then set alarms. Yarn is responsible for the management of the overall cluster resources.

The above picture is the interface of the real-time computing platform 1.0, and the overall function is relatively simple.

The picture above is the real-time computing platform 2.0. Compared to 1.0, the biggest difference is the blue part. There may not be a unified standard for the shape of the real-time computing platform, which is closely related to the situation of each company, such as the size and scale of the company itself, and the company's resource investment in the real-time computing platform. The status quo of the company itself is the best standard.

In version 2.0, we added support for functions in two stages from development to testing. Briefly describe their specific functions:

FlinkSQL: It is a function supported by many companies' real-time computing platforms. Its advantages are that it can reduce the cost of use and is relatively easy to use.
Space management: Different departments and groups can create and manage jobs in their own space. After we have the concept of space, we can use it to control some permissions, for example, we can only perform some operations in the space that we have permissions to.
UDF management: On the premise of using FlinkSQL, you can use UDF to expand functions based on the semantics of SQL. In addition, UDF can also be used for Java and Schema tasks, and some common functions can be packaged into UDF to reduce development costs. It also has a very important function of debugging, which can simplify the original debugging process and make users unaware.

The realization of the real-time computing platform 2.0 has brought us the greatest impact on reducing the burden on the data team. In our original development process, the intervention of the data team was often required, but in fact, a large part of the work was relatively simple, such as data synchronization or simple data processing, which did not necessarily require the intervention of the data team. .

We only need to make the real-time computing platform more complete, easy-to-use and simple. Other teams can use FlinkSQL to do the above simple tasks. Ideally, they can do some Flink tasks without even knowing Flink's related concepts. development. For example, when the back-end staff is doing business-side development, for some relatively simple scenarios, there is no need to rely on the data team, which greatly reduces the communication cost and makes the progress faster. It would be a little better to have a closed loop within the department. And in this way, each character will actually feel happier. The product manager's job will also become easier, there will be less need to bring in as many teams in the requirements phase, and the workload will be reduced.

So, this is a great example of a technical way to optimize organizational processes.

3. Real-time Kanban

Real-time Kanban is a relatively common function. In our specific implementation, we mainly found the following difficulties:

First, the data is delayed in reporting. For example, after a problem occurs in the business database, CDC access needs to be interrupted, including subsequent writing to Kafka. If the Kafka cluster load is high or Kafka has a problem, it will also be interrupted for a period of time, which will cause data delay. The above delay can be avoided in theory, but it is difficult to avoid completely in practice. In addition, there are some delays that cannot be completely avoided in theory. For example, there is a problem with the user's traffic or signal, so that the operation log cannot be uploaded in real time.
Second, the integration of flow and batch. It mainly depends on whether historical data and real-time data can be unified.
Third, real-time selection of dimensions. Real-time Kanban may need to flexibly select multiple dimension values. For example, if you want to look at the number of active users in Beijing, then look at the number of active users in Shanghai, and finally look at the number of active users in Beijing + Shanghai, this dimension can be flexibly selected according to your needs.
Fourth, the verification of indicators. The verification of the indicators is relatively simple in the case of offline. For example, you can do some data distribution to see the general situation of each distribution, or you can compare the calculation of the ODS layer data with the intermediate table, and do crossover verify. However, it is more troublesome in the case of real-time, because real-time processing is always going on, and some situations are difficult to reproduce, and it is also difficult to verify the range or distribution of indicators.

Real-time Kanban generally has two requirements:

The first is latency. Different scenarios have different requirements for latency. For example, in some scenarios, it is acceptable for data to arrive with a delay of 1-2 minutes, but in some scenarios, only a few seconds of delay is allowed. The complexity of technical solutions practiced in different scenarios is different.
Secondly, it is necessary to take into account the functions of real-time and historical Kanban. In some scenarios, in addition to watching real-time data changes, it is also necessary to compare historical data for analysis.

Real-time and historical data should be stored in a unified manner, otherwise there may be many problems. First of all, the table structure is relatively complex during implementation. When querying, it may be necessary to determine which period of time is historical data and which period of time is real-time data, and then splicing them together will lead to high query implementation costs. Secondly, it is also easy to have problems when switching historical data. For example, historical data is refreshed regularly in the early morning every day. If the historical task is delayed or wrong, it is easy to cause the detected data to be wrong.

Our internal requirements for the delay of real-time Kanban are relatively high, generally within seconds, because we hope that the numbers on the big screen are always beating and changing. The traditional solution generally adopts the pull method, such as querying the database every second, which is difficult to implement, because a page contains many indicators, and many interfaces need to be sent to query data at the same time, so that all data can be collected in one second. Returning within is unlikely. In addition, if many users query at the same time, the load will be very high, and the timeliness will be more difficult to guarantee.

Therefore, we adopted the push method. The above figure is the architecture diagram of the specific implementation, which is mainly divided into three layers. The first layer is the data layer, which is the real-time data warehouse of Kafka. After processing these data through Flink, they are pushed to the background in real time, and then the background is pushed to the front end in real time. The interaction between the backend and the frontend is achieved through web sockets, so that all data can be pushed in real time.

In this demand scenario, some functions will be more complicated.

To give a simple example, such as counting the number of UVs in real-time deduplication, one of the dimensions is city, and a user may correspond to multiple cities. Selecting the number of UVs in Shanghai and Beijing means that people in Shanghai and Beijing should be placed in the city. Carry out deduplication together, and calculate the real-time UV data of deduplication, which is a more troublesome thing. From an offline point of view, it is very simple to select multiple dimensions. After selecting the dimensions, you can directly extract the data for aggregation. However, in real-time scenarios, the dimensions to be aggregated are specified in advance.

The first solution is to store all user IDs and dimensions that have appeared in the Flink state, and directly calculate all possible dimension combination UVs, and then push the updated UVs to the front end.

But this method will increase a lot of computing costs, and will lead to an explosion of dimensions, which will lead to a sharp increase in storage costs.

The architecture diagram of the second scheme is as above. We regard sink as a streaming core, and end-to-end as a streaming application as a whole, such as data access, data processing and calculation in Flink, backend, and push to frontend through web socket as a whole. an application to consider.

We will store all dimension values of each user in Flink, and the details of users pushed by Flink in the background will also be stored in the list of user IDs under each city. Flink has a very critical exclusion function. If the user has already appeared, then the changes will not be pushed to the front-end and the back-end in the Flink stage; if the user has not appeared, or the user has appeared but the city has not appeared, it will be removed. The combination of user and city is pushed to the backend to ensure that the backend can get the list of user IDs in each city to deduplicate.

After the front-end selects a dimension, you can subscribe incrementally to the user IDs of different dimensions in the back-end. There are two points to note here:

The first is that when the front-end is just opened and the latitude is selected, there is an initialization process. It will read the full user ID of the selected dimension from the background to make a collection, and then calculate the number of UVs.
After the new user ID arrives in the second stage, it will be pushed to the backend through Flink, and the backend will only push the incremental ID to the frontend, and then the frontend has saved the previous collection, and for the incremental ID, it can directly Use O(1) time to calculate the new collection and calculate its UV number.

Some people may ask, what should I do if there are too many users under this scheme? Will the front end take up too many resources?

First of all, from our current actual usage scenario, this solution is sufficient. If the number of IDs increases in the future, using bitmap is also an option, but only using bitmap is not enough to solve the problem. Because the generation rules of user IDs of different companies are different, some are auto-incrementing IDs, some are non-auto-incrementing IDs or even not a numerical value, so mapping is required. If it is a discrete numerical value, additional processing is required.

The first scheme re-encodes the ID from 1, making it smaller and contiguous. At present, everyone may use RoaringBitMap in most scenarios. Its characteristic is that if the ID is very sparse, it will use a list instead of bitmap to store in actual storage, so it cannot achieve the purpose of reducing memory usage. . So try to make the ID space smaller and make the ID values more continuous.

But this is not enough. If the ID has not appeared before, it needs to be reassigned an ID, but when processing these data, the parallelism of the Flink task may be greater than 1. At this time, if multiple nodes consume data at the same time, they will May encounter the same new ID, how to assign the corresponding new mapped small ID to this ID?

For example, after a node is queried, a new ID needs to be generated, and at the same time, it is necessary to ensure that other nodes will not generate the same ID again. It can be ensured by making a unique index on the new ID. After the index is successfully created, a new ID is generated. The failed node can retry to get the current ID mapping, because other nodes have already generated this ID, so it will be successful in the retry to get the mapping stage.

In addition, there is also a scenario that needs to be considered. For example, after the user registration is completed, some behaviors are generated immediately, and the behavior tables of user registration and some business modules may be developed by different business departments, or there may be different databases, different In the table, even for different types of databases, the access methods in the above situations will be different, which may cause the registration data flow to arrive slightly later than the behavior data flow although it is registered first. Will this cause the occurrence of what is the problem?

At present, it does not seem to be possible. It only needs to share an ID mapping between the behavior data flow and the new user registration data flow.

To sum up, a good architecture, even in the face of a surge in data volume, does not require major changes at the architecture level, but only needs to be redesigned in detail.

The second question is will the front end have a large computational load?

The answer is: no. Although the deduplication of the number of people is done by the front-end, only when the front-end is loaded for the first time, the full amount of users needs to be pulled, and the subsequent incremental user IDs will be directly added to the current collection in an O(1) way. Therefore, the computational burden of the front-end is very low, and the whole process is completely streaming.

The third question is what if there are many users accessing the real-time report at the same time?

From the current architecture, there is basically no impact on Flink and the backend side. The only impact is that if many users visit at the same time, their pages need to establish a web socket connection with the backend at the same time. However, because the real-time report is mainly used internally and not externally, there will not be too many simultaneous visits.

And we put part of the responsibility of data ID deduplication on the front end. Even if there are multiple users accessing at the same time, the computing responsibilities will be distributed to different user browsers, and there will not be too much load in fact.

4. CDP

CDP is an operation platform, responsible for partial background work. Our CDP needs to store some data, such as attribute data in ES, detailed behavior data including statistical data in Doris, and task execution in TiDB. There are also some real-time scenarios.

The first is that attributes need to be updated in real time, otherwise it may cause poor operational results. The second is that the aggregated data of behavior sometimes needs to be updated in real time.

5. Real-time data warehouse

The key considerations for real-time data warehouses are as follows:

Meta information management, including Catalog management.
Stratification, how to carry out reasonable stratification.
Modeling, how should the real-time data warehouse be modeled, and what is the difference between it and the offline data warehouse?
Timeliness, the lower the delay, the better, and the shorter the link, the better.

The above picture is our current real-time data warehouse architecture diagram. It is very similar to the offline data warehouse as a whole, and it also has a primitive layer, a DWD layer, a DWS layer and an Application layer.

The difference is that it has a dimension layer (DIM layer), which contains many different storage media. The dimension information can be placed in TiDB, and the dimension table can be accessed through AIO; it can also be placed in Hive and used in Temporal Join. Correlation; some data is always changing, or some time-based correlation needs to be done, you can put the data in Kafka, and then use Broadcast or Temporal Join to correlate.

On the left are the capabilities we are planning.

The first is blood relationship, which is helpful for tracing the source of the problem and evaluating the impact of changes;
The second is meta-information management. We hope that all data can be tabulated, and SQL can be used directly when processing data;
The third is permission management. For different data sources and different tables, permission management is required;
The fourth is data quality, how to ensure data quality.

The following is a detailed description of these future plans.

First, Catalog management, this function has not yet been developed. We want to create a table for all data sources, no matter whether the data in it is a dimension table or other table, whether it exists in MySQL or Kafka, after creating the table, these details can be masked, and it can be easily used through SQL.

Second, reasonable stratification. Layering will affect real-time data warehouses in many ways.

First, the more layers there are, the greater the latency. Whether real-time data warehouses need so many layers is worth thinking about.
Secondly, the quality monitoring of real-time data will be more complicated than offline data, because it is constantly being processed, and the more layers there are, the more difficult it is to find problems, locate problems, and backtrack or reproduce them, including the distribution of data integration. monitor.
Finally, how to do reasonable stratification. It is definitely necessary to reduce the number of layers as much as possible, and to carry out reasonable vertical division of business functions. If there is little intersection between different businesses, try to establish their own separate layers in their respective business areas.

Third, modeling. This is a very important part of offline data warehouse, because a very large part of offline data warehouse users are analysts. Their daily work is to use SQL to query and analyze data. At this time, ease of use must be considered. Like large wide tables, all relevant fields are put into one table. Therefore, when modeling the offline data warehouse and designing the table structure, it is necessary to add some dimensions that may be used as much as possible.

The real-time data warehouse is more faced by developers, so it puts more emphasis on practicality. Because under the requirement of real-time data warehouse, each additional field in the wide table will increase the delay, especially the increase of dimension. Therefore, the scene dimension table and modeling of the real-time data warehouse are more suitable for actual needs.

Fourth, timeliness. The real-time data warehouse itself still needs to have a raw layer, but for scenarios with high timeliness, such as synchronizing some online data, the data is finally synchronized and fast charging is also used for online services. It is necessary to minimize the link and reduce the delay. For example, some Flink CDC methods can be used to reduce the middle layer, which not only reduces the overall link and delay, but also reduces the probability of problems. For internal analysis scenarios with less latency requirements, try to use real-time data warehouses to reduce data redundancy.

6. Other application scenarios

Other usage scenarios include CQRS-like applications. For example, the functions of the business department are more about adding, deleting, modifying and checking or traditional database operations, but there will still be data analysis scenarios in the future. At this time, it is not correct to use the business library for analysis, because the design of the business library Analysis is not considered in the first place, and it is more suitable to use the analytical OLAP engine to do this work. In this way, the work that the business department is responsible for is separated from the work that the data department is responsible for, and everyone performs their own duties.

There is also monitoring of metrics and anomaly detection. For example, for real-time detection of various indicators through Flink, it will load a machine learning model, and then detect in real time whether the changes in the indicators are in line with expectations, and how far the gap is from the expectations. You can also set a regional value to detect abnormal indicators.

There are more and more scenarios of real-time data, and people's demand for real-time data is also increasing, so we will continue to explore real-time data in the future. We have already made some achievements in the unification of real-time and offline storage of stream-batch integration, and we will also invest more energy in this area, including whether Flink CDC can really reduce links and improve response efficiency, and we will also consider it. The problem.

Click to view live replay & speech PDF

For more technical issues related to Flink, you can scan the code to join the community DingTalk exchange group to get the latest technical articles and community dynamics as soon as possible. Please pay attention to the public number~

Application of Apache Flink in NIO

1. The development history of real-time computing in NIO

2. Real-time computing platform

3. Real-time Kanban

4. CDP

5. Real-time data warehouse

6. Other application scenarios

ApacheFlink

引用和评论

Flink在B站的大规模云原生实践

【Hadoop】HDFS架构解析

【Hadoop】HBase系统解析及适用场景

基于 pyflink 的算法工作流设计和改造

MCP+Hologres+LLM 搭建数据分析 Agent

某全球领先网络解决方案提供商基于 Apache Doris 统一 Trino、Pinot、Iceberg、Kyuubi技术栈

SelectDB 实时分析性能突出，宝舵成本锐减与性能显著提升的双赢之旅

Application of Apache Flink in NIO

1. The development history of real-time computing in NIO

2. Real-time computing platform

3. Real-time Kanban

4. CDP

5. Real-time data warehouse

6. Other application scenarios

ApacheFlink

引用和评论

Flink在B站的大规模云原生实践

【Hadoop】HDFS架构解析

【Hadoop】HBase系统解析及适用场景

基于 pyflink 的算法工作流设计和改造

MCP+Hologres+LLM 搭建数据分析 Agent

某全球领先网络解决方案提供商 基于 Apache Doris 统一 Trino、Pinot、Iceberg、Kyuubi技术栈

SelectDB 实时分析性能突出，宝舵成本锐减与性能显著提升的双赢之旅

某全球领先网络解决方案提供商基于 Apache Doris 统一 Trino、Pinot、Iceberg、Kyuubi技术栈