The exploration and practice of the integration of flow and batch in JD.com

Abstract: This article is compiled from the sharing of Han Fei, a senior technical expert of JD.com, in the special session of Flink Forward Asia 2021. The main contents include:
Think holistically
Technical solutions and optimization
Landing case
future outlook

Click to view live replay & speech PDF

1. Overall thinking

When it comes to stream-batch integration, we have to mention the traditional big data platform - Lambda architecture. It can effectively support offline and real-time data development requirements, but the high development and maintenance costs and inconsistent data calibers caused by the separation of the streaming and batch data links are defects that cannot be ignored.

It is the most ideal situation to meet the data processing requirements of both streams and batches through a set of data links, that is, the integration of streams and batches. In addition, we believe that there are still some intermediate stages in the integration of stream and batch, such as only realizing the unification of computing or only realizing the unification of storage is also of great significance.

Taking only computing unification as an example, some data applications have high real-time requirements. For example, it is hoped that the end-to-end data processing delay does not exceed one second. This is a problem for the current open source storage that is suitable for stream batch unification. A big challenge. Taking the data lake as an example, its data visibility is related to the commit interval, which in turn is related to the time interval of Flink's checkpointing. Combined with the length of the data processing link, this feature shows that it is not necessary to achieve end-to-end one-second processing. easy. Therefore, for this kind of demand, it is also feasible to only realize the unification of the calculation. Through the unified calculation, the user's development and maintenance costs are reduced, and the problem of inconsistent data caliber is solved.

In the process of implementing the integrated flow-batch technology, the challenges faced can be summarized in the following four aspects:

The first is real-time data . How to reduce the end-to-end data delay to the second level is a big challenge, because it involves both computing engine and storage technology. It's inherently a performance issue and a long-term goal.
The second challenge is how to be compatible with offline batch processing capabilities that have been widely used in the field of data processing . This involves two levels of development and scheduling. The development level is mainly about reuse, such as how to reuse the existing offline table data model, and how to reuse the custom-developed Hive UDF that the user is already using. The problem at the scheduling level is how to integrate with the scheduling system reasonably.
The third challenge is resource and deployment issues . For example, the mixed deployment of different types of streams and batch applications can improve resource utilization, and how to build elastic scaling capabilities based on metrics to further improve resource utilization.
The last challenge is also the most difficult one: user perception . Most users are usually limited to technical exchanges or verification for relatively new technical ideas. Even if they feel that they can solve practical problems after verification, they still need to wait for a suitable business to test the water. This problem has also prompted some thinking. The platform side must look at the problem from the user's perspective, and reasonably evaluate the cost of changing the user's existing technical architecture, user benefits, and potential risks of business migration.

The above picture is a panorama of JD.com's real-time computing platform, and it is also the carrier of our ability to realize the integration of streaming and batching. The middle Flink is deeply customized based on the open source community version. The cluster built based on this version has three external dependencies, JDOS, HDFS/CFS and Zookeeper.

JDOS is JD.com's Kubernetes platform. Currently, all our Flink computing tasks are containerized and run on this platform;
Flink's state backend has two options: HDFS and CFS, of which CFS is an object storage developed by JD.com;
The high availability of Flink clusters is built on Zookeeper.

In terms of application development methods, the platform provides two methods: SQL and Jar packages. The Jar method supports users to directly upload the Flink application Jar package or provide a Git address for the platform to be responsible for packaging. In addition, our platform-based functions are relatively complete, such as basic metadata services, SQL debugging functions, product-side support for all parameter configurations, metrics-based monitoring, task log query, etc.

In terms of connecting data sources, the platform supports a variety of data source types through connectors. JDQ is customized based on open source Kafka and is mainly used for message queues in big data scenarios; JMQ is a self-developed JD.com, mainly used for message queues in online systems; JimDB is JD's self-developed distributed KV storage.

In the current Lambda architecture, it is assumed that the data of the real-time link is stored in JDQ, and the data of the offline link is stored in the Hive table. Even if the same business model is calculated, the definition of metadata is often different, so we introduce a unified Logical model to be compatible with metadata on both sides of real-time and offline.

In the calculation process, FlinkSQL combines UDF to realize stream-batch unified calculation of business logic. In addition, the platform will provide a large number of public UDFs, and also support users to upload custom UDFs. For the output of the calculation results, we also introduce a unified logical model to shield the differences between the two ends of the stream batch. For scenarios where only unified computing is achieved, the computing results can be written to the respective storages of the stream batches to ensure that the real-time data is consistent with the previous one.

For scenarios where both computing and storage are realized at the same time, we can directly write the calculation results to the unified storage of streaming batches. We chose Iceberg as the unified storage for streaming batches because it has a good architectural design, such as not being tied to a specific engine, etc.

In terms of compatible batch processing capabilities, we have mainly carried out the following three aspects of work:

First, reuse Hive tables in offline data warehouses.

Taking the data source side as an example, in order to shield the difference between the metadata between the stream and batch in the left side of the figure above, we define the logical model gdm_order_m table, and the user needs to explicitly specify the fields in the Hive table and topic and this logical table The mapping relationship of the fields in . The definition of the mapping relationship here is very important, because the calculation based on FlinkSQL only needs to face this logical table, and does not need to care about the field information in the actual Hive table and topic. When creating flow tables and batch tables through connectors at runtime, the fields in the logical table will be replaced by actual fields through the mapping relationship.

On the product side, we can bind flow tables and batch tables to logical tables respectively, and specify the mapping relationship between fields by dragging and dropping. This mode makes our development method different from the previous one. The previous method is to create a new task and specify whether it is a stream task or a batch task, then perform SQL development, then specify the configuration related to the task, and finally publish the task. In the stream-batch integration mode, the development mode becomes the first to complete the development of SQL, including the definition of logical and physical DDL, the designation of the field mapping relationship between them, the writing of DML, etc., and then specify the stream batch respectively. Task-related configuration, and finally publish two tasks in a stream and batch.

Second, get through with the scheduling system.

The data processing of offline data warehouses is basically based on the Hive/Spark combined scheduling mode. Take the center picture above as an example. The data processing is divided into four stages, corresponding to the BDM, FDM, GDM and ADM layers of the data warehouse. With the enhancement of Flink capabilities, users want to replace the data processing tasks of the GDM layer with batch tasks of FlinkSQL, which requires embedding the batch tasks of FlinkSQL into the current data processing process as an intermediate link.

In order to solve this problem, in addition to the task itself supporting the configuration of scheduling rules, we also opened up the scheduling system, inherited the dependencies of the parent task from it, and synchronized the information of the task itself into the scheduling system to support the parent task of the downstream task, thereby The batch task of FlinkSQL is realized as one of the links in the original data processing.

Third, reuse of user-defined Hive UDFs, UDAFs and UDTFs.

For existing Hive-based offline processing tasks, if the user has developed UDF functions, the ideal way is to directly reuse these UDFs when migrating Flink, rather than re-implementing them according to the Flink UDF definition.

Regarding the compatibility of UDFs, the community provides a load hive modules solution for scenarios using Hive's built-in functions. If the user wants to use the Hive UDF developed by himself, he can do it by using create catalog, use catalog, create function, and finally calling it in DML. This process will register the function information in the Hive Metastore. From the perspective of platform management, we hope that users' UDFs have a certain degree of isolation, limit the granularity of user jobs, and reduce the risk of interacting with Hive Metastore and generating dirty function metadata.

In addition, when the meta information has been registered, it is hoped that it can be used normally on the Flink platform next time. If the if not exist syntax is not used, it is usually necessary to drop the function first, and then perform the create operation. However, this method is not elegant, and it also has restrictions on how users can use it. Another solution is that users can register a temporary Hive UDF. The way to register a temporary UDF in Flink1.12 is to create a temporary function, but the function needs to implement the UserDefinedFunction interface before it can pass the subsequent verification, otherwise the registration will fail.

So we did not use create temporary function, but made some adjustments to create function, extended ExtFunctionModule, registered the parsed FunctionDefinition in ExtFunctionModule, and made a temporary registration at the Job level. The advantage of this is that it does not pollute the Hive Metastore, provides good isolation, and does not limit the user's usage habits, providing a good experience.

However, this problem has been comprehensively solved in the 1.13 version of the community. By introducing extensions such as Hive parser, custom Hive functions that implement UDF and GenericUDF interfaces can be registered and used through the create temporary function syntax.

In terms of resource consumption, stream processing and batch processing are naturally staggered. For batch processing, the offline data warehouse starts to calculate the data of the past day at 0:00 every day, and the data processing of all offline reports will be completed before going to work the next day, so usually 00:00 to 8:00 is a large number of batch computing tasks. The time period when resources are occupied, and the online traffic during this time period is usually relatively low. The load of stream processing is positively correlated with online traffic, so the resource requirements of stream processing are relatively low during this time period. From 8:00 a.m. to 0:00 p.m., online traffic is relatively high, and most of the batched tasks will not be triggered for execution during this time period.

Based on this natural peak shift, we can improve the utilization of resources by mixing different types of streaming batch applications in the exclusive JDOS Zone, and if the Flink engine is used uniformly to process the streaming batch applications, the utilization rate of resources will be higher.

At the same time, in order to enable applications to dynamically adjust based on traffic, we have also developed an Auto-Scaling Service. Its working principle is as follows: Flink tasks running on the platform report metrics information to the metrics system, and Auto-Scaling Service will determine tasks based on some key indicators in the metrics system, such as TaskManager's CPU usage, task back pressure, etc. Whether it is necessary to increase or decrease computing resources, and feedback the adjustment results to the JRC platform, the JRC platform synchronizes the adjustment results to the JDOS platform through the embedded fabric client, thereby completing the adjustment of the number of TaskManager Pods. In addition, users can decide whether to enable this feature for tasks through configuration on the JRC platform.

The chart on the right side of the above figure is the CPU usage when we conduct a pilot test of streaming batch mixing in the JDOS Zone combined with the elastic scaling service. It can be seen that the stream task at 0:00 has been scaled down, releasing resources to batch tasks. The new task we set starts to execute at 2:00, so from 2:00 to the end of the batch task in the morning, the CPU usage is relatively high, up to more than 80%. After the batch task runs, when the online traffic begins to increase, the streaming task is expanded, and the CPU usage also increases.

2. Technical solutions and optimization

Stream-batch integration takes FlinkSQL as the core carrier, so we have also optimized the underlying capabilities of FlinkSQL, including dimension table optimization, join optimization, window optimization, and Iceberg connector optimization.

The first is several optimizations related to dimension tables. The current community version of FlinkSQL only supports the modification of the parallelism of sink operators in some data sources, and does not support the modification of the parallelism of source and intermediate processing operators.

Assuming that a topic consumed by a FlinkSQL task has 5 partitions, the actual parallelism of downstream operators is 5, and the relationship between operators is forward. For the dimension table join scenario with a relatively large amount of data, in order to improve the efficiency, we hope that the degree of parallelism can be higher, and we hope that the degree of parallelism can be flexibly set without being bound to the number of upstream partitions.

Based on this, we have developed the function of previewing the topology. Whether it is a Jar package or an SQL task, it can parse and generate a StreamGraph for preview. It also supports modifying grouping, operator chain strategy, parallelism, setting uid, etc.

With this function, we can also adjust the parallelism of the dimension table join operator, and adjust the partition strategy from forward to rebalance, and then update the adjusted information to StreamGraph. In addition, we have also implemented a dynamic rebalance strategy, which can judge the load situation in the downstream partition based on the backLog, so as to select the optimal partition for data distribution.

In order to improve the performance of dimension table join, we implement asynchronous IO for all types of dimension table data sources supported by the platform and support in-memory caching. Whether it is the native forward method or the rebalance method, there are problems of cache invalidation and replacement. So, how to improve the hit rate of the dimension table cache and how to reduce the operation of eliminating the dimension table cache?

Taking the native forward method as an example, forward means that each subtask caches random dimension table data, which is related to the value of joinkey. Hash the joinkey of the dimension table to ensure that each downstream operator caches different dimension table data related to the joinkey, thereby effectively improving the cache hit rate.

At the implementation level, we added an optimization rule called StreamExecLookpHashJoinRule and added it to the physical rewrite stage. A StreamExecChange node is added between the bottom scan data StreamExecTableSourceScan and the dimension table join StreamExecLookupJoin, which completes the hash operation on the dimension table data. This can be enabled by specifying lookup.hash.enable=true when defining the dimension table DDL.

We enabled the cache for forward, rebalance, and hash, and conducted performance tests in the same scenario. 100,000,000 pieces of data in the main table are joined to 10,000 pieces of data in the dimension table. Under different computing resources, rebalance has several times the performance improvement compared to the original forward method, and hash has several times compared with the rebalance method. The performance is improved twice, and the overall effect is quite impressive.

For the problem that the efficiency of a single query of dimension table join is relatively low, the solution is also very simple. The batch size can be specified in the DDL definition by setting the value of lookup.async.batch.size. In addition, we also introduced the Linger mechanism in the time dimension to limit, to prevent extreme scenarios from being delayed and unable to accumulate a batch of data, resulting in a relatively high delay, you can set lookup in the definition of DDL. The value of async.batch.linger to specify the wait time.

After testing, the mini-batch method can bring 15% to 50% performance improvement.

Interval join is also a frequently used scenario in production. This type of business is characterized by very large traffic, such as 100 GB in 10 minutes. Interval join The data of the two streams will be cached in the internal state. When the data of either side arrives, the data of the corresponding time range of the opposite stream will be obtained to execute the join function, so this kind of high-traffic task will have a very large state.

In this regard, we chose RocksDB as the status backend, but the effect is still not satisfactory after parameter tuning and optimization. After the task runs for a period of time, there will be back pressure, which will cause the performance of RocksDB to decrease and the CPU usage rate to be relatively high.

Through analysis, we found that the root cause is related to the prefix-based scanning method used by Flink to scan RocksDB at the bottom. Therefore, the solution is also very simple. According to the query conditions, the upper and lower bounds of the query are accurately constructed, and the prefix query is changed into a range query. The key of the specific upper and lower bounds of the query condition is changed to keyGroup+joinKey+namespace+timestamp[lower,upper], which can accurately query only the data between certain timestamps, and the problem of the back pressure of the task has also been solved. And the larger the amount of data, the more obvious the performance improvement brought by this optimization.

Regular join uses state to save all historical data, so if the traffic is large, the state data will be relatively large. And it saves the state by relying on the table.exec.state.ttl parameter, a larger value of this parameter will also lead to a larger state.

For this scenario, we instead use the external storage JimDB to store state data. At present, only inner join is implemented, and the implementation mechanism is as follows: while the two streams deliver the joined data, all data is written to JimDB in mini-batch mode, and both in-memory and JimDB are scanned during join. corresponding data in . In addition, the table.exec.state.ttl function can be implemented through the mechanism of JimDB ttl to complete the cleaning of expired data.

The advantages and disadvantages of the above implementation methods are obvious. The advantage is that it can support very large states. The disadvantage is that it cannot be covered by Flink checkpoint at present.

For window optimization, the first is the window offset. The earliest demand comes from an online scenario. For example, we want to count the results of an indicator from 0:00 on December 4, 2021 to 0:00 on December 5, 2021, but since the online cluster is the time of East 8th District, the actual statistics The result is from 8:00 am on December 4, 2021 to 8:00 am on December 5, 2021, which is obviously not as expected. Therefore, this function was first used to fix the problem of window statistics errors at the cross-day level in non-local time zones.

After we increase the window offset parameter, we can set the start time of the window very flexibly, and can support a wider range of requirements.

Secondly, there is another scenario: although the user sets the size of the window, he wants to see the current calculation result of the window earlier, so that he can make decisions earlier. Therefore, we have added the function of the incremental window, which can trigger the execution of the current calculation result of the output window according to the set incremental interval.

For applications that do not require high end-to-end real-time performance, Iceberg can be selected as the downstream unified storage. However, due to the characteristics of the calculation itself, the configuration of the user checkpoint interval, etc., a large number of small files may be generated. For the bottom layer of Iceberg, we use HDFS as storage, and a large number of small files will put a lot of pressure on the Namenode, so there is a need to merge small files.

The Flink community itself provides a tool for merging small files based on Flink batch jobs, which can solve this problem, but this method is a bit heavy, so we have developed an implementation of small file merging at the operator level. The idea is this. After the native global commit, we added three new operators, compactCoordinator, compactOperator, and compactCommitter. The compactCoordinator is responsible for acquiring and delivering the snapshots to be merged. Each compactOperator executes concurrently, and the compactCommitter is responsible for committing the merged datafiles.

We have added two parameters to the definition of DDL, auto-compact specifies whether to enable the function of merging files, and compact.delta.commits specifies how many commits are submitted to trigger a compaction.

In actual business requirements, users may read nested data from Iceberg. Although the data inside the nested field can be specified to be read in SQL, the current nested field will be included when actually reading the data. All fields are read, and then the fields required by the user are obtained, which will directly lead to an increase in the CPU and network bandwidth load, so the following requirements arise: How to read only the fields that the user really needs?

To solve this problem, two conditions must be met. The first condition is that the schema of the data structure read Iceberg contains only the fields required by the user. The second condition is that Iceberg supports reading data by column name, which itself has been satisfied. , so we only need to implement the first condition.

As shown on the right side of the above figure, combined with the previous tableSchema and projectFields information reconstruction, a new data structure PruningTableSchema containing only the fields required by the user is generated, and as the input of the Iceberg schema, through this operation, the actual use of the user is realized. Column clipping is performed on nested structures. The example in the lower left part of the figure shows the comparison of reading nested fields before and after user optimization. It can be seen that useless fields can be effectively trimmed based on PruningTablesSchema.

After the above optimization, the CPU usage is reduced by 20%~30%. Moreover, under the same amount of data, the execution time of batch tasks is shortened by 20%~30%.

In addition, we have also implemented some other optimizations, such as repairing the data loss problem caused by the interval outer join data is released later than the watermark and there are time operators downstream, UDF reuse problem, FlinkSQL extended KeyBy syntax, dimension table Data preloading and Iceberg connector read from the specified snapshot.

3. Landing case

JD.com currently has 700+ FlinkSQL online tasks, accounting for about 15% of the total number of Flink tasks. The cumulative peak processing capacity of FlinkSQL tasks exceeds 110 million per second. At present, some customization optimizations have been carried out mainly based on the 1.12 version of the community.

3.1 Case 1

Real-time general data layer RDDM stream-batch integration construction. The full name of RDDM is real-time detail data model - real-time detail data model, which involves orders, traffic, commodities, users, etc. It is an important part of Jingdong's real-time data warehouse, serving a lot of core businesses, such as golden eye/business intelligence, JDV, advertising algorithm, search algorithm, etc.

The real-time business model of the RDDM layer is consistent with the business processing logic of the ADM and GDM layers in the offline data. Based on this, we hope to realize the unification of stream and batch computing of business models through FlinkSQL. At the same time, these services also have very distinct characteristics. For example, order-related business models involve the processing of large states, and traffic-related business models have relatively high requirements for end-to-end real-time performance. In addition, some special scenarios also require some customized development to support.

The implementation of RDDM mainly has two core demands: first, its calculation requires a lot of associated data, and a large amount of dimensional data is stored in HBase; in addition, the query of some dimensional data has a secondary index, and the index table needs to be queried first, from which Take out the qualified key and then go to the dimension table to get the real data.

In response to the above requirements, we improve the efficiency of join by combining the function of dimension table data preloading and the function of dimension table keyby. For the query requirements of the secondary index, we customized the connector to achieve it.

The function of dimension table data preloading refers to loading the dimension table data into the memory during the initialization phase. This function can be used in combination with keyby to effectively reduce the number of caches and improve the hit rate.

Some business models are associated with a lot of historical data, resulting in relatively large status data. At present, we customize and optimize according to the scenario. We believe that the fundamental solution is to implement a set of efficient KV-based statebackend, and the implementation of this function is being planned.

3.2 Case 2

Public opinion analysis of traffic buying and selling black products. Its main process is as follows: the source obtains relevant information through the crawler and writes it to JMQ. After the data is synchronized to JDQ, it is processed by Flink and then continues to be written to the downstream JDQ. At the same time, through the DTS data transmission service, the upstream JDQ data is synchronized to HDFS, and then offline data processing is performed through the Hive table.

This service has two characteristics: first, the end-to-end real-time requirements are not high, and a minute-level delay can be accepted; second, the offline and real-time processing logic is consistent. Therefore, the storage of the intermediate link can be directly replaced from JDQ to Iceberg, and then incrementally read through Flink, and business logic processing can be realized through FlinkSQL, that is, the complete unification of the two sets of links in streaming and batching is completed. The data in the Iceberg table can also be used for OLAP query or offline for further processing.

The end-to-end delay of the above link is about one minute. The operator-based small file merging function effectively improves performance, significantly reduces storage and computing costs, and reduces development and maintenance costs by more than 30%.

4. Future planning

Future plans are mainly divided into the following two aspects:

First, business development . We will increase the promotion of FlinkSQL tasks, explore more business scenarios that integrate streaming and batching, and polish the product form to accelerate the transformation of users to SQL. At the same time, the platform metadata and offline metadata are more deeply integrated to provide better metadata services.

Second, in terms of platform capabilities . We will continue to dig deep into join scenarios and large state scenarios, and explore efficient KV-type state backend implementations, and continue to optimize the design under the framework of unified computing and unified storage to reduce end-to-end latency.

Click to view live replay & speech PDF

For more technical issues related to Flink, you can scan the code to join the community DingTalk exchange group to get the latest technical articles and community dynamics as soon as possible. Please pay attention to the public number~

Recommended activities

Alibaba Cloud's enterprise-level product based on Apache Flink - real-time computing Flink version is now open:
99 yuan to try out the Flink version of real-time computing (yearly and monthly, 10CU), and you will have the opportunity to get Flink's exclusive custom sweater; another package of 3 months and above will have a 15% discount!
Learn more about the event: https://www.aliyun.com/product/bigdata/en

The exploration and practice of the integration of flow and batch in JD.com

1. Overall thinking

2. Technical solutions and optimization

3. Landing case

3.1 Case 1

3.2 Case 2

4. Future planning

ApacheFlink

引用和评论

Flink CDC 3.4 发布, 优化高频 DDL 处理，支持 Batch 模式，新增 Iceberg 支持

【Hadoop】HDFS架构解析

【Hadoop】HBase系统解析及适用场景

基于 pyflink 的算法工作流设计和改造

通过Milvus内置Sparse-BM25算法进行全文检索并将混合检索应用于RAG系统

MCP+Hologres+LLM 搭建数据分析 Agent

基于 Flink CDC YAML 的 MySQL 到 Kafka 流式数据集成