Introduction to Cloud technical expert Li Shaofeng (Fengze) organized his manuscript in a speech at the Apache Hudi and Apache Pulsar joint Meetup Hangzhou station. This topic will introduce a typical CDC lake entry scenario and how to use Pulsar/Hudi to build a data lake. At the same time, the Hudi kernel design, new vision and the latest developments in the community will be shared.
PPT download link of this article:
"The CDC Data into the Lake Based on Apache Hudi".pdf 1613f0c110c644
Other dry goods:
"Alibaba Cloud Building Lakehouse Practice Based on Hudi".pdf 1613f0c110c68d
1. CDC background introduction
First we introduce what is CDC? The full name of CDC is Change data Capture, which is a very common technology in the database field. It is mainly used to capture some changes in the database, and then the change data can be sent downstream. It has a wide range of applications. It can do some data synchronization, data distribution and data collection, and it can also do ETL. Today, the main thing to share is to ETL to the data lake through CDC.
For CDC, there are two main types in the industry: one is query-based, the client will query the source database table change data through SQL, and then send it to the outside world. The second is log-based, which is also a widely used method in the industry. Generally, it is through the binlog method. The changed record will be written to the binlog. After the binlog is parsed, it will be written to the message system, or directly processed based on Flink CDC.
There is a difference between the two. Based on the query is relatively simple and intrusive, while the log-based is non-invasive and has no effect on the data source, but the binlog analysis is more complicated.
Based on query and log, there are four implementation technologies, including timestamp-based, trigger and snapshot-based, and log-based. This is the technology to implement CDC. The following is a comparison of several methods.
Through this table comparison, it can be found that the log-based synthesis is optimal, but the analysis is more complicated, but there are many open source binlog parsers in the industry. The more common and popular ones are Debezium, Canal, and Maxwell. Based on these binlog parsers, ETL pipelines can be built.
Let's take a look at a CDC warehousing architecture that is more popular in the industry.
The entire data storage is divided into real-time stream or offline stream. The real-time stream parses the binlog, parses the binlog through Canal, and then writes it to Kafka, and then the Kafka data is synchronized to Hive every hour; the other is the offline stream, and the offline stream needs to be synchronized Pull the full amount once to the table of the source layer of Hive. If only the previous real-time stream has incomplete data, you must import the full amount of data once through the SQL Select of the offline stream. For each ODS table, the stock data will be added and added. Do a Merge for the amount of data. It can be seen here that the real-time performance of the ODS layer is not enough, and there are hourly and day-level delays. For the ODS layer, this delay can be achieved in minutes by introducing Apache Hudi.
Second, the method of CDC data entry into the lake
Based on the CDC data into the lake, this structure is very simple. Various upstream data sources, such as DB change data, event streams, and various external data sources, can be written into the table through the change stream, and then external query analysis is performed. The entire architecture is very simple.
Although the architecture is simple, it still faces many challenges. Take the Apache Hudi data lake as an example. The data lake stores a variety of data through files. For CDC data processing, certain files in the lake need to be reliably and transactionally changed to ensure that downstream queries will not see it. Part of the result is that the CDC data needs to be updated and deleted efficiently, which requires quick positioning of the changed files. In addition, for each small batch of data writes, it is hoped that small files can be automatically processed to avoid complicated small file processing , There is also query-oriented layout optimization, which can transform the file layout through some technical means such as Clustering to provide better query performance externally.
And how does Apache Hudi deal with these challenges? First, support transactional writes, including the MVCC mechanism between reads and writes to ensure that writes do not affect reads, and can also control transactions and concurrency guarantees. For concurrent writes, use OCC optimistic locking mechanism, for updates and deletions, built-in indexes and custom guarantees for updates , Delete is more efficient. In addition, for query optimization, Hudi will automatically manage small files internally, and the files will automatically grow to the user-specified file size, such as 128M, which is also a core feature for Hudi. In addition, Hudi provides Clustering to optimize the file layout function.
The following figure is a typical CDC link into the lake. The above link is the link adopted by most companies. The previous CDC data is first imported into Kafka or Pulsar through the CDC tool, and then written to Hudi through Flink or Spark streaming consumption. The second architecture is to directly connect to the MySQL upstream data source through Flink CDC and write directly to the downstream Hudi table.
In fact, these two links have their own advantages and disadvantages. The first link is a unified data bus, which has good scalability and fault tolerance. For the second link, scalability and fault tolerance will be slightly worse, but due to fewer components, maintenance costs are correspondingly lower.
This is the CDC link into the lake of the Alibaba Cloud Database OLAP team. Because we are the Spark team, we use the Spark Streaming link to enter the lake. The entire link to the lake is also divided into two parts: First, there is a full synchronization job, which will pull a full amount of data through Spark. If there is a slave library, you can directly connect to the slave library for a full synchronization to avoid the impact on the main library. , And then write to Hudi. Then an incremental job will be started, and the incremental job will use Spark to consume binlog data in Alibaba Cloud DTS to synchronize the binlog to the Hudi table in quasi-real time. The scheduling of full and incremental jobs uses Lakehouse’s automatic job scheduling capabilities to coordinate full and incremental jobs. Hudi’s Upsert semantics are used to ensure the final consistency of the full incremental data when connecting full and incremental tasks. The problem of too much and too little data.
Our team has also made some optimizations in the CDC link of Lakehouse into the lake.
The first one is the Schema change processing of the original library, the scenario where some columns are added, deleted or modified by our docked customers. Before Spark writes Hudi, it will check the schema to see if the schema is legal. If it is legal, it can be written normally. If it is not legal, the writing will fail. Deleting the field will cause the schema validation to be illegal and cause the job. Failure, such stability is not guaranteed. Therefore, we will catch the exception of Schema Validation. If we find that the fields are reduced, we will auto-complete the previous fields, and then retry to ensure that the link is stable.
Second, some customer tables do not have a primary key or the primary key is unreasonable. For example, the update time field is used as the primary key, or the partition field that will change is set. At this time, the data written to Hudi will not match the data of the source database table. Therefore, we have made some product-level optimizations to allow users to reasonably set the primary key and partition mapping to ensure that the data synchronized to Hudi is completely aligned with the source database.
There is also a common requirement for users to add a table to the upstream library. If table-level synchronization is used, the new table is not aware of the entire link and cannot be synchronized to Hudi. In Lakehouse, we can The entire database is synchronized, so when a new table is added to the database, the new table will be automatically sensed, and the new table data will be automatically synchronized to Hudi, so that the original database has the ability to automatically sense the added table.
Another is to optimize the performance of CDC writing, such as pulling a batch of data containing Insert, Update, Delete and other events. Do you always use Hudi's Upsert method to write? This control is relatively simple, and Upsert has data deduplication capabilities, but the problem it brings is that the efficiency of finding indexes is low. For the Insert method, there is no need to find indexes, and the efficiency is relatively high. Therefore, for each batch of data, it will be judged whether it is an Insert event. If it is an Insert event, it will be directly written in Insert mode to avoid the overhead of finding whether the file is updated. The data shows that the performance can be improved by 30%~50%. Of course, we also need to consider the DTS abnormality. When re-consuming data, the Insert method cannot be used directly during the recovery period, otherwise there may be data duplication. For this problem, we have introduced a table-level Watermark to ensure that it will not even be in the case of DTS abnormalities. There is a data duplication problem.
Three, Hudi core design
Then introduce Hudi's positioning. According to the latest vision of the community, Hudi is defined as a streaming data lake platform, which supports massive data updates, built-in table formats and supports transaction storage, a series of list services Clean, Archive,
Compaction, Clustering, etc., as well as out-of-the-box data services, as well as its own operation and maintenance tools and indicator monitoring, provide good operation and maintenance capabilities.
This is a picture of Hudi's official website. It can be seen that Hudi is doing lake storage in the entire ecosystem. The bottom layer can be connected to HDFS and object storage of various cloud vendors, as long as it is compatible with Hadoop protocol. The upstream is the flow of change events into the lake, which can support a variety of data engines, such as presto, Spark, and cloud products; in addition, Hudi's incremental pull capabilities can be used to build derived tables with the help of Spark, Hive, and Flink.
The entire Hudi architecture is very complete, and it is positioned as an incremental processing stack. The typical streaming type is row-oriented, processing the data row by row, and the processing is very efficient.
However, in row-oriented data, there is no way to perform large-scale analysis and scan optimization, and batch processing may need to be processed in full once a day, which is relatively inefficient. Hudi introduces the concept of incremental processing. The data processed is after a certain point in time. It is similar to stream processing and much more efficient than batch processing. It is itself oriented to column storage data in the data lake, and scanning optimization is very efficient.
And look back at the development history of Hudi. In 2015, the chairman of the community published an article on incremental processing. It was put into production at Uber in 2016, providing support for all key database businesses; in 2017, Uber supported a 100PB data lake. In 2018, with cloud computing Popularity, attracting users at home and abroad; Uber donated it to Apache for incubation in 19 years; it will become a top project in about a year in 2020, and the adoption rate has increased by more than 10 times; Uber's latest information in 2021 shows that Hudi supports Developed a 500PB data lake, and made many enhancements to Hudi, such as the integration of Spark SQL DML and Flink. Recently, the Hudi-based data lake practice single table shared by the ByteDance recommendation department has exceeded 400PB, and the total storage has exceeded 1EB, which is an increase of PB level.
After several years of development, many companies at home and abroad have adopted Hudi. For example, public cloud Huawei Cloud, Alibaba Cloud, Tencent Cloud, and AWS have integrated Hudi. Alibaba Cloud also builds Lakehouse based on Hudi. The migration of Bytedance's entire data warehouse system to the lake is also based on Hudi, and there will be corresponding articles to share their practice of increasing PB data volume based on Flink+Hudi's data lake. At the same time, major Internet companies like Baidu and Kuaishou are using it. At the same time, we understand that the banking and financial industries also include Industrial and Commercial Bank, Agricultural Bank, Baidu Finance, and Baixin Bank. The game field includes Sanqi Mutual Entertainment, Mihayou, and 4399. It can be seen that Hudi is widely used in all walks of life.
Hudi is positioned as a complete data lake platform. The top layer is for users to write a variety of SQL. Hudi provides various capabilities as the platform. The lower layer is based on SQL and programming APIs, and the next layer is Hudi's core includes indexing, concurrency control, and table services. The following community will build a cache based on Lake Cache. The file format is the open Parquet, ORC, and HFile storage format used. The entire data lake can be built on various clouds.
The key design of Hudi will be introduced later, which is very helpful for us to understand Hudi. The first is the file format. The bottom layer is based on the design of Fileslice, which is translated into file slices, which include basic files and incremental log files. The basic file is a Parquet or ORC file, and the incremental file is a log file. For the writing of the log file, Hudi encodes some blocks. A batch of Updates can be encoded into a data block and written to the file. The basic file is pluggable and can be based on Parquet. The latest version 9.0 already supports ORC. Also based on HFile, HFile can be used as a metadata table.
A series of various data blocks are stored in the log file, which is a bit similar to the redo log of the database, and each data version can be found through the redo log. The basic file and Log file are compressed and combined to form a new basic file. Hudi provides two ways of synchronous and asynchronous, which provides users with very flexible choices. For example, they can choose synchronous compaction. If they are not sensitive to delay, there is no need to start an additional asynchronous job for compaction, or some users want to guarantee The delay of the write link can be used for asynchronous compaction without affecting the main link.
Hudi is based on the concept of File Group on File Slice. File Group will contain different File Slices, and File Slices constitute different versions. Hudi provides a mechanism to retain the number of metadata to ensure that the size of metadata is controllable.
For data update writing, try to use append. For example, if you wrote a log file before, you will continue to try to write to the log file during the update. It is very friendly to storage that supports append semantics like HDFS, but many cloud object storage is not Support append semantics, that is, data cannot be changed after being written in, and only a new log file can be written. For each file group, that is, different FileGroups are isolated from each other, and different logics can be made for different file groups, and users can customize the algorithm to achieve it, which is very flexible.
The design based on Hudi FileGroup can bring a lot of benefits. For example, the basic file is 100M, and the basic file is updated with 50M data later, which is 4 FileGroups. The cost of compaction is 600M. 50M only needs to be combined with 100M. 4 150M costs are 600M. This is a FileGroup design. There are still 4 100M files, which have also been updated. Each time they are combined, for example, 25M must be merged with 400M, and the cost is 1200M. It can be seen that the design of FileGroup reduces the merge cost by half.
There is also a table format. The content of the table format is how the file is stored in Hudi. First define the root path of the table, and then write some partitions, which is the same as the file partition organization of Hive. There are also table schema definitions and table schema changes. One way is to record the metadata in a file, and some use external KV to store the metadata. Both have their own advantages and disadvantages.
Hudi expresses the schema based on the Avro format, so the Evolution capability of the Schema is completely equivalent to the Evolution capability of the Avro Schema, that is, fields can be added and upward compatible changes, such as int becoming long is compatible, but long becoming int is not compatible .
Currently, the community already has a solution to support Full Schema Evolution, that is, you can add a field, delete a field, and rename it, that is, to change a field.
Another is Hudi's index design. When each piece of data is written to Hudi, it will maintain the mapping of the data primary key to a file group ID, so that the changed file can be located faster when updating or deleting.
There is an order table in the picture on the right, which can be written into different partitions according to the date. The following is the user table, there is no need to partition, because its data volume is not so large, changes are not so frequent, you can use a non-partitioned table.
For partitioned tables and frequently changed tables, when using Flink to write, the global index constructed by Flink State is more efficient. The entire index is pluggable, including Bloomfilter, HBase high-performance index. In the byte scenario, Bloomfilter filters can't meet the index lookup of the increasing PB at all, so they use HBase high-performance indexes, so users can flexibly choose different index implementations according to their business forms. When there are different types of indexes, it can support late update and random update scenarios at a lower cost.
Another design is concurrency control. Concurrency control was introduced after 0.8. Hudi provides an optimistic locking mechanism to deal with concurrent write issues. It checks whether two changes conflict at the time of submission. If they conflict, write fails. For table services such as Compaction or Clustering without internal locks, Hudi has a set of coordination mechanisms to avoid lock contention problems. For example, to do compaction, you can make a point on the timeline first, and then it can be completely decoupled from the write link and perform compaction asynchronously.
For example, on the left is the data ingest link, data is ingested every half an hour, and on the right is an asynchronous delete job, which will also change the table, and it is likely to conflict with write modification, which will cause this link to fail all the time, and the platform consumes CPU for no reason. Resources, the community now has improved plans for this situation, hoping to detect conflicts of concurrent writes as soon as possible, terminate them early, and reduce waste of resources.
Another design is the metadata table. Because Hudi was originally built and designed based on HDFS, there was not much consideration of cloud storage scenarios, which caused FileList on the cloud to be very slow. Therefore, in version 0.8, the community introduced the Metadata Table. The Metadata Table itself is also a Hudi table. It is built into a Hudi and can reuse various table services such as Hudi tables. The Metadata Table file will store all the file names and file sizes under the partition. The statistical information of each column is optimized for query, and now the community is doing a global index based on the Meta Table table. Each record corresponds to each file ID. They are all recorded in the Meta table, which reduces the overhead of querying files to be updated when processing Upsert, which is also necessary for cloud access.
Four, Hudi's future planning
Future plans, such as building Lakehouse based on Pulsar and Hudi, this is the Proposal proposed by the CEO of StreamNative, and want to build Pulsar tiered storage based on Hudi. In the Hudi community, we have also done some work. We want to integrate Hudi's built-in toolkit DeltaStreamar into Pulsar Source. Now there is a PR. We hope that the two communities can be more closely connected. Some students are working on the StreamNative part of Pular's tiered storage core.
Important optimizations and improvements of 0.9.0 have been released in recent days. First, Spark SQL is integrated, which greatly reduces the threshold for data analysts to use Hudi.
Flink's Hudi integration scheme has been available as early as Hudi's 0.7.0 version. After several iterations, Flink's integration of Hudi has become very mature, and it is already in production in big companies such as ByteDance. A CDC Format integration done by the Blink team directly saves Update and Deltete events to Hudi. There is also a one-time migration of stock data, which increases the batch import capability and reduces the overhead of serialization and deserialization.
In addition, some users now think that Hudi stores some metadata fields, such as \_hoodie\_commit\_time and other metadata. This information is extracted from the data information and has some storage overhead. Now it supports virtual keys. Metadata fields do not. Data will be stored again. The limitation it brings is that incremental ETL cannot be used, and the change data after a certain point in Hudi cannot be obtained.
In addition, many small partners also hope that Hudi supports the ORC format. The latest version of Hudi supports the ORC format. At the same time, this part of the format is pluggable and can be flexibly connected to more formats in the future. It also optimized the writing and query of the Metadata Table. When querying through Spark SQL, avoid Filelist and directly obtain the entire file list information through the Metadata Table.
From a further perspective, the community’s future plans include upgrading from Spark integration to Data SourceV2. Now Hudi is based on V1 and cannot use the performance optimization of V2. There is also Catalog integration, which can manage tables through the Catalog, which can be created, deleted, and updated, and the management of table metadata is integrated through Spark Catalog.
The Flink module Blink team has full-time students in charge, and the Watremark in the streaming data will be pushed to the Hudi table in the follow-up.
The other is the integration with Kafka Connect Sink, and then directly write Kafka data to Hudi through Java clients.
The optimization on the kernel side includes the global record level index based on the Metadata Table. There are also byte-beating partners that support buckets for writing. The advantage of this is that when data is updated, the corresponding bucket can be found through the primary key. Just read the Bloomfilter of the parquet file corresponding to the bucket, which reduces the search. The cost of updating.
There is also a smarter clustering strategy. We have also done this part of the work internally. Smarter clustering can dynamically turn on Clustering optimization based on the previous load situation. In addition, it also includes the construction of secondary indexes based on Metadata Table and Full Schema Evolution. And cross-table transactions.
Now the Hudi community is developing relatively fast, and the amount of code reconstruction is very large, but it is for better community development. From version 0.7.0 to version 0.9.0, the Flink integrated Hudi module is basically completely refactored. If you are interested, Can participate in the community and jointly build a better data lake platform.
Copyright Notice: content of this article is contributed spontaneously by Alibaba Cloud real-name registered users, and the copyright belongs to the original author. The Alibaba Cloud Developer Community does not own its copyright and does not assume corresponding legal responsibilities. For specific rules, please refer to the "Alibaba Cloud Developer Community User Service Agreement" and the "Alibaba Cloud Developer Community Intellectual Property Protection Guidelines". If you find suspected plagiarism in this community, fill in the infringement complaint form to report it. Once verified, the community will immediately delete the suspected infringing content.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。