1
头图

This article introduces Flink Hudi's continuous optimization and evolution of the original mini-batch-based incremental computing model through stream computing. Users can write CDC data into Hudi storage in real time through Flink SQL, and the upcoming 0.9 version of Hudi natively supports CDC format. The main content is:

  1. background
  2. Incremental ETL
  3. Demo

1. Background

Near real time

Since 2016, the Apache Hudi community has begun to explore use cases of near real-time scenarios through Hudi's UPSERT capability [1]. Through the batch processing model of MR/Spark, users can realize hourly data injection into HDFS/OSS. In pure real-time scenarios, users can realize end-to-end real-time analysis in seconds (5 minutes) through the architecture of the stream computing engine Flink + KV/OLAP storage. However, there are still a large number of use cases in scenes from seconds (5 minutes) to hours, which we call NEAR-REAL-TIME (Near Real Time).

img

In practice, a large number of cases belong to the category of near real time:

  1. Minute-level large screen;
  2. Various BI analysis (OLAP);
  3. Minute-level feature extraction for machine learning.

Incremental calculation

The solution to near real-time is currently relatively open.

  • The latency of stream processing is low, but the pattern of SQL is relatively fixed, and the capabilities of the query end (index, ad hoc) are lacking;
  • The batch processing capacity is abundant but the data delay is large.

So the Hudi community proposed an incremental computing model based on mini-batch:

increment data set => increment calculation result merge saved result => external storage

This model pulls the incremental data set (the data set before the two commits) through the snapshot stored in the lake, calculates the incremental result (such as a simple count) through batch processing frameworks such as Spark/Hive, and then merges to the stored result middle.

the core issue

The core issues that the incremental model needs to solve:

  1. UPSERT capability: similar to KUDU and Hive ACID, Hudi also provides minute-level update capability;
  2. incremental consumption: Hudi provides incremental pull through multiple snapshots stored in the lake.

The mini-batch-based incremental calculation model can increase the delay of some scenarios and save calculation costs, but there is a big limitation: there are requirements for the SQL pattern. Because the calculation is in batches, the batch calculation itself does not maintain the state, which requires the calculated indicators to be merged more conveniently. Simple count and sum can be done, but avg and count distinct still need to pull the full amount of data and recalculate.

With the popularization of stream computing and real-time data warehouses, the Hudi community is also actively embracing changes. Through stream computing, the original mini-batch-based incremental computing model is continuously optimized and evolved: In version 0.7, streaming data is introduced into the lake. The native CDC format is supported in version 0.9.

2. Incremental ETL

DB data into the lake

With the maturity of CDC technology, CDC tools such as debezium have become more and more popular, and the Hudi community has also integrated streaming writing and streaming reading capabilities. Users can write CDC data into Hudi storage in real time through Flink SQL:

img

  • Users can directly import DB data into Hudi through Flink CDC connector;
  • You can also import CDC data into Kafka first, and then import Hudi through the Kafka connector.

The fault tolerance and scalability of the second solution will be better.

Data Lake CDC

In the upcoming version 0.9, Hudi natively supports the CDC format, and all changes to a record can be saved. Based on this, the combination of Hudi and the streaming computing system is more complete, and CDC data can be read in streaming [2]:

img

All message changes of the source CDC stream are saved after entering the lake and used for streaming consumption. Flink's stateful calculations accumulate the calculation results (state) in real time, and synchronize the calculation changes to Hudi Lake storage by streaming Hudi, and then continue to dock with the changelog stored by Flink streaming Hudi to realize the next level of stateful calculation. Near real-time end-to-end ETL pipeline:

img

This architecture reduces the end-to-end ETL latency to the minute level, and the storage format of each layer can be compressed into column storage (Parquet, ORC) through compaction to provide OLAP analysis capabilities. Due to the openness of the data lake, after compression The format can be connected to various query engines: Flink, Spark, Presto, Hive, etc.

A Hudi data lake table has two forms:

  • table form: queries the latest snapshot results, while providing an efficient column storage format
  • stream form: stream consumption change, you can specify any point to stream the changelog after reading

3. Demo

We use a demo to demonstrate the two forms of the Hudi table.

Environmental preparation

  • Flink SQL Client
  • Hudi master package hudi-flink-bundle jar
  • Flink 1.13.1

Here prepare a piece of CDC data in debezium-json format in advance

{"before":null,"after":{"id":101,"ts":1000,"name":"scooter","description":"Small 2-wheel scooter","weight":3.140000104904175},"source":{"version":"1.1.1.Final","connector":"mysql","name":"dbserver1","ts_ms":0,"snapshot":"true","db":"inventory","table":"products","server_id":0,"gtid":null,"file":"mysql-bin.000003","pos":154,"row":0,"thread":null,"query":null},"op":"c","ts_ms":1589355606100,"transaction":null}
{"before":null,"after":{"id":102,"ts":2000,"name":"car battery","description":"12V car battery","weight":8.100000381469727},"source":{"version":"1.1.1.Final","connector":"mysql","name":"dbserver1","ts_ms":0,"snapshot":"true","db":"inventory","table":"products","server_id":0,"gtid":null,"file":"mysql-bin.000003","pos":154,"row":0,"thread":null,"query":null},"op":"c","ts_ms":1589355606101,"transaction":null}
{"before":null,"after":{"id":103,"ts":3000,"name":"12-pack drill bits","description":"12-pack of drill bits with sizes ranging from #40 to #3","weight":0.800000011920929},"source":{"version":"1.1.1.Final","connector":"mysql","name":"dbserver1","ts_ms":0,"snapshot":"true","db":"inventory","table":"products","server_id":0,"gtid":null,"file":"mysql-bin.000003","pos":154,"row":0,"thread":null,"query":null},"op":"c","ts_ms":1589355606101,"transaction":null}
{"before":null,"after":{"id":104,"ts":4000,"name":"hammer","description":"12oz carpenter's hammer","weight":0.75},"source":{"version":"1.1.1.Final","connector":"mysql","name":"dbserver1","ts_ms":0,"snapshot":"true","db":"inventory","table":"products","server_id":0,"gtid":null,"file":"mysql-bin.000003","pos":154,"row":0,"thread":null,"query":null},"op":"c","ts_ms":1589355606101,"transaction":null}
{"before":null,"after":{"id":105,"ts":5000,"name":"hammer","description":"14oz carpenter's hammer","weight":0.875},"source":{"version":"1.1.1.Final","connector":"mysql","name":"dbserver1","ts_ms":0,"snapshot":"true","db":"inventory","table":"products","server_id":0,"gtid":null,"file":"mysql-bin.000003","pos":154,"row":0,"thread":null,"query":null},"op":"c","ts_ms":1589355606101,"transaction":null}
{"before":null,"after":{"id":106,"ts":6000,"name":"hammer","description":"16oz carpenter's hammer","weight":1},"source":{"version":"1.1.1.Final","connector":"mysql","name":"dbserver1","ts_ms":0,"snapshot":"true","db":"inventory","table":"products","server_id":0,"gtid":null,"file":"mysql-bin.000003","pos":154,"row":0,"thread":null,"query":null},"op":"c","ts_ms":1589355606101,"transaction":null}
{"before":null,"after":{"id":107,"ts":7000,"name":"rocks","description":"box of assorted rocks","weight":5.300000190734863},"source":{"version":"1.1.1.Final","connector":"mysql","name":"dbserver1","ts_ms":0,"snapshot":"true","db":"inventory","table":"products","server_id":0,"gtid":null,"file":"mysql-bin.000003","pos":154,"row":0,"thread":null,"query":null},"op":"c","ts_ms":1589355606101,"transaction":null}
{"before":null,"after":{"id":108,"ts":8000,"name":"jacket","description":"water resistent black wind breaker","weight":0.10000000149011612},"source":{"version":"1.1.1.Final","connector":"mysql","name":"dbserver1","ts_ms":0,"snapshot":"true","db":"inventory","table":"products","server_id":0,"gtid":null,"file":"mysql-bin.000003","pos":154,"row":0,"thread":null,"query":null},"op":"c","ts_ms":1589355606101,"transaction":null}
{"before":null,"after":{"id":109,"ts":9000,"name":"spare tire","description":"24 inch spare tire","weight":22.200000762939453},"source":{"version":"1.1.1.Final","connector":"mysql","name":"dbserver1","ts_ms":0,"snapshot":"true","db":"inventory","table":"products","server_id":0,"gtid":null,"file":"mysql-bin.000003","pos":154,"row":0,"thread":null,"query":null},"op":"c","ts_ms":1589355606101,"transaction":null}
{"before":{"id":106,"ts":6000,"name":"hammer","description":"16oz carpenter's hammer","weight":1},"after":{"id":106,"ts":10000,"name":"hammer","description":"18oz carpenter hammer","weight":1},"source":{"version":"1.1.1.Final","connector":"mysql","name":"dbserver1","ts_ms":1589361987000,"snapshot":"false","db":"inventory","table":"products","server_id":223344,"gtid":null,"file":"mysql-bin.000003","pos":362,"row":0,"thread":2,"query":null},"op":"u","ts_ms":1589361987936,"transaction":null}
{"before":{"id":107,"ts":7000,"name":"rocks","description":"box of assorted rocks","weight":5.300000190734863},"after":{"id":107,"ts":11000,"name":"rocks","description":"box of assorted rocks","weight":5.099999904632568},"source":{"version":"1.1.1.Final","connector":"mysql","name":"dbserver1","ts_ms":1589362099000,"snapshot":"false","db":"inventory","table":"products","server_id":223344,"gtid":null,"file":"mysql-bin.000003","pos":717,"row":0,"thread":2,"query":null},"op":"u","ts_ms":1589362099505,"transaction":null}
{"before":null,"after":{"id":110,"ts":12000,"name":"jacket","description":"water resistent white wind breaker","weight":0.20000000298023224},"source":{"version":"1.1.1.Final","connector":"mysql","name":"dbserver1","ts_ms":1589362210000,"snapshot":"false","db":"inventory","table":"products","server_id":223344,"gtid":null,"file":"mysql-bin.000003","pos":1068,"row":0,"thread":2,"query":null},"op":"c","ts_ms":1589362210230,"transaction":null}
{"before":null,"after":{"id":111,"ts":13000,"name":"scooter","description":"Big 2-wheel scooter ","weight":5.179999828338623},"source":{"version":"1.1.1.Final","connector":"mysql","name":"dbserver1","ts_ms":1589362243000,"snapshot":"false","db":"inventory","table":"products","server_id":223344,"gtid":null,"file":"mysql-bin.000003","pos":1394,"row":0,"thread":2,"query":null},"op":"c","ts_ms":1589362243428,"transaction":null}
{"before":{"id":110,"ts":12000,"name":"jacket","description":"water resistent white wind breaker","weight":0.20000000298023224},"after":{"id":110,"ts":14000,"name":"jacket","description":"new water resistent white wind breaker","weight":0.5},"source":{"version":"1.1.1.Final","connector":"mysql","name":"dbserver1","ts_ms":1589362293000,"snapshot":"false","db":"inventory","table":"products","server_id":223344,"gtid":null,"file":"mysql-bin.000003","pos":1707,"row":0,"thread":2,"query":null},"op":"u","ts_ms":1589362293539,"transaction":null}
{"before":{"id":111,"ts":13000,"name":"scooter","description":"Big 2-wheel scooter ","weight":5.179999828338623},"after":{"id":111,"ts":15000,"name":"scooter","description":"Big 2-wheel scooter ","weight":5.170000076293945},"source":{"version":"1.1.1.Final","connector":"mysql","name":"dbserver1","ts_ms":1589362330000,"snapshot":"false","db":"inventory","table":"products","server_id":223344,"gtid":null,"file":"mysql-bin.000003","pos":2090,"row":0,"thread":2,"query":null},"op":"u","ts_ms":1589362330904,"transaction":null}
{"before":{"id":111,"ts":16000,"name":"scooter","description":"Big 2-wheel scooter ","weight":5.170000076293945},"after":null,"source":{"version":"1.1.1.Final","connector":"mysql","name":"dbserver1","ts_ms":1589362344000,"snapshot":"false","db":"inventory","table":"products","server_id":223344,"gtid":null,"file":"mysql-bin.000003","pos":2443,"row":0,"thread":2,"query":null},"op":"d","ts_ms":1589362344455,"transaction":null}

Create a table through Flink SQL Client to read CDC data files

Flink SQL> CREATE TABLE debezium_source(
>   id INT NOT NULL,
>   ts BIGINT,
>   name STRING,
>   description STRING,
>   weight DOUBLE
> ) WITH (
>   'connector' = 'filesystem',
>   'path' = '/Users/chenyuzhao/workspace/hudi-demo/source.data',
>   'format' = 'debezium-json'
> );
[INFO] Execute statement succeed.

Execute SELECT to observe the results, you can see that there are a total of 20 records, there are some UPDATE s in the middle, and the last message is DELETE

Flink SQL> select * from debezium_source;
+----+-------------+----------------------+--------------------------------+--------------------------------+--------------------------------+
| op |          id |                   ts |                           name |                    description |                         weight |
+----+-------------+----------------------+--------------------------------+--------------------------------+--------------------------------+
| +I |         101 |                 1000 |                        scooter |          Small 2-wheel scooter |              3.140000104904175 |
| +I |         102 |                 2000 |                    car battery |                12V car battery |              8.100000381469727 |
| +I |         103 |                 3000 |             12-pack drill bits | 12-pack of drill bits with ... |              0.800000011920929 |
| +I |         104 |                 4000 |                         hammer |        12oz carpenter's hammer |                           0.75 |
| +I |         105 |                 5000 |                         hammer |        14oz carpenter's hammer |                          0.875 |
| +I |         106 |                 6000 |                         hammer |        16oz carpenter's hammer |                            1.0 |
| +I |         107 |                 7000 |                          rocks |          box of assorted rocks |              5.300000190734863 |
| +I |         108 |                 8000 |                         jacket | water resistent black wind ... |            0.10000000149011612 |
| +I |         109 |                 9000 |                     spare tire |             24 inch spare tire |             22.200000762939453 |
| -U |         106 |                 6000 |                         hammer |        16oz carpenter's hammer |                            1.0 |
| +U |         106 |                10000 |                         hammer |          18oz carpenter hammer |                            1.0 |
| -U |         107 |                 7000 |                          rocks |          box of assorted rocks |              5.300000190734863 |
| +U |         107 |                11000 |                          rocks |          box of assorted rocks |              5.099999904632568 |
| +I |         110 |                12000 |                         jacket | water resistent white wind ... |            0.20000000298023224 |
| +I |         111 |                13000 |                        scooter |           Big 2-wheel scooter  |              5.179999828338623 |
| -U |         110 |                12000 |                         jacket | water resistent white wind ... |            0.20000000298023224 |
| +U |         110 |                14000 |                         jacket | new water resistent white w... |                            0.5 |
| -U |         111 |                13000 |                        scooter |           Big 2-wheel scooter  |              5.179999828338623 |
| +U |         111 |                15000 |                        scooter |           Big 2-wheel scooter  |              5.170000076293945 |
| -D |         111 |                16000 |                        scooter |           Big 2-wheel scooter  |              5.170000076293945 |
+----+-------------+----------------------+--------------------------------+--------------------------------+--------------------------------+
Received a total of 20 rows

Create the Hudi table, here set the form of the table to MERGE_ON_READ and open the changelog mode attribute changelog.enabled

Flink SQL> CREATE TABLE hoodie_table(
>   id INT NOT NULL PRIMARY KEY NOT ENFORCED,
>   ts BIGINT,
>   name STRING,
>   description STRING,
>   weight DOUBLE
> ) WITH (
>   'connector' = 'hudi',
>   'path' = '/Users/chenyuzhao/workspace/hudi-demo/t1',
>   'table.type' = 'MERGE_ON_READ',
>   'changelog.enabled' = 'true',
>   'compaction.async.enabled' = 'false'
> );
[INFO] Execute statement succeed.

Inquire

Import the data into Hudi through the INSERT statement, turn on the stream reading mode, and execute the query to observe the results

Flink SQL> select * from hoodie_table/*+ OPTIONS('read.streaming.enabled'='true')*/;
+----+-------------+----------------------+--------------------------------+--------------------------------+--------------------------------+
| op |          id |                   ts |                           name |                    description |                         weight |
+----+-------------+----------------------+--------------------------------+--------------------------------+--------------------------------+
| +I |         101 |                 1000 |                        scooter |          Small 2-wheel scooter |              3.140000104904175 |
| +I |         102 |                 2000 |                    car battery |                12V car battery |              8.100000381469727 |
| +I |         103 |                 3000 |             12-pack drill bits | 12-pack of drill bits with ... |              0.800000011920929 |
| +I |         104 |                 4000 |                         hammer |        12oz carpenter's hammer |                           0.75 |
| +I |         105 |                 5000 |                         hammer |        14oz carpenter's hammer |                          0.875 |
| +I |         106 |                 6000 |                         hammer |        16oz carpenter's hammer |                            1.0 |
| +I |         107 |                 7000 |                          rocks |          box of assorted rocks |              5.300000190734863 |
| +I |         108 |                 8000 |                         jacket | water resistent black wind ... |            0.10000000149011612 |
| +I |         109 |                 9000 |                     spare tire |             24 inch spare tire |             22.200000762939453 |
| -U |         106 |                 6000 |                         hammer |        16oz carpenter's hammer |                            1.0 |
| +U |         106 |                10000 |                         hammer |          18oz carpenter hammer |                            1.0 |
| -U |         107 |                 7000 |                          rocks |          box of assorted rocks |              5.300000190734863 |
| +U |         107 |                11000 |                          rocks |          box of assorted rocks |              5.099999904632568 |
| +I |         110 |                12000 |                         jacket | water resistent white wind ... |            0.20000000298023224 |
| +I |         111 |                13000 |                        scooter |           Big 2-wheel scooter  |              5.179999828338623 |
| -U |         110 |                12000 |                         jacket | water resistent white wind ... |            0.20000000298023224 |
| +U |         110 |                14000 |                         jacket | new water resistent white w... |                            0.5 |
| -U |         111 |                13000 |                        scooter |           Big 2-wheel scooter  |              5.179999828338623 |
| +U |         111 |                15000 |                        scooter |           Big 2-wheel scooter  |              5.170000076293945 |
| -D |         111 |                16000 |                        scooter |           Big 2-wheel scooter  |              5.170000076293945 |

You can see that Hudi keeps the change record of each row, including the operation type of the change log. Here we turn on the TABLE HINTS function to facilitate the dynamic setting of table parameters.

Continue to use the batch read mode, execute the query and observe the output result, you can see that the intermediate changes are merged.

Flink SQL> select * from hoodie_table;
2021-08-20 20:51:25,052 INFO  org.apache.hadoop.conf.Configuration.deprecation             [] - mapred.job.map.memory.mb is deprecated. Instead, use mapreduce.map.memory.mb
+----+-------------+----------------------+--------------------------------+--------------------------------+--------------------------------+
| op |          id |                   ts |                           name |                    description |                         weight |
+----+-------------+----------------------+--------------------------------+--------------------------------+--------------------------------+
| +U |         110 |                14000 |                         jacket | new water resistent white w... |                            0.5 |
| +I |         101 |                 1000 |                        scooter |          Small 2-wheel scooter |              3.140000104904175 |
| +I |         102 |                 2000 |                    car battery |                12V car battery |              8.100000381469727 |
| +I |         103 |                 3000 |             12-pack drill bits | 12-pack of drill bits with ... |              0.800000011920929 |
| +I |         104 |                 4000 |                         hammer |        12oz carpenter's hammer |                           0.75 |
| +I |         105 |                 5000 |                         hammer |        14oz carpenter's hammer |                          0.875 |
| +U |         106 |                10000 |                         hammer |          18oz carpenter hammer |                            1.0 |
| +U |         107 |                11000 |                          rocks |          box of assorted rocks |              5.099999904632568 |
| +I |         108 |                 8000 |                         jacket | water resistent black wind ... |            0.10000000149011612 |
| +I |         109 |                 9000 |                     spare tire |             24 inch spare tire |             22.200000762939453 |
+----+-------------+----------------------+--------------------------------+--------------------------------+--------------------------------+
Received a total of 10 rows

polymerization

Calculate in Batch read mode count(*)

Flink SQL> select count (*) from hoodie_table;
+----+----------------------+
| op |               EXPR$0 |
+----+----------------------+
| +I |                    1 |
| -U |                    1 |
| +U |                    2 |
| -U |                    2 |
| +U |                    3 |
| -U |                    3 |
| +U |                    4 |
| -U |                    4 |
| +U |                    5 |
| -U |                    5 |
| +U |                    6 |
| -U |                    6 |
| +U |                    7 |
| -U |                    7 |
| +U |                    8 |
| -U |                    8 |
| +U |                    9 |
| -U |                    9 |
| +U |                   10 |
+----+----------------------+
Received a total of 19 rows

Calculate in Streaming read mode count(*)

Flink SQL> select count (*) from hoodie_table/*+OPTIONS('read.streaming.enabled'='true')*/;
+----+----------------------+
| op |               EXPR$0 |
+----+----------------------+
| +I |                    1 |
| -U |                    1 |
| +U |                    2 |
| -U |                    2 |
| +U |                    3 |
| -U |                    3 |
| +U |                    4 |
| -U |                    4 |
| +U |                    5 |
| -U |                    5 |
| +U |                    6 |
| -U |                    6 |
| +U |                    7 |
| -U |                    7 |
| +U |                    8 |
| -U |                    8 |
| +U |                    9 |
| -U |                    9 |
| +U |                    8 |
| -U |                    8 |
| +U |                    9 |
| -U |                    9 |
| +U |                    8 |
| -U |                    8 |
| +U |                    9 |
| -U |                    9 |
| +U |                   10 |
| -U |                   10 |
| +U |                   11 |
| -U |                   11 |
| +U |                   10 |
| -U |                   10 |
| +U |                   11 |
| -U |                   11 |
| +U |                   10 |
| -U |                   10 |
| +U |                   11 |
| -U |                   11 |
| +U |                   10 |

It can be seen that the calculation results in batch and streaming modes are consistent.

Reference

[1] https://www.oreilly.com/content/ubers-case-for-incremental-processing-on-hadoop/

[2] https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform


ApacheFlink
936 声望1.1k 粉丝