Apache Flink is not limited to computing, and a new round of changes may arise in the data warehouse architecture

Author | Cai Fangfang

Interview Guest | Wang Feng (Mo Wen)

Under the "Apache Flink" entry on Wikipedia, there is a description: "Flink does not provide its own data storage system, but provides data sources and data sources for systems such as Amazon Kinesis, Apache Kafka, Alluxio, HDFS, Apache Cassandra, and Elasticsearch. receiver", and soon, the first half of this sentence may no longer apply.

Full video: https://developer.aliyun.com/special/ffa2021/live

At the beginning of 2021, in the annual technology trend outlook planned by the InfoQ editorial department, we mentioned that the field of big data will accelerate to embrace the new direction of "convergence" (or "integration") evolution. The essence is to reduce the technical complexity and cost of big data analysis while meeting higher requirements for performance and ease of use. Today, we see the popular stream processing engine Apache Flink (hereinafter referred to as Flink) taking another step along this trend.

On the morning of January 8, Flink Forward Asia 2021 kicked off in the form of an online conference. This year is the fourth year that Flink Forward Asia (hereinafter referred to as FFA) has landed in China, and it is also the seventh year that Flink has become a top-level project of the Apache Software Foundation. With the development and deepening of the real-time wave, Flink has gradually evolved into a leading role and de facto standard for stream processing. Looking back on its evolution history, Flink continues to optimize its core capabilities of stream computing and improve the standard of stream computing processing in the entire industry. But beyond these, Flink's long-term development needs a new breakthrough.

In the keynote speech of Flink Forward Asia 2021, Wang Feng (Hua Ming Mowen), the initiator of the Chinese community of Apache Flink and the head of Alibaba's open source big data platform, focused on the latest progress in the evolution and implementation of Flink's integrated streaming-batch architecture, and The next development direction of Flink is proposed - Streaming Warehouse (Streamhouse for short). As the title of the keynote speech "Flink Next, Beyond Stream Processing" said, Flink will move from Stream Processing to Streaming Warehouse to cover larger scenarios and help developers solve more problems. To achieve the goal of streaming data warehouse, it means that the Flink community needs to expand data storage suitable for the integration of streaming and batching. This is an innovation of Flink in terms of technology this year. The community-related work has been started in October. A key direction for the Flink community to advance in the coming year.

So, how to understand streaming data warehouse? What problems does it want to solve with existing data architectures? Why did Flink choose this direction? What will be the implementation path of streaming data warehouse? With these questions, InfoQ has an exclusive interview with Mo Wen to learn more about the thinking behind the streaming data warehouse.

In recent years, Flink has repeatedly emphasized the integration of stream and batch, that is, using the same set of APIs and the same set of development paradigms to realize stream computing and batch computing of big data, thereby ensuring the consistency of processing and results. Mo Wen said that the integration of streaming and batching is more of a technical concept and capability. It does not solve any problems of users. Only when it really falls into actual business scenarios can it reflect the value of development efficiency and operational efficiency. The streaming data warehouse can be understood as the thinking of the landing solution under the general direction of the integration of streaming and batching.

Two application scenarios of stream-batch integration

In last year's FFA, we have seen Flink flow in one batch Lynx double eleven floor applications, it is Ali for the first time on the core business data flow batch of truly large-scale landing one . Now a year has passed, and Flink's stream-batch integration has made new progress in both technical architecture evolution and landing applications.

At the technical evolution level, the Flink streaming-batch integration API and architecture transformation have been completed. On the basis of the original streaming-batch integration SQL, the two APIs, DataStream and DataSet, have been further integrated to realize a complete stream-batch API at the Java semantic level. A set of codes can undertake both stream storage and batch storage at the same time.

In October this year released Flink 1.14 version already supports same application mix bounded flow and unbounded flow : Flink is now supported on the part of running applications, part of the end (part of the operator has been processed to The end of the bounded input data stream) to do Checkpoint. In addition, Flink processed to the end of the bounded data stream to ensure that all calculation results are successfully submitted to Sink.

and batch execution mode now supports mixed use of DataStream API and SQL/Table API in the same application (previously only supported DataStream API or SQL/Table API alone).

Additionally, Flink update a unified Source and Sink API, start integrated connector ecology around unified API . The new Hybrid Source transitions between multiple storage systems, enabling things like reading legacy data from Amazon S3 and then switching seamlessly to Apache Kafka.

At the landing application level, there are also two more important application scenarios.

first is a full-incremental integrated data integration based on Flink CDC.

Data integration and data synchronization between different data sources are rigid requirements for many teams, but traditional solutions are often too complex and time-sensitive. Traditional data integration solutions usually use two sets of technology stacks for offline data integration and real-time data integration, which involve many data synchronization tools, such as Sqoop, DataX, etc. These tools can only be used in full or incremental. Developers You need to control the full-increment switching yourself, which is more complicated to cooperate.

Based on Flink's stream-batch integration capability and Flink CDC, you only need to write a single SQL to synchronize the full amount of historical data first, and then automatically resume the incremental data transfer at breakpoints to achieve one-stop data integration. The whole process does not require user judgment and intervention. Flink can automatically switch between batch streams and ensure data consistency.

As an independent open source project, Flink CDC Connectors has maintained a fairly rapid development since it was open sourced in July last year, with an average of one version every two months. At present, the Flink CDC version has been updated to 2.1 version, and has completed the adaptation of many mainstream databases, such as MySQL, PostgreSQL, MongoDB, Oracle, etc., and the docking of more databases such as TiDB, DB2, etc. is also in progress. It can be seen that more and more companies are using Flink CDC in their own business scenarios. XTransfer, which InfoQ interviewed not long ago, is one of them.

second application scenario of 161de75783911f is the core data warehouse scenario in the big data field.

The current mainstream real-time offline integrated data warehouse architecture is usually shown in the following figure.

In most scenarios, Flink+Kafka is used to process real-time data streams, that is, the part of real-time data warehouses, and the final analysis results are written to an online service layer for display or further analysis. At the same time, there must be an asynchronous offline data warehouse architecture in the background to supplement the real-time data, running large-scale batches or even full-scale analysis on a regular basis every day, or performing regular revision of historical data, etc.

However, there are some obvious problems with this classic architecture: first, the technology stack used by the real-time link and the offline link is different, and there must be two sets of APIs, so two sets of development processes are required, which increases the development cost; second, the real-time offline technology Different stacks cannot guarantee the consistency of data calibers; thirdly, the intermediate queue data of real-time links is not conducive to analysis. If the user wants to analyze the data of a detail layer in the real-time link, it is actually very inconvenient. The current method adopted by many users may be to export the data in this detail layer first, such as exporting to Hive for offline analysis, but this time limit The performance will be greatly reduced, or in order to speed up the query, the data will be imported into other OLAP engines, but this will increase the complexity of the system, and data consistency is also difficult to guarantee.

The concept of Flink's stream-batch integration can be fully applied in the above scenarios. In Mo Wen's view, Flink can make the current mainstream data warehouse architecture in the industry an advanced level, and realize the real-time analysis capability of the real end-to-end full link, that is, when the data changes at the source, it can capture this Change and support layer-by-layer analysis of it, so that all data flows in real time, and all data in flow can be queried in real time. With the help of Flink's complete stream-batch integration capability, the same set of APIs can support flexible offline analysis at the same time. In this way, real-time, offline and interactive query analysis, short query analysis, etc. can be unified into a complete set of solutions, becoming an ideal "Streaming Warehouse".

Understanding Streaming Data Warehouses

Streaming Warehouse (Streaming Warehouse) is more accurate, it is actually "make data warehouse streaming", which is to make the data of the entire data warehouse flow in real-time, and in a pure streaming way instead of mini-batch. way to flow. The goal is to implement a pure streaming service with end-to-end real-time performance, and use a set of APIs to analyze all the data in flow. When the source data changes, such as capturing the Log of the online service or the Binlog of the database, the According to the pre-defined Query logic or data processing logic, the data is analyzed. The analyzed data falls into a certain layer of the data warehouse, and then flows from the first layer to the next layer, and then all the data warehouses are divided into two layers. The layers will all flow and eventually flow into an online system, where users can see the full real-time flow effect of the entire data warehouse. In this process, data is active, while query is passive, and analysis is driven by changes in data. At the same time, in the vertical direction, for each data detail layer, users can execute Query to actively query, and can obtain query results in real time. In addition, it is also compatible with offline analysis scenarios, and the API is still the same, achieving true integration.

At present, there is no mature solution for such an end-to-end full streaming link in the industry. Although there are pure streaming solutions and pure interactive query solutions, users need to add the two solutions by themselves, which will inevitably increase the complexity of the system. If the offline data warehouse solution is also added, the system complexity problem will be even greater. What the streaming data warehouse needs to do is to achieve high timeliness without further increasing the complexity of the system, so that the entire architecture is very simple for developers and operation and maintenance personnel.

Of course, streaming data warehouse is the final state. To achieve this goal, Flink needs a supporting stream-batch integrated storage support. In fact, Flink itself has a built-in distributed RocksDB as the state storage, but this storage can only solve the storage problem of the internal flow data state of the task. Streaming data warehouse requires a table storage service between computing tasks: the first task writes data into it, the second task can read it from it in real time, and the third task can also execute the user's Query on it. analyze. Therefore, Flink needs to expand a storage that is compatible with its own concept, and go out from the state storage and continue to go out. To this end, the Flink community proposed a new Dynamic Table Storage, a storage solution with flow-table duality.

Stream-batch integrated storage: Flink Dynamic Table

Flink Dynamic Table (refer to FLIP-188 for community discussion) can be understood as a set of stream-batch storage that seamlessly connects to Flink SQL. Originally, Flink could only read and write external tables such as Kafka and HBase. Now, using the same set of Flink SQL syntax, you can create a Dynamic Table just like the original source and target tables. The hierarchical data of the streaming data warehouse can all be placed in the Flink Dynamic Table, and the entire data warehouse can be connected in real time through Flink SQL, which can not only query and analyze the data of different detailed layers in the Dynamic Table in real time, Batch ETL processing can also be performed on different layers.

In terms of data structure, there are two core storage components in Dynamic Table, namely File Store and Log Store. As the name suggests, Flie Store stores the file storage form of Table, adopts the classic LSM architecture, supports streaming updates, deletions, additions, etc. At the same time, it adopts an open column storage structure and supports optimizations such as compression; it corresponds to the batch mode of Flink SQL, Full batch read is supported. The Log Store stores the operation records of the Table, which is an unchangeable sequence. It corresponds to the flow mode of Flink SQL. You can subscribe to the incremental changes of the Dynamic Table through Flink SQL for real-time analysis. Currently, plug-in implementation is supported.

Writing to the Flie Store is encapsulated in a built-in sink, shielding the complexity of writing. At the same time, Flink's Checkpoint mechanism and Exactly Once mechanism can ensure data consistency.

At present, the implementation plan of the first stage of Dynamic Table has been completed, and the community is also conducting more discussions in this direction. According to the community's plan, the final state in the future will realize the service of Dynamic Table, truly form a set of Dynamic Table Service, and realize fully real-time stream-batch integrated storage. At the same time, the Flink community is also discussing the operation and release of Dynamic Table as an independent sub-project of Flink, and it is not ruled out that it will be completely independently developed into a general-purpose storage project of stream-batch integration. Finally, using Flink CDC, Flink SQL, and Flink Dynamic Table, a complete set of streaming data warehouses can be built to achieve a real-time offline integrated experience. For the whole process and effect, please refer to the demo video below.

https://www.bilibili.com/video/BV13P4y1J7PD/

Although the whole process has been completed initially, in order to achieve full real-time links and be stable enough, the community still needs to gradually improve the quality of the implementation solutions, including Flink SQL optimization in OLAP interactive scenarios, dynamic table storage performance and consistency optimization and the construction of dynamic table service capabilities. The direction of streaming data warehouse has just been launched, and there is a preliminary attempt. In Mo Wen's view, there is no problem with the design, but a series of engineering problems need to be solved in the future. This is like designing an advanced process chip or ARM architecture. Many people can design it, but it is actually difficult to produce the chip on the premise of ensuring the yield. Streaming data warehouse will be the most important direction for Flink in the big data analysis scenario, and the community will also invest heavily in this direction.

Flink goes beyond computation

Under the general trend of real-time transformation of big data, Flink can not only do one thing, it can also do more.

The industry's original positioning of Flink is more of a stream processor or stream computing engine, which is not the case. Mo Wen said that Flink is not just computing natively. You may think that Flink is computing in a narrow sense, but in a broad sense, Flink has storage. "Flink can break out of the encirclement with stream computing, relying on stateful storage, which is a bigger advantage over Storm."

Now Flink hopes to go a step further and implement a solution that covers a wider range of real-time problems, and the original storage is not enough. However, external storage systems or other engine systems are not completely consistent with Flink's goals and characteristics, and cannot be well integrated with Flink. For example, Flink is integrated with data lakes including Hudi and Iceberg, and supports real-time entry into the lake and real-time incremental analysis of entry into the lake. However, these scenarios still cannot fully utilize the full real-time advantages of Flink, because the nature of the data lake storage format is still Mini-Batch , in which Flink also degenerates to Mini-Batch mode. This is not the architecture that Flink most wants to see or is most suitable for Flink, so it naturally needs to develop a storage system that matches the concept of Flink's stream-batch integration.

In Mo Wen's view, a set of big data computing and analysis engines cannot provide a set of data analysis solutions for the ultimate experience without the support of a storage technology system supporting its concept. This is similar to the fact that any good algorithm needs a corresponding data structure to solve the problem with the best efficiency.

Why is Flink more suitable for streaming data warehouses? This is determined by the concept of Flink. The core concept of Flink is to give priority to Streaming to solve the problem of data processing. Streaming is essential for the data of the entire data warehouse to flow in real time. After the data flows, the duality of flow and table of aggregated data and Flink's integrated analysis capability of flow and batch can analyze the data of any link in the flow, whether it is second-level analysis of short queries or offline. For ETL analysis, Flink has corresponding capabilities. Mo Wen said that the biggest limitation of Flink's stream-batch integration was that there was no supporting storage data structure in the middle, which would make the scene difficult to implement. As long as the storage and data structure were supplemented, many chemical reactions of stream-batch integration would naturally occur. .

Will Flink's self-built data storage system have a certain impact on the existing data storage projects in the big data ecosystem? In this regard, Mo Wen explained that the new streaming-batch integrated storage technology launched by the Flink community is to better meet the needs of its own streaming-batch integrated computing. It will maintain open protocols, open APIs and SDKs for storage and data, and there are plans in the future. Develop this project independently. In addition, Flink will still actively connect with mainstream storage projects in the industry to maintain compatibility and openness to the external ecosystem.

The boundaries between different components of the big data ecosystem are becoming more and more blurred. Mo Wen believes that the current trend is to move from a single component capability to an integrated solution. "Everyone is actually following this trend. For example, you can see many database projects. It was originally OLTP, then OLAP was added, and finally it was called HTAP. In fact, it is a combination of row storage and column storage, which not only supports Serving, but also The purpose of supporting analysis is to provide users with a complete data analysis experience.” Mo Wen further added: “At present, many systems are beginning to expand their boundaries, from real-time to offline, or from offline to real-time, and infiltrate each other. Otherwise, Users need to manually combine various technical components, and face various complexities, and the threshold is getting higher and higher. Therefore, the integration trend of integration is very obvious. In the end, there is no right or wrong, the key is Can you use a good integration method to provide users with the best experience? Whoever does it will win the final user. The community must have vitality and sustainable development, and only do the best in the field that it is best at. Not enough, we must continue to innovate and break through boundaries based on user needs and scenarios, and most users’ needs are not necessarily in the gap between 95 and 100 points in a single ability.”

According to Mo Wen's estimation, it will take about a year or so to form a relatively mature streaming data warehouse solution. For users who have used Flink as a real-time computing engine, it is natural to try new streaming data warehouse solutions, and the user interface is fully compatible with Flink SQL. According to reports, the first Preview version will be released in the latest Flink 1.15 version, and users who are using Flink can try it first. Mo Wen said that the streaming data warehouse based on Flink has just been launched, the technical solution needs further iteration, and it will take some time to polish before it matures. I hope that more companies and developers can participate in the construction with their own needs. It is the value of the open source community.

Epilogue

The problems of numerous open source ecological components and high architectural complexity of big data have been criticized for many years. Now the industry seems to have reached a consensus to a certain extent, that is, to promote the evolution of data architecture in a simplified direction through integration and integration, although different enterprises There are different sayings and implementation paths.

In Mo Wen's view, it is normal for the open source ecosystem to flourish, and each technical community has its own areas of expertise, but if we really want to solve business scenario problems, we still need a one-stop solution to provide users with simple and easy-to-use solutions. experience. Therefore, he also agrees that the general trend will go in the direction of integration and integration, but the possibility is not unique. In the future, there may be a special system responsible for integrating all components, and it is also possible that each system will gradually evolve into integration. Which possibility is the end, perhaps we can only wait for time to give us the answer.

FFA 2021 Video Playback & Presentation PDF Get

Follow the "Apache Flink" official account and reply to FFA2021

For more technical issues related to Flink, you can scan the code to join the community DingTalk exchange group
Get the latest technical articles and community dynamics for the first time, please pay attention to the public number~

Apache Flink is not limited to computing, and a new round of changes may arise in the data warehouse architecture

Two application scenarios of stream-batch integration

Understanding Streaming Data Warehouses

Flink goes beyond computation

Epilogue

ApacheFlink

引用和评论

Flink CDC 3.4 发布, 优化高频 DDL 处理，支持 Batch 模式，新增 Iceberg 支持

【Hadoop】HDFS架构解析

【Hadoop】HBase系统解析及适用场景

基于 pyflink 的算法工作流设计和改造

通过Milvus内置Sparse-BM25算法进行全文检索并将混合检索应用于RAG系统

MCP+Hologres+LLM 搭建数据分析 Agent

基于 Flink CDC YAML 的 MySQL 到 Kafka 流式数据集成