Introduction: The Apache Flink Table Store project is under development, and you are welcome to try it and discuss it.
Author: Jingsong Lee jingsonglee0@gmail.com
Click to enter Flink Chinese Learning Network
1. Calculation in data warehouse
In computing, a data warehouse (DW or DWH), a system for reporting and data analysis, is considered a core component of business intelligence. It stores current and historical data in one place and creates analytical reports for workers across the enterprise. [1]
A typical extract, transform, load (ETL) based data warehouse uses an ODS layer, a DWD layer, and a DWS layer to house its key functions. Data analysts can flexibly query each layer in the data warehouse to obtain valuable business information.
There are three key indicators in the data warehouse [2]:
- Freshness of data: The length of time it takes for data to be available for query by users after a series of processing in the warehouse. Usually ETL is a series of processes used to prepare data. ETL is mostly done by scheduling a series of jobs that run a series of stream calculations or batch calculations.
- Data query delay: After the data is ready, the user queries the data in the table through Query, and the length of time from when the user sends a query to when the query result is received is the query delay. The query delay directly determines the somatosensory sensation of the end user.
- Cost: The amount of resources required to complete a certain amount of data analysis (including various calculations such as ETL and query). Cost is also a key indicator in data warehouses.
What is the relationship between these three indicators?
- Enterprises need to achieve better query latency and freshness while controlling costs. Different data may have different cost requirements.
- Freshness and query latency are also trade-offs in some cases. For example, if it takes longer to prepare data, clean and preprocess data, the query will be faster.
So these three constitute a triangular Tradeoff [2] in the data warehouse:
(Note: In a triangle, closer to the vertex means better, farther from the vertex means worse)
For this triangular Tradeoff, what are the trade-offs of the current mainstream architecture in the industry?
2. Industry mainstream architecture
A typical offline warehouse:
The offline data warehouse uses Batch ETL to overwrite based on partition granularity (INSERT OVERWRITE), which has good cost control while solving the scenario of large data.
But it has two serious problems:
- Poor freshness: The data delay is generally T + 1, that is, the data generated on the day of the business needs to be queried the next day.
- I am not good at handling changelogs. All offline data warehouses store Append data. If they need to receive update streams similar to database changelogs, it is necessary to repeatedly merge full data and incremental data, resulting in a surge in cost.
In order to solve the above problems, real-time data warehouses are gradually emerging. A typical real-time data warehouse implementation is to use the Flink + Kafka solution to build the middle layer, and finally write it to the online database or analysis system to achieve second-level full-link delay. Very good data freshness.
However, it also gradually exposed some problems.
Problem 1, the middle layer cannot be checked
The data query in Kafka is limited, OLAP query cannot be flexibly performed, and long-term historical data is usually not saved. This is very different from the widely used data warehouse. In a mature warehouse system, each data set in the data warehouse should be a table abstraction that can be queried. However, Kafka cannot meet all users' needs for table abstraction. For example:
- Query capabilities are limited. The real-time data warehouse architecture requires all data sets available for query to be pre-computed and finally written into the analysis system available for query, but not all calculations can be pre-defined in actual business, and a large number of data analysts’ needs are temporary Ads hoc query, if the intermediate data Queue cannot be checked, this will seriously limit the data analysis capability of the business.
- Troubleshooting is difficult. In the real-time data warehouse, if there is a problem with the data, the user needs to check the data pipeline. However, because the Queue that stores the intermediate results cannot be checked, it is very difficult to check.
In summary, we hope to have a unified architecture to obtain a real-time data warehouse that can be queried everywhere, rather than a data warehouse where intermediate results are pipelined.
The second problem is the high cost of real-time links
There is no free lunch in the world, and it is expensive to build a real-time link.
- Storage cost: Whether it is Kafka or the subsequent ADS layer, they are all online services. Although they have low latency, they have high storage costs.
- Migration and maintenance costs: The real-time link is a new set of systems that is independent of offline and is not compatible with a set of offline tool chains. Migration and maintenance costs are high.
Therefore, we hope to have a low-cost real-time data warehouse, which provides low operating costs and is compatible with offline toolchains, while accelerating the original offline data warehouse.
Summarize:
Offline warehouse | real-time data warehouse | |
---|---|---|
cost | Low | high |
Freshness | Difference | it is good |
Data warehouse intermediate table query delay | high | Unable to query |
Data warehouse result table query delay | Low | Low |
Because the current two architectures face different trade-offs and scenarios, businesses usually can only maintain two architectures, and even require different technical teams, which not only brings a lot of resource costs, but also brings expensive development costs and operating costs.
So is it possible for us to provide a data warehouse that is relatively balanced in terms of freshness, query latency, query capability and cost? To answer this question, we need to analyze the technical principles behind freshness and query latency, the different architectures caused by different tradeoffs, and the technical differences behind them.
3. ETL freshness
The first thing to think about is data freshness: data freshness measures the length of time it takes for the data to be queried after a series of processing in the warehouse. The data is ingested into the data warehouse, and after a series of ETL processing, the data is in a usable state.
The traditional batch calculation is based on the caliber for ETL calculation, so its freshness is: caliber + ETL delay. The general caliber is days, so the freshness of traditional offline warehouses is at least one day. Calculated according to the caliber, the input and output of the calculation are full. If the freshness is smaller than the caliber, the input and output of the calculation are partial, that is, incremental. A typical incremental computing is stream computing, such as Flink Streaming.
Incremental computing is not completely equivalent to stream computing, for example, there can also be incremental computing in small batches. Full computing is not exactly equivalent to batch computing. For example, stream computing can also be used for full output through Window (that is to say, the delay of stream computing can also be large, which can reduce costs);
4. Query Latency
The query delay will directly affect the data analysis efficiency and experience. The query is returned to the human being. This person is not a robot. The data he sees is filtered or aggregated data. In traditional offline data warehouses, it may take 10+ minutes to query large tables.
The most intuitive way to speed up the return of the query is to pre-calculate. In essence, the ETL of the data warehouse is doing pre-calculation. When the calculation of the data analyst's query takes too long, he will notify the data warehouse personnel and establish the corresponding ETL. Pipeline, after the data is ready, the analyst can directly query the final result table. From one perspective, this is actually exchanging freshness for faster query latency.
However, in the traditional offline data warehouse, there are a large number of ad hoc queries, and users can flexibly select query conditions according to their own needs. Queries involving large tables may often take 10+ minutes. In order to return results as soon as possible, major storage systems use various optimization methods.
For example, the storage is closer to the calculation, and the closer to the calculation, the faster the read:
- Some Message Queue and OLAP systems only provide local disk storage, which guarantees read performance, but also sacrifices flexibility, and the cost of expansion and migration is relatively high and the cost is higher.
- The other direction is the architecture of separation of computing and storage. The data is all remote, but the high cost of remote access to DFS / Object Store is reduced through the local Cache.
For example, Data Skipping, combined with query conditions and fields, skips irrelevant data to speed up data search:
- Hive: Query specific partitions through partition pruning, and skip irrelevant fields through column storage.
- Lake storage: On the basis of using column storage, the statistical information of files is introduced, and unnecessary reading of some files is minimized according to the statistical information of the files.
- OLAP system: On the basis of using column storage, for example, the LSM structure is used to keep the data in order according to the primary key as much as possible. Order is one of the most favorable structures for querying, such as Clickhouse.
- KV system: Through the organizational structure of data, the structure of LSM is used to speed up the query.
- Message Queue: Queue actually achieves the ability to quickly locate data through a special read interface. It only provides a positioning method based on Offset / Timestamp to read data incrementally.
There are many optimization methods, which are not enumerated here. Storage uses various methods to cooperate with computing to speed up queries, so that queries can be found and read quickly.
Through the above analysis, we can see that the underlying technologies of different systems are basically the same:
- Stream computing and batch computing are different modes of computing, and they can both complete full computing or incremental computing.
- The means of storage to accelerate query performance revolve around finding fast and reading fast, and the underlying principles are the same.
In theory, it should be possible for us to build a certain architecture through a certain selection and combination of underlying technologies to achieve the Tradeoff we want. This unified architecture may need to address the following scenarios according to different Tradeoffs:
- Real-time data warehouse: The freshness is very good.
- Near real-time data warehouse: As an acceleration of offline data warehouse, it can improve freshness without bringing too high cost.
- Offline data warehouse: has better cost control.
- Offline OLAP: Accelerates the query performance of a certain part of the data warehouse, such as ADS tables.
Streaming Warehouse aims to be a unified architecture:
(Note: In a triangle, closer to the vertex means better, farther from the vertex means worse)
An ideal data warehouse should be one where users can freely adjust the tradeoff between cost, freshness, and query delay, which requires the data warehouse to fully cover the full capabilities of offline data warehouses, real-time data warehouses, and OLAP. Streaming Data Warehouse takes a step forward based on real-time data warehouses and greatly reduces the cost of real-time data warehouses.
While providing real-time computing capabilities, Streaming DW allows users to cover offline data warehouse capabilities under the same architecture. Users can make corresponding tradeoffs according to business needs to solve problems in different scenarios.
5. Streaming Data Warehouse
Before looking at how the storage architecture of Streaming Data Warehouse is designed, let's review the two issues of mainstream real-time data warehouses mentioned earlier. After solving these two problems, the architecture design of Streaming Data Warehouse is ready.
5.1 The intermediate data cannot be checked
Since the Kafka storage in the middle is untraceable, an idea of real-time and offline integration is: real-time offline one-to-one running, the business layer should do as much encapsulation as possible, and try to let users see the abstraction of a set of tables.
Many users will use Flink and Kafka for real-time data stream processing, and write the analysis results to the online service layer for user display or further analysis. Real-time data is supplemented, and large-scale batch operation/full operation is performed regularly every day or historical data is regularly revised. [3]
But there are several problems with this architecture:
- The abstraction of Table is different: using different technology stacks, there are two sets of Table abstraction for real-time link and offline link, which not only increases development cost, but also reduces development efficiency; This kind of bumpy problem, there are many misaligned pits.
- The data caliber of real-time data warehouse and offline data warehouse is difficult to maintain natural consistency;
In the Streaming Data Warehouse, we hope that the data warehouse has a unified Table abstraction for querying, and all data in flow can be analyzed without data blind spots. This requires this unified Table abstraction to support two capabilities at the same time:
- Message Queue
- OLAP query
That is to say, on the same Table, users can subscribe to the Change Log on the Table in the form of a message queue, or directly perform OLAP queries on the Table.
Let's look at the second problem of the classic real-time data warehouse.
5.2 High cost of real-time links
Although the unified Table abstraction provided by Streaming Data Warehouse can well solve the problems of freshness and query latency, its cost is higher than offline data warehouses. In many cases, not all business scenarios have high requirements for freshness and query latency, so it is still necessary to provide low-cost Table storage capabilities.
Here lake storage is a good option:
- The storage cost of lake storage is lower: Lake storage is based on DFS/Object Store, has no service, and has lower resource and operation and maintenance costs.
- The local update of lake storage is flexible: what should I do if there is a problem with the historical partition? What should I do if I need to make corrections? The computing cost of lake storage is lower, and the cost of lake storage + offline ETL, INSERT OVERWRITE correcting historical partitions, is much lower than real-time update.
- Openness of Lake Storage: Lake Storage can be opened to various batch computing engines.
Therefore, Streaming Data Warehouse needs to provide low-cost offline storage while maintaining the real-time flow of full-link data, and the architecture does not affect the real-time link. Since the SLA requirements of real-time links are generally higher than offline links, the storage of Streaming Data Warehouse should take the writing and consumption of Queue as a high priority in the design and implementation, and the storage of historical data should not affect its storage. The ability to act as a Queue.
6. Flink Table Store
Flink Table Store [4] is a stream-batch integrated storage specially built for Streaming Warehouse.
Over the past few years, with the help of our many contributors and users, Apache Flink has become one of the best distributed computing engines, especially when it comes to large-scale stateful stream processing. Still, there are some challenges when trying to gain real-time insights from data. Among these challenges, one prominent problem is the lack of storage that can satisfy all computing modes.
Until now, it was common for people to deploy some storage systems that work with Flink for different purposes. A typical approach is to deploy a message queue for stream processing, a scannable filesystem/object store for batch processing and Ad-Hoc queries, and a KV store for polling. Due to its complexity and heterogeneity, such an architecture presents challenges in both data quality and system maintenance. This has become a major issue that undermines the unified end-to-end user experience for streaming and batching that Apache Flink brings.
The goal of Flink Table Store is to solve the above problems. This is an important step for Flink, which expands Flink's capabilities from computing to storage. Because of this, we can provide users with a better end-to-end experience.
6.1 Architecture
6.1.1 Service
Coordinator is the control node of the cluster. It is mainly responsible for managing Executors. The main capabilities are:
- Coordinator manages the life cycle of Executors, and the client finds the address of Executors through Coordinator.
Data Manager:
- Manage the version of the Table, be responsible for dealing with the metastore, and regularly checkpoint the version to the metastore.
- According to the written data and the pattern of the query, manage the cache and manage the index.
Resource Manager:
- Manage the distribution of Table's Buckets among Executors.
- Dynamically allocate Buckets to Executors as needed.
Metastore is an abstract node, it can connect to Hive Metastore, it can also minimize dependencies based on Filesytem, or it can connect to your own Metastore, which saves the most basic table information. You don't have to worry about performance issues, more detailed and complex table information is stored in the lake storage.
Executor is a separate computing node, serving as a Cache for storage and an acceleration unit for local computing:
- It is responsible for receiving data updates, writing to the local Cache, writing to the local disk, and then Flush to the underlying DFS.
- It is also geared towards real-time OLAP queries and Queue consumption, performing some accelerated local computations.
Each Executor is responsible for one or more Buckets, and each Bucket has a corresponding Changelog. These Changelogs will be stored in the Message Queue and are mainly used for:
- Write ahead log, read log after Executor Failover to recover data.
- Provides the abstraction of Queue, and provides the Changelog stream consumption of Table to downstream stream computing.
6.1.2 Lake Storage
The data of the Executor falls into the lake storage after the checkpoint. The lake storage is built on the file format of the column storage and the shared DFS storage. Lake Storage provides a complete Table Format abstraction, and its main purpose is to support updates and reads at a lower cost:
- LSM structure: used for large data updates and high-performance queries.
- Columnar File Format: Uses Apache ORC to support efficient queries.
- Lake Storage: Metadata and data on DFS and Object Store.
6.1.3 Separation of hot and cold
The storage read and write paths are divided into two:
- Streaming Pipeline & Online OLAP Query: Get metadata through Coordinator, write and get data from Executor.
- Batch Pipeline & Offline Query: Get metadata through metastore, write and get data from lake storage.
The data of the Service is up-to-date, and is synchronized to the lake storage after a minute-level Checkpoint. Therefore, users reading lake storage will only read data that is not so timely. In essence, the data on both sides are consistent.
There are these differences between the use of Services and Lake Storage:
- Service is suitable for the latest hot data, provides fast update writing one by one, and high-performance query delay.
- Service is not suitable for Offline Query. One is that it affects the stability of Online, and the other is that the cost will be higher.
- Service does not support INSERT OVERWRITE of Batch Pipeline.
Therefore, the storage needs to expose the lake storage to undertake these capabilities. How to judge which data is operated in the service and which data is operated in the lake storage?
Only partitions after ARCHIVE can batch INSERT OVERWRITE in lake storage.
- Users can specify a partition automatic ARCHIVE time when creating a table.
- A partition can also be archived explicitly through a DDL statement.
6.2 Short-term goals
6.2.1 Short-term architecture
The overall transformation of Streaming Data Warehouse is huge. OLAP, Queue, lake storage, stream computing, and batch computing have leaders in each field, and it is impossible to produce a complete solution in a short period of time today.
However, we are moving forward, in Apache Flink Table Store, we first developed LSM-based lake storage and natively integrated Kafka as Log System.
Compared to the full architecture in the above section, the short-term architecture does not have Coordinators and Executors, which means it:
- Can not provide the ability of real-time OLAP, file-based OLAP can only be a quasi-real-time delay.
- No service-oriented data management and control capabilities.
We hope to start from the bottom and consolidate the foundation, first to propose a complete unified abstraction, then to accelerate the storage, and then to provide the real OLAP.
The current architecture it provides two core values:
6.2.2 Value 1: Real-time middle layer can be checked
Table Store brings query capabilities to the Kafka layered storage of the original real-time data warehouse, and intermediate data can be checked;
Table Store still has the ability to stream real-time Pipeline, it integrates native Log, supports the integration of Kafka, and shields the concept of stream batch, and users only see the abstraction of Table.
However, it is worth noting that data writing to storage should not affect the stability of the original writing to Kafka, which needs to be strengthened and guaranteed.
6.2.3 Value 2: Offline warehouse acceleration
Table Store accelerates offline data warehouses, and provides incremental update capabilities while being compatible with Hive offline data warehouses.
Table Store provides a complete lake storage Table Format and provides quasi-real-time OLAP query. The structure of LSM is not only conducive to update performance, but also can have better data skipping and speed up OLAP query.
6.3 Follow-up plans
The community is currently working on enhancing the core functionality, stabilizing the storage format, and completing the rest to make Flink Table Store production ready.
In the upcoming 0.2.0 version, we hope to provide a Table Format that integrates streams and batches, and a gradually improved stream-batch integrated lake storage. You can expect (at least) the following additional features:
- Flink Table Store Reader supporting the Apache Hive engine.
- Supports adjusting the number of buckets.
- Supporting Append Only data, Table Store is not limited to update scenarios.
- Full Schema Evolution, better metadata management.
- Improvements based on feedback from the 0.1.0 preview.
In the medium term, you can also expect:
- Flink Table Store Reader supporting Presto, Trino and Apache Spark.
- Flink Table Store Service, to accelerate updates and improve query performance, has millisecond-level Streaming Pipeline capabilities, and strong OLAP capabilities.
Please try out the 0.1.0 preview, share your feedback on the Flink mailing list, and contribute to the project.
6.4 Project Information
The Apache Flink Table Store project [4] is under development, and the first version has been released. You are welcome to try it out and give feedback.
[1] Data warehouse Wiki: https://en.wikipedia.org/wiki/Data\_warehouse
[2] Napa: Powering Scalable Data Warehousing with Robust Query Performance at Google: http://vldb.org/pvldb/vol14/p2986-sankaranarayanan.pdf
[3] Flink Next: Beyond Stream Processing: https://mp.weixin.qq.com/s/CxHzYGf2dg8amivPJzLTPQ
[4] https://github.com/apache/flink-table-store
Copyright statement: The content of this article is contributed by Alibaba Cloud's real-name registered users. The copyright belongs to the original author. The Alibaba Cloud developer community does not own the copyright and does not assume the corresponding legal responsibility. For specific rules, please refer to the "Alibaba Cloud Developer Community User Service Agreement" and "Alibaba Cloud Developer Community Intellectual Property Protection Guidelines". If you find any content suspected of plagiarism in this community, fill out the infringement complaint form to report it. Once verified, this community will delete the allegedly infringing content immediately.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。