Looking at the recent major events in the industry, the open source of Delta 2.0 is the most talked about, especially when Databricks officially announced delta 2.0, the following performance comparison was thrown, which is quite a bit of a war.
Although the engineers of Databricks have repeatedly emphasized that the performance test comes from the third-party Databeans, and they did not actively ask Databeans to do this test, but if you watch the delta2.0 conference throughout the whole process, you will find that in the key feature that delta2.0 is about to open, special columns are listed. Out of the Iceberg to Delta conversion function, and the official focus on Adobe's practice of migrating from Iceberg to Delta2.0, which inevitably makes people think.
In the past two years, our team has invested a lot of energy in the research, exploration and practice of new data lake technologies. Although our main investment direction is Iceberg, the open source of delta2.0 and Databricks' own emphasis on Iceberg are more firm. Our confidence in the integration of data lakes and lakes and warehouses. The competition for open source is essentially a competition for standards. Competition will accelerate the confirmation and implementation of standards, and all big data practitioners will benefit from it.
Since our work mostly uses Iceberg as a low-level dependency, it has the possibility of decoupling in architecture, and we can fully embrace Delta, so here I would like to talk about the direction of Lakehouse from a third-party standpoint, as well as several The understanding and thinking of mainstream open source products will also briefly talk about our work. In addition, I hope you can think and explore the following question together with me: What kind of data lake does an enterprise need?
1 Table format battle of the top three
Table format was first proposed by Iceberg and has now become an industry consensus concept. What is table format? In a nutshell:
- Table format defines which files constitute a table, so that any engine can query and retrieve data according to the table format;
- Table format regulates the distribution of data and files. Any engine writing data must comply with this standard, and support ACID, schema evolution and other high-level functions through the standards defined by the format.
At present, domestic and foreign counterparts use delta, iceberg, and hudi as the benchmarking solutions for the data lake table format. Let's talk about the background of the three open source data lakes, delta, iceberg, and hudi.
1.1 Delta
Delta is a data lake product launched by databricks in 2017, announced in 2018, and open sourced in 2019. It can be seen that databricks has been established for 4 years when delta was established, has experienced several financings, and is laying out its business layout in an orderly manner. At that time, hadoop Distribution is not a difficult business. The birth of delta is more like the core competitiveness of databricks based on the genes of its own spark founding team. It is not difficult to understand why delta 1.0 is almost not open to other engines.
The launch of delta is to solve the shortcomings of traditional data lakes in transaction processing, stream computing, and BI analysis. Databricks has created a lakehouse concept for delta with strong storytelling capabilities. Today, the concept of lakehouse and lake warehouse is integrated. It has been deeply rooted in the hearts of the people, and even the old rival snowflake has adopted this concept, and gave a definition that is more suitable for its own products on the official website. In the 2021 Gartner database leadership quadrant, Databricks and snowflake are promoted to the first quadrant together, and lakehouse has also entered the hype cycle for data management for the first time, positioning the jumping period. According to Gartner's definition, lakehouse technology may still be 3-5 years away from being fully mature. time.
According to the idea of Databricks, delta1.0, as a lakehouse solution, can make the data lake act more and more quickly in real-time and AI scenarios. Databricks proposes the delta architecture to help users liberate from the lambda architecture. The core idea is that the data lake can both Running batches can also run streams. The processes and codes of stream computing and batch computing can be reused, so that users do not have the burden of maintaining the lambda architecture. Of course, the computing engine must be spark. It is a pity that spark streaming and struct streaming have a small number of domestic users. Most of the users are eager to quench their thirst for delta1.0. The deep binding of spark also limits the development of the delta community to a certain extent, and pre-embeds the rise of iceberg. Foreshadowing, a comparison of community activity as of 2022 Q1 is as follows:
But on the other hand, we must not ignore that Databricks, as a commercial company that has been operating for many years, has a considerable number of paying users, coupled with the dominance of the spark community, mature marketing and channel capabilities, it may be easy to re-establish The advantage of open source.
Delta is the solution of Lakehouse, and Databricks is also used as the representative of lakehouse, but the definition of the delta project itself has undergone some changes. I noticed that before a certain time last year, delta was defined as open format, and delta can be used directly in the engine. Replace parquet.
The definition of Format is very similar to the definition of iceberg's table format, but on the current official website, as well as various related sharing and blogs, such descriptions are no longer seen. Currently, delta is officially defined as the lakehouse storage framework. Of course, No matter the format or the framework, the soup is still that soup, but the recipe is more plump.
1.2 Iceberg
Iceberg is a data lake table format developed and open sourced by the Netflix team. The founder, Ryan Blue, is the PMC of spark, parquet, and avro. He has very rich experience and contacts in the field of data analysis. There is also a senior from Cloudera in the co-fonder. Engineer, from the perspective of the timeline, iceberg entered the apache incubation in 2018 and graduated in 2020. Considering the development cycle of the project itself, it is difficult to judge the time sequence of it and delta, and the founder himself is an active spark contributor, The two projects were highly similar from the start.
From a functional point of view, to apply a sentence on Zhihu: you can't say that they are very similar, you can only say that they are exactly the same. From a development point of view, iceberg is more in line with the temperament of an open source project. In the early days, this project was more to meet Netflix's needs for large-scale data analysis, emphasizing the following characteristics:
The characteristics of ACID and MVCC, the inconsistent state of writing will not be read when reading data
- Data skipping, by skipping the file in the table format layer, the query performance is greatly improved in some scenarios
- Unlike hive, which requires too much reliance on namenode, the performance of plan is greatly improved for very large clusters.
- In the design, more consideration was given to building a table format on S3, making iceberg a good choice for data lake cloud
- Schema evolve and hidden partition make table changes and maintenance easier
The founder emphasized the advantages of iceberg over hive, and really hit the pain points of developers, especially users who need to go to the cloud. Many insiders proposed that iceberg would become the next generation of hive, and the equal rights of iceberg engine further promoted the Peripheral manufacturers agree. At present, we can know from public information that Cloudera mainly promotes iceberg; snowflake supports iceberg appearance; starrocks supports iceberg appearance; Amazon Athena can use iceberg table. Strangely, although delta 2.0 has also begun to attract engine developers other than spark to participate, it will take some time to catch up with the current iceberg's peripheral ecology.
I myself came into contact with iceberg in 2020. At that time, I was looking for a better data lake solution for flink than hive to solve the problem of upsert and the separation of batch flow scenario development and operation and maintenance. At that time, iceberg and hudi were both incubating, and delta was still Spark's delta, and hudi was also a spark lib at that time, only iceberg made people's eyes shine, and iceberg was also the first to support flink connector.
The Iceberg community has always been very restrained on the roadmap, and any modification to the underlying table format is cautious, ensuring that it is friendly enough to any engine, and the scalability of the operation and row-level api leave enough room for developers to imagine. In terms of engine equality, iceberg is unique, and we should continue to observe how it will look in the future.
1.3 Hudi
The time line of Hudi open source and incubation is similar to iceberg. Back to the beginning of open source, hudi's full name is hadoop upsert and incremental, and its core function is to support upsert and incremental process on hadoop. Up to now, hudi is no longer limited to hadoop and its name. For the two functions above, hudi does not emphasize its own data format. After several large iterations, its definition has become a bit complicated. Open the official website and we will see this description:
Hudi is a rich platform to build streaming data lakes with incremental data pipelines on a self-managing database layer while being optimized for lake engines and regular batch processing.
It can be seen that hudi wants to do a lot of things, and has established a goal like a database for himself. There is a long way to go to achieve this goal. Hudi is the first to provide stream upsert capability among the three projects. If there is no secondary development, hudi is an out-of-the-box data lake upsert solution, and the hudi community is very open to developers, which is different from iceberg's focused and cautious tone. This is extreme, but there are great changes between hudi’s major versions. This aspect should be suppressed first, and there is a chance to open an article to talk about it.
In the earliest days, hudi was only implemented under spark. In order to support flink, the community made a lot of effort in refactoring (similar to delta). This is also the most important reason for not choosing hudi in 2020. After the core team of hudi founded Onehouse , the positioning of hudi is obviously different from the other two. As a commercial company, databricks, delta is an important means for him to attract traffic. Commercialization is then realized through the upper-level data development, governance and AI platform. Similarly, from According to public information, Tabular, founded by Ryan Blue, also builds a platform on iceberg, which is very different from the table format. And hudi itself has pulled itself to the height of the platform, although the function is still far away, but it is foreseeable that the long-term roadmap will make a big difference.
For the sake of competition, delta and iceberg are available, and hudi may follow, so hudi can also be used as a table format. When we do technology selection for enterprises, we need to consider whether to choose a pure table format to integrate into our own platform, or choose a new platform or integrate platforms.
2 Iceberg backstab and delta2.0 counterattack
It is too early to judge now.
If we must compare, I prefer to compare delta and iceberg, because hudi's vision is quite different from the first two. In other words, in terms of table format, delta and iceberg may know what to do better, just understand In terms of two aspects, iceberg is better in my opinion. Taking the content of the recent delta 2.0 release, interested students can go to the Data + AI summit 2022 officially held by Databricks.
The key functions mentioned in the conference are summarized as follows:
- Data skipping via column stats: data skipping via format-level metadata
- Optimize ZOrder: This should be a function that delta has always had, but it was officially open sourced in 2.0
- Change data feed: Support CDC function under UPDATE/DELETE/MERGE INTO
- Column mapping: delta can also evolve like iceberg, with little difference in function
- Full ACID guarantees on S3: Introduce DynamoDB in the commit phase, and also guarantee ACID on S3
- Flink, presto, trino connector: Emphasis is placed on flink and trino, and the connector and delta projects are managed separately
- Delta standalone: I understand that it provides a layer of format api, like iceberg, you can manipulate data without going through the engine
For students who don't know much about Iceberg, you can go to the official website of iceberg and quote the above sentence. You can't say that it is very similar, but you can only say that it is exactly the same, and most of the functions are quite mature in iceberg 2 years ago.
In the latter part of the conference, Databricks engineers highlighted:
- Adobe's migration practice from iceberg to delta, the emphasis on iceberg can be said to be written on the face
- Delta is not only contributed by databricks, but also involved developers from flink and trino communities in 2.0. However, the part contributed by engine developers is in a separate connector project, which is separate from the main body of delta. Can it be done in terms of engine equality in the future? or beyond iceberg, also need to observe
- Quoting the third-party Databeans test, delta 2.0 performance is 1.7 times faster than iceberg and 4.3 times faster than hudi
Our team also used the benchmark tool to compare delta2.0 and iceberg. The test plan is to test the tpch of 100 warehouses under trino (the test tool is actually a chbenchmark tailored for testing stream lakehouses, which is also mentioned below). When We use the default parameters of the delta and iceberg open source versions. The comparison of delta is really amazing. The average response time of delta is about 1.4 times faster than iceberg, but we noticed two important differences in the default parameters:
- The default compression algorithm of delta and iceberg under Trino is different. The default compression algorithm of trino writing iceberg is ZSTD, while the default compression algorithm of writing delta is SNAPPY. ZSTD has a higher compression ratio than SNAPPY, which is compressed by actual observation of ZSTD. The file size is only 60% of the size of SNAPPY, but SNAPPY is more CPU-friendly when querying, and the query efficiency is higher;
- The default read-target-size of Delta and iceberg is different, delta defaults to 32m, iceberg defaults to 128m, and assembling smaller files in the plan stage can use more concurrency in the execution plan, of course, this will bring more resource consumption, from a practical point of view A file size of 32m may be a better choice for response time sensitive data analysis.
Set the compression algorithm of delta and iceberg to be the same, and set the read-target-size to 32m. The measured average response time of tpch is no longer different. Under the same configuration, the benchmark test mainly focuses on the IO performance of file formats such as parquet, and it is reasonable to have no difference. The subsequent response from Onehouse in the performance test also supports this point:
As a related practitioner, the complete open source of Delta2.0 is an exciting thing. It can almost be concluded that the overlapping functions of delta2.0 and iceberg will become the de facto standard of the data lake table format, and advance in this direction Invested products and developers have the potential to reap the fruits more quickly.
As for who is better? Iceberg's openness, focus and execution are impressive, and delta's influence, business resources and maturity cannot be ignored. From the perspective of function and peripheral ecology, iceberg still has a first-mover advantage of at least 1-2 years, but the original Tablur that grows on iceberg has not yet been seen. After that, I believe that open source contributors and influences will follow up at the same time, and I expect the delta community to catch up in terms of activity.
3 Dilemma in the promotion of new technologies
As a basic software engineer, it is very difficult to force the demand from the bottom up. If you want the business team to switch the basic software, you may need the right place and the right place at the same time. Students who study data lake believe that in the past two years, they will encounter some problems when promoting the business. To the situation of powerless, here I will share some of my understanding.
We will make a summary of the standard capabilities that have been formed in the current data lake Format:
- Free structure, users can freely change the table structure, including adding columns, changing columns and deleting columns, without data rewriting
- Read and write freedom, ensure ACID through commit primitive, can write and read concurrently, and will not read and write to an inconsistent state
- Stream batch homology, in addition to batch read and batch write, support stream computing through incremental read and streaming ingestion
- Engine equality, supporting mainstream computing engines that most users will use, including flink, spark, trino, etc.
Now users use the data lake basically in a mature data productivity platform, such as Alibaba Dataworks and NetEase Shufan Shushu platform, borrowing the development practice of NetEase's big data business in the past ten years, which can be roughly divided into three stages:
- Big data platform: able to develop workflows on the Hadoop platform, support basic data development and operation and maintenance, and have certain data governance capabilities.
- Data middle office: abstract more common requirements of data business to the middle office layer, build an indicator system around the business subject domain, and open up the data model, data development, operation and maintenance, and build a permission and quality evaluation system for the business. The asset platform provides business Higher-order data governance capabilities.
- 3D platform: We have upgraded from the data center to the 3D system of Dataops, Datafusion, and Dataproduct. 3D emphasizes systemization and process standardization, emphasizing CICD, and emphasizing the integration of multiple data sources.
In addition to Alibaba, the data productivity platform in the market is basically built around the data lake system of Hadoop, Hive, or object storage in the cloud. Compared with data lake formats such as Delta and Iceberg, the structure of Hive and object storage is not free. The problem of involuntary writing has basically been overcome through process specification and upper-level evasion. In the emerging stage of the new data lake technology, everyone talked about the evolution of the model, ACID, a mature data productivity platform and the platform operation and maintenance for which it is oriented, data consumers, and analysts are basically indifferent, and the characteristics of engine equality, hive has done its best.
As for the flow batch homology, in practice, the following two points can be summarized:
- Replacing message queues with data lake CDC can theoretically bring cost benefits, but it will also introduce small file problems;
- Data lake + read-time merging forms an alternative to real-time data warehouse solutions such as kudu, clickhouse, and doris to a certain extent.
In general, the above two points are the most discussed data lake practices in the industry, but this technology is not mature enough in practice. Adaptation costs, business acceptance of this capability downgrade often requires more sufficient reasons than cost optimization, and the data lake CDC will also introduce the problem of small files. For the read-time merge, we tested it, using streaming ingestion to write to the iceberg table for two hours, the performance of the AP dropped by at least half. Of course, the new features brought by delta/iceberg are more than that. For example, schema evolution is very useful for feature scenarios, MERGE INTO syntax is very useful for complement scenarios, and UPDETE/DELETE SQL is a strong requirement for foreign GDPR/CPAA execution. But these features are relatively detailed, and are often only attractive to specific scenarios.
In the past two years, we have conducted practical exchanges with many colleagues, and we have generally encountered the following problems: the business is not attractive enough, and it does not seem to bring about a qualitative improvement compared with alternative solutions; the willingness to adapt to products is not strong, three The projects are very good, but it seems that they can't see any real benefits to the product, and they are afraid of taking the wrong side and choosing the wrong path; the economic situation is down, business risk appetite is reduced, and they are not so concerned about new technologies.
So when we really apply the new data lake technology to products and practices, we might as well think about this question from the top down: what kind of data lake does the enterprise need?
4 What kind of data lake do enterprises need?
In fact, Databricks has already given us the answer to this question. Delta uses a set of data lake storage, which integrates batch computing and stream computing, and combines the advantages of traditional data warehouses in data analysis with the advantages of data lakes in AI and data science. Based on the storage base of Lakehouse, it can achieve full scene coverage of data services. To sum up, the value Delta brings to Databricks is to use a set of basic data lake software to achieve full scene coverage.
So does this methodology apply to other companies? I think the answer is yes, but it needs to be slightly modified. First of all, Databricks did this project relatively early, and it was done as a strategic project. Its products and superstructures must be followed up synchronously, which also made his The entire platform benefits from the simplicity of Lakehouse, while most enterprise users have a much heavier historical burden, and product adaptation affects the whole body.
On the other hand, Flink is basically used for real-time computing in China. I will not evaluate which one is better between Flink and Spark. The reality is that most companies will not bind a computing engine, which is why engine equality is extremely important for data lakes, as important as Delta. It also has to compromise. The application of different engines can absorb the advantages of each company, but it will bring about the problem of product fragmentation. Most suppliers of mainstream big data platforms use real-time computing as a separate product entry. Of course, the reason behind this is not only the engine The most important problem is still the inconsistency of storage solutions. Product fragmentation has been amplified in the iteration of big data methodology. For example, in the data center, indicator system, data model, data quality, and data assets, a set of middle-end modules are basically built around offline scenarios, while in Dataops, which emphasizes CICD, , the requirements and scenarios of stream computing are more difficult to be considered because of the inconsistency of storage and computing.
The direct result of this situation is that the real-time data warehouse, the scenarios and requirements corresponding to stream computing are marginalized in the methodological iteration of the big data platform, and users cannot experience the benefits of data security, data quality, and data governance in real-time scenarios. To make things worse, in many scenarios that require both real-time and offline, users need to maintain two sets of models, two sets of codes, flow table and batch table, and always be alert to the ambiguity of semantics and models.
After understanding the meaning of Lakehouse, based on reality, taking NetEase Shufan as an example, the new data lake technology should help dataops to expand the boundaries, allowing data development and operation and maintenance, and the entire system of data governance includes real-time and AI scenarios, stream-batch integration The data lake is responsible for realizing the lambda-free architecture for the business. The interaction and experience of the product should be more concise and efficient, so that algorithm analysts, data scientists, and risk control that are more sensitive to timeliness can also quickly follow the standards and specifications of Dataops. To get started, you can use data governance methodology to optimize costs.
Talking about this, it may feel more and more abstract. Let's take a lambda architecture for data analysis:
In the scenario, Hive is used as a batch table, and kafka is used as a flow table. The entire offline follows the methodology of data center and Dataops. In real-time scenarios, users need to build a stream computing task that is synchronized to hbase, and users need to implement a stream computing task of joining hbase dimension tables. Write the data to kudu that supports real-time update, and finally the business can choose to query the kudu table or the hive table according to the needs of real-time and offline. Handle the differences between the two systems yourself. In this architecture, users suffer from a fragmented experience and a lot of work needs to be done on the upper layers.
The data lake required by the enterprise should be able to help the business solve the problem of fragmentation, and use the data lake to realize the whole process of ETL, data pipeline and olap, and achieve the following effects:
Because a set of models is used in real-time and offline, in theory, many capabilities of mid-stage and dataops can be applied to real-time scenarios, such as data quality. Of course, more innovations in details are needed in this process. The core point is that the application and promotion of data lake technology should not be based on one or some specific functions. It should be combined with the methodology of the data platform to view the overall situation, so that data analysis, AI, and stream computing can be used in all scenarios and links. benefit.
5 Our work
Driven by this goal, our team has developed the Arctic project in the past two years, and silently open sourced it at the end of July.
First of all, our job is not to create a new product to compete with delta/iceberg, which does not meet the needs of enterprises. Arctic is a service based on the open source data lake Format. As mentioned above, we are currently based on iceberg.
Secondly, our goal is to extend the boundaries of Dataops to stream computing, so Arctic will provide users with more optimized streaming capabilities, including stream upsert, CDC, production-ready merge technology, and provide minute-level freshness data analysis Capability, summed up in one sentence, Arctic is a streaming lake warehouse service that adapts to multiple engines:
Several core features of Arctic show how we focus on expanding the boundaries of the big data platform.
- Arctic has the ability to continuously self-optimize (self-optimized)
- Provides two compatibility modes of hive or iceberg, you can use an Arctic table as an open source hive table or iceberg table, users never have to worry about the new features of iceberg being unavailable, and the hive table of the old business cannot use Arctic functions
- Supports concurrent writing by multiple engines, and ensures data consistency in primary key scenarios. Streams and batches are written separately, and arctic will resolve data conflicts written by the same primary key.
Provides standardized metrics and management tools for real-time data lakes, and provides thrift APIs to the platform
As a service, Arctic can adapt to different data lake formats, so that products do not need to worry about the selection of data lake technology, and the continuous self-optimization capability allows data analysis to be plug-and-play, replacing real-time data warehouses, and compatibility mode allows products There is no need to worry about model selection. In practice, targeted upgrades and grayscale solutions can be designed. Concurrent conflict resolution and consistency make data flow management simpler.
Performance is also a very focus of Arctic. We have done a lot of work, especially in terms of read-time merging. We have made a targeted plan for the streaming lake warehouse performance testing tool. We will also share this work with you later this year. Open, to put it simply, we use the chbenchmark idea of HTAP. The data continuously written by tpcc is streamed into arctic and hudi through FlinkCDC. The test results of the benchmark are measured by tpch. The test object is the read-time merge performance in the olap scenario. , the data freshness of arctic and hudi are both set to 1 minute. The current test results of arctic open source version are as follows (the smaller the value, the better the performance):
The test plan, environment and configuration will be published on Arctic's official website, and we will announce more benchmark details in the August 11 sharing. Interested students, or have questions about the test results, welcome to our conference. to know more information.
Although we have done a lot of optimization work on the table format and under the engine, Arctic will not magically change the internal implementation of the format. Arctic relies on the release packages released by the community. In the future, Arctic will also insist on this and pass the format Compatible features bring users the best solution.
We will hold a simple conference online on August 11, 2022 ( click to watch the Arctic open source conference ), I spend about 30 minutes talking about Arctic's goals, features, plans, and what can be done to open source users value brought. From the tonality point of view, Arctic as the basic software will be a completely open source project, and the related commercialization (if any) will be promoted by another team. In the future, if conditions permit, we will also actively promote the incubation of the project to the foundation.
If you are interested in Arctic's positioning, functions, or any part related to him, welcome to watch our live or recorded broadcast.6 Summary
The release of Delta 2.0 marks that the data lake table format standard has become clear. While the competition between delta, iceberg and hudi has become fierce, enterprises and related suppliers should start to seriously consider how to introduce data lake table format technology to the platform. User brings Lakehouse best practices.
The value that Lakehouse brings to enterprises should be to use a data lake base to expand the boundaries of the data platform and improve the inefficiency and cost waste caused by the separation of products, data islands and process specifications. The first thing to do is to focus on traditional The data center created by the offline data warehouse, or the derived Dataops methodology, is extended to real-time scenarios, and the future data product methodology, driven by Lakehouse and related technologies, is believed to make great strides in the general direction of stream-batch integration.
However, enterprises and developers need to understand that the open source data lake Format is not equivalent to Lakehouse, including Databricks, who created the concept of Lakehouse, and has never equated Delta with Lakehouse. How to help enterprises build lakehouse is what we are doing this time The significance of the open source Arctic project is that Arctic is currently positioned as a streaming lakehouse service. Streaming emphasizes the expansion of real-time capabilities, while services emphasize management, standardized metrics, and other lakehouse capabilities that can be abstracted into basic software. Take Arctic's continuous self-optimization function as an example:
Arctic provides administrators and developers with continuously optimized measurement and management tools to help users measure, calibrate and plan for timeliness, storage and computing costs. Furthermore, in an offline scenario built with a data lake, cost and performance are very linear. When performance or capacity is insufficient, SRE only needs to consider how many machines to add. When we expand the capabilities of the data lake to real-time scenarios, the relationship between cost, performance and data freshness will become more complex and subtle. Arcitic's service and management functions will clarify this layer for users and upper-layer platforms. Triangle relation:About the Author
Ma Jin, NetEase Shufan big data real-time computing technology expert, and the project leader of the lake and warehouse integration project, responsible for NetEase Group's distributed database, data transmission platform, real-time computing platform, real-time data lake and other projects, long-term engaged in middleware, big data infrastructure Currently, he leads the team to focus on the platform solution and technology evolution of the integration of streaming and batching, the integration of lakes and warehouses, and the open source of the Arctic project of streaming lake and warehouse services.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。