Detailed explanation of open source streaming lake warehouse service Arctic: not another set of Table Format

【Click to learn more about big data】
This article is based on the content of the author's speech at the Arctic open source conference (slightly deleted), and systematically interprets the original intention, ecological positioning, core features, performance and future planning of the Arctic project.

First of all, thank you all for participating in our Arctic open source conference. My name is Ma Jin, the head of the NetEase Shufan real-time computing and Hucang integrated team. In 2020, we began to pay attention to the new technology of data lake, and use it to build an architecture that integrates streams and batches, and integrates lakes and warehouses. At first we used Flink+Iceberg, but in practice, we found that there is still a big gap between this architecture and the production scene, so we have the Arctic project ( http://github.com/NetEase/arctic ).

Data Lake Table Format Battle

Let's first look at the selection of the current mainstream open source Table formats such as Apache Hudi, Apache Iceberg, and Delta.

The concept of Table format was first proposed by Iceberg, and now the industry understands it mainly in two ways. The first point is that the Table format defines which files can constitute a table, so that any engine such as Apache Flink, Apache Spark, Trino, Apache Impala can query and retrieve data according to the Table format. The second point is that the Table format regulates the distribution of data and files. Any engine that writes data must comply with this standard. The standard defined by the format supports ACID and schema evolution that Hive did not support before. We see that Iceberg, Delta and Hudi are basically leveled in these functions, although they are very different in their respective implementations, but abstracting their commonalities personally think it is very meaningful.

Comparing the current mainstream data lake Table format with Hive, Hive simply defines the mapping relationship between tables and static directories on HDFS. It has no ACID guarantee, and we can only read and write in real production scenarios. At present, our upper-level data center or DataOps platform can ensure that we can use Hive correctly through workflow, of course, it can only be used in offline scenarios.

The new Table format capability led by Iceberg, Delta, and Hudi adds a concept of snapshot. The metadata of the table is no longer a simple relationship between tables and files, but becomes the relationship between tables and snapshots and snapshots and files. , each write of the data will generate a new snapshot, and this snapshot and the file generate a dynamic mapping relationship, so that it can realize the protection of ACID for each write, and can also realize the user's multi-reading and multi-writing through the isolation of snapshots. . Moreover, based on snapshots, it can also provide some interesting functions to the upper layer. For example, incremental reads can be implemented based on incremental writes based on snapshots, which is the function of CDC. Snapshots can be used to support backtracking, such as when we travel in time or data. rollback.

To sum up, Table format has four core features.

First, structural freedom. Like the previous Hive, it can only support simple column addition operations, while in Table formats such as Delta and Iceberg, users can freely change the structure of the table, add columns, subtract columns, change columns, and migrate and change data. There will be no requirements.
Second, freedom of reading and writing. Because it can ensure the ACID of data through snapshots, any real-time, offline and AI requirements can freely write or read data to this table.
Third, the flow batch is homologous. Because one of the core functions of the Table format is that it can well support streaming scenarios, our batches and streams can be written and read to the new Table format.
Fourth, engine equality. This is very important, it cannot just be bound to a certain engine. For example, Delta was a component of the Spark ecosystem in the 1.0 era. The release of Delta 2.0 a month ago once again proved to us that it is possible to adapt to multiple engines. importance.

On the official websites of Table format projects, they will mainly push some functions including CDC, SQL extension, data rollback, and time travel, schema evolution, and what we often call Upsert, read-time merge (merge-on-read) function.

CDC can play the role of flattening message queues to a certain extent. For example, in production scenarios, real-time computing mainly uses Kafka or Pulsar for the selection of flow tables. With the Table format, we can implement functions similar to message queues based on the data lake. Of course, its data delay will be downgraded from milliseconds or seconds to minutes. Like Upsert, read-time merge and the industry or the main scenario of many companies to promote data lakes, is to use this real-time update and read-time merge to replace the real-time update data warehouse systems such as Apache Kudu, Doris, and Greenplum.

What kind of data lake do enterprises need

The first point is that if we only focus on individual functions in the data lake Table format, it is very difficult to generalize. For example, the CDC function of the data lake can indeed replace the message queue to a certain extent, but it will also introduce some other problems: first, the problem of delay; second, using the data lake as a message queue may introduce many small files, Who will manage this small file? The third is a relatively invisible problem. In the past, the cost of message queues was linked to the business team. If we now use a public data lake base, how should this cost be allocated?

In the past two years, we have communicated with many companies in the industry. Generally, we are struggling with such a contradiction. If we want to replace some other solutions with new technologies of the data lake, the attractiveness of the business is very insufficient. What value can our data lake or Lakehouse technology bring to enterprises?

In our production scenario, the main problem that our entire data platform system will encounter in 2020 is the fragmentation of the streaming batch platform. We all know that we have developed a very rich methodology around the Hive offline data warehouse. From data models, data assets to data quality, we have produced a very good set of specifications, standards and governance systems based on the open architecture of the data lake.

But we switched our attention to the real-time scenario. At present, Flink is mainly used for real-time computing, and Kafka is used as the selection of flow table. When we do flow table join, we may need to pull a real-time synchronization task from the database alone. For data analysis, we need relatively high data freshness, and we need to introduce data warehouse systems such as Kudu or Doris that can be updated in quasi-real time or in real time.

This set of tools and our offline complete set of technical selection and tools are very separate, and there is no better specification, mainly for point-to-point development.

For example, if we want to do both real-time and offline, we need to pull two sets of links. According to our entire methodology and the workflow of the entire offline process, it is relatively easy to define a set of offline links in a standardized manner. In real-time scenarios, we need more developers, and users need to understand how to play with Flink, how to read Kafka, how to serialize and deserialize in this process, and how to build tables on Kudu. This set of specifications brings users came a very heavy burden.

Especially for some AI businesses, they want to produce data. In fact, they pay more attention to the AI-related processes such as data training and samples. They don't know anything about HBase and KV, so they will refer their needs to another team. , while the other team can only respond point-to-point.

Summarize the disadvantages that the traditional Lambda architecture brings to us.

The first is the problem of data silos. If we use Kudu or other data warehouse solutions that are separated from the data lake, it will bring independent procurement and deployment costs, and will waste costs because of easy storage. Because data is difficult to reuse and communicate with each other, if we need a real-time data warehouse in the same business scenario, we may need to pull a new data from the source, resulting in a waste of cost and human efficiency.

The second is that the efficiency of R&D personnel is low, the R&D system is fragmented, and the R&D standards are not universal. This is typical in AI features and recommended scenarios. Users need to figure out when to call real-time things and when to call offline things, which will make the entire business layer very complicated.

Finally, there is the issue of ambiguity between indicators and semantics. For example, in the past few years, we have mainly used Kudu as a real-time data warehouse solution. Users need to build a data warehouse table in Kudu. There will be a set of Kudu Schema, and a set of tables created by the data model on Hive. Both sets of things need to be maintained by users themselves. When the business logic changes, the user may change Hive but not Kudu, which will lead to the ambiguity of indicators and semantics in the long run. And over time, the cost of maintenance will get higher and higher.

So what does the business expect? In fact, we can use a set of specifications and a set of processes to unify real-time and offline, as well as more scenarios such as AI, at the platform layer, the middle-stage layer of the entire data, or the methodology layer of the entire set of data. So let's look back at the meaning created by the concept of Lakehouse, which is to expand the boundaries of products, so that the data lake can serve more streaming scenarios and AI scenarios.

In our production scenario, what Lakehouse ultimately brings to the business should also be a system benefit, rather than using it for a certain function. For example, I use it in CDC or in analysis scenarios, but if the user simply compares the differences between Kudu and Hudi or Iceberg, it may be difficult for him to say what kind of benefits it brings; but if we tell the user to say The entire platform can unify offline and real-time in a plug-and-play manner, which is a great benefit. Based on such a goal, we have developed a system like Arctic, a streaming lake warehouse service.

Understanding Arctic Streaming Lake Warehouse Services

What is Arctic? To put it simply, Arctic is the Streaming LakeHouse Service open sourced by NetEase Shufan. It adds more real-time scene capabilities on top of lceberg and Hive, so Arctic first emphasizes real-time scene capabilities, and Provides out-of-the-box metadata services for DataOps and makes data lakes more usable and useful. It will be more abstract if we summarize it in one sentence. Later, we will use more functional examples and some of our dry goods to share, so that everyone can deeply understand what Arctic is.

Niche differences

First of all, we emphasize the difference of ecological niche through this picture. Arctic is above Table format in terms of ecological niche, so strictly speaking, we should not regard Arctic as another set of Iceberg or another set of Table format.

On the other hand, on the Table format, we mainly consider compatibility with the open source Table format, so a core goal of Arctic is to help enterprises make good use of the Table format of the data lake, and to solve or flatten the Table format and users, or products. Gap between real needs.

Arctic itself contains two core components, the first is the metadata service AMS, which is positioned as the next-generation HMS in our system. Second, our continuous self-optimization function will have a complete set of optimizer components and mechanisms to achieve background data optimization.

Tablestore Design and Benefits

We talked to a lot of users about Arctic before, and the first question for most users is what exactly is our relationship with the open source Iceberg. I want to illustrate this with this picture. First of all, there is the concept of Tablestore in Arctic. Tablestore is the positioning of a storage unit, which is somewhat similar to the concept of clustered index in traditional databases. When streaming writing, we will use a change Tablestore to store the data written by a CDC. This data is somewhat similar to the binlog or relog in our database. The latter change table can be used for CDC playback, or as a as a separate table to access.

Hudi and Iceberg also have the function of upsert, but Iceberg did not have this function when we started to do this in 2020. The community will make some compromises in the implementation due to the rigorous consideration of the design of the manifest layer, so we finally decided to use the upper layer. To do this, and it will reflect some of our advantages.

The Change table mainly stores the change data of CDC. In addition, there is a set of Basestore that will store our stock data. The two Tablestores are actually two independent Iceberg tables. In addition, we can optionally integrate Kafka's logstore, which means that our data can be double-written, part of which is first written to Kafka, and then written to the data lake, which realizes the unification of stream tables and batch tables.

What are the advantages of this design? First, the CDC data in the change table can be played back in order, which will solve the problem that Iceberg's native V2 CDC is not very good for playback.

The second is that the change table is open to access. In many e-commerce and logistics scenarios, change data is not only used as built-in data in a table, but in many cases, the change data of the order table and logistics table will also be used as an independent warehouse table. Through this design, we allow change The table is used separately, and of course some write protection will be added. If the business has some customization requirements in the future, such as adding some additional fields in the change table, adding some business's own UDF calculation logic, this design also has such a possibility.

The third point is the conversion between our set of design concepts change and base. This process is optimize. This concept has been introduced in Delta, Iceberg and Hudi. Its core is to do some small file merging. We also include the conversion of change data to base data into the category of optimize, and these processes are transparent to users. If users use Iceberg or Delta directly, all optimize operations will have a snapshot at the bottom layer, which is not user-friendly. We encapsulate this set of things at the top layer. When a user reads a high-freshness data for analysis, our engine will do a merge-on-read of change and base.

Arctic Architecture and Components

After understanding the concept of Tablestore and then looking at Arctic's architecture and components, it will be easier for us to understand. In the data lake layer, we have change files and base files, which correspond to changestore and basestore respectively. The concept of Tablestore can not only be used in CDC scenarios, but also for some specific requirements of ZOrder in the future, an independent Tablestore can be set up at the upper layer to solve.

In the upper layer, we have the AMS (Arctic Meta Service) introduced earlier. AMS is the component emphasized in the "service" layer of the Arctic streaming lake warehouse service, and it is the metadata center for triples.

What are triples? It is a triple such as catalog.table.db. We all know that after Spark 3.0 and Flink 1.2, the main push is the function of multi catalog, which can adapt to different data sources. At present, in the mainstream big data practice, we use HMS as a metadata center. HMS is a two-tuple structure. If we want to expand, we need to do a lot of customization according to more data sources in HMS. sex work. For example, NetEase Shufan Youshu platform actually built a metadata center outside of HMS to manage the relationship between triples and data sources. AMS itself is a metadata service designed for triples.

Second, our AMS can be synchronized with HMS, that is, our Schema can be stored in HMS. In addition to some field information that Hive can store, some additional component information and some additional properties will be stored in AMS, so that AMS can store Provide services together with HMS, so the business does not have to worry about making a replacement when using Arctic, which is actually a very grayscale process.

The third is that AMS will provide APIs for transaction and conflict resolution.

Here in the optimizer, we have a complete set of extension mechanism and management mechanism. First of all, we have a concept of optimizer container, which is essentially a component of platform scheduling tasks. Our entire background optimization process is transparent to the business. The background needs to have a set of scheduling services that can schedule the optimize process to a platform ( For example, on YARN, K8s), this different mode is the concept of optimizer container. In the future, users can also extend its scheduling framework through the container interface.

The optimizer group is used to isolate resources inside the container. For example, if a user thinks that some tables need to be optimized with high priority, an independent optimizer group can be given to him to perform his optimization tasks.

The third point is that there is a separate Dashboard in our architecture, which is also one of our management interfaces. We pay great attention to the management experience of the lake warehouse itself.

The last point is also very important. We just mentioned that we have the feature that Table format is fully compatible. There are two currently available, one is Iceberg, because we are based on Iceberg, basestore and changestore are independent Iceberg tables, and our compatibility is continued with the iteration of Iceberg, and is currently compatible with Iceberg V2.

In addition, we also have a Hive compatible mode, which allows businesses to directly use some of Arctic's main functions without changing the code. If the user is using the Hive format compatible, its change data still exists in Iceberg.

management function

As mentioned earlier, Arctic pays great attention to the management experience, especially for the management of our continuous optimization in the background, there is a set of functions and corresponding measurement and calibration capabilities available to everyone. As shown in the figure below, which tables are optimizing the resources used, the duration, and how to do a more reasonable resource scheduling in the future, we will give you all through our management function.

The function of our table service, there is a lot of metadata information for the table, including the dynamic changes of each table, some DDL historical information, and transaction information, will be reflected in the table service.

Concurrency conflict resolution

When we use the Table format to solve the same-origin scenario of streams and batches, for example, in the upper part of the figure below, we are doing a CDC synchronization of data. Normally, it is a Flink task to do continuous synchronization, but If we want to do data rollback or data correction, for example, add a column, this column has a default value, we need to initialize the value in batches, and a Spark task will be started to run synchronously with Flink. At this time, if the Saprk task and the Flink task operate on the same row of data, and the primary key of the data is the same, the problem of data conflict will be encountered.

Now the semantics of optimistic concurrency control are generally provided at the table format layer. When we encounter a conflict, the first thing we think of is to let a certain commit fail. In other words, the core point of the semantics of optimistic concurrency control is that concurrency is not allowed. , then in our scenario, the Spark task may never be submitted successfully, because our expectation for it is to rewrite all the data, so its data category will definitely conflict with our real-time data. But the business definitely hopes that his data can be submitted successfully, so we provide a mechanism for concurrency conflict resolution, so that this data can be submitted successfully, and at the same time ensure that it still has the semantics of transaction consistency.

The second half is similar. We have performed ad-hoc concurrent updates to c1 and c2 for a data warehouse table and a lake warehouse table. c1 is submitted after c2, but c1 starts before c2. When they conflict, c1 is overwritten. c2, or does c2 override c1? From the current data lake solution, generally whoever submits later shall prevail, but in more production scenarios, we need whoever starts first shall prevail. This time relationship will not be expanded. If you have any questions, you can communicate with us in-depth in the user group.

Arctic auto bucketing

Arctic has also done a lot of work in terms of performance. We are currently based on Iceberg. Iceberg is a very flexible and open Table format. It does not consider my data and the corresponding update of my data under the partition. How should I do it better? mapping to improve its performance.

On Iceberg, we have done the function of auto bucketing, which is similar to the concept of file_group in Hudi. The difference is that we do not expose concepts such as file_group or file_index to the user. We provide the function of grouping on top of the file, which is an extensible way. Through the expansion of the binary tree, the data volume of each node can be maintained as much as possible in the user-configured size. For example, Iceberg is configured with 128 megabytes by default. Through a whole set of optimization mechanisms in the background, we will try to maintain the size of each node as close to 128 megabytes as possible.

When a node's data exceeds this category, we will try to split it. As mentioned earlier, we divided changestore and basestore, which are managed in the same way, so that each node can correspond to change data and base More detailed data mapping can be achieved, and the performance of data analysis will be greatly improved.

It can be seen that the whole set of mechanisms can also be used in the process of merge-on-read. Around 2000, Berkeley had a paper describing this scheme, and interested students can also learn about it by themselves.

Arctic performance test

Streaming lake warehouse, or the whole set of practice of real-time streaming data warehouse on the data lake, there is no very good benchmark tool to define its performance. We have also done a lot of consideration and exploration on this issue. At present, our solution is the same as that adopted by the HTAP benchmark. According to the introduction of TiDB, we have found the concept of CHbenchmark, which has been in the industry for a long time and transformed it.

CHbenchmark supports a database running both TPC-C and TPC-H. As can be seen from the left side of the figure below, there are 6 tables that overlap, running both in TPC-C and TPC-H, 3 tables are referenced in TPC-C, and 3 tables are only in TPC-C. Quoted in H.

Based on this solution, we have made a transformation. First, we use TPC-C to run the database. Next, we will run a Flink CDC task to synchronize the database to the Arctic data lake in real time, and use the Arctic data lake to build a minute-level data. The freshness of the streaming lake warehouse, on top of this, we run the TPC-H part of CHbenchmark, so that we can get the data analysis performance of the standard streaming lake warehouse.

Regarding the results of the Arctic, Iceberg and Hudi tests before and after optimization (test under Trino), we made a simple comparison by stage, divided into four groups: 0-30 minutes, 30-60 minutes, 60-90 minutes and 90-120 minutes . The blue part of the figure below is the performance of the data analysis without optimize. From 0-30 minutes to the last 90-120 minutes, the delay is reduced from 20 seconds to more than 40 seconds, which is more than half. The yellow part has Arctic that is continuously merged, and the performance is stable at about 20 seconds.

The gray one is the native Iceberg upsert scheme, 0-30 minutes is around 30 seconds, and the performance drops sharply from 30-60 minutes. Why did Iceberg have such a big performance slump? Because the native Iceberg does not have a refined mapping of insert data and delete data, after we continue to write streaming files, each insert file will have a lot of associations with the delete file, which leads us to merge- On-read performance drops dramatically. After measuring 60-90 minutes and 90-120 minutes, it will be OOM directly, and it will not run out.

The yellow part is Hudi. At present, Arctic, like Hudi, can ensure the performance of data analysis through background optimization and maintain a relatively stable number. At present, through the optimization of the upper layer, it is better than Hudi. In the future, we will also disclose our entire testing process and related configurations to everyone on the official website.

Arctic currently has certain advantages over Hudi in terms of mor performance. Here we will not emphasize how well Arctic has done. We have also studied Hudi. It has two modes: RO and RT. The former only reads merged data, and the RT mode is merge. A mode for -on-read. The performance gap between its RO mode and RT mode is very large, so there may be a lot of room for optimization in the future.

Arctic roadmap and summary

Finally, we make a brief summary of Arctic roadmap and the whole system. Arctic is a streaming lake warehouse service that provides core features corresponding to streaming, lakehouse, and service respectively.

At the streaming level, we provide efficient streaming updates of primary keys. We have the ability to self-bucket and structure free data, and the functions of Spark and Trino merge-on-read to provide minute-level freshness data analysis.

At the lakehouse level, we achieve format compatibility, which is 100% compatible with the table format syntax of lceberg and Hive. If there are some functions that Arctic does not have but Iceberg has, users only need to switch to the Iceberg catalog to treat an Arctic table as Iceberg table to use, and we provide access to the two tables base and change.

The engine supports Spark and Flink to read and write data, and supports Trino and Impala to query data. At present, Impala mainly uses the compatibility features of Hive. It can query Arctic tables as a Hive to support Impala.

In the service section, the main emphasis is on management functions:

The first is to support the encapsulation of the data lake and the message queue into a unified table to realize the unification of the stream batch table, so that the user does not have to worry about reducing the level of seconds or milliseconds to minutes when using Arctic tables, and he can still use the data lake to provide milliseconds. Or the ability to delay data in seconds.

The second point provides standardized metrics for streaming warehouses, dashboards and related management tools.

The third point is to resolve concurrent write conflicts and achieve transactional consistency semantics.

At the management level, we focus on answering the following questions, which still have a long way to go.

The first is how to quantify the real-time performance of the table. For example, after we build a streaming lake warehouse table, whether the current freshness is one minute, two minutes or more, and whether it meets the user's expectations.

The second is how to provide users with a trade off solution between timeliness, cost, and performance.

How much room for optimization of the third query performance, and how much resources need to be invested to do such a thing.

The fourth point is how to quantify the resources of data optimization and how to maximize the use of these resources. If users have a deep understanding of Arctic, they will find that our optimization is very different from the other Hudi currently. First of all, our optimization is scheduled at the platform level, not in each written task, so that we can centrally manage these data. Optimized resources and can provide the fastest iterations. For example, we found that through some optimizations, the efficiency of merging can be greatly improved, and it can be quickly iterated.

The last point is how to allocate resources flexibly and schedule resources for high priority.

In the future work on the Core feature dimension, we will first focus on implementing the Flink lookup join function without relying on external KV. In an architecture we saw before, if we do a dimension table join in a real-time scenario, we may need an external KV for synchronization. At present, we are doing such a scheme, but we do not need to do external synchronization, we can directly base on Arctic table to do dimension table join.

The second is streaming update of some columns. Now we mainly do streaming upsert through CDC. In many scenarios, such as features and large-width tables, we may need to be able to update some columns.

The back is more optimizer container support, such as K8s; more SQL syntax support, such as merge into - at present, we do not have such a syntax at the Arctic level, users can use Arctic tables as Iceberg tables, to Merge into is supported. In the future, if merge into is supported at the Arctic level, it will be different from Iceberg, because our change data will first enter the change space.

The last point is that because our niche is based on the data lake Table format, we will decouple the architecture in the future to expand to more Table formats, such as Delta and Hudi.

Finally, let's talk about the original intention of our open source. In the past, we may not have a very unified pace in open source. Last year, our leadership made a determination to open source in a more focused way. Taking Arctic as an example, we will not do any commercial concealment. And in terms of organizational structure, our team's promotion of open source is also a very independent process. If there is a possibility of commercialization, it will be promoted by other teams.

We will be committed to building an open and free data lake technology exchange community for developers, users and members. Of course, we are mainly targeting domestic users, and the official website is mainly in Chinese. Hope more developers will participate in our project.

That's all I have to share today, thank you all!

Q&A

Moderator: Have you paid attention to the features of Flink Tablestore and how is it different from Arctic?

Ma Jin: Yes. First of all, what everyone does is relatively similar. Last year, we saw such a proposal in the Flink community. I understand that Flink will definitely do such a thing, and they also hope to build a complete ecosystem of their own, just like Delta does to Spark. Although they are similar, everyone's goals are not the same. Flink does this more authentically for the streaming scene, but it is definitely not as much as we think about how to do it on Spark and other engines. Some things, how to provide more management functions in the upper layer. So aside from some functional overlaps, I understand that there will still be differences in everyone's original intentions or the problems to be solved in the end.

Moderator: Although the expressions are similar, the Flink tablestore approach is closer to the native Flink scenario, but in addition to the Flink-compatible scenario, we will have more Spark-oriented scenarios for compatibility and support.

Ma Jin: Not only Spark, we also provide Hive compatibility. If you are a Hive user, how can some Hive tables be smoothly upgraded to our new integrated lake and warehouse architecture? This is something our system considers. How to provide some convenient management functions and metrics in this process? Metrics, these may be different from what Flink Tablestore considers.

Moderator: The bottom layer of Arctic just mentioned is based on Iceberg. Is there a strong binding relationship in the code? Will you consider doing this conversion based on other Table formats in the future?

Ma Jin: We have also experienced some changes. Now the standards we set ourselves will not invade the internal implementation of format, nor will it magically change the open source code. But early on we didn't have such a clear goal, there will be some changes on Iceberg. Now our code and Iceberg can do a relatively clean decoupling, but we have not done this at present. For example, the definitions of Schema are still in the Iceberg package, but it is very easy to decouple this thing. . There is a design intention here. The product has to consider how to use the data lake. There will be some considerations, such as Iceberg and Delta, who is more likely to become the mainstream in the future? We hope that users can avoid such troubles. In the future, we hope to provide users with a unified Lakehouse solution that they need at the upper level, and we will make such a selection at the lower level.

Moderator: To put it bluntly, we do not help users make the final decision, but provide more possibilities. Whether the future is Iceberg or Delta, we can be compatible in a more open way.

Ma Jin: This is long-term, and now we will be more closely integrated with Iceberg.

Guest introduction Ma Jin, NetEase Shufan big data real-time computing technology expert, and the person in charge of the Lake and Warehouse integration project, responsible for NetEase Group's distributed database, data transmission platform, real-time computing platform, real-time data lake and other projects, long-term engaged in middleware, big data In the research and practice of infrastructure, he currently leads the team to focus on the platform solution and technology evolution of the integration of streaming and batching, the integration of lakes and warehouses, and the open source of the Arctic project of streaming lake and warehouse services.

Arctic Documentation: https://arctic.netease.com/ch/
GitHub address: https://github.com/NetEase/arctic
Video watch: https://www.bilibili.com/video/BV1Nd4y1o7yk/
Communication group: Add "kllnn999" as a friend on WeChat, and indicate "Arctic Communication"
Learn more: Let's talk about what kind of data lakes we need, starting with Delta 2.0

Detailed explanation of open source streaming lake warehouse service Arctic: not another set of Table Format

Data Lake Table Format Battle

What kind of data lake do enterprises need

Understanding Arctic Streaming Lake Warehouse Services

网易数帆

引用和评论

一图看懂网易数帆指标平台EasyMetrics

【Hadoop】HBase系统解析及适用场景

Flink+Paimon+Hologres，面向未来的一体化实时湖仓平台架构设计

鹰角基于 Flink + Paimon + Trino 构建湖仓一体化平台实践项目

基于 pyflink 的算法工作流设计和改造

湖仓实时化升级：Uniflow 构建流批一体实时湖仓

Flink CDC YAML：面向数据集成的 API 设计

Detailed explanation of open source streaming lake warehouse service Arctic: not another set of Table Format

Data Lake Table Format Battle

What kind of data lake do enterprises need

Understanding Arctic Streaming Lake Warehouse Services

网易数帆

引用和评论

一图看懂网易数帆指标平台EasyMetrics

【Hadoop】HBase系统解析及适用场景

Flink+Paimon+Hologres，面向未来的一体化实时湖仓平台架构设计

鹰角基于 Flink + Paimon + Trino 构建湖仓一体化平台实践项目

基于 pyflink 的算法工作流设计和改造

湖仓实时化升级 ：Uniflow 构建流批一体实时湖仓

Flink CDC YAML：面向数据集成的 API 设计

湖仓实时化升级：Uniflow 构建流批一体实时湖仓