The exploration and practice of Netease Shufan's real-time data lake Arctic
Author | Cai Fangfang
Interview guests | Majin Netease Shufan platform development expert
The data center must also move from offline to real-time, and the integration of the lake and warehouse is the first step.
Data from offline to real-time is a big trend at present, but there are still two difficulties in constructing real-time data and applying real-time data. The first is the inconsistency of the real-time and offline technology stacks, which leads to repeated investment in the system and R&D. The data model and code on this cannot be unified; the second is the lack of data governance. Real-time data is usually not included in the data center management, and there is no modeling. Standards and poor data quality. In response to these two problems, Netease Shufan recently launched Arctic, a real-time data lake engine. According to reports, Arctic has the ability to update and import real-time data. It can seamlessly connect to the data center and bring data governance into the real-time field. It also supports batch query and incremental consumption, and can integrate flow tables and batch tables.
This is the first time that Netease Shufan, as the basic software team of NetEase, has released its progress in the integrated direction of Hucang. At the same time, it also announced the real-time data center strategy of Netease Shufan. In order to gain an in-depth understanding of Netease Shufan's exploration and thinking in the integration of the lake and warehouse, as well as the design ideas and product positioning of the real-time data lake engine Arctic, InfoQ interviewed Ma Jin, a Netease Shufan platform development expert, to discuss the above issues one by one.
What kind of lake and warehouse integration will NetEase Shufan do?
Lakehouse originally referred to the emerging data architecture that integrates data lakes and data warehouses, and has the advantages of both, but now it is not just a pure technical concept, but has been given more and vendor product levels. Relevant meaning. While Hucang Integration is getting more and more popular, different manufacturers have also made their own interpretations for it. Before further discussing Netease's integrated lake and warehouse practice, we must first understand how Netease Shufan understands "integrated lake and warehouse".
The Netease Shufan team’s integrated lake and warehouse work mainly stems from a pain point in the real application scenario, that is, the processing link of real-time data and offline data in the big data scenario is separated, and the storage of real-time data and offline data is also Two different storage schemes were adopted respectively. On the one hand, the cost of repeated construction and maintenance is relatively high. On the other hand, the research results of both parties have not been well reused. Therefore, the initial goal of the team was actually to realize the integration of flow and batch, that is, to unify the processing and storage of real-time data and offline data.
Then why did it evolve into a lake and a warehouse? Ma Jin divided the integration of stream batches into three levels, namely, storage stream batch integration, development stream batch integration, and tool stream batch integration, and he gave such an equation:
"Integration of storage, flow and batch = Integration of lake and warehouse = Realize all data warehouse functions based on the data lake"
In essence, offline data warehouse storage corresponds to data lake technology, such as Hive in the Hadoop ecosystem; correspondingly, real-time data warehouse corresponds to the technical capabilities of traditional data warehouses, such as commercial databases such as Greenplum, Teradata, and Oracle. , In fact, they all have the ability to stream update and ACID, and can complete some real-time report work. Netease DataFan team hopes that the offline data warehouse technology based on the data lake concept has real-time computing capabilities and ACID guarantees, that is, it has the capabilities of traditional data warehouses. Therefore, the combination of data lake and traditional data warehouse capabilities, It is the integration of the lake and warehouse that the Netease Shufan team will do. Based on this goal, NetEase Shufan has built the real-time data lake engine Arctic.
From the perspective of the implementation path, Arctic’s original demand is based on the data lake to solve the “warehouse” problem. The team’s plan for it is to have the function of “warehouse” first, and then extend the implementation after the “warehouse” related work is done. "Lake" function.
logical data lake and the warehouse are integrated, two solutions for the same scenario
In addition to the integration of the lake and warehouse, InfoQ noted that NetEase Shufan also mentioned another concept in public on many occasions, namely the logical data lake. Yu Lihua, director of Netease Data Science Center and general manager of Netease Shufan Youshu Products, once said in an interview with InfoQ that the logical data lake is a more cost-effective way. This also brings us some doubts: Why did the concept of logical data lake appear? How to understand the relationship between it and the warehouse and the data center?
Ma Jin said that the logical data lake and the lake warehouse are two solutions in the same scenario, and in essence, they both serve the middle office. The logical data lake is "physical decentralization and logical unity", while the integration of the lake and warehouse is "physical unity". The two are two branches under the same problem.
The data center provides a set of data governance and data research and development methodologies, which are mainly business-oriented. Among them, data modeling and data research and development include data operation and maintenance, and their governance system is a set. But looking down from the middle and Taiwan module products, there will be a split of different solutions.
Among them, the logical data lake respects the historical burden of the business in the past. For example, data warehouses such as Greenplum and Oracle were used before. It is hoped that the data center can be built directly based on these data warehouses without data migration. From a business perspective, data modeling, data development, and the management system of the middle office are one set, but the underlying data storage can be different. Logical data lakes try to use technology to open up data islands, such as federated Join for different data warehouses. It can be considered as a solution to this inconsistency. "Physically decentralized", that is, the storage at the bottom layer can be separated, but "logically unified", and the logic of the upper-level middle station is unified.
It is understood that the logical data lake solution is mainly to meet the needs of some enterprise customers of Netease Shufan. In fact, there are not too many such burdens within NetEase Group. It can even be said that there is almost no such burden, because Netease's internal construction is based on Hadoop from the beginning. Data lake to achieve. But for many enterprise customers, they purchased different databases before, and later they wanted to build their own data lake and data middle-office system. Netease Shufan provided them with a logical data lake solution, and customers can continue to use the original solution. , At the same time, Netease Shufan provides them with a complete set of entrances to the middle station, unified management of different data islands. This is the scenario where the logical data lake is mainly applicable.
In comparison, the integrated solution of the lake and warehouse is more thorough. For business scenarios or corporate customers that have no historical burden, all their new businesses can be built based on the integrated lake and warehouse solution. Based on the integrated lake and warehouse solution, the underlying storage is physically unified, all based on the data lake, and the upper layer must also be unified.
It can be considered that these two schemes serve the entire middle station to build a unified data middle station governance logic. Ma Jin explained that the benefits of the two solutions are different. The logical data lake can allow users to get started quickly and better cover the historical burden of the enterprise; while the integration of the lake and warehouse can solve business pain points at a lower cost. If you save time In the future, when cloud computing becomes more widespread, building a data lake based on cloud object storage can save costs as much as dozens or even hundreds of times compared with using traditional commercial databases or commercial data warehouses.
The design ideas and positioning of the real-time data lake Arctic
The biggest difference between the core technology principle of Netease Shufan's integrated lake storage solution and Hive's offline data storage solution is that the granularity of data management is more refined. The management granularity of Hive is at the Partition level, while the management granularity of Netease Shufan lake storage integrated solution is fine. To file. As the upper layer undertakes the data middle-office system, the lake and warehouse integration needs to provide the upper layer with a systematic file management plan, covering functions such as file management and file merging. Therefore, the ability to have fine-grained file management is the primary requirement.
After investigation, the team finally chose to use Apache Iceberg. The main consideration is that the metadata management of Iceberg itself is file-oriented. It has a very complete manifest mechanism that can manage all the files in the table. Iceberg provides ACID transactions as the base. Guarantee and MVCC function, can guarantee the consistency of the data, and at the same time have scalability.
On the basis of Iceberg, the team self-developed real-time ingestion, file indexing, data merging, and a complete set of metadata management services.
Technical Selection
According to Ma Jin, in the earliest technical selection, the team also investigated the same type of open source projects Apache Hudi and Delta Lake as Iceberg, but eventually gave up the selection due to some reasons. When doing research, Hudi is still relatively closed (it is positioning itself as a Lib of Spark, and it started to support Flink as a higher priority work from the end of last year to this year), and Netease Shufan needs an open Solutions to meet highly customized needs.
In addition, there are also some technical details. For example, in terms of data format, Hudi's file index uses Bloomfilter and HBase mechanisms. These two mechanisms are not particularly ideal. HBase needs to introduce a third-party KV database, which is not good for commercial output, while Bloomfilter is heavier and will allow real-time The performance is greatly reduced, so they are not suitable for Netease Shufan's technical selection. Netease Shufan's idea and design of Arctic's core functions are also different from Hudi.
Delta Lake was not selected because it does not value real-time performance. Ma Jin’s team found through research related papers that Delta Lake still regards the ecology of Spark as the first priority, which is integrated with the team as a lake warehouse. There are still some differences in the goals.
In contrast, Iceberg is relatively more open. It has done a good job in the integration of computing engines, the integration of upper-level metadata, and the integration of different systems, which can meet the highly customized needs of the team. Therefore, the team finally chose Iceberg to better implement their ideas and make the unique features of NetEase Sufan.
based on Iceberg, but not limited to Iceberg
Although Arctic is based on Iceberg, Ma Jin believes that from the perspective of community positioning, Hudi is the most similar to Arctic. The data lake warehouse has a very important function, that is, it can perform row-level updates based on the primary key. Hudi is functionally compatible with Arctic, but there are differences in the core design. Hudi is also the most representative in terms of real-time access to the lake. sex. So when Arctic is doing performance comparison tests, it also compares Hudi instead of Iceberg.
In fact, when the Netease Shufan team first made the Arctic product, it did not intend to bind any open source data lake solution, including Iceberg.
Initially, the team hoped to build a lake warehouse with integrated flow and batch based on the data lake. By formulating a plan for managing Base data (i.e. inventory data) and Change data (i.e. incremental data or real-time data), the solution of the two data can be achieved. Coupling, no matter what data lake technology is used at the bottom, whether Iceberg or Delta Lake, the same set of integrated lake and warehouse solutions are exposed. This is Arctic's initial positioning, that is, it does not have a high degree of binding with any data lake base, but to achieve this requires extremely high research and development investment, and it is difficult to achieve it in one step. Therefore, the early stage team’s positioning of Arctic is first of all to meet the business goal of NetEase’s integrated lake warehouse, and manage the read merger, asynchronous merger, metadata service, small file management, etc. involved in the upper-level real-time lake entry function on a data lake base. With the base of the data lake, you can do the upper-level services based on this, and then consider adding the ability to build a lake and warehouse integration on different data lakes.
This seems to have added a service layer to the already quite complex data system, but Ma Jin said that is not the case. First of all, the data center is to add a layer on top of Hive; secondly, these added functions are actually engine-end adaptations, and there will be a separate governance service, and this governance service is a module that is partial to the middle platform. It is considered to be part of the entire data center system. This governance service can manage the metadata of the lake warehouse, similar to the HMS in Hive, and can also do some data merging planning, and it can also interface with different computing engines, such as Presto, Impala, Spark SQL, and Flink.
According to Ma Jin, the team expects to open source Arctic in Q2 next year. In fact, the team has been considering how to contribute self-developed things back to the open source community. Since last year, the Netease Shufan team has tried to co-build the Iceberg community with some leading Internet companies, hoping to guide the community to develop in the direction of integration of lakes and warehouses. But the community itself has its own plan for the direction of development, including the founder of the community who has left Netflix not long ago to start his own business and set up a business company around Iceberg. It is costly to push the community in a certain direction, and the progress will be relatively slow. .
Therefore, the Netease Shufan team currently hopes to implement all the ideas on Arctic first, so that the entire lake and warehouse integration plan can be put into operation, and then open up the results that have been made, and then further communicate with the community to see what can be contributed back to the community . Ma Jin believes that the most important thing is to hope that Arctic can at least operate in NetEase for a long time.
Landing situation and challenges
At present, some customers of Netease Shufan are using Arctic, and many businesses within the group have connected to Arctic. Ma Jin revealed that according to the statistics reported in the mid-term of the previous period, NetEase Group already has about 600TB of data using Arctic, and new businesses have begun to try.
According to the data source, Ma Jin divided Arctic's user scenarios into two major categories. Different use scenarios use different data architectures, and the transformation schemes used when Arctic is introduced are also different.
The data of the first type of scenario mainly comes from logs, such as NetEase Cloud Music, NetEase Media, and part of the e-commerce data warehouse system. Their data is mainly based on logs. For log data, the business line has built a very sound T+1 data processing solution several years ago, and now they hope to transform the original T+1 offline business into a real-time business. However, after transforming into a real-time link, I worry about the accuracy of the data, because log data is more prone to data disorder and duplication. uses the Lambda architecture more for this kind of log data scenario. Arctic provides an in-situ upgrade solution for Hive. upgrades Hive's offline data warehouse table to Arctic table in a specific way. After the upgrade, you can use it The real-time calculation engine writes data, and the offline data warehouse also maintains the ability to write in batches. The Arctic table will automatically switch between real-time and offline according to the scene to face different business scenarios. NetEase Group mainly promotes the Lambda architecture because there are more log-type data scenarios within the group.
The Kappa architecture is more geared toward enterprise users, such as traditional industries such as finance, manufacturing, and logistics. Whether their data is real-time data or full data, the main source is the database. The data stored in the database rarely has the problem of out-of-order repetition, which is relatively accurate, and there is also a complete mechanism to ensure data consistency. In this case, there is usually no need to use an offline link to cover the bottom, but a real-time link can be used daily. But sometimes there will be changes to the database table, such as adding or reducing a column of data, the structure of the data table changes, or some data has errors that need to be corrected on a large scale. At this time, it is necessary to perform batch calculations to supplement the original data. At the same time, offline links are needed to function.
In summary, NetEase Group mainly carries out the transformation of Lambda architecture, and for enterprise customers, the main practice is Kappa architecture. The Internet business within the NetEase Group and the business of traditional corporate customers have different data processing scenarios and methods, but there is no absolute boundary between the two. There are also some potential scenarios for using the Kappa architecture within the NetEase Group, such as Yanxuan e-commerce. A lot of real-time data comes from database scenarios.
For the implementation and landing of the integrated lake and warehouse solution, the Kappa architecture is the most ideal solution, because it is naturally real-time and has no historical burden, and the cost of building an integrated lake and warehouse is low; for the Lambda architecture, it may be offline. The link, but the offline is not standardized enough, it needs a certain transformation itself. In this case, the cost of upgrading and transformation will be higher, and the technical realization will also require more running-in. The current team promotes the implementation of the integrated lake and warehouse plan, and more will choose some scenarios based on the Kappa architecture. The Lambda architecture is mainly co-built with the large internal businesses of the group, and the process is relatively slow.
In addition to the historical burden mentioned above, enterprises trying to adopt the integrated technology of lake storage and warehouse are also facing another challenge, which is the problem of organization. In Ma Jin's view, the current "offline" gene of the entire program in the data center is very heavy. Real-time is a relatively independent branch, and real-time computing is not only used in big data scenarios, but also often involved in online scenarios. If you want to empower the entire data center in real time, you need to invade the data center architecture system. This involves the running-in of different teams and the unification of goals, and there are certain difficulties in advancing. This is similar to the challenges faced by enterprises in implementing the data-centric strategy in the past two years.
Ma Jin said frankly that when he was preparing to do the integration of the lake and warehouse last year, he faced relatively large resistance, because the data center team also has its own plan, such as the logical data lake mentioned above, and the integration of the lake and warehouse is solved from another perspective. problem. This requires the company's decision-makers to have a very precise judgment on this matter and formulate corresponding strategic goals. This year, at the NetEase Digital+ Conference, it was officially announced that the real-time data center will be promoted as a strategy, which is a point where NetEase Datafan has an advantage in the process of promoting the integration of the lake and warehouse.
How far is the ultimate goal of the batch integration?
For Netease Shufan, the integration of lake and warehouse (that is, the integration of storage and batching) is a necessary step to finally realize the integration of batching and batching. The final vision is to use one logic and one set of code to cover both offline and real-time scenarios. If there are two sets of storage for real-time and offline, and two tables are used, it is impossible to solve with one set of code. Therefore, it is necessary to solve the integration of the flow and batch of storage first, and then develop the integration of the flow and batch based on this. After the tools and the team are unified, the modules of the middle station, such as data models, data assets, data quality, etc., can also be integrated in the flow batch. From the original only offline function to the real-time function, this is called the tool flow batch One, to be more precise, is the integration of the flow and batch of the middle station module, and the real-time data middle station is finally presented to the front-end business.
Flow batch integration has been the tactical direction of the Netease Shufan team, which is to achieve real-time big data platform, rather than independent real-time computing. The aforementioned three levels of integration of flow and batch are the key improvement directions for the NetEase big data platform in the future.
Aiming at the integration of storage and batching, there is now a real-time data lake engine Arctic. The follow-up team’s work focuses mainly on performance optimization and self-developed features, such as real-time data ingestion, data merging, metadata management services, etc., and there is a long-term overall Research planning. In the future, Arctic will also adapt to more computing engines. In addition to the already adapted Flink and Spark, the adaptation of Impala is also in progress. When Arctic is open sourced next year, Presto will also be adapted.
At the same time, the development of the integration of flow batching and tool flow batching is also in full swing.
The development of batch integration is mainly followed up by the small team responsible for Flink, and it is currently mainly in the practice stage. Ma Jin said that the community maturity of the integration of computing, streaming and batching is much better than that of the integration of storage, streaming and batching. NetEase is more practical on the business side, striving to develop tools and platforms for the integration of streaming and batching next year. The integration of tool flow and batch is the progress of the entire data center team. The overall progress has been completed by 20% to 30%, but it has not yet been released to the public.
In Ma Jin's view, real-time and offline technologies will inevitably converge in the future. From a technical perspective, it is relatively optimistic. At present, Netease Shufan already has a corresponding solution, but large-scale business implementation will take more time. It will take at least two years for more businesses to use the integration of batching and storage as a relatively standard solution. The speed of the process is related to the urgency of each business's own demand for the separation of storage and computing.
Objectively speaking, at this stage, the integrated technology of the lake and warehouse is not very mature in the open source technology. Ma Jin said that companies need to pay attention to the general direction, but whether they want to adopt it depends on the development of the company. If the company's self-research ability is relatively lacking, it can continue to wait and see and wait for more mature solutions to appear. In his view, most of the current solutions are still in the stage of experience and early adopters, far from reaching the stage of widespread application. For companies with certain technical strength, they can first promote the use based on the internal scenes of the group. This is also the practice of many leading companies, such as Ali, Tencent, and Byte. NetEase also first incubates some solutions based on the internal scenes of the group.
However, Netease Shufan’s work focuses on privatization solutions for enterprises. In contrast, Ali and Tencent’s work focuses on public clouds, and they hope to monopolize customer solutions in their own ecology. Fan is more inclined to rely on open source, and then is stronger than open source, to be a technology breaker.
In the past year, InfoQ interviewed several big data platform experts on the topic of integration of lakes and warehouses and integration of batching and batching. Although each company has different interpretations and implementation paths, it is about the long-term development of integration of lakes and warehouses and batching batches in the future. The trend can basically reach agreement.
In the long run, whether it is Alibaba Cloud, Tencent Cloud or Databricks, the future development trend of integrated warehouses will be the same. That is, based on cheap storage facilities, the capacity of data warehouses will be built. In the short term, it may be due to the company's development strategy and There is a certain difference in the research direction due to the difference in self-positioning.
Even so, the process of technological change is still unavoidable. For many companies, real-time calculations have been done before and a relatively independent architecture has been built, and there is no strong motivation to upgrade and update the architecture. This is a bit similar to the "revolution of one's own life" often mentioned in the database field in the past, and the data center also faces such problems. But from a development perspective, such a breakthrough is very necessary. If the revolution is carried out, and finally real-time and offline are unified together, the big data platform will be simpler and more refined in the future, the professional threshold of tools will be higher and higher, but the cost of use will be lower and lower, and the investment of users in the use of tools will converge.
The full real-timeization of big data platforms and data services will inevitably bring certain adjustment demands to the current production relations and organizational structure. This is a drive for self-reform and requires a certain amount of courage from the enterprise.
Interview guest introduction:
Ma Jin, NetEase Shufan platform development expert, head of the online data and real-time computing team of NetEase Data Science Center, responsible for NetEase Group’s distributed database, data transmission platform, real-time computing platform, real-time data lake and other projects, has long been engaged in middleware, large The research and practice of data infrastructure currently leads the team to focus on the platform solution and technological evolution of the integration of batching and warehousing.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。