【Click to register now】
On August 11th , NetEase Shufan will hold the " Enterprise-level Streaming Lake Warehouse Service Arctic Open Source Conference ", inviting the relevant leaders of NetEase Shufan's big data product line and partners to jointly interpret their thoughts on the evolution of data technology and Arctic open source. Introduce the Arctic project progress, future development and community planning, and share the practical results and experience of the integration of enterprises, lakes and warehouses.
The pace of data infrastructure development has never stopped, and Lakehouse is currently in the limelight.
The integration of lake and warehouse, as the name suggests, is the combination of the advantages of data lake and data warehouse . With the advancement of digital intelligence in enterprises, the integration of lake and warehouse is not only a hot technology in the open source community, but also the center of vision of A16Z, a top Silicon Valley investment institution, and an important member of many big data business product families.
So, will the integration of lake and warehouse really become the standard of enterprise big data infrastructure? Should we be concerned about this technology? What is its future?
Why do we need lake and warehouse integration?
Borrowing from the definition of Databricks, the integrated lake and warehouse platform can simultaneously provide the reliability, strong governance and performance of a data warehouse, as well as the openness, flexibility and machine learning support of a data lake . Ma Jin, head of the NetEase Shufan Lake and Warehouse Integration Project , believes that the Lake and Warehouse integration is a new track to relay the vigorous ecology of Apache Hadoop. Its core feature is to build a transaction layer on the data lake and graft advanced data processing and management functions to low-cost data. storage architecture. This is an architectural evolution driven by business requirements. After all, the types and scale of business data continue to expand, and the requirements for real-time computing are even higher.
Taking NetEase as an example, from T+1 offline data production to the introduction of real-time and continuous improvement, such as the introduction of Apache Kudu to solve the shortage of Hive offline data warehouse in real-time data update, the Lambda architecture of stream batch segmentation has been formed (this is also the industry A microcosm of the evolution of big data architecture), and then problems such as data silos, fragmentation of R&D systems, and ambiguity of indicators and semantics are gradually exposed, which requires a more elegant unified data infrastructure architecture, that is, the integration of lakes and warehouses to solve. The implementation scheme based on the open source three Musketeers of the data lake (Delta Lake, Apache Iceberg, and Apache Hudi) has become a popular choice.
Innovation of NetEase Shufan Streaming Lake Warehouse
Although Lakehouse is indeed the stitching monster of Data Lake and Data Warehouse in terms of vocabulary, to become a production-level new technology, the integration of lake and warehouse is not as simple as data lake and data warehouse 1+1=2 after all. In Ma Jin's view, there are two major shortcomings in the current lake-warehouse integration solution: one is that what you read is what you write, which will cause problems such as streaming ingestion and lead to massive small files; the other is the lack of real-time capabilities, such as stream computing based on the integration of lakes and warehouses. The delay is at the minute level.
Based on this, Ma Jin led the team to develop a streaming lake warehouse service named Arctic, and put forward five design goals : to provide a reliable integrated lake and warehouse service, to solve the shortage of mainstream lake warehouse integration, and to face more scenarios of integration of streaming and batch processing. , try not to reinvent the wheel, and seek intergenerational solutions.
In terms of technical solutions , Arctic is built on the Iceberg table format, reuses various functions of Iceberg, and is fully compatible with Hive. Arctic provides optimized CDC (change data acquisition) and streaming update capabilities for streaming scenarios. It can also openly integrate middleware such as MQ and KV, and provide streaming and batch unified table services to mainstream computing engines such as Flink, Spark, and Trino. In order to realize the unification of the data lake and data warehouse, and integrate the real-time capability, the delay of stream computing can reach the millisecond level.
Therefore, Arctic can be regarded as an independent real-time data warehouse service , and users do not need to care about the data storage structure, size and distribution, or whether to introduce other middleware.
The future of Liuhucang
Thirty years ago, Western scholars expressed the emotion of "the end of history" in the face of social changes, but history has slapped this assertion in the face. So, will the streaming lake warehouse become the end point of modern big data infrastructure? Looking back at the field of data analysis, various methodologies such as data warehouse, OLAP, BI, big data, and data middle platform that have appeared successively have been integrated into the enterprise data life cycle, and the underlying Hadoop system is still widely used. We have reason to believe that stream The design and implementation of the cloud warehouse service, which is derived from business requirements, may be upgraded, but this idea will always exist in the data infrastructure.
From the panorama of A16Z, we can also see that the stability of enterprise-level data infrastructure is often accompanied by long-term precipitation, and Arctic's open architecture and compatibility with the Hadoop ecosystem have indicated its vitality.
Adhering to NetEase Shufan's concept of " open architecture and open source kernel ", Arctic is about to open source !
Join us to discuss the key elements of the integration of lakes and warehouses, and jointly promote the ecological improvement and development of open-architecture data infrastructure. Please click the link: i.163yun.com/n1pvp4086 to register.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。