Under the trend of digital transformation, the exploration and pursuit of data productivity in various industries has gradually entered the deep water area. The real problem is that the coexistence of multiple technologies in enterprise data warehouse storage and data lake will exist for a long time. How can we get rid of the internal friction of technical collaboration and let big data go directly to the other side of productivity?

On the afternoon of August 11, NetEase Shufan and Huatai Securities jointly held the Arctic open source conference for enterprise-level streaming lake warehouse services online, announcing that the open source Arctic would improve the existing data lake base, expand the boundaries of the data platform, improve product, The inefficiency and cost waste caused by data silos and fragmented process specifications promote the integration of lakes and warehouses and the integration of streams and batches to achieve data productivity and drive business value.

Arctic open source: no magic change, no closure, to promote the implementation of data productivity

Faced with NetEase's diversified business and diversified technologies, NetEase Shufan encountered the problems mentioned at the beginning of the article when promoting data productivity. A complete set of big data technology system has been promoted and applied to more than 300 customers in finance, retail, circulation, manufacturing and other industries.

Yu Lihua, general manager of NetEase Shufan's big data product line, said that this achievement is due to the two major technical principles of NetEase Shufan when building a big data system: open architecture and open source. The open architecture is implemented with modular design and a large number of open source components, which makes the system comprehensive, strong vitality and low construction cost. Of course, this also brings complex problems of use and maintenance. NetEase Shufan solves this problem by integrating into the open source community, such as building a unified SQL gateway through open source Apache Kyuubi to provide a unified entry for the data lake.

网易数帆大数据产品线总经理余利华

Participating in the digital transformation of the financial industry, NetEase Shufan has discovered new challenges: financial enterprises hope to integrate real-time data lakes and data warehouses to create real-time data middle-end platforms to support their digital business innovation. This is essentially the idea of integrating lakes and warehouses, but the current mainstream data lake technologies only solve the problems of update, large table access performance, streaming consumption, etc., and still leave small files leading to performance loss, compatibility, and loss of performance such as updates. and usability-related issues, and the open source community has not yet seen a corresponding solution. This is the direct reason why NetEase Shufan developed and open-sourced the Arctic project, a streaming lake warehouse service.

Arctic is a Streaming LakeHouse Service built on top of Apache Iceberg. Through Arctic, users can implement more optimized CDC, streaming update, OLAP and other functions on Flink, Spark, Trino and other engines. Combined with the efficient offline processing capabilities of the data lake, Arctic can serve more scenarios where streams and batches are mixed; , Arctic's structural self-optimization, concurrent conflict resolution, and standardized lake and warehouse management functions can effectively reduce the user's burden on data lake management and optimization.

Yu Lihua said, adhering to the principle of open architecture, Arctic is based on the open source data lake, refuses to make changes, does not bind the computing engine, and pays attention to the compatibility with the traditional data warehouse Hive. This is after the unified entry of SQL, NetEase Shufan big data system is once again unified at the storage level, which enables the data middle-end system to be seamlessly extended to real-time scenarios, and the performance of enterprise data productivity will no longer be troubled by isolated islands. The application practice in the financial industry has also verified the value of this idea.

Arctic Design: Reshaping the Balance of Cost, Performance and Data Freshness

Ma Jin, an expert in real-time computing technology of NetEase Shufan big data and the head of the Hucang Integrated Project, further introduced the Arctic project's goals, features, planning, and the value it brings to open source users.

Ma Jin said that Arctic is positioned as a streaming lake and warehouse service. Streaming emphasizes the expansion of real-time capabilities, while services emphasize management, standardized metrics, and other lake and warehouse integration capabilities that can be abstracted into basic software.

网易数帆湖仓一体项目负责人马进

Although there are many data lake technologies, all of them provide various data lake formats, rather than a real integrated lake and warehouse platform. These formats already exist in the enterprise environment. Arctic, as a service, can adapt to different data lake formats, so that enterprises do not need to worry about the selection of data lake technologies, continuously optimize data analysis capabilities, and simplify data flow management.

In terms of capabilities, Arctic not only provides features such as efficient streaming updates based on primary keys, automatic data bucketing, and self-optimization of structures, but also supports the encapsulation of data lakes and message queues into a unified table to achieve lower-latency streaming than traditional solutions. Batch integration, fundamentally and elegantly solve performance problems. On the other hand, Arctic also provides standardized metrics, dashboards and related management tools for streaming data warehouses, and provides transactional guarantees for streaming batch concurrent writes.

In terms of architecture, Arctic has a simple design with only three components: AMS, optimizer, and dashboard. It provides the capabilities required for the integration of the lake and warehouse between the data lake and the computing engine, but supports Spark and Flink read and write and Trino queries. It is compatible with Iceberg/Hive table format and syntax, which makes it very cheap to use.

Ma Jin also emphasized the deep meaning of Arctic's positioning: "When we expand the capabilities of the data lake to real-time scenarios, the relationship between cost, performance and data freshness will become more complex and subtle. Arctic's services and management function, which will clarify this triangular relationship between users and the upper platform.”

Huatai Securities: Arctic helps financial digital intelligence center to improve real-time lake warehouses

Chen Feng, an expert in big data stream computing technology from Huatai Securities, introduced the role of Arctic in the construction of real-time lake warehouses in Huatai Digital Intelligence Center. Real-time lake warehouses are of great value in Huatai Securities' offline processing of intraday data, real-time correlation of large amounts of historical data, frequent correction of financial data, and unified buried processing links. There are five major problems: complex real-time business logic, inconsistent data storage, complex data update, and difficult evolution.
华泰证券大数据流计算技术专家陈丰
"The industry has given solutions such as Iceberg and Hudi, but our business and platform need more than a single open source data lake component." Chen Feng said, Huatai Securities has set a stream-batch integration and high-performance for real-time data lake construction. Low latency, compatible with multiple targets such as existing Hive/Impala.

Huatai Securities cooperated with NetEase Shufan to introduce Arctic to realize real-time warehouse storage, and achieved good application and excellent performance in scenarios such as margin financing and securities lending, and logging operations. For example, the financing and securities lending scenario includes a large amount of historical data joint calculation, and the logic of using stream computing is complex. After the upgrade from offline architecture to real-time architecture, and then to the real-time lake warehouse architecture, the overall implementation logic is clear, and the end-to-end delay time is shortened from T+1 days to T+20 minutes.

Community Planning: All members are welcome to contribute, share, and collaborate

Ma Jin also introduced the plan of the Arctic open source community, which will establish an open and free global data lake technology exchange community for developers, users and other members, and all members can participate in the community by contributing, sharing and collaborating.

The co-construction enterprise participation plan was launched simultaneously. As the first co-construction unit of the Arctic open source community, Huatai Securities took the lead in joining the community to participate in the construction at the beginning of the Arctic project’s open source construction, not only as a user to provide real user feedback in combination with business scenarios, but also as a development force Jointly continue to explore innovative functions in the field of streaming lake warehouse technology.

In the future, Huatai Securities will further prosper the Arctic community ecology. Together with the Arctic community partners, we will jointly create a world-leading innovative product of streaming lake warehouse services and build a prosperous data lake warehouse ecosystem.

Here, NetEase Shufan also welcomes more individuals and enterprises to participate in the Arctic community.
• Arctic documentation address:
https://arctic.netease.com/ch/
• Git address:
https://github.com/NetEase/arctic

[Click me to watch the live replay of the open source conference]


网易数帆
391 声望550 粉丝

网易数智旗下全链路大数据生产力平台,聚焦全链路数据开发、治理及分析,为企业量身打造稳定、可控、创新的数据生产力平台,服务“看数”、“管数”、“用数”等业务场景,盘活数据资产,释放数据价值。