Abstract: Huawei Cloud Security Gateway Product Director Guo Mian delivered a keynote speech on "Huawei Cloud FusionInsight MRS, One Architecture to Realize Three Data Lakes" on the "Huawei Cloud TechWave Cloud Native 2.0 Special Day", sharing data in the era of intelligent data Lake development trend, MRS cloud native data lake technology innovation to achieve an architecture to build offline, real-time, and logical data lakes, as well as successful cases in business practices.
This article is shared from the Huawei Cloud Community " Huawei Cloud FusionInsight MRS Cloud Native Data Lake, one architecture three lakes, decrypting the new features of Huawei Cloud FusionInsight MRS component ", the original author: IT old mill.
On May 20th, Guo Mian, Product Director of Huawei Cloud Security Gateway, delivered a keynote speech on “Huawei Cloud FusionInsight MRS, One Architecture to Realize Three Data Lakes” on the “Huawei Cloud TechWave Cloud Native 2.0 Special Day” and shared the wisdom of the era of intelligent data. Data lake development trends, MRS cloud-native data lake technology innovation to achieve an architecture to build offline, real-time, and logic three data lakes, as well as successful cases in business practices.
Entering the era of intelligent data, the industry's top ten consensus on building a data lake
After decades of rapid development, big data processing technology has become increasingly mature. There are as many as stars around data warehouses and data lake derivative technologies. After years of exploration, the industry has also reached ten important consensus on the future data lake shape. Warehouse integration has become the preferred architecture for smart data lakes. In response to the new challenges of big data technology in the era of intelligent data, Huawei Cloud FusionInsight MRS cloud native data lake has been fully upgraded, introducing popular Hudi and ClickHouse components, strengthening the self-developed HetuEngine virtualization engine, and adding IoTDB timing processing The ability to expand the boundaries of data-enabled applications.
Huawei Cloud FusionInsight MRS Cloud Native Data Lake
HUAWEI CLOUD FusionInsight MRS cloud native data lake provides government and enterprise customers with a lake-warehouse integrated, cloud-native data lake solution, and builds an offline, real-time, and logical data lake with a sustainable architecture to support the real-time data of all government and enterprise customers. Big data application scenarios such as analysis, offline analysis, interactive query, real-time retrieval, multi-mode analysis, data warehouse, data access and governance, enable government and enterprise customers to efficiently use data, simplify data usage, and help government and enterprise customers realize one enterprise, one lake , One city and one lake, more accurate business insights and faster value realization.
- offline data lake: provides interactive, BI, AI and other computing engines, and uses OBS to realize the separation of storage and calculation, making the architecture of the cloud-native data lake more flexible. It supports a single cluster of 20,000+ nodes at a super scale, and through cluster federation, it can support a scale of 100,000+. Support rolling upgrades to ensure uninterrupted upgrades of key services.
- real-time data lake: ACID data into the lake through Hudi and ClickHouse millisecond-level OLAP analysis to build real-time update processing capabilities, making the supply timeliness from T+1 to T+0.
- Logical Data Lake: HetuEngine provides cross-lake, cross-warehouse, and cross-cloud collaborative analysis, realizes the integration of the lake and warehouse, reduces data relocation by 80%, and improves the efficiency of collaborative analysis by 50 times.
One framework for the new features of Three Lakes, covering the whole process of data analysis
Hudi: Incremental real-time access to the lake, achieving fast data access to the lake, easy development, high performance, and higher resource utilization
Traditional data lakes do not support data updates, resulting in data using T+1 offline processing mode, which is completely unable to meet flexible and changeable business demands. In response to data timeliness issues, Huawei Cloud FusionInsight MRS cloud-native data lake introduces Hudi.
Hudi can support data update, data deletion, and ACID guarantee to ensure that the data enters the lake in real time for update operations. It provides a variety of views, including read optimization view, incremental view, and real-time view. It can provide different views for different analysis applications. Based on these technologies, data storage models such as incremental tables, zipper tables, and mirror tables can be easily implemented. After Hudi was introduced, it brought four significant effects:
- Faster data timeliness: In the business system, minute-level data is entered into the lake through the CDC system, and the data timeliness is from T+1 to T+0.
- Higher processing performance: In the face of data deletion and update scenarios, the traditional Hive update method is used. Processing only one row of data may also require processing the entire table, at least the entire partition, and the introduction of Hudi increases the processing efficiency by 10 times+ .
- Development is simpler: For developers, traditional data into the lake does not support updating or deleting. Developers need to create a new temporary table and overwrite the data after processing. The same task may need to write a lot of code to complete. With Hudi After the blessing of, doing a data update operation is as simple as using a database, and a single statement can be completed.
- Higher resource utilization: The traditional T+1 model is not running tasks 24 hours a day, but batch processing in the evening and reporting in the morning. During the entire processing process, the peak period of calculation is only the time for running batches at night, but the resources are According to the calculation requirements during the peak period, the resource utilization during the day was insufficient. After the introduction of Hudi, the data was collected into the lake in real time, and the processing of the lake was distributed throughout the day. In fact, the peak and low consumption of the entire resource was reduced. The peak is smoothed off.
A financial customer builds a data lake based on Hudi. The delay of data entering the lake is reduced to the minute level, and the resource utilization rate during the day is increased by 2 times+, and the data processing efficiency is increased by 50%. Developers can complete the development with a single statement, simplifying the difficulty of development.
ClickHouse: Real-time OLAP engine to realize full self-service and cost-effective real-time analysis of reports
Traditional OLAP engines have limited processing capabilities, and data is generally organized according to topics or topics and then docked with BI tools, resulting in a disconnect between BI users and data engineers who provide data. For example, BI users have a new requirement, and the required data is not in the thematic market, and the requirement needs to be given to the data engineer in order to develop the corresponding ETL task. This process often requires inter-departmental coordination, which has a long time period and low collaboration efficiency. .
Now, Huawei Cloud FusionInsight MRS cloud native data lake can load all detailed data into ClickHouse in the form of a large wide table. BI users can conduct self-service analysis based on the ClickHouse large wide table, which requires less data for data engineers, and even faces most of them. When there is a new demand, there is no need to re-supply, and the development efficiency and the online rate of BI reports will be greatly improved. At the same time, ClickHouse can analyze data in a table up to milliseconds.
Implementation of self-service BI based on ClickHouse has also achieved good results in Huawei's internal practice. The HIS data lake of Huawei Group was originally based on traditional OLAP engine modeling, and limited by development efficiency, dozens of reports were launched only in a few years. After the introduction of Clickhouse, 400+ reports were developed and launched in three months, and the efficiency of business launch was increased by 50 times. At present, the overall use scale of Huawei's internal ClickHouse has reached 2000+ nodes, the data volume has reached 10+PB, and the daily data volume has increased by 100TB.
HetuEngine: Data virtualization engine, breaks geographical restrictions and breaks the data "wall"
With the demands of corporate development and digital transformation, corporate business is becoming more and more complex, and the demand for innovation is getting higher and higher. Single-system independent work is difficult to meet the changing needs of the business. There may be multiple lakes, multiple warehouses, and multiple systems in the enterprise at the same time. However, the traditional chimney construction scheme does not directly interconnect between lakes and warehouses or between multiple engines. Interoperability requires moving back and forth through ETL data, resulting in long data transfer links, multiple data redundancy, and data islands. Multiple data redundancy in the system is also difficult to ensure the consistency and reliability of the data.
In order to make data use easier, cross-lake collaboration easier, and solve the problem of lake warehouse data fragmentation, Huawei launched the data virtualization engine HetuEngine to achieve cross-lake, cross-storage, and cloud on-cloud, cloud-off, and multi-cloud collaborative analysis capabilities, making breakthroughs Geographical restrictions break the data "wall", the efficiency of cross-lake collaborative analysis is increased by 50 times, and the cross-warehouse collaborative analysis reduces the synchronization of data migration between systems by 80%, and the analysis performance is improved from minutes to seconds.
A financial bank introduced the HetuEngine data virtualization engine to improve its concurrency capabilities in terms of data lake query and analysis. Only 1/5 of the resources can support 45 concurrency, the peak concurrency is up to 200QPS, and the average delay is optimized to 8 seconds; In terms of coordinated analysis of the lake warehouse, through HetuEngine, the data barrier between the data lake and the data warehouse has been opened up. The performance of the coordinated analysis of the lake warehouse has been improved from minutes to seconds, while reducing the synchronization of data migration between systems by 80%, greatly improving the efficiency of data governance.
IoTDB: Time series database, easy to build time series data mart with cloud side-end collaboration
Time series data has two major characteristics: it is processed at the end, side, and cloud, and time series data does not need to be updated after collection. In traditional time series processing schemes, different technology stacks are used at the end, edge, and cloud, and heterogeneous technology stacks will inevitably bring complexity to data processing. The time series database IoTDB (also known as the time series engine) developed by Tsinghua University uses a unified time series data file format, TsFile, to achieve a data compatible with all scenarios. A set of engines connects the cloud side end and a set of frameworks integrates the cloud side end. Huawei maintains close cooperation with Tsinghua University. The latest IoTDB cluster version is a version led by Huawei and Tsinghua University.
In Shanghai, Chengdu, Chongqing and other cities, IoTDB has been used to manage subway monitoring data. Originally, 144 trains required 9 servers, but now only one IoTDB instance is required to meet the requirements, and the sampling delay of measurement points has also been reduced from the original 500ms. 200ms, 414 billion data point management is added daily, greatly improving resource utilization.
Concluding remarks
At present, Huawei Cloud FusionInsight MRS cloud native data lake has joined hands with 800+ ecological partners, and has served 3000+ government and enterprise customers, and is widely used in utilities, finance, operators, energy, medical, manufacturing, transportation and other industries.
Click to follow and learn about Huawei Cloud's fresh technology for the first time~
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。