Abstract: Based on the data lake architecture, After several years of development, the current cluster scale has reached more than 1,000 nodes, the data volume is dozens of petabytes, and the average daily processing job number is about 100,000. It is empowered by more than 180 head office applications and 41 domestic and foreign branches and subsidiaries.
This article is shared from the Huawei Cloud Community " How does FusionInsight help "Universe" build a good "cloud data platform"? ", author: Xu Lifeng.
The "Evolution Trend" of Big Data Technology
From the beginning of big data in 2010 to the present, various new technologies have emerged in an endless stream. Recently, there have been many innovative developments around cloud infrastructure.
Many companies’ early data analysis systems were mainly built on data warehouses, and some did not even have data warehouses. They used TP-type relational databases to directly connect to BI systems. This type of system is generally deployed as an all-in-one computer in the form of a physical machine, and the distributed capability is relatively weak. With the huge growth of data scale, especially the development of the mobile Internet, traditional databases are difficult to support this large-scale data analysis demand.
In this context, Hadoop technology emerged and developed rapidly, effectively solving the analysis and processing needs of large data volumes. The initial applications of Hadoop were mostly used for log-type non-relational data processing, mainly based on the MapReduce programming model. With the development of SQL on Hadoop, the processing capabilities of relational data have become stronger and stronger.
Early Hadoop was mainly based on physical machine deployment, and it is still the most mature deployment mode. Although Hadoop computing and storage are decoupled, most of the practices will still integrate computing and storage. The Hadoop scheduling system will transfer the calculation to the location of the data for nearby calculations, which greatly improves the nearby calculations. The processing power of the system, Spark, flink, etc. later inherited the advantages of this architecture.
Since Amazon launched its cloud IT infrastructure, more and more companies have migrated their IT business to the cloud. Therefore, it is logical to develop big data business on the cloud. Basically all cloud vendors also provide big data solutions on the cloud.
So, what is the difference between deploying big data on the cloud and the original on premise deployment based on physical machines?
First, try to use cloud computing resources, including cloud virtual machines and containers, to meet the rapid distribution of resources, including bare metal service BMS to provide high-performance processing computing resources similar to physical machines.
Second, various cloud vendors have introduced a big data architecture that separates storage and computing. Amazon is the first to implement object storage to replace HDFS, because the cost of object storage is relatively low compared to the three copies of HDFS. The separation of computing and storage brings many benefits, but it also faces many challenges. This direction has been continuously improved. At present, the separation of big data storage and computing on the cloud has developed relatively mature.
Lakehouse is a big data concept that is very hot recently. The concept of Lakehouse was first mentioned in a blog published by databricks in January 2020. After that, I published another paper in January of this year, systematically explaining how to build Lakehouse.
Many data analysis systems are built on the basis of data warehouses and data lakes. Some combine the two to form a two-tier structure as shown in the figure. Large enterprises have more of this form. What are the problems with this two-tier architecture of data lake and data warehouse?
It can be seen that a lot of data in the data lake and data warehouse are the same, which will lead to the following three problems:
First, two copies of the data must be stored, and the corresponding storage cost will be doubled.
Second, if two copies of data are stored in the data lake and data warehouse, data consistency needs to be maintained. This process is mainly ensured by ETL. The maintenance cost is relatively high, and it is often difficult to maintain consistency, and the reliability is not very high.
Third, the timeliness of data. Data lakes are not easy to integrate large amounts of data. Since the data lake is mostly managed based on Hive, and its underlying HDFS storage does not support modification, the data only supports additional mode for integration. The data changes of the business production system are not only additional data, but also many update operations. If you want to update the data in the data lake, you need to merge and then rewrite the data by partition. This increases the difficulty of data merging processing. More often, it can only be merged once a day through the T+1 mode. The T+1 mode means that most of the data is less visible to back-end applications by one day. The data seen is actually yesterday, which means that the data in the warehouse is not new.
LakeHouse hopes to solve the problem of integrated analysis of data lake and data warehouse. LakeHouse proposes to solve the problem of timeliness of data by providing an open format storage engine of ACID. Another advantage of the open format is that the data in the data lake can be oriented to a variety of analysis engines, such as Hive, Spark, Flink, etc., can access and analyze the data. , And the AI engine can also access Lakehouse data for advanced analysis.
For incremental data management frameworks such as Hudi, Iceberg, and DeltaLake, because it provides ACID capabilities, data can be updated and concurrently read and written, so the storage data storage requirements are also higher, such as the need to support time travel, zero copy, etc. Ability to ensure that the data can be traced back at any time.
In addition to supporting traditional BI and report applications, Lakehouse also supports advanced AI analysis. Data analysts and data scientists can directly perform data science calculations and machine learning in Lakehouse.
In addition, Lakehouse's best practice is based on the separation of storage and computing architecture. The biggest problem with the separation of storage and computing lies in the network. Various cloud manufacturers and big data manufacturers have explored many methods to solve the performance problems of cloud storage itself, such as providing local caching functions to improve the efficiency of data processing.
The Lakehouse architecture can realize the integration and unification of offline and real-time, and data enters the lake through ACID.
As shown in the figure is the classic big data Lampda architecture, the blue processing flow is batch processing, and the red one is stream processing, forming a real-time merge view at the application layer. The problem with this architecture is that batch processing and stream processing are separated, the collaboration between data management is troublesome, and different development tools have different development requirements, which poses greater challenges for system maintenance engineers and data developers. .
In response to this situation, the Lakehouse architecture can combine batch processing and stream processing into a Lakehouse view, and extract business production system data to the data lake in real time through CDC, and send it to the back-end OLAP system after real-time processing to open to the outside world. This process It can achieve end-to-end minute-level delay.
The concept of Lakehouse itself is relatively new, and everyone is still doing various practices to improve it.
Three stages of FusionInsight's practice in ICBC
In the early days, ICBC mainly used Oracle and Teradata to build its data system. The data warehouse is Teradata, and the data mart is Oracle Exadata.
Since 2013, we launched the first big data platform in the banking industry at ICBC. At that time, the big data platform was dominated by a single application, such as log analysis, TD's new business uninstallation and detailed query.
After 2015, the data system was integrated and merged, including replacing the Teradata data warehouse with GaussDB to form a fusion data warehouse, which was called one lake and two databases in ICBC. The data lake base was built with FusionInsight to support full data processing. Services such as real-time analysis and interactive analysis have also been carried out.
At the beginning of 2020, start to build a cloud data platform, move the entire data lake to the cloud to realize the cloudification and serviceization of big data, and at the same time build an architecture that separates storage and computing. In addition, AI technology is introduced.
The technological evolution of ICBC is a process from single to diversified, from centralized to distributed, from isolated systems to integration, and from traditional IT to cloud native.
The first-generation big data platform was built more on-demand according to application requirements. At this time, there is no deep understanding of what problems Hadoop can solve.
The first thing that comes to mind is to solve business innovations and businesses that cannot be done in data warehouses, such as offloading large-scale data merging operations to the Hadoop system.
The lack of system planning during this period led to the small scale of single clusters, the increasing number of clusters, high maintenance costs, and low resource utilization. In addition, a lot of data needs to be shared among multiple businesses and needs to be copied and migrated between clusters. A large amount of redundant data increases resource consumption. At the same time, data needs to be stored in different technical components according to different scenarios, and processed using different technical components, which also leads to long ETL links, low development efficiency, and high maintenance costs. Therefore, the architecture of the entire big data platform needs to be optimized.
In the second stage, multiple big data clusters were merged to form a data lake, which is characterized by unified planning of the data processing layer, centralized access to the lake, and centralized management. This greatly improves the overall management, maintenance, and development efficiency.
After the raw data is put into the lake, some processing will be performed on the data to form aggregate data and thematic data, and centralized management will be carried out in the data lake. The data will be processed and sent to the data warehouse or data mart, as well as other back-end systems inside.
Based on this architecture, the application efficiency of the data lake can be greatly improved. After several years of development, the current cluster scale has reached more than 1,000 nodes, the data volume is dozens of petabytes, and the average daily processing job number is about 100,000. It is empowered by more than 180 head office applications and 41 domestic and foreign branches and subsidiaries.
However, storing all data in a centralized data lake also brings many management problems.
The business and users supported by the data lake have different requirements for SLA levels. How to manage the operations of different departments, different business lines, and different users is more critical. The core is the multi-tenant capability. The YARN scheduling function of the Hadoop community is not the same in the early days. Very strong, the management ability of thousands of nodes is weak, although the new version has been improved.
After the early clusters reached hundreds of nodes, the scheduling management system was difficult to support. Therefore, we developed the Superior scheduler to improve it. ICBC's 1000-node cluster is relatively large in the banking industry. We built a cluster from 500 to several thousand to 10,000 nodes within Huawei. This process has paved the way for the management capabilities of large clusters, and the application in ICBC has been relatively smooth.
As shown in the figure, we manage the entire resource management according to the department's multi-level resource pool, and use the superior scheduler to schedule according to different strategies to support different SLAs. In terms of overall effect, resource utilization has been doubled.
There are some other components, especially the region server like HBase, which is based on JAVA JVM to manage memory. The memory that can be used is very limited. The physical machine resources are basically unsatisfied, and the resources cannot be fully utilized.
In order to solve this problem, we can deploy multiple instances on a physical machine, and try to make full use of the resources of a physical machine. ES is also handled in this way.
After the cluster becomes larger, its availability and reliability also have big problems. Once a problem occurs in a large cluster, it will be completely paralyzed, which will have a great impact on the business. Therefore, the banking industry must fully possess the reliability of two places and three centers.
The first is the ability to deploy across AZs. AZ is actually a concept on the cloud. In traditional ICT equipment rooms, there are more concepts of cross-DC data centers. Cross-AZ deployment means that a cluster can span two AZs or three AZs. Deploy.
Hadoop itself has a multi-copy mechanism. Based on the multi-copy mechanism, multiple copies can be placed in different computer rooms. However, the above conditions are not supported by open source capabilities. Some replica placement and scheduling strategies need to be supplemented. When scheduling, it is necessary to sense which AZ the data is placed in, and the task is scheduled to the corresponding AZ to ensure that the data is processed nearby, and try to avoid data transmission between AZs. Brought network IO.
In addition, disaster tolerance can also be achieved through remote master and backup. Cross-AZ capabilities require network latency between computer rooms to reach the millisecond level. Too high latency may not guarantee the development of many services. Disaster recovery and backup in remote locations, that is, a primary cluster and a standby cluster. Normally, the standby cluster does not undertake business, only the primary cluster carries business. Once the primary cluster fails, the standby cluster will take over, but the corresponding cost will be higher. For example, a primary cluster with 1000 nodes must be constructed with 1000 In most cases, the main backup disaster recovery is to build only the backup of key data and key business, not all of them are made into the main backup capacity.
Big data clusters need to be continuously expanded. As time goes by, the hardware will be upgraded. After the upgrade, two situations will inevitably occur. One of them is that the CPU and memory capacity of the newly purchased machine, as well as the capacity of the disk, are all larger than the original. Or increase, you need to consider how to achieve data balance on different cross-generation hardware.
The operation of changing disks will also cause disk imbalance. How to solve the data balance is a very important topic.
We have specially developed the ability to place data according to the available space to ensure that the data is placed according to the available space of the disk and the node. At the same time, cross-generation nodes are divided into resource pools according to specifications. For devices that are relatively old and have relatively poor performance in the early days, a logical resource pool can be formed to run Hive jobs, and new devices with more memory can form another resource pool. Used to run spark, resource pools are distinguished and isolated by resource tags, and they are scheduled separately.
After the cluster becomes larger, the impact of business interruption due to any changes will be very large. Therefore, both upgrade operations and patch operations need to consider how to ensure that the business will not be interrupted.
For example, a version upgrade is performed on the integration of 1000 nodes. If the entire upgrade is shut down, the entire process will take at least 12 hours.
The rolling upgrade strategy can realize the rolling time-sharing upgrade of cluster nodes one by one, gradually upgrading all nodes to the latest version. However, the open source community does not guarantee the compatibility of the interface across major versions, which will cause the new and old versions to be unable to upgrade. Therefore, we have developed a lot of capabilities to ensure rolling upgrades between all versions. From the earliest Hadoop version to Hadoop3, we can guarantee rolling upgrades for all components. This is also a necessary capability for large clusters.
The construction of the data lake solves the problems of ICBC's data management, but it also faces many new challenges and problems.
Generally speaking, the hardware of many large enterprises is purchased centrally, without considering the different demands for resources in different scenarios of big data, and the ratio of calculation and storage has not reached a good balance, and there is a large waste.
Secondly, there are also differences between different batches of hardware. Some may also use different operating system versions, resulting in different hardware and different operating system versions in a cluster. Although technical means can be used to solve the problems of hardware heterogeneity and OS heterogeneity, the cost of continuous maintenance is quite high.
Third, manual deployment of big data is inefficient. When starting a new business, it takes at least one month for the entire system delivery cycle from hardware procurement to network configuration to operating system installation.
Finally, resource elasticity is insufficient. If you face insufficient resources when launching new services, you need to expand the capacity. The application for purchasing machines and resources leads to a long online cycle. Sometimes we deploy a new business to our customers, and often we wait until the resources are in place most of the time. In addition, resources between different resource pools cannot be shared, and there is also a certain amount of waste.
Therefore, ICBC has to introduce cloud architecture.
FusionInsight has been on HUAWEI CLOUD a long time ago, that is, the MRS service on HUAWEI CLOUD.
At present, ICBC and many other banks are deploying cloud platforms and deploying big data on cloud platforms. However, there are still some challenges in deploying large-scale big data clusters to the cloud. There are many advantages to deploying big data services based on the cloud-native storage and computing separation architecture.
First of all, the hardware resources are pooled. After the resource pooling, the upper layer is a relatively standard computing resource. The computing and storage can be flexibly expanded, and the utilization rate is relatively high.
Secondly, the construction of a big data environment based on the cloud platform is fully automated, from the preparation of hardware resources to the installation of software, which only takes one hour to complete.
Third, when applying for cluster expansion resource flexibility, no preparation is required, and unified deployment can be quickly performed in a large resource pool. Generally speaking, as long as resources are reserved on the cloud, space resources can be quickly added to the big data resource pool, and new business launches will become very agile.
In addition to the separation of storage and computing, storage is mainly based on object storage, replacing the original three copies of HDFS with low-cost object storage. Object storage generally provides HDFS-compatible interfaces. On this basis, object storage can provide big data, AI, etc. provide a unified storage, reduce storage costs, and improve the efficiency of operation and maintenance.
However, the performance of object storage is not very good, and the following problems need to be solved around the business characteristics of big data.
The first one is metadata, because big data is a heavy-duty calculation, and the read IO is very high during the calculation process. In the process of reading data. The metadata performance of object storage is a big bottleneck, so it is necessary to improve the read and write capabilities of metadata.
The second is network bandwidth. The network IO between storage and computing has a relatively high demand for network bandwidth.
The third is network latency. Big data calculations are calculated nearby. Where the data is, the corresponding calculations will be there. When storing data, read the local disk first, and then read the network. There is a certain sensitivity to the delay.
We mainly do some optimizations from the cache and partial computing pushdowns. On the whole, the performance of the storage and computing separation architecture is compared with that of integration. Except for the gap in individual use cases, the overall performance is higher, especially for writing scenarios, because Writing object storage is writing EC, instead of writing three copies, just write 1.2 copies.
In the end, the overall TCO may be reduced by 30% to 60%. Compared with other surrounding products, the overall performance still has great advantages.
When big data is deployed on the cloud, virtual machines do not have much advantage for large clusters. Because the data pool is large enough, virtual machines will also bring performance loss, and there is a certain gap between their performance and physical machines. Moreover, based on SLA isolation requirements, big data resource pools are deployed in private clouds, and in many cases they still need to be monopolized, and their resources cannot be shared with other businesses.
The bare metal service can actually solve these problems. Its performance is close to that of a physical machine, and it can complete the distribution of bare metal servers in minutes, including the entire network configuration and OS installation.
The network part has a dedicated DynaSky network accelerator card to manage the bare metal network, and the network performance is higher than that of the original physical network card. The deployment of large-scale big data services on the bare metal server is the best deployment solution on the cloud.
Future prospects: Hucang integration
ICBC and we are also exploring the integrated solution of lake and warehouse.
On the basis of the separation of storage and computing, Huawei Cloud and Lake Warehouse Integration separates the data management layer, which contains several parts. The first is data integration. Data enters the lake from a variety of external systems. The second is metadata integration. Since the metadata on the Hadoop data lake is managed by Hive, we provide an independent metadata service compatible with Hive Metastore. The third is the security strategy of data authorization and desensitization. We need to implement a unified closed loop at the Lake Formation layer.
After the data base is built, data analysis services such as big data services, data warehouse services, graph calculations, and AI calculations are all calculated on the same data view. The service of Data Warehouse DWS is essentially local storage, and Data Warehouse can also access the data in Lakehouse through one of its engines. In this way, the data warehouse itself has an accelerated data layer locally, and it can also access Lakehouse.
On this basis, we implement these three lakes through an architecture and continue to evolve.
The blue data stream is an offline data stream, which realizes the ability of an offline data lake. The data is integrated in batches, stored in Hudi, and then processed through Spark.
The red data stream is a real-time stream. The data is captured in real time through CDC and written to Hudi in real time through Flink; variables are cached through Redis to realize real-time data processing, and then sent to special markets such as Clickhouse, Redis, and Hbase to provide external services. .
At the same time, we have also developed the HetuEngine data virtualization engine to perform multi-data source correlation analysis on the data in the data lake and data in other thematic markets to form a logical data lake.
Although HetuEngine can connect to different data sources, it does not mean that all applications can only access data through HetuEngine. Applications can still access the data through the native interface of each data source. Only when joint analysis between different thematic data is required, it is necessary to access it through HetuEngine.
The following is the specific implementation plan:
The first one is to introduce Hudi to build a data lake for data management, which can save millions of dollars every year.
The second is to introduce HetuEngine to realize the query and analysis of data in the data lake without relocation. Avoid some unnecessary ETL processes.
The third is the introduction of ClickHouse, which has very good processing performance and many advantages in the OLAP field, so it is considered to be implemented in ICBC.
The data lake uses Hive as storage, and adopts a batch integration and batch consolidation solution once a day, that is, the T+1 data processing model. There are several major business pain points in this model:
First, the data latency is relatively high, and the data seen by the back-end service is not the latest.
Second, the batch operation is carried out at night, and the resource utilization rate is low during the day, but the cluster resources are constructed according to the peak demand, resulting in a great waste of resources.
Third, Hive does not support updates. Data merging requires a lot of code development. For example, the new data temporary table and the original Hive table are associated with the original Hive table and then the original whole table or part of the partition table is overwritten. The development cost is relatively high and the business logic is complicated.
The introduction of Hudi can solve these problems to a large extent. Data enters the lake through CDC, and is written to Hudi through Spark or Flink. Real-time updates are supported, and end-to-end delays of minutes can be achieved. Data is merged into the data lake in the form of very small micro batches. Distributed batches allow resources to be fully utilized during the day and night. The TCO of the data lake cluster is expected to drop by 20%. In addition, the development of data integration scripts can use Hudi's Update capability. Originally, Hive had to write hundreds of lines of code, which can be completed with only one line of script, and the development efficiency is greatly improved.
ICBC Data Lake uses Hive to carry flexible query services. For example, SAS uses Hive SQL to access the data lake. The access efficiency is relatively low, the response time is long, and the concurrency is insufficient.
In addition, the two-tier architecture of the data lake and data warehouse has led to a large amount of duplicate data, and there are more requirements for correlation analysis. Correlation processing will inevitably involve a large amount of ETL between the lake and warehouse. For example, in order to support the analysis requirements of BI tools, data lakes and data warehouses need to be processed and processed, and the processed data is imported into the OLAP engine. The overall data link is relatively long, and the analysis efficiency and development efficiency are very low.
Through HetuEngine data virtualization to realize the collaborative analysis of the lake and warehouse, on the one hand, it replaces Hive SQL to access Hive data. It only needs 1/5 of the resources to support about 5 times the original concurrency, and at the same time, the access delay is reduced to the second level. On the other hand, Hive and DWS can be accessed at the same time to provide second-level related queries, which can reduce the data migration between systems by 80% and greatly reduce the ETL process.
Traditional OLAP solutions generally use MySQL, Oracle, or other OLAP engines. These engines have limited processing capabilities. Data is generally organized according to topics or topics and then connected with BI tools, causing BI users to be disconnected from data engineers who provide data. For example, BI users have a new requirement, and the required data is not in the thematic market, and the requirement needs to be given to the data engineer in order to develop the corresponding ETL task. This process often requires inter-departmental coordination, and the time period is relatively long.
Now all detailed data can be loaded into Clickhouse in the form of a large wide table. BI users can conduct self-service analysis based on the large wide table of Clickhouse. The data engineer will be much less required to supply data, and even new requirements in most cases do not need to be renewed. Data supply, development efficiency and the online rate of BI reports will be greatly improved.
This methodology is very effective in our internal practice. It turned out that we were based on traditional OLAP engine modeling and limited development efficiency. We only launched dozens of reports in a few years. But after switching to Clickhouse, hundreds of reports were uploaded within a few months, which greatly improved the efficiency of report development.
Note: This blog is adapted from the speech content of the "Banking Industry AI Ecological Cloud Summit" held by Leifeng.com.
Click to follow to learn about Huawei Cloud's fresh technology for the first time~
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。