Abstract: On April 21, 2021, China Pacific Insurance Group and Huawei Cloud completed the world's first big data cluster rolling upgrade across multiple versions of a big data cluster.

This article is shared from the HUAWEI CLOUD community " HUAWEI CLOUD FusionInsight helps CPIC's multi-version upgrade business with 0 outage ", author: Hourglass.

On April 21, 2021, China Pacific Insurance Group and Huawei Cloud completed the world's first big data cluster rolling upgrade across multiple versions of a big data cluster, breaking through the traditional solution that requires offline multiple upgrades, and clustering the core live network at once. The version has been upgraded from FusionInsight HD C70 to FusionInsight MRS 8.0.2, spanning the C80 and 6.5.1 versions. At the same time, it completed the transformation of the big data cluster from physical machines to cloud services, achieving the first breakthrough in this case in the financial industry. Set a new benchmark in the industry. After a two-week upgrade implementation process, the smooth and rolling upgrade of CPIC's upper-level business was realized without interruption and performance impact. The success of this cross-version rolling upgrade is of great significance to the financial technology field, marking that China Pacific Insurance has set a new benchmark for the financial industry to upgrade big data services across multiple versions, business continuity, and sustainable evolution.

1. Project background

China Pacific Insurance Group selected Huawei Cloud FusionInsight in 2017 to build an insurance big data platform. As the cooperation between CPIC and HUAWEI CLOUD continues to deepen, its main internal business systems have already used the HUAWEI CLOUD big data platform. However, in the early days, independent big data clusters were built for each business system, data could not be interoperable, there was data redundancy, and multiple clusters caused difficulties in maintenance. As of the upgrade, 18 sets of big data clusters have been built, mainly FusionInsight HD C70 version.

With the rapid development of CPIC’s business, there are new demands for the unified management, data sharing, and upgrading of the big data platform. It is hoped that the 18 sets of production clusters on the current network will be upgraded and merged together, and at the same time, big data clusters will be sustainable for the future. The ability to evolve.

To this end, CPIC and Huawei Cloud decided to upgrade the existing 18 sets of big data clusters from FusionInsight HD C70 to MRS8.0. The main goals of the upgrade are:

  • By upgrading and merging the original clusters, they are unified into a set of large clusters, and resource utilization is improved through resource integration;
  • Unified to the MRS platform version, the resource monitoring is more complete, and the problem location is more accurate;
  • Upgrade to the cloud platform, which can flexibly allocate resources on demand, realize an evolvable integrated lake and warehouse architecture, and expand other high-end services.
    image.png

2. Project content

2.1 Technical challenges

The CPIC big data cluster deploys various components such as HBase, Hive, HDFS, ZooKeeper, YARN, Oozie, Hue, Spark, etc. on demand.

In addition, tens of thousands of jobs are executed every day in the cluster, which also increases the difficulty of unperceived rolling upgrades. The main challenges are as follows:

  1. In the cross-large version upgrade of Hadoop component kernel from X to 3.X, the community only provides the rolling upgrade capability of HDFS. The community native target version of YARN cannot support rolling upgrade due to its different protocol from the original version;
  2. During the upgrade process of the community’s native version of HDFS, the deleted files will not be physically deleted, but moved to the trash directory. This process creates pressure on storage resources for the rolling upgrade of large-capacity clusters and hinders the protection of remaining information. If it cannot be timely Cleaning up will cause the disk to explode;
  3. During the major version upgrade of the Hive component kernel from X to 3.X, the native version of the community cannot support rolling upgrades due to the incompatibility of the format before and after the metadata, the changes in the version before and after the API, and the incompatibility of some syntaxes;
  4. During the major version upgrade of the HBase component core from X to 2.X, there are major changes in the version before and after the API, which makes the community native version unable to support rolling upgrades;
  5. Tens of thousands of tasks per day, how to ensure smooth operation during the rolling upgrade, especially in core scenarios such as profit and loss analysis and impairment calculation;
  6. In a big data cluster environment with 600+ nodes, it is necessary to ensure that there are unexpected situations during the upgrade process, and quickly respond to hardware (disk, memory, etc.) failures without affecting the upgrade;
  7. 70+ business systems, hundreds of businesses run on this cluster, and it is necessary to ensure that each business is not damaged during the rolling upgrade process.

2.2 Technical guarantee

Rolling upgrade is to use FusionInsight MRS's high availability mechanism, active/standby mode, multiple copy mechanism, rack strategy, etc., to upgrade/restart some nodes at a time without affecting the overall business of the cluster. Scroll in a loop until all nodes in the cluster are upgraded to the new version.

The following figure shows an example of rolling upgrade of HDFS components:
image.png

In order to deal with the above technical challenges, a rolling upgrade team was formed, which is composed of community PMC, community Commiter, and version Developer. The following technical guarantees are mainly implemented:

  • Relying on protocol synchronization, metadata mapping conversion, API packaging conversion and other methods, it solves the compatibility problems caused by different community protocols, different metadata formats, API changes, etc., and ensures the normal use of low-version component clients during the rolling upgrade process;
    image.png
  • In response to the problem of files not being deleted during the upgrade of the new version of the HDFS community, the automatic cleaning of the trash directory was additionally implemented, the logical deletion was converted to a physical deletion, and the tool for regularly cleaning the trash directory of the old version was added. Ensure the effectiveness of infrastructure resource utilization and reduce storage costs;
  • In view of the performance status before and after the component upgrade, the upgrade time, the bottleneck points that may occur during and after the upgrade, the corresponding architecture adjustments and optimizations have been made to help realize the global controllability, insensitivity, and completeness of the rolling upgrade;
  • In terms of operation and maintenance management, the project team has developed a targeted upgrade management service interface, which can complete rolling upgrades end-to-end and step by step, making it easy to view the rolling upgrade status and achieve component-level control. In order to reduce the impact on the continuity of mission-critical services during the upgrade process, the project has implemented the function of suspending the upgrade batches, which helps to avoid risks by suspending upgrades during critical operations or peak hours to ensure that there is no business impact. In addition, in order to avoid various emergencies from interrupting the upgrade process, the project has realized the ability to isolate faulty nodes. When a fault occurs, the upgrade action of the corresponding node can be skipped, ensuring the synchronization of fault handling and upgrade.

2.3 Organizational Guarantee

After the start of the project, a joint project team with relevant leaders of CPIC as the project manager, and Huawei delivery and R&D, and CPIC's R&D and operation and maintenance as members was established. This upgrade targets as many as 20+ application departments, and the platform involves a large number of and complex businesses. In order to ensure the success of the rolling upgrade and zero business interruption during the entire process, Huawei will lead the six months before, during and after the upgrade, and the various business departments of the customer will work closely together, and the project team has formulated a thorough organizational guarantee system.
image.png

CPIC Upgrade Project Organization Guarantee

  1. Pre-upgrade preparation stage: Under the overall coordination of the project team and Huawei’s R&D support, 70+ application code transformations and verifications were completed, and test reports were output; to fully identify risks, Huawei proactively provided hardware resources for the test environment, and the project team worked with applications In order to achieve the upgrade prerequisites, Huawei experts conducted research and guidance, and effectively carried out cluster small file consolidation, client rectification, multiple cluster inspections, repeated review and improvement of the upgrade plan, etc. Preparation before the upgrade;
  2. Upgrade process assurance: During the two weeks of the upgrade process, Huawei arranges R&D and solution experts to provide on-site assurance. Huawei has worked with the CPIC joint project team to formulate a 24-hour schedule guarantee, information feedback and communication between the joint project team and the application department (the rolling upgrade requires business verification and confirmation after each component is upgraded), the joint project team authorization for the upgrade operation, Upgrade the system of recording and monitoring the operation;
  3. Observation after the upgrade: After the rolling upgrade is completed, the joint project team coordinates the application business verification with each application department, and all business operation reports have been output. After that, the Huawei project team continued to observe for two weeks and submitted the upgrade after confirming that the platform and applications were operating normally.

3. Summary and Outlook

Pacific Insurance and Huawei completed the rolling upgrade of the first big data cluster across multiple versions in the financial industry. This achieved no perception of upper-level business, no interruption of the entire cluster operation, and no impact on performance, effectively protecting the core interests of customers and establishing Set a new benchmark in the financial industry.

With the continuous iterative upgrade of digital technology, the traditional insurance operation model will be changed, and the following three major changes will appear in the future:

  1. Achieve from large numbers to decimals, strengthen the characterization of risk figures, and have a more keen perception of the probability of large numbers in the past to decimals, which will fundamentally change the traditional operating model;
  2. From physical to virtual, data is already an important means of production. Identifying and assessing the risks of new types of assets through massive data will become the core capability of the insurance industry;
  3. From insurance to governance, digitalization will enhance insurance companies' own risk management capabilities, and will participate more in national and urban risk governance, and gradually shift from loss compensation to risk management and governance.

Looking to the future, Pacific Insurance will work with Huawei to continuously innovate, continuously improve the risk ecosystem, implement the "customer demand-oriented" strategy, and build a "first-class insurance financial service group that focuses on the main insurance business, continues to grow in value, and has international competitiveness."

Click to follow and learn about Huawei Cloud's fresh technology for the first time~


华为云开发者联盟
1.4k 声望1.8k 粉丝

生于云,长于云,让开发者成为决定性力量