Liu Weiguang: Successful practice of comprehensive migration of domestic databases of super-large financial institutions

Introduction

In September 2021, a super-large financial institution successfully completed the last comprehensive migration and transformation of a core database of up to 20TB+, which also laid a solid foundation for the subsequent evolution to a cloud-native multi-active architecture. The successful launch of the core system database full migration project has established a benchmarking practice for the financial industry to practice the power of science and technology. Liu Weiguang, Vice President of Alibaba Group and General Manager of Alibaba Cloud Smart New Finance & Internet Business Unit, refined and sorted out the complete steps and technical strategies of the whole migration process that lasted for one year, and completely precipitated it into a unique dry product, the entire content of this article.

"Practice brings true knowledge", Alibaba Cloud and OceanBase have taken a solid step in assisting the comprehensive migration of domestic databases of super-large financial institutions, and accumulated precious experience. Therefore, this article is not an analysis and imagination of database replacement, but a technical guide to replace the technical platform of the actual large-scale and complex core application system. There are various problems that were unexpected in the "analysis" article in the process. Especially for various adaptations and compatibility of the existing operating environment, friendliness to applications, etc., how to solve these problems, this article gives detailed solutions one by one.

Under the background of accelerating the construction of a powerful country in science and technology and realizing high-level scientific and technological self-reliance and self-reliance at the national level, a super-large insurance (group) company further promoted digital transformation, followed the development trend of pioneering technology, and launched a forward-looking layout to start the distributed transformation of IT architecture. Transformation, and successfully completed the last comprehensive migration and transformation of the core database up to 20TB+ in September 2021, which also laid a solid foundation for the subsequent evolution to a cloud-native multi-active architecture. The domestic migration project of the database was successfully launched, which has established a benchmark practice for the financial industry to practice science and technology, and is also responsible for the national science and technology self-reliance strategy and domestic technology; it also promotes the entire domestic database management and application system. The rapid maturity of the industrial chain.

For the insurance industry, although the concurrent pressure of short-term business is not as great as that of Internet companies, it is far greater than Internet companies in terms of business complexity and dependence on proprietary database features. The processing of insurance business is more complex. A single business needs to be completed by multiple systems. The call chain is longer and more complex than banking and Internet business. Ensuring the stability of complex collections and large transaction volumes is a domestic challenge for insurance business databases.

Due to the strict requirements of financial institutions for business continuity and data accuracy, none of the traditional leading financial institutions has been able to complete the full migration of domestic databases until this insurance company successfully implemented it and made five breakthroughs.

Short migration time

From September 2020 to September 2021, it took only one year to complete the migration, while traditional financial institutions have not yet achieved such a large-scale full migration of core systems.

Record-breaking scale of migration

Completed the full relocation of online traditional centralized databases of nearly 100 business systems including traditional core, Internet core, individual insurance sales, group insurance sales, operation management, customer service management, and big data within one year, and the scale of migrated data exceeded 400TB , The amount of data exceeds 100 billion, the data scale of a single database exceeds 20TB, and the overall server scale of the project exceeds 20,000 cores.

Ensure business continuity and data accuracy at the same time

The entire migration process has not been switched back. Nearly a year after it went online, the system has been running stably, and has gone through a full cycle of "business exams" in 2021. The rigorous tests of all business links fully meet the production needs and realize the leap from usable to easy-to-use domestic databases.

Realize 100% independent innovation of technology

Based on the completely self-developed and innovative domestic native distributed database, during the migration process, the version upgrade has been continuously released for a total of more than 50 times, and the longest demand resolution time is 2 months (Pro*C+Tuxedo). At the same time, through systematic training and communication, more than 500 employees have been certified by database professional exams, and the ability to fully control the database has been realized.

Next-generation technology becomes key productivity

After the migration, the storage cost has been significantly reduced, and the performance has been greatly improved. The database has been developed from the active-standby mode to support multi-active deployment in two places and three centers, and the processing time of production events has been shortened from hours to minutes.

When we look back on this process, although the process is difficult, we have accumulated valuable practical experience in the migration of domestic databases of large financial institutions.

Domestic financial-grade database migration practice

initial preparation work

1. Database selection

The database is the crown jewel of the enterprise IT infrastructure. It stores the core data assets of the enterprise's operation, supports applications upwards, and shields the underlying infrastructure downwards. Under the premise of "stability overriding everything" in the financial industry, database selection is more cautious. , according to the description of the "Database Development Research Report (2021)" of the Academy of Information and Communications Technology, as of the end of June 2021, there will be as many as 81 domestic relational database manufacturers. Faced with such a complex product, how to choose a suitable database is placed in the The first question facing insurance companies. Although there are many database products, after careful evaluation, three products including OceanBase and PolarDB were finally selected as the initial pilot verification. The main selection considerations are as follows:

Whether it can meet the smooth migration of business and the evolution of future architecture;
Whether it has the ability of layered decoupling, focusing on decoupling the database with the underlying hardware, operating system, and middleware;
Whether there is enough talent reserve and capital investment to ensure the long-term evolution of the product and business bottom line;
Whether there is a wide range of industry practice cases;
Whether it can achieve completely independent research and development;
Whether it is compatible with the original development operation and maintenance system, and whether its own technical personnel can quickly grasp it.

2. Infrastructure preparation

The insurance company's core business system originally used more than 60 IBM and HP high-end minicomputers and more than 70 high-end storage systems. The Oracle architecture is highly coupled, making it difficult to achieve linear expansion of scale and performance. The domestic database adopts rack-mounted servers and local storage to fully replace imported minicomputers and traditional SAN storage architectures to meet the cloud-native distributed architecture transformation of the full migration of core systems. At the same time, in order to avoid the instability of the business system caused by excessive changes in the infrastructure, a hybrid deployment architecture of Intel + Haiguang + Kunpeng servers is adopted. In the early stage, it was still dominated by Intel X86, and gradually transitioned to domestic servers with Haiguang and Kunpeng chips. Realize online adjustment of different types of machines, and relieve the dependence on infrastructure supply.

In September 2020, after the official launch of the domestic database migration project, from the model selection of the hardware environment, to the selection of the target system, and the capacity planning, it took less than two months to complete the hardware and operating system adaptation of the domestic database from 0. configuration, and the construction of the entire server cluster.

3. Development of migration strategy

After years of development, the insurance company's business has covered the whole country, with distinctive features, various types, and intricate relationships. The migration of the core database requires extensive research and sufficient scientific demonstration - both the database products are required to be comparable to the original production database. Performance, security and reliability also need to quickly achieve smooth migration of multiple systems, while addressing resource elasticity and the ability to scale out the database. Therefore, the unified norms and standards for database migration implementation are established, and the scientific methodology of evaluation-implementation-control-analysis improvement is generally followed, and orderly migration is carried out, and three migration strategies are set:

First relocate and then do business and architecture transformation and upgrading to avoid simultaneous occurrence of multiple variables and affect business continuity. The original data model will not be transformed, and the main transformation work will be undertaken by the new database;

The migration batch is based on the business system, from low load to high load, from peripheral to core;

It takes 1 year to complete the full migration and transformation of the databases of all business systems. The time window for all system database migration actions is only for Saturdays and Sundays from 0:00 am to 6:00 am. Small traffic verification on weekends, key guarantees on Mondays, does not affect normal business development .

Internet Core Migration

1. Business Background

Although the core of the insurance company involves many systems, it is mainly divided into: Internet core and traditional core, and asynchronous decoupling is realized through a bus mechanism similar to ESB in the middle.

Since 2016, the insurance company's Internet core and traditional new core applications have been transformed from a traditional monolithic architecture to a distributed microservice architecture. By 2020, the core Internet business system has been split into more than 40 microservice modules and meshed access has been completed. The core features of the Internet are:

The database system has achieved physical and logical centralization nationwide, and there are many associated systems for database docking;
Although the microservice is split, the database still has a certain amount of stored procedures, and advanced functions such as triggers, custom types, functions, foreign keys, and partition tables are used;
Because of the business characteristics, to serve more than 1 million agents well, it has higher requirements on the flexibility and performance of database resources.

Therefore, the main technical challenges facing database migration at the core of the Internet are:

A single point of failure under a national centralized deployment will affect the whole country;
As the entire insurance account opening portal in the core business link, the main data system is connected to 43 related systems internally, with a data scale of over 20TB, the largest single table exceeding 5 billion pieces of data, and the daily interface call volume exceeding 20 million times. The system with the largest average daily database request has many related systems and is at the core of the business link, so the efficiency of the database SQL is very high, and the migration process cannot affect the original production system;
Migrating to a new distributed database platform requires the ability to synchronize to Kafka in real time and be compatible with the original format for consumption by downstream big data systems.

2. Technical solutions

(1) Overall selection

In response to the above technical challenges, PolarDB, which is closer to the original Oracle RAC architecture, was chosen as the replacement for the core Internet database. The main features of PolarDB as a new generation of cloud-native database are as follows:

Computing and storage are separated, and shared distributed storage is used to meet the needs of business elastic expansion. Greatly reduce the storage cost of users;

Read-write separation, one-write multiple-reading, the PolarDB engine adopts a multi-node cluster architecture, and there is one master node (readable and writable) and at least one read-only node in the cluster (a maximum of 15 read-only nodes are supported). Write operations are sent to the master node, and read operations are evenly distributed to multiple read-only nodes to achieve automatic read-write separation;

Based on K8S deployment, it provides minute-level configuration upgrade and upgrade, second-level fault recovery, global data consistency and complete data backup disaster recovery services;

The centralized architecture does not need to consider the design of the distributed architecture, which is consistent with the original usage habits, and the performance is not lower than the original database;

It is highly compatible with the Oracle database, and the application basically does not need to adjust the SQL syntax.

(2) Migration method

In order to avoid the impact on the original production business and ensure the strict consistency of the migrated data, the DTS full + incremental method is adopted. For Oracle database clusters with large data scale, such as the customer master data system, the data migration chain is started 2 weeks in advance Before the full data migration, DTS will start the incremental data pull module, and the incremental data pull module will pull the incremental update data of the source instance, and parse, encapsulate, and store it in the local storage.

When the full data migration is completed, DTS will start the incremental log playback module. The incremental log playback module will obtain incremental data from the incremental log reading module, and migrate it to the target instance after reverse parsing, filtering, and encapsulation. The end primary key guarantees the uniqueness of the data. After the application is switched successfully, in terms of the response speed of the application interface, the performance is improved by about 30% compared with the Oracle database. By the end of 2020, the two parties will work together to complete the migration of all core Internet modules, including a billing system app serving over one million agents, a life insurance app with over 100 million registered users, and a total of more than 40 business systems including customer master data.

In order to reduce the impact on downstream big data consumption during the migration process, a two-step strategy is adopted for the transformation of the synchronization link to big data.

The first step is to increase the reverse real-time synchronization between PolarDB and Oracle. The original synchronization link between Oracle and Kafka remains unchanged to avoid too much changes caused by database switching;

The second step is to carry out customized development and transformation of DTS with reference to the format of SharePlex. After the verification is sufficient, the original synchronization link of SharePlex is directly replaced.

(3) Main challenges

After the migration is completed, PolarDB, as the core database of the Internet, needs to stably support the business sprint in the first quarter of 2021. The front-end ordering system is the focus of the entire performance pressure, and due to the microservice transformation, it has been split into more than 30 modules and scattered in multiple databases. Any database may be at risk of being exploded. , before migrating to PolarDB, it was dismantled in multiple Oracle RAC clusters, relying on internally developed database monitoring to complete the monitoring of multiple Oracle clusters. After migrating to PolarDB, the overall architecture will be more adaptable to the challenges of business flexibility:

Unified management and control: Unified management and control of clusters composed of multiple machines through PolarStack to provide DBaaS services;

Resource elasticity: The instance is deployed from the original physical machine to the K8S Pod deployment, which is more flexible and elastic;

Read-write separation: The intelligent proxy service realizes automatic read-write separation, realizes minute-level expansion, automatic switching in failure scenarios, and the application does not need to be adjusted.

The business sprint day passed three peak times: 12:00, 17:00, and 21:00. The number of orders issued per hour and the number of orders issued throughout the day entered the top three in history, and the number of orders issued during the peak period reached 9,000. pen/s.

(IV) Migration process

In September 2020, the first batch of Internet core application modules were migrated to PolarDB, and the entire adaptation process took less than a month. Since then, various modules of the core of the Internet have begun to migrate on a large scale;

In November 2020, PolarDB completed the largest single-database customer master data migration;

At the end of January 2021, PolarDB, as the database of the core Internet ordering system, will stably support the insurance company's business sprint in the first quarter of 2021.