This article was written by Ali technical expert Ji Yuan in February 2022.
Background of the project
For a long time, the IOE technical architecture has become the standard configuration and the only choice for the core of the banking industry. The system they consist of was once regarded as the "golden architecture" of the back office of large financial enterprises. The first-level core business system of Class A for a bank customer is a mature one. The construction method of "commercial system + centralized architecture": IBM mainframe, Db2 database, EMC storage.
However, under the popularization trend of emerging technologies such as cloud computing/cloud native, the traditional centralized architecture is facing unprecedented challenges. The specific business pain points are as follows:
- Slow business iteration: commercial software architecture and technology are closed, the ecosystem is insufficient, the iteration is slow, and it is weak to respond to financial innovation, so it is difficult to serve the ever-changing domestic market environment and ever-changing user needs.
- Low performance and poor scalability: The centralized architecture has poor scalability and elasticity, and cannot support the huge pressure on system load in high concurrency scenarios such as "Double 11", and transactions are slow.
- High risk of stability: Online financial business requires 7*24 uninterrupted service, and the traditional architecture is slow to respond in operation monitoring and problem location. At the same time, for applications, resources, or equipment room-level failures, services cannot be quickly restored.
- High maintenance cost: The total cost (acquisition/expansion + maintenance service) of a centralized system based on IBM's mainframe and Db2 database far exceeds that of an X86-based distributed system, and the cost pressure is becoming more and more obvious.
- Unable to be self-controllable: As one of the self-controllable industries in the financial industry, IOE manufacturers form a de facto monopoly on state-owned banks, and it is easy to be “stuck in the neck” in the field of core systems, which is not in line with the current top-level design of the country’s promotion of self-controllability.
Therefore, the original system has reached the ceiling of its support capacity, and the construction of a new system is imminent.
Unitized distributed architecture design
As the core of a large state-owned bank to go to IOE and it is the first mainframe downgrade pilot project of Alibaba Cloud, how to smoothly switch the core system from the traditional mainframe architecture to the distributed architecture in the context of no reference object is itself a problem. The huge challenge, especially in the case of already having a user base of 100 million, how to give full play to the high scalability of the distributed architecture and meet the stringent security, stability and reliability requirements of financial services is the construction of the new core system The primary problem.
After focusing with customers, the 6 major design goals of the new generation of distributed architecture were finally clarified to meet the requirements of high availability, high standards and low risk in financial business scenarios. At the same time, in the Internet scenario, it needs to meet the demands of high performance, high elasticity, and low cost. The details are as follows:
Figure 1 Distributed architecture design goals
The customer's core system is distributed in a distributed design, and the overall deployment adopts the two-site three-center architecture of "same-city active-active + remote disaster recovery" to achieve the disaster recovery goals of intra-city RPO=0, RTO in minutes, and remote RPO in minutes. The underlying database adopts the unitized architecture designed by the OceanBase database, as shown in the following figure:
Figure 2 Unitized architecture of three computer rooms in the same city
A unit (ie, the deployment unit of the unitized application service product layer) refers to a self-contained collection that can complete all business operations. This collection includes all services required by all businesses and data assigned to this unit. The unitized architecture is to use the unit as the basic unit of deployment. Multiple units are deployed in all computer rooms of the whole station. The number of units in each computer room is not fixed. A part of some sort of dimension.
The essence of unitization is to split the data traffic according to a certain latitude. It enables a unit to provide complete service capabilities in a closed loop. However, this service capability is oriented to part of the split data traffic, and only serves this part of the traffic.
unit problem
- Capacity problem: IDC resources are tight, and centralized single-machine database connection bottlenecks.
- Multi-room disaster recovery: Controlling the fault explosion radius
- User experience: Nearest access to improve user request access speed
unit design principles
- Core business must be shardable (transactions, payments, accounting)
- It is necessary to ensure that the sharding of the core business is balanced, such as customer number and membership number as sharding dimensions
- The core business should be self-contained as much as possible, all the units should be converged, and the calls should be closed as much as possible
- The entire system must be designed for logical partitions, not physical deployment
Apply unitized design
The application software is deployed in the first computer room and the second computer room, and the two computer rooms are the master and backup of each other.
1 Gzone unit, stateless peer-to-peer deployment of dual-computer room applications, only one copy of data, which needs to be accessed across computer rooms.
10 Rzone units, a total of 20 Rzones in dual-room units, application/data access is self-closed, and application failure explosion radius is controlled within 10%.
- GZone (Global Zone): Inseparable data and services are deployed, which may be relied upon by RZone. GZone has only one set globally and only one copy of data. Design: global services such as gateway, configuration, parameters, routing, etc.
- RZone (Region Zone): The zone defined by the most standard unit, each RZone is self-contained, has its own data, and can complete all services. Design: core shardable businesses such as transactions, payments, and accounting.
Database unit design
- The customer core system shards data according to the dimension of "User ID". Divide the full amount of data into 100 data shards with a granularity of 1%, that is, use the last 2 digits of the "User ID" ID as the identifier (00-99).
- According to the "user number" as the main body, 5 unit clusters are divided, each cluster has 20 tenants, a total of 100 tenants/100 libraries/100 tables, each tenant corresponds to a set of sharded libraries/tables; 1 global cluster, storage Non-unitized public information. The biggest advantage of dividing five unit clusters is that when an extreme database failure occurs, the ability of a unit cluster is lost, affecting only 20% of users, and the overall impact is controllable.
- Each cluster has 5 zones (that is, 5 copies), and the primary copy is allocated to 4 zones in the main computer room and the standby computer room according to the unitized access requirements. The third computer room does not undertake business traffic, and the network delay between computer rooms is within 2ms .
- Since the distributed multi-copy data consistency protocol needs to strongly synchronize the data of each transaction to the majority copy, this deployment mode will inevitably lead to frequent cross-machine room synchronization operations. In order to ensure the write performance of the database, there are relatively high requirements on the network quality between the computer rooms. Usually, the network delay between any two computer rooms is required to be no more than 2 milliseconds (ms).
- There is no cross-zone/cross-node distributed transaction in the database, all distributed transaction requirements are solved within the application unit, and the compensation mechanism achieves final consistency through the "local message table";
You may be wondering, the standard deployment of OceanBase cluster is three copies in three computer rooms in the same city, and why should this project use five copies in three computer rooms in the same city? The reason is this: Although the first computer room and the second computer room are planned for the business, the customer's requirements are mainly stable, and the core system has high requirements for delay. Computer rooms in the same city are basically not allowed to access across computer rooms. Therefore, in order to avoid the downtime of the first computer room of the OceanBase cluster, which would cause the leader to switch to the second computer room, we deployed two replicas in each of the first and second computer rooms. In addition, the deployment methods such as three computer rooms and five copies can tolerate the exception of two observer machines at the same time, providing higher security and reliability for the business.
Figure 3 OceanBase database unit architecture
Unified access to unitized data
The customer core system accesses the OceanBase database unit cluster through the unified entry provided by SOFA ODP sub-database sub-table middleware/OBProxy proxy, which shields the complexity of sub-database sub-table, multi-cluster and OBServer distribution, and routes SQL statements to The leader copy corresponding to the application unit is completely transparent to the user; however, it should be noted that after sharding by "user number", all SQL must be used for data operations with "shard key", otherwise full table scan SLB and ODP will be blasted. Each unit will split an SLB (Load Balancing Service) instance to mount the ODP, which corresponds to all subsequent OBservers after the ODP.
Figure 4 ODP unitized architecture design
Here we focus on the deployment architecture. The core system of the customer has very high performance requirements, and has undergone several evolutions during the cloud migration process. From the separate deployment of ODP and OBProxy at a time, to the integration of two processes and deployment to the container, it mainly solves the problem of one-hop time-consuming and fast elastic expansion of the network, and will evolve to a single C program of OBSharding to solve the performance problem. Another time, mount all ODP instances on the backend from one SLB instance and split it into one ODP SLB for each unit. The problems solved are: 1) In the case of heavy traffic, a single SLB traffic is easily filled; 2) Due to the application The reason of long connection will cause uneven load, which will cause great pressure on an ODP, and even JAVA FullGC will occur; 3) When an ODP SLB fails, it will not affect other units, which is in line with the unitized design idea.
Distributed database OceanBase deployment and disaster recovery
Renovation from single computer room to three computer rooms in the same city
According to the overall design, we have planned three computer rooms in the main city, but the procurement and deployment time of each computer room is inconsistent. In order not to affect the application launch plan, we started to deploy the OceanBase cluster after the machines in the first computer room were ready. First, deploy a three-replica cluster in a single computer room in the first computer room and provide it to customers for testing and verification. At the same time, wait for the machines in the second and third computer rooms to be in place. After the machines are ready, we use OceanBase's unique function of increasing or decreasing replicas and online conversion of replica types. , to achieve smooth data relocation, and finally realize the architectural transformation from three copies in a single computer room in the same city to five copies in three computer rooms in the same city. The whole process is transparent to the application and has no application perception, which fully reflects the powerful elastic scalability of OceanBase.
Figure 5 OceanBase three-room adjustment-1
Figure 6 OceanBase three-room adjustment-2
Figure 7 OceanBase three-room adjustment-3
OceanBase active-standby cluster solution <br>The implementation of high availability of traditional IT systems is mainly based on active-standby mode. This kind of scheme has a very wide range of applications at present. Due to the verification of time, the industry's recognition is relatively high. Active/standby dual-system can also be used as a disaster recovery option. For many systems currently running, their disaster recovery solutions are constructed in an active/standby manner. Although OceanBase perfectly solves the disaster recovery solution through its multi-copy feature, for extremely important systems, it is still necessary to avoid the unpredictable failure of the entire cluster, resulting in unavailability of services. Therefore, OceanBase also provides a replication function similar to the traditional architecture, and uses REDO logs to synchronize data between the primary cluster and the standby cluster. In extreme cases, such as when the primary cluster experiences planned or unplanned (majority replica failure) unavailability, the standby cluster can take over service. The standby database provides three protection modes: maximum availability, maximum performance, and maximum protection, which can be selected according to actual conditions to minimize service downtime.
Three Protection Modes for Oceanbase Active and Standby Clusters
- Maximum Performance. This is the default protected mode. It protects users' data while maximizing the performance of the primary cluster. In this protection mode, the transaction can be committed immediately after waiting for the REDO log to be successfully persisted in the main cluster. The REDO log will be asynchronously synchronized to the standby cluster, but will not block the transaction submission of the primary cluster. Therefore, the performance of the active cluster will not be affected by the synchronization delay of the standby cluster.
- Maximum Protection. This protection mode provides the highest level of data protection and ensures no data loss in the event of a primary cluster failure. In this protection mode, transactions need to wait for the REDO log to be persisted successfully on both the primary cluster and the strongly synchronized standby cluster before committing. In the maximum protection mode, only one strongly synchronous standby cluster can be configured, and other standby clusters can only be in asynchronous synchronization mode. If the strongly synchronized standby cluster is unavailable, the primary cluster will stop the write service.
- Maximum Availability. This protection mode provides the highest level of data protection without sacrificing cluster availability. By default, transactions cannot be committed until the REDO logs are successfully persisted on both the primary cluster and the strongly synchronized standby cluster. When sensing the failure of the strongly synchronized standby cluster, the primary cluster no longer waits for the log to be strongly synchronized, and restores the primary cluster services with the same priority as the maximum performance to ensure cluster availability. Until the strongly synchronized standby cluster resumes service, the primary cluster will automatically return to strong synchronization mode, providing the highest level of data protection. In the maximum availability mode, only one strongly synchronous standby cluster can be configured, and other standby clusters can only be in asynchronous synchronous mode.
OceanBase offsite deployment solution
The financial industry has very high requirements for business continuity and emergency response to IT risks, and the construction of a disaster recovery system is very important. For the Class A core system, it must not only be able to meet the dual-active in the same city, but also have the ability of disaster recovery in different places. In order to meet the needs of customers for remote disaster recovery, we built a set of off-site standby clusters in the off-site disaster recovery equipment room, adopted the OceanBase active-standby cluster architecture, and used the first released OCP to manage and control the active-standby cluster version for rapid cross-city deployment, realizing off-site storage capacity. The disaster switching time has been improved by leaps and bounds (from the hour level to the minute level), and the switching method is changed from the original black screen switching to the white screen switching, which greatly improves the response time, safety and convenience of emergency switching.
Figure 8 OceanBase remote disaster recovery deployment
Disaster Recovery Test Drill
Because the customer is using OceanBase distributed database on the core system for the first time, it puts forward very strict requirements on stability and security, requiring us to carry out disaster recovery drills for the whole scene, and through business disaster recovery drills to meet the production of core A-class services Require. For this reason, we not only designed a high-availability architecture with three computer rooms and five copies in the same city in the main city, but also compiled detailed test cases together with the application to test our high-availability solution. By verifying that we can achieve RPO=0 and RTO<30s when two copies of the main city are down at the same time, we have won the trust of customers and improved the confidence of customers in the face of disasters.
Figure 9 OceanBase disaster recovery test case
Overall migration solution for core system cutover
The customer's core system as a whole adopts the method of stop-write migration for data migration. During the entire migration process, the data is exported from the mainframe, and it must be imported into the temporary database table first, and then imported into the target database through the linked table query, which involves multiple data import and export. After careful design preparation in the early stage, as well as multiple rounds of production environment migration drills, the migration time was controlled within hours. At the same time, in order to stabilize the period, customers will perform whitelist switching and migration verification in three stages. In each stage, data reconciliation and business logic verification will be carried out to ensure the correctness of data and business.
Figure 10 Overall migration plan for core system cutover
value
- Customer benefits: The core system is moved from the mainframe, which saves a lot of software and hardware costs, and obtains the horizontal expansion capability far from the supercomputer. Thanks to the dynamic expansion and shrinkage capabilities of OceanBase, it can calmly cope with the peak of the big promotion business. The primary and secondary database architecture provides high availability comparable to the mainframe.
- Technological improvement: The transformation of the customer's core system database from a centralized to a distributed architecture has been successfully realized to meet 7x24 hours of continuous service, and the high-availability disaster recovery has reached level 5. Make sure the core business RPO is 0.
- Capability Precipitation: Design a set of distributed database unit architecture and standard deployment scheme for the core system of Dahang Bank, meet the needs of disaster recovery in the same city and different places, and form the best practice of core products, providing powerful technology for the future development of the business Assure.
- A new chapter: Alibaba Cloud's first major state-owned bank's core IBM mainframe downgrade system, full-stack Alibaba Cloud technology, proprietary cloud + distributed database + distributed microservices unitized business transformation, proving that the Alibaba Cloud platform is capable of carrying banks The core system has absolute benchmark and demonstration effect.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。