On April 27, the first "Global Information System Stability Summit" was held in Beijing. At the meeting, China Academy of Information and Communications Technology (hereinafter referred to as "CAICT") announced the first batch of information system stability assurance capability assessment results, distributed system stability laboratory member units, and excellent cases of information system stable operation. Ant Group was elected as a member unit of the laboratory, and its payment platform was awarded the "Excellent Case of System Smooth Operation" certificate for its technology and practice in system stability and security.
As a member unit of the laboratory, Ant Group actively participated in the preparation of standards and research reports related to system stability. At this summit, Shi Shiqun, deputy general manager of technology of Ant Group's digital technology business group, also made "Alipay System Double Eleven Stability Guarantee" The keynote speech of "Experience Sharing" shared the exploration and practical experience of Ant Group's financial-level distributed architecture SOFAStack in the field of system stability assurance.
Below is the full text of the speech:
Hello everyone, I am Shi Shiqun from Ant Group Digital Technology. Today I will introduce the relevant content of Alipay Double 11 stability guarantee online.
System stability guarantee is a complex system engineering. From 2004 to 2021, Alipay has undergone a series of technical architecture upgrades and iterations, from a unitized architecture to elastic cloud migration, and then to cloud-native and green computing. In this process, both capacity stability and cost must be considered. and efficiency.
We made a brief summary, which probably went through three stages:
The first stage is mainly to solve the problem of capacity. Through LDC, elastic capability and OceanBase, the theoretically infinite scalability of capacity is solved. At the same time, the full-link capacity is well verified through the full-link stress measurement technology;
In the second stage, when the payment capacity reaches the standard, the further consideration is how to improve the stability and efficiency of the overall structure through technological innovation. There are two typical scenarios. One is cloud-native. The core concept of cloud-native architecture is to separate infrastructure and business, so as to release the dividends of infrastructure and greatly improve the speed and efficiency of innovation. A typical case is ServiceMesh in Ant. landing. Another is our intelligent monitoring operation and maintenance system, through data intelligence, to improve the response speed of system emergency response and recovery.
The third stage is green emission reduction. For several years in a row, we have proposed a big promotion and 0 cost increase while maintaining the steady growth of the peak value. On Double 11 in 2021, our main direction is to focus on green emission reduction. Through innovative technologies such as off-line co-location, time-sharing scheduling, and intelligent AI capacity, we will save 640,000 kWh of electricity and reduce carbon emissions by 394 tons.
Next, I will introduce you to the key technologies of Alipay's Double 11 promotion.
unit deployment
The remote multi-active logical unit architecture, also called LDC internally in Ant, is the full name of Logical Data Center, which is a logical division of IDC (Internet Data Center, Internet Data Center), and is also the practice of Alipay system "unitized deployment" "The scheme adopted.
To ensure the stability of the information system, the core needs to solve two problems:
The first is the single point bottleneck. When any Internet system develops to a certain scale, it will inevitably hit a single point of bottleneck. From a single server, a single application, to a single database, a single computer room, and then to multi-computer room deployment, multi-location deployment (multiple activities in different places), this process is constantly breaking through single-point bottlenecks;
The second is to ensure remote disaster recovery capabilities, so as to meet financial-level stability requirements.
The deployment of multiple computer rooms in multiple locations is an inevitable direction for the development of the Internet system. There are many key problems to be solved, including traffic allocation, data splitting, delay, etc. Of course, these problems can be solved through technologies and solutions, while carrying these solutions is a deployment architecture. Although there is more than one deployment scheme that can be adopted, whether it is pure theoretical research or some advanced system architecture practice, "unit deployment" is listed as the best scheme.
The so-called unit refers to a self-contained set that can complete all business operations. This set contains all services required by all businesses and the data assigned to this unit. A unit is a miniature version of the complete station with all the internal organs. It is omnipotent because all applications are deployed; but it is not full, because only a part of the data can be manipulated.
Alipay solves the problems of traffic allocation, data splitting and delay by dividing the units into three categories: RZone, GZone and CZone:
- RZone (Region Zone): The zone most in line with the theoretical unit definition, each RZone is self-contained, has its own data, and can complete all services.
- GZone (Global Zone): A global unit that deploys inseparable data and services that may be relied upon by RZone. GZone has only one set globally and only one copy of data.
- CZone (City Zone): A unit deployed in a city, which also deploys inseparable data and services, and will also be relied upon by RZone. But unlike GZone, the data or services in CZone will be frequently accessed by RZone, and each business will be accessed at least once; while GZone is accessed much less frequently by RZone. CZone is specially designed to solve the problem of offsite latency.
Based on the LDC architecture, Alipay has realized a true remote multi-active architecture, achieved financial-level 99.99% availability, and theoretical wireless capacity capabilities, successfully supporting hundreds of thousands of large-scale promotion capabilities, and also laying the foundation for the subsequent flexible architecture. a good foundation.
Resilient Architecture
Just now we talked about the LDC logic cell architecture, which theoretically has the possibility of infinite capacity, but it is often not feasible in reality, for the following two reasons:
On the one hand, the resources under the control of the company are limited. With the rapid growth of the number of payments, the self-sustained resources will encounter bottlenecks; , it is also uneconomical in terms of cost, which has not fully released the dividends of cloud computing.
On the basis of the LDC architecture, Ant Alipay has further upgraded the elastic architecture, realizing the elastic capability according to the business granularity, transforming some units into elastic units, and popping up to the cloud during the peak period, thus realizing the rapid expansion capability. When the big promotion is over, these units are bounced back to the daily computer room, which can ensure a more efficient use of resources. All the elasticity logic is encapsulated at the infrastructure level, realizing insensitive elasticity to the business. In the 2016 Double 11 promotion, we effectively supported the peak payment of more than 100,000 per second. Compared with the mode of holding resources by ourselves, the cost was greatly reduced by more than 50%.
service mesh
Next, let's look at the service mesh ServiceMesh, which is also a very critical technology.
Why do you need ServiceMesh? We have to start with microservices. Many of the problems existing in microservices are related to service governance, including interdependence between components, difficulty in service management and control, and platform operation and management. We use lightweight network agents to be responsible for communication between microservices and deploy them in the form of sidecars. In the independent process of the container, and through a series of infrastructure and business decoupling, the infrastructure upgrade is efficiently achieved. During the promotion period, the iterative efficiency of infrastructure has been improved by more than 10 times.
Secondly, flexible flow control can be achieved through ServiceMesh, and all current limiting and fusing are taken over by ServiceMesh, which does not require business transformation, saving a lot of plan R&D costs and SDK access costs. At present, ServiceMesh has covered 100% of the core payment links of Alipay, with a container scale of one million and a peak of 10 million QPS.
The evolution of online full-link stress measurement technology
Stress testing is an extremely important capacity verification method. All the methods we just mentioned are continuously improving the capacity expansion capability. However, a very good method is also needed to verify whether the capacity meets expectations, and the online full-link stress measurement technology becomes very critical.
There are many problems with traditional stress testing technology, mainly reflected in the incompleteness of traditional local single-link stress testing, which is mainly based on single-service stress testing. real business situations. In addition, the accuracy of traditional offline pressure measurement, simulation pressure measurement, and online single-machine drainage pressure measurement is not high, and the resource situation is not accurately evaluated.
For the entire online and full-link stress test, we mainly have the following points:
- Core link analysis to establish end-to-end user behavior models. Through big data technology, an end-to-end traffic model is constructed based on the user behavior and back-end links of the big promotion, which is used to verify the adequacy of the full-link stress test.
- The stress testing environment is reused for production. Through the data access agent, the pressure measurement data is transferred to the link without affecting the normal business data, and the result is very reliable.
- Stress measurement performance analysis and diagnosis. During the stress measurement, if you encounter problems, you can quickly locate the problem and diagnose and give optimization suggestions. Typical ones include network diagnostics (network quality, bandwidth), application diagnostics (memory, CPU hotspots, threads), database diagnostics (slow SQL, CPU, memory), infrastructure (containers, processes), and full-link diagnostics (diagnostics distributed bottleneck point in the link).
- Based on the accumulation of the past so many years, our simulation degree on the full-link stress test exceeds 99%. In recent years, the Double 11 promotion has 0 major failures and 0 capital losses.
Intelligent monitoring technology
Although a lot of things have been done before, for a complex business, there will inevitably be problems with the online system, so how to quickly find problems, respond quickly, and recover quickly becomes very important.
In the face of the peak of the big promotion, the challenges encountered by monitoring are also huge. Under the large-scale log promotion, the log volume per second may reach hundreds of gigabytes, and the cleaning traffic may reach tens of terabytes per minute. How to effectively process these logs is very important.
Ant's self-developed time series database engine, Ceresdb, can basically achieve second-level monitoring by optimizing the acquisition technology and stream computing engine, realizing 1-minute discovery, 5-minute positioning, and 10-minute recovery, ensuring rapid emergency response and response to online time.
1 minute discovery: The fault is discovered within 1 minute, and the stakeholders are introduced into the fault handling process.
5-minute positioning: Respond to the cause of the failure within 5 minutes and formulate a hemostasis plan.
10-minute recovery: The hemostasis plan is executed in 10 minutes, and the fault is recovered.
2021 Double 11 Energy Saving and Emission Reduction
On Double 11 in 2021, we have shifted from focusing on peak value and traffic to green computing, taking both cost and efficiency into consideration to ensure technical sustainability. Through a series of innovative technologies including offline hybrid deployment technology, cloud-native time-sharing scheduling, and AI elastic capacity, we have realized various scheduling of overall resources, applied green computing on a large scale, saved 640,000 kWh of electricity, and reduced carbon emissions by 394%. Ton.
A lot of technical capabilities and methods to ensure system stability have been discussed earlier, but for each organization, it takes a long time to build these capabilities and systems from scratch, and it also requires a lot of complex work. As the industry realizes digital upgrading and transformation, Ant Group is also actively promoting the technological opening of related capabilities.
Native distributed database OceanBase
Next, let's look at an important product - OceanBase. After 9 years of verification on Double 11, OceanBase has a lot of application experience and is mature and stable. As a native distributed database, OceanBase has wireless expansion and always-on capabilities to ensure data is not lost, and achieve automatic disaster recovery within 30 seconds. OceanBase is suitable for various large-scale scenarios and industries that require high business continuity. It is also very suitable for industries that require strong consistency, high availability, and high HTAP performance. Currently, it has been used in industries such as finance, government, operators, transportation, and energy. There are many successful implementations.
Financial-grade distributed architecture SOFAStack
At the same time, we have also opened up the SOFAStack cloud-native technology products to the outside world. We have packaged the unitized architecture, service mesh service mesh, full-link stress testing, and monitoring and emergency systems together. The technical capabilities of Ant for more than ten years have been commercialized and realized. Become mature commercial products and services. At present, the core business system has served hundreds of customers. We hope that through these efforts, we can help all walks of life to better achieve digital upgrade and transformation.
Today I will share it here, thank you all!
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。