Author:

From preliminary research, plan evaluation, multi-active construction to the final transformation and launch of the core logistics business, it only took more than 2 months for Cainiao Village to complete the core business multi-active disaster recovery goal in the same city, and achieve 7*24 hours of business. Uninterrupted service ensures maximum business stability and continuity.

As a new type of logistics business serving rural areas, Cainiao Village uses digital technology to create a three-level joint distribution service system at the county, township and village levels to help rural logistics reduce costs and improve efficiency, and improve the express service experience of rural consumers. , to form the integration of product classification, quality control, logistics and transportation, drive rural agricultural goods upward, and help farmers increase their income. So far, Cainiao Rural Logistics has served more than 1,000 districts and counties, with service sites covering more than 30,000 villages.

In the face of the rapidly developing business scale, in addition to constantly iterating business capabilities, Cainiao Village is also constantly consolidating its technical base and the high availability of business applications. The business of Cainiao Village is completely built on the big public cloud and adopts a cloud-native architecture, in order to realize rapid business iteration with the help of mature cloud product capabilities. At the beginning of the research on multi-active disaster recovery, Cainiao students found the MSHA students from Alibaba Cloud's native high-availability team responsible for business multi-active disaster recovery to discuss the multi-active disaster recovery solution for the current situation and future business planning of Cainiao's rural business. After more than two months, Cainiao Village and Alibaba Cloud's native high-availability team have built an intelligent and intensive cloud-based intra-city multi-active logistics system and a common distribution platform, realizing a multi-active intra-city disaster recovery architecture for related system applications. Construction, with the second-level control (<10s) of the availability zone-level traffic ratio, and the ability to cut all traffic (HTTP, RPC, MQ, task scheduling) in all directions in the availability zone in failure scenarios with one click. The effective time of one-key to 0 is less than 20 seconds, which provides a strong disaster recovery guarantee for the rookie rural business.

The road of transformation and practice of rookie rural application multi-activity

For a fast-growing company, IT construction and operation and maintenance usually cannot keep up with the rapid development and iteration of business. How to ensure business stability efficiently and at low cost has become a very important challenge and risk in business development. Let's experience the practical transformation of the application of the rookie's countryside.

Stability challenges and high availability requirements in the early stage

  • The core business system is only deployed in a single availability zone of the public cloud, and there is a risk of failure at the availability zone level.
  • How to implement efficient and low-cost disaster recovery solutions while ensuring rapid business iteration.
  • Selection of disaster recovery solutions. The goal is to shorten the impact time of the failure on users as much as possible when a disaster occurs, and quickly restore the business.

From the perspective of disaster tolerance indicators, in a distributed system, at most two points of consistency, availability, and partition tolerance can be achieved at the same time, and it is impossible to take into account all three. In disaster recovery scenarios, most systems choose AP or CP mode.

在这里插入图片描述

From the perspective of implementation cost, the higher the stability and scalability benefits, the greater the implementation cost.

在这里插入图片描述

In order to quickly supplement the single-availability zone risk of cloud infrastructure, Cainiao Village chose to use the "active-active in the same city" disaster recovery architecture to quickly supplement the shortcomings and improve the high availability of the business.

Quickly formulate an active-active disaster recovery plan for applications in the same city based on the current business situation

在这里插入图片描述

Cainiao Village and Alibaba Cloud conducted in-depth communication and discussions on the problems they faced and future business plans. Based on the requirements of business disaster recovery and the business technology stack, Alibaba Cloud has developed a solution for a multi-active architecture for applications in the same city. The main points of the solution are as follows:

1. Active-active for zone-level applications. Expand from 1 Availability Zone to 2 Availability Zones, and deploy applications of equal capacity in 2 Availability Zones. Based on the multi-active access gateway product, it undertakes all business traffic, and schedules traffic to back-end applications in different availability zones according to proportional or precise routing rules. Applications deployed in multiple availability zones provide external services at the same time, realizing multi-active applications.
2. Microservices are called preferentially in the same availability zone. Based on the agent capability of multi-active products, it supports to enable the priority call function of Dubbo/Spring Cloud in the same availability zone, so as to avoid RT growth caused by cross-availability zone calls. When the number of healthy Providers in the equipment room is lower than the configured threshold, the priority invocation policy will automatically fail to avoid too few Providers in the same availability zone that cannot support the upstream traffic pressure.
3. Rapid disaster recovery. When a failure occurs in an availability zone, based on the one-click flow switching capability of the multi-active product, the HTTP traffic is first switched to another availability zone through the multi-active access gateway, and at the same time, based on the multi-active product agent capability, the traffic in the faulty availability zone is switched. RPC (Dubbo/SpringCloud), MQ (RocketMQ), and scheduled task (SchedulerX/XXL-Job) clients perform fault isolation to achieve rapid disaster recovery switching of global traffic.

Multiple disaster recovery drills and verifications before going online

The verification of the pre-release environment and the disaster recovery drill of the production environment are the most critical aspects to ensure that the construction of dual-active in the same city can run smoothly online.

The verification work mainly includes the following two parts:

Agent starts verification. All types of middleware strongly depend on the Agent. It must be ensured that one Agent can start normally in all business containers, that the probe can be reported to the MSHA control service normally, and that other associated services are not affected after the Agent is started.

Disaster recovery capability verification. The purpose of building the same-city active-active is to be able to switch traffic in time in the event of a failure. Then the access layer can switch to 0 capability verification, the service layer can switch to zero capability verification, the message layer can switch to zero capability verification, the scheduling task layer can switch to zero capability verification, and the service layer can be closed. Policy validation, etc. are essential.

The disaster recovery drill is to actually verify the effect of disaster recovery in the production environment. The real exercise in the production environment is selected during the business trough period, and the overall verification content is similar to the pre-release environment to ensure that the business verification passes, and the trends of various middleware, messages, and tasks meet expectations.

The business value of the rookie rural application multi-active

With the help of Alibaba Cloud's intra-city application multi-active solution, it helps Cainiao Village achieve the goal of disaster recovery in the same city in a relatively short period of time, and realizes 7*24 hours of uninterrupted business service. Even if a single computer room fails, it can be restored in minutes. Maximize business continuity.

At the same time, Cainiao's business high availability road has not stopped. How to introduce chaos engineering to simulate real fault injection for disaster recovery drills, how to normalize active-active switching and keep fresh at all times are all tasks that need to be completed continuously. The road to disaster recovery does not stop there, and the road to high availability does not stop there. Rookie Village continues on the road.

Today, cloud-native has become a key strategy for enterprise digital transformation. As applications need to be developed and delivered quickly, this has prompted enterprises to adopt a cloud-native approach to application development to improve efficiency and increase flexibility. For enterprises and developers in the cloud-native era, it is not only necessary to adopt cloud-native means to cope with the high-speed iteration of business, but also to pay attention to the construction of high availability under cloud-native. It is also recommended to have a broad vision and an open mind. To embrace cloud native ecology.

On January 11, at the Cloud Native Practice Summit in Shanghai, Ding Yu, a researcher of Alibaba Cloud Intelligence, released the "Application Multi-Active Technology White Paper". Open source "App Active" middleware: AppActive.

Click to read the original text and download the "Application Multi-Active Technology White Paper" now!

To learn about Alibaba Cloud's native application-oriented multi-active disaster recovery product MSHA, please pay attention to:
https://www.aliyun.com/product/aliware/ahas/msha


阿里云云原生
1k 声望302 粉丝