1

Author: Zhongxi (github @zhongxig), head of AppActive, from Alibaba Cloud Cloud Native High Availability Architecture Team, engaged in the R&D and open source work of disaster recovery architecture and fault fast recovery.

: the high-availability architecture team, the third blockbuster high-availability product: AppActive is officially open-sourced, forming a high-availability troika to help enterprises build stable and reliable enterprise-level production systems , to improve the steady-state system construction capabilities of enterprises in the face of disaster tolerance, fault tolerance, capacity and other issues.

On January 11, at the Cloud Native Practice Summit in Shanghai, Ding Yu, a researcher of Alibaba Cloud Intelligence, released the "Application Multi-Active Technology White Paper". Open source "App Active" middleware: AppActive.

What is AppActive

"What should I do if the resources in the computer room are not available due to large-scale business expansion? What should I do if the computer room hangs up? What should I do if the business suddenly collapses? What should I do if the power outage is caused by the typhoon and earthquake?"

In 2013, it was not long after Taobao completed the Go-O, and the scale of Double Eleven increased further than the previous year. Alibaba's engineers are facing the above series of problems. On the one hand, the resources of the computer room are very tight and the capacity is insufficient. On the other hand, there is a rare high temperature weather in Hangzhou, and the computer room faces the risk of power failure. The off-site multi-active architecture is hatched in this context, and its carrier is the group version of UnitRouter&UnitBrain.

With the evolution of Taobao's business scale, multi-active in different places has also changed from two computer rooms in the same city to long-distance dual-active in different places, and then to four units in three places and multi-active in multiple places, accumulating rich experience in multi-active applications at the computer room level.

In 2019, Alibaba’s system was fully migrated to the cloud, and the remote multi-active architecture also followed the rhythm of cloud migration to incubate Alibaba Cloud’s cloud product AHAS-MSHA, serving groups and cloud customers.

On January 11, 2022, the AHAS-MSHA code was officially open sourced and named AppActive.

AppActive is an open source middleware that builds cloud-native high-availability, multi-active and disaster-tolerant architecture for business applications. Its main values ​​are:

  • Minute RTO. recovery time of 161f137c8df37f is fast. The average recovery time of Alibaba's internal production level is within 30s, and the average recovery time of external customer production systems is 1 minute.
  • resources are fully utilized. resources do not have the problem of being idle, and multi-machine rooms and multi-resources are fully utilized to avoid resource waste.
  • has a high switching success rate. relies on a mature multi-active technology architecture and a visual operation and maintenance platform. Compared with the existing disaster recovery architecture, the success rate of switching is high. The success rate of thousands of times of internal streaming within Alibaba is as high as 99.9% or more.
  • precise flow control. application multi-active supports traffic from top to bottom, relying on precise traffic drainage capabilities to enter specific business traffic into the corresponding computer room, enterprises can incubate features such as global grayscale and key traffic assurance based on this advantage.

Why open source

Through nearly 9 years of practical experience in serving Alibaba Group and more than 2 years of commercial iterative accumulation of serving customers on the cloud, AHAS-MSHA has been implemented in disaster recovery scenarios covering more than ten large enterprises in Alibaba. The stability and functional characteristics have also been fully tested.

In 2021, many well-known companies and cloud platforms at home and abroad will experience serious service interruptions and downtime events. This also sounded the alarm for enterprises, and more and more enterprises put disaster recovery construction on the agenda. While solving the problem of disaster recovery, in order to maintain cost control, support future multi-cloud architecture evolution, and ensure disaster recovery, many enterprises choose to use multi-active disaster recovery.

However, there is no unified understanding of multi-activity in the industry. Different companies have different definitions for the word "multi-active". Many companies often think that "multi-active" has been realized, but when the fault comes, they discover the fault of the current system. The escape ability is very weak, and business recovery and fault location cannot be decoupled, which drags down the production of enterprises, and causes problems such as external public opinion and capital loss. In addition, some enterprises subconsciously want to invest resources inside the enterprise after understanding the "multi-activity". Carry out technical rehearsal, but due to lack of experience, it often results in repeated waste of resources such as manpower and material resources. With the development of cloud native technology, more and more customers use cloud native technology for system construction. How to build a stable and highly available system on cloud native is a core challenge. The cognitive bias of "more activities" will aggravate enterprises' investment in infrastructure costs, application transformation costs, operation and maintenance costs, etc., but there are problems of inefficiency, misuse, or even uselessness or no use, so they can't enjoy "more activities". ” brings stability dividends. Therefore, "multi-activity" requires a relatively unified standard and cognition to deepen users' understanding and use of it, thereby improving the stability of the business system.

Under the current status and market perception of cloud native development, Zhongxi, the project leader of AppActive, said that the open source and interpretation of application multi-active can preliminarily define the standard and implementation of "multi-active" and help developers form a unified "multi-active". "Cognition. When an enterprise builds a multi-active architecture, the existing mature experience is shared based on the multi-active application to avoid waste of redundant resources. At the same time, different enterprises have different business scenarios and advantages, and reversely promote the further improvement of application multi-activity and the evolution of mature multi-activity forms and capabilities. It is hoped that relying on the strength of the community, "live more" will become a practical and inclusive technology, rather than a technology that can be used by some people, and help more enterprises and individuals build a production-level high-availability architecture.

open source content

Introducing the AppActive Standard

In the standard definition of multi-active application, there are LRA (multi-active in the same city), UDA (multi-active in different places), HCA (multi-active in hybrid cloud) and BFA (multi-active in business traffic). In AppActive v0.1 version, we give priority to realizing the basic capabilities of BFA and UDA. In subsequent versions, while improving BFA and UDA, we will add LRA and HCA capabilities. This article focuses on BFA and UDA.

1. BFABusiness Flow Active

BFA means that the final presentation of multi-active applications is a business, and the multi-active disaster recovery system has the ability to fine-tune production traffic according to business characteristics.

In the BFA indicator, AppActive supports automatic traffic correction, strong routing to the designated computer room and self-closing loop, which belongs to the refined traffic allocation.

When illegal traffic enters the computer room, the plug-ins at each layer of the computer room will rely on the unified scheduling rules for processing:

  • The access layer identifies erroneous traffic and automatically corrects errors to the correct computer room.
  • The service layer identifies erroneous traffic and automatically corrects errors to the correct computer room.
  • The data layer identifies error traffic, throws an exception and fails to write to ensure data quality.

2. Ultra Distance Active (UDA, Ultra Distance Active)

UDA means that the business system still has good access performance when the distance between the computer room is more than 300 kilometers. When entering the disaster recovery state, the RTO and RPO are at the minute level.

AppActive supports access performance well in UDA metrics.

It supports traffic analysis at the access layer, parses the request traffic, and sends the traffic to the application machines in the computer room. Based on the capabilities of the application-side Servlet plug-in, Dubbo plug-in, and MySQL plug-in, business traffic requests are self-closed in a single computer room, and finally read and written to the database in this computer room.

In the ultra-long-distance scenario, the business system still has good access performance because the traffic is enclosed in the computer room.

The RPO that enters the disaster recovery state is guaranteed by open source data synchronization components or commercial synchronization tools. RTO only provides primary traffic switching capabilities in AppActive 0.1 version, and subsequent versions will evolve to production-level RTO assurance tools.

AppActive Module Introduction

AppActive is a definition and implementation of application multi-activity, which has the overall implementation of the data plane and the control plane. The data plane is divided into 4 parts, all of which support adding capabilities in the form of plug-ins without changing the technical components used by the original enterprise:

  • Access gateway. As the first hop for business traffic to enter the computer room, the access gateway is responsible for the identification and distribution of application multi-active ingress traffic, and has two core capabilities of computer room routing and application routing.
  • service layer. The synchronous invocation method of business traffic within the computer room and across the computer room generally has roles such as Consumer, Provider, and Registration Center. It has three core capabilities of traffic routing, traffic protection, and fault isolation, avoiding data dirty writing caused by calling errors and accelerating flow switching. business recovery during the period.
  • message layer. The asynchronous invocation method of business traffic within the computer room and across computer rooms is based on message peak shaving and valley filling. Generally, there are roles such as Producer, Consumer, and Broker. It has three core capabilities of traffic routing, traffic protection, and fault isolation, so as to avoid errors caused by wrong message delivery. Dirty data writing protects messages from being lost during stream switching.
  • Data layer: It covers business application data reading and writing, data storage and data synchronization. It has three core capabilities of traffic routing, data consistency protection, and data synchronization.

The core of the management and control plane covers the daily operation and maintenance of multi-active disaster recovery rules and traffic switching in disaster scenarios.

The current AppActive is in the v0.1 version, open source:

  • The basic implementation of the definitions for all layers of the data plane described above.
  • Nginx plugin implementation of access layer gateway.
  • Service layer Dubbo2.x plugin implementation.
  • Data layer open source MySQL plugin implementation.
  • Basic capability for traffic switching on the control plane.

Based on the capabilities of v0.1, developers can run and verify the basic functions of application multi-activity.

AppActive follow-up planning

  1. Enrich the access layer, service layer, and data layer plug-ins, and support more technical components to the list supported by AppActive.
  2. Add the plug-in implementation of the message layer to support the multi-activity capability of the message application.
  3. Add other layers in the application of multiple live standards and implementations.
  4. Support Web white screen, follow the standard of multi-active UDA, and improve RTO.
  5. Follow the application multi-active HCA standard to support hybrid cloud multi-active forms.
  6. Follow the application multi-active LRA standard to support the multi-active form in the same city

starting point

"Multiple activities in different places" and "unitization" originated from Ali and have also been recognized by the industry. Ali has always hoped that the product ecology of the application multi-activity can be standardized and open, and contribute to the industry.

Based on the standard technology of multi-active applications, business applications can be interconnected between different cloud vendors, between different infrastructures, and between different chips. While making full use of resources, business applications can achieve RTO indicators at the minute or even second level, which truly means they are not afraid of failures.

Today, the first version of AppActive open source is just a starting point for the application multi-activity field, and everyone is welcome to join in and build an application multi-activity ecosystem together. To learn more about AppActive, Dingding search group number: 34222602, join the AppActive open source discussion group to participate in the discussion!

Click here , immediately went to download the "Application live technical white paper."


阿里云云原生
1k 声望306 粉丝