In the cloud-native era, ideas and best practices for constructing multi-active disaster recovery systems for enterprises

Introduction to 's interpretation of the concept of cloud native. You often hear about microservices and containers. So what is the relationship between these technologies and enterprise disaster recovery? In fact, the needs of disaster recovery exist in all walks of life. For example, the financial industry also has a strong demand for disaster recovery. But how to build disaster tolerance and survivability is actually something every company needs to think about. This sharing hopes to provide you with some relevant ideas.

头图.jpg

For the interpretation of the concept of cloud native, you often hear about microservices and containers. So what is the relationship between these technologies and enterprise disaster recovery? In fact, the needs of disaster recovery exist in all walks of life. For example, the financial industry also has a strong demand for disaster recovery. But how to build disaster tolerance and survivability is actually something every company needs to think about. This sharing hopes to provide you with some relevant ideas.

Evolution of disaster recovery system functions

What I talked about today is actually part of the disaster tolerance system. You can take a look at the evolution of the entire disaster tolerance system architecture:

Disaster Recovery 1.0 : During the construction of the original application system, the business system was deployed in the computer room based on the traditional architecture. What about the relevant emergency measures or troubleshooting methods? During this period, only data backup , mainly in cold backup mode. In addition to the computer room that provides services, an additional computer room may be considered for disaster scenarios. From the perspective of system construction, you may choose to use a separate computer room to synchronize data to another computer room for cold backup, and switch when a problem occurs. However, in actual situations, it is generally not the choice to switch computer rooms, even in the financial industry that conducts routine exercises of disaster recovery systems every year, they are afraid to switch when there is a problem with the system during the production process.

Disaster Recovery 2.0 : More consideration is given to . For example, cloud native, or higher-level applications in the traditional IOE system, switching is not just simply cutting over and loading the original cold standby data, but hoping to quickly apply the application to another when cutting over. The engine room pulled up. In order to achieve replication on the data layer without too much delay, we usually have a requirement for active-active. However, there are generally some requirements for dual-active, such as within a certain range of distance to be able to do dual-active in the same city. Hyperactive is more likely to be applied to the AQ model, which means doing full business in the production side and doing other business in another computer room.

Disaster Tolerance 3.0 : Hope to do more work in different places. What is more? This means that it is no longer limited to two computer rooms, but hopes to have three or more computer rooms. For example, Ali's business is distributed in multiple computer rooms. How to provide external business support at the same time requires corresponding technical support. And living more in different places means not limited to distance, such as 200 kilometers or the same city, because today's computer rooms are deployed all over the country.

Overview of business continuity and disaster recovery

For business continuity, there is actually a systematic approach, which refers to the specifications and guidance accumulated over the years in the construction of disaster recovery systems. There are several dimensions:

1. The multi-active business is not the same as the original disaster recovery, which directly pulls the same business peering in another computer room, but chooses valuable business. Because in the construction of a disaster tolerance system, it is very difficult to achieve more activity in all businesses in terms of cost and technology.

2. To guarantee real-time operation, it is necessary to ensure that the core business will not stop service due to various reasons such as power outages in the computer room.

3. M stands for guarantee system. Nowadays, all walks of life may have their own different methods and management methods, and what Ali provides is to transform this part of things into technologies, tools and products, so that everyone can quickly build their abilities in the future. Based on this set of methods and products to build more business activities.

The BCM system and IT disaster recovery and recovery capabilities are a practical guiding framework. In terms of completeness, business continuity at the top is the goal, and the following are various ways to achieve it. At the bottom you can see, for example, the IT plan, the plan for handling failures when there are special problems in business continuity, etc. These things were taken into account when doing disaster recovery, but we took these things into consideration in the product system from the perspective of how to work. inside.

The several disaster recovery methods mentioned here are actually relatively common: from cold standby to dual-active in the same city, dual-active in the same city and cold standby in different places (two places and three centers), these are relatively standardized methods in the industry. . And living in different places is like providing the ability to live in three computer rooms in two places and three centers at the same time. On the basis of the previous, there are some differences with the original traditional disaster tolerance. Multi-living is also different from the traditional in terms of construction cost. For example, the ability to build multi-living in different places will require more investment than traditional (such as dual-active in the same city and three centers in two places) in terms of construction costs.

When constructing multi-activity capabilities, the actual situation of the business is also taken into consideration. For example, in different industries, or for example, in terms of multiple activities, only two sides are required to read. Then, under different circumstances, the construction cost and the time to switch services are different. The ability to live in different places can be switched in minutes from the horizontal time axis, but it may need to be switched in days if it is based on cold standby.

Why does Ali do more work?

In Ali's business model, the reasons for doing more work are similar to those mentioned earlier. As mentioned earlier, if you do not use multiple activities, you will need to build another computer room. The cost is very high, because the computer room is only used for data synchronization and is not in operation. During this period, it needs to be uninterrupted. Locally update the version corresponding to the production system and the version of the disaster recovery system. But in reality, when the original cold standby or the three centers in the two locations fail, they are afraid to switch, because it is very likely that they cannot be switched back after the switch.

There are three main demands for doing long work:

1. Resources. With the rapid development of today's business, the single-site resource capacity is limited. We know that cloud native and cloud computing provide high availability and disaster tolerance capabilities, but cloud computing is deployed in different computer rooms, and the ability to live more across regions requires the support of the underlying infrastructure. We hope to expand our business to unlimited Due to the limitation of the physical computer room, multiple computer rooms can receive services at the same time;

2. There are diversified business requirements, which require local or remote deployment;

3. Aiming at disaster recovery events. For example, the fiber optic cable is cut or the power supply and heat dissipation problem of the computer room due to weather, which will cause the failure of the single computer room. Today's demand is not limited to a certain computer room, but multiple computer rooms are deployed in different forms across the country, which can be flexibly adjusted according to the business model.

Because these demands are more urgent for the ability of Duohuo, Alibaba has made Duohuo solutions and products based on its own business needs and technical capabilities.

Dismantling of multi-active architecture

multi-active architecture

1. remote mutual backup : Today, everyone talks about how good cloud native is, how good cloud computing is, and there is no more viability. These technologies are actually idle. It doesn't work in the cold standby state, and the decision to cut to cold standby in which state mostly depends on people's decision-making. Since layer-by-layer reporting has a relatively large impact on the business, more mature customers will have some plans, such as what kind of impacts and failures need to be switched, but in fact, they are generally afraid to switch based on the cold standby mode.

2. Live-active in the same city : There is a certain distance limit, and the common active-active mode can be distributed in the upper application layer, such as the cloud-native PaaS layer. Both computer rooms can be distributed. In the data layer, because the same city can be used for storage, the main computer room has problems with the database and cut to the standby computer room, but the advantage is that the machines and resources in the two computer rooms are in an active state. In addition, when the computer room is active, there is no need to worry about the difference between the production version and the version of the standby computer room, and you will not be afraid to cut.

3. , three centers in two places, : In addition to considering the problem of providing in the same city, the ability to deal with failures will be stronger. Build a cold standby computer room in a different place. This is similar to the first solution for cold standby. The cold standby computer room is usually If you don't use it, you may do some other synchronization, and only switch when a failure occurs.

4, live more in different places : There are multiple data centers to provide external services at the same time. Due to the limitation of distance, replication at the data level may be limited to the network, and the problem of delay will definitely exist. There are many technical problems to be solved, such as how to switch from the Beijing computer room to Shanghai very quickly, and how to cut the underlying data without complete synchronization due to physical limitations. Our operating mode is not switched like the original disaster recovery method, but a lot of preparation work and follow-up data compensation process. We integrate this set of things into the product system, and if there is no way to break through the physical limit, we use the architectural model to optimize it.

Progressive multi-active disaster recovery architecture

For the key core business, in fact, when doing multiple systems or projects, some sorting of the business will be done. Today, I am talking about unitized sorting.

Progressive multi-active disaster recovery architecture

Double reading, two places and three centers, under normal circumstances, at most half and half of the two computer rooms are divided, which is the simplest. According to this model, the rules for business segmentation can be found. For example, the business can be divided into half and half according to the user number, just as the bank may divide the business into half and half according to the card number or the user's location. In Multi-Activity, we hope to be able to configure it flexibly, such as how large the processing capacity of the computer room is, what kind of failures it encounters, and the flow can be adjusted to 50%, 60%, or other proportions. The same is true in multiple computer rooms, and the traffic access conditions can be distributed uniformly.

In terms of technology, for example, remote backup is one-way data replication, and remote active-active is two-way. Two-way means that there may be problems in either of the two computer rooms and can be switched to each other. One of the most important of these is the technical realization. At the digital level, we must find a way to avoid the problem of circular replication. After the data is synchronized, another computer room believes that the new data is copied back. In the case of multiple computer rooms, the traditional way is to use the serial number in the database. In multi-live, the serial number needs to be generated by rules to be globally unique, and is not based on a single computer room but on the entire cluster. We need to consider more The serial number generated by a computer room cannot be repeated, which requires the product to have some rules to solve this problem.

Multi-live disaster recovery solution

Architecture diagram of the multi-active product solution

1. access layer : The first thing to be solved in multi-live is the very important traffic access layer. The access layer can finely control the access rules. According to the business fragmentation rules, it must be accurate to map to each computer room in the lower layer. After the traffic comes in, it is necessary to determine which computer room the traffic user should provide services in. How is this achieved in practice?

The traditional way is domain name switching. For example, the front-end domain name has two computer rooms, and the domain name address is switched when switching, then the entire business was originally connected to computer room A, and it can be switched to another computer room B through the domain name. The problem with this method is that it affects the business being done. For example, after a problem occurs in a certain computer room, the business needs to be quickly switched to another computer room. If the domain name is switched, the ongoing business at the bottom layer will be affected. In addition, this kind of low-level switching cannot be linked to the entire cloud-native PaaS layer. The upper layer is cut and the lower layer cannot perceive it. It is not known that the previous traffic has been switched to another computer room, including the middle call may still be in the original In the computer room unit, this is actually a relatively large impact on business continuity. In extreme cases, this mode can solve some problems. For example, if a computer room cannot do any business and there is a spare computer room, then cutting the domain name is also a way.

Another way is to use cloud-native microservices, which can mark the traffic in the microservices. After the marking is completed, the mark is passed down in the cloud-native microservice technology system, and the request is considered to be in a certain unit as much as possible. Or do it in a computer room, and you can't jump to another computer room.

2. application layer : The middle layer access routing specification includes service routing components, which can be provided separately in our product system. For example, some customers say that they do not want to use a full set of solutions, because they may have all the open source components used in the middle layer of the solution, but they want to achieve the ability to live more. Then the upper layer can use our entire multi-activity management and control flow, accurately define how many logical units there are, and provide APIs for intermediate calls. The globally unique sequence number, routing rules, and fragmentation rules are all provided to him by the previous layer. Among them, marking and traffic identification seem to be relatively simple. In fact, for example, in a multi-live scenario, some distributed messages that will be used when decoupling and decoupling, as well as messages used in the architecture, If you switch in a certain computer room without finishing consumption, then what method needs to be used to synchronize to another computer room? This kind of problem needs to be solved with the help of cloud native.

3. data layer : involves the logic of copying and writing. The write prohibition control in our solution will have a logic on the database, that is, once the front-end switch occurs, the code will be automatically generated. For example, when the data of the switched target computer room is restored, the code with time will be automatically generated, and the writing action will be released again only when the data is restored. We will protect the database and judge the delay of the database by prohibiting writing. If the underlying data synchronization capabilities are not strong enough, switching and most of the services can be done, but many write-in services may not be able to be done, because the database is restricted by the write prohibition rule. In addition, the rules for data synchronization and the requirements for replication under multiple computer room logic are more controlled in terms of overall rules.

Based on the whole solution system, we put forward a concept (as shown in the figure above): The four-letter abbreviation of MSHA represents the ability to provide cloud-native products today. We hope to play a small role in these four numbers: 0, 1, 5, and 10 minutes of prevention.

The first is 0 minutes to prevent . As mentioned above, the cut flow can be deployed in two computer rooms in a blue-green publishing environment. This is one method. Even in the same computer room, two units can be defined under the logic of the control console, and the blue and green releases can be quickly carried out in the same computer room. The blue-green release of a computer room is limited by the support of technical products. Through this component, it can be clearly delineated which resources belong to one unit and which resources belong to another unit. At the same time, the blue-green release of this unit can be quickly realized. ；

Second, 5 Fenzhong positioning , the original city such as cold standby disaster recovery technology, is often very hard to make decisions, or who do switch to bear the consequences, we hope that based on this platform can visually see the situation today affect failure related Corresponding to what problems the stakeholders need to do, or what operations should be done to restore the application; when a failure occurs, the system can quickly find the problem of the failure, such as locating the problem in 5 minutes, and then let it To initiate a decision on whether to make a cut flow;

Third, 10 minutes to recover , finally, we want the whole business back up and running through the entire process control in this mode can within 10 minutes recovery .

Best Practices for Multi-Live Disaster Recovery

Here are a few examples of Alibaba’s applications to external enterprises. This multi-active disaster recovery capability is not only available on public clouds, because cloud does not mean that when applications are deployed on the cloud, all high availability is naturally Provided by the cloud, when using resources, you will find that the cloud actually has different regions, and the same region contains different availability zones. When using on the public cloud, it needs to be combined with the actual situation. For example, most customers may be in the south, then a node may be opened in the south computer room. Then when there is a problem in the Ali computer room, the customer's business will be corresponding. Affected, although customers deploy the corresponding business on the cloud, the products on the cloud also provide high availability, but once the failure scenario involves the computer room, the corresponding business will still be affected. Therefore, the solution provided is that the multi-active capability can be deployed in the computer room like commercial software in addition to being deployed on the cloud.

Case 1: Live-active in the same city

A logistics customer actually used Multi-Activity within the same city. Although the traditional technology is not a big problem, the benefits of using Multi-Activity are reflected in that, for example, there is a corresponding SDK, which can be automatically identified, and there is no need to do too much business. With multiple modifications, the marking request can be passed on automatically. After the disaster tolerance is completed, the RTO is much faster than before.

Case 2:

The difficulty in this case of double reading in different places is that the distance exceeds thousands of kilometers. Under this distance limitation, both reading and writing are actually difficult. Data replication itself has delays. The logic of using this set of products also hopes to unify the control and traffic levels to clearly know which is the reading business. Which services are imported into the computer room of the reading, and what is the status of the replication. The minute-level RTO has been greatly improved compared to the original one, and the business can be dynamically switched online and flexibly.

Case 3:

This enterprise customer who uses HyperMetro in different places currently has two computer rooms to write, and it may expand in the future. When this plan was implemented, a lot of product-adaptive development was done, because if you want to realize the reading, there is a lot of work in the middle layer for the basic capabilities of the original product, and the whole process is from the development of multi-life products and then forward. Adapt to the application scenarios, and then complete the transformation with the business. The core point is business continuity, so it does not mean that all businesses will use multiple activities in the computer room in the future, but only for key businesses. For example, for example, every year on Double Eleven, our core business is to ensure that the order cannot be affected. Then, through decoupling or other methods, the priority of logistics will not be as high as the order transaction type in terms of business continuity. . The key point is how to ensure that the services and products involved in the core transaction link will not cause problems when switching in the multi-active dimension.

This multi-live management and control platform recommends that you experience it for yourself. After two or more units are defined in the console, when one of the computer rooms fails, we hope to quickly switch its application to the other computer room through Multi-Activity. The prerequisite for switching is to define the points in the management console. Whether it is a logical point in a single computer room or a point in multiple physical computer rooms, it must be mapped to the multi-active management and control platform. In the control console, we will allocate some rules, such as the access of a single service, in what dimension to split the access traffic, or mark it by ID. It is relatively simple to dynamically display which dimensions of the flow to another computer room when the flow is cut, and it can be quickly allocated when a fault occurs.

Nowadays, we help customers deploy capabilities, and often do some cut-flow and drills through the console in the system to see if the computer room is affected, because the entire system is equipped with other solutions, such as fault drills and cooperation. These failures switch the application to another computer room and so on.

to sum up

The ability to live and disaster recovery has been practiced in Ali’s internal business for many years, and it took a long time to evolve it into a product. The purpose is to hope that today’s set of products and solutions can help companies build their own within 30 days More viability. In particular, there are many product deployments on the public cloud that are already ready-made enterprises, but in fact, it takes less time to build. We hope that this set of products and solutions can help enterprises to quickly realize failover and build multi-activity capabilities in minutes.

Solution consultation technology exchange group: Search for Dingding group number 31704055 and join Dingding group to obtain detailed cloud native solution information and expert answers.

Solution Consulting Technology Exchange Group
(Note: For group application, please provide company, position and name)

Copyright Statement: content of this article is contributed spontaneously by Alibaba Cloud real-name registered users. The copyright belongs to the original author. The Alibaba Cloud Developer Community does not own its copyright and does not assume corresponding legal responsibilities. For specific rules, please refer to the "Alibaba Cloud Developer Community User Service Agreement" and the "Alibaba Cloud Developer Community Intellectual Property Protection Guidelines". If you find suspected plagiarism in this community, fill in the infringement complaint form to report it. Once verified, the community will immediately delete the suspected infringing content.

In the cloud-native era, ideas and best practices for constructing multi-active disaster recovery systems for enterprises

Evolution of disaster recovery system functions

Overview of business continuity and disaster recovery

Why does Ali do more work?

Dismantling of multi-active architecture

Progressive multi-active disaster recovery architecture

Multi-live disaster recovery solution

Best Practices for Multi-Live Disaster Recovery

to sum up

阿里云开发者

引用和评论

福利来了！计算巢支持在已经购买的 ECS 上搭建幻兽帕鲁服务器，支持图形化管理配置

被 Manus 带火的 MCP 是什么｜一文看懂

MySQL慢查询日志：性能优化的终极指南

做到真正0丢失、0重复：Apache SeaTunnel 实现万亿级数据一致性全解密

Devin 发布 DeepWiki，2 星的项目直接装出万星的气场

好用的开源埋点方案-ClkLog埋点用户分析系统

DNS服务器地址大全