Best Practices for Active-Active Disaster Recovery for Hybrid Cloud Applications

Author: distal metatarsal

foreword

More and more enterprises choose the form of hybrid cloud (cloud + self-built IDC or cloud + cloud of other manufacturers) for disaster recovery construction in the process of digital transformation and cloud migration. On the other hand, it can also make full use of the existing offline IDC resources.

The MSHA cloud-native multi-active disaster recovery solution [1] also released the hybrid cloud multi-active disaster recovery product capability. This article will introduce the difficulties of hybrid cloud disaster recovery construction through a business demo case, and how to quickly build an application active-active architecture based on MSHA and have minute-level business recovery capabilities.

Business Hybrid Cloud Disaster Recovery Practice

Business Background Information

Company A is an e-commerce trading platform in the retail industry. The business system is deployed in the self-built IDC computer room, and there are the following pain points:

The business is only deployed in a single IDC room, lacking disaster recovery capabilities.
The IDC capacity is insufficient, and the physical machine upgrade and replacement cycle is long, which is not enough to support the rapid development of the business.

During the rapid development of the business, the lack of capacity and failures encountered many times have attracted the attention of the company's senior management, and they are determined to build disaster tolerance capabilities. Since the self-built IDC is the company's existing assets and has been used stably for many years, and at the same time does not want to rely too much on the cloud, it is expected to establish a hybrid cloud form disaster recovery architecture of IDC + cloud.

Current application deployment architecture

Applications included in the e-commerce trading platform:

frontend: Web application, responsible for interacting with users.
cartservice: shopping cart application, providing shopping cart adding, storage and query services.
productservice: commodity application, providing commodity and inventory services.

Technology stack:

SpringBoot。
RPC framework: SpringCloud, Dubbo, the registry uses self-built Nacos and Zookeeper.
Databases Redis and MySQL.

Hybrid cloud disaster recovery goals

The business disaster recovery requirements are summarized as follows:

Cloud-to-cloud mutual disaster recovery, switching RTO to minute level. expects mutual disaster recovery between the cloud and the cloud, continues to play the value of IDC, and does not rely 100% on the cloud. In the face of IDC or cloud failure scenarios, it is necessary to dare to switch and be able to switch at critical moments, and the RTO requirement for switching is less than 10 minutes.
No data consistency risk. The data of the two data centers on the cloud and the cloud are strongly consistent, and data consistency risks such as dirty writes must be avoided in the process of daily state and disaster recovery switching.
One-stop control. business disaster recovery require unified management and control, unified operation and maintenance, and unified switching.
short implementation period and low renovation cost. business has multiple product lines, complex dependencies, long call links, and is in a period of rapid development and frequent iterations. It is hoped that the disaster recovery construction will not bring the transformation burden to the business R&D team.

Construction difficulties

Traffic management is difficult
If DNS is used to parse the traffic to the cloud and the cloud according to the weight, there is a problem that the modification of the DNS resolution takes a long time (usually ten minutes or hours, see FAQ [2] ), which cannot meet the disaster tolerance requirements. Toggle less than 10 minutes requirement.
Business applications rely on Redis and MySQL. IDC adopts open source self-built and cloud products are used directly on the cloud. It is difficult to realize the disaster recovery switching capability of open source self-built + cloud products.
It is difficult to ensure the data quality of disaster recovery switching
During the disaster recovery switching process, the old data may be read due to data synchronization delay, and the time when switching rules are pushed to distributed application nodes may be inconsistent, which may cause the simultaneous reading and writing of the databases on the cloud and the cloud, resulting in the problem of dirty writing. The whole switching process Data quality assurance is a key point, but also a difficult point.
No business code intrusion is difficult
In order to realize the disaster recovery switching capability of Redis and MySQL, it usually needs to cooperate with the transformation of business applications, which intrudes into the business code greatly.

solution

Combined with business disaster recovery requirements and the characteristics of hybrid cloud IDC + cloud form, the application active-active architecture can better meet business disaster recovery requirements.

Application active-active architecture

Architecture diagram:

Architecture Specification:

Select a region on the cloud with a physical distance from the IDC <= 200km, and the network latency is low (about 5~7ms).
Applications and middleware are deployed redundantly and symmetrically on the cloud and off the cloud, and provide external services at the same time (active-active applications).
Database remote master and backup, asynchronous replication backup. Applications read and write databases in the same data center to avoid consistency issues.

Detailed plan

application traffic active-active

Business applications are deployed symmetrically on the cloud and off the cloud, and are based on the MSHA access layer cluster to undertake the HTTP/HTTPS traffic of the access interface, and distribute the traffic on the cloud and off the cloud according to the proportion or precise routing rules. The multi-active console provides regular operation and maintenance capabilities such as white-screen deployment, capacity expansion, and monitoring of the MSFE cluster interface, as well as minute-level traffic switching capabilities in response to failure scenarios.

service interworking and same unit call

Business applications need to be migrated to the cloud in batches according to business product lines. In the process, downstream applications are only deployed by IDC. Using the synchronization function of the MSHA registry, it can realize the interoperability of services on and off the cloud, and help businesses migrate to the cloud. At the same time, based on the aspect capability of MSHA-Agent, when the Dubbo/SpringCloud service is called, the Consumer will call the Provider in the same unit first, so as to avoid the network delay caused by the call across the computer room and reduce the RT of the service request.

Data synchronization & database connection switching

Databases are deployed in different places, and applications on and off the cloud can read and write Redis and RDS databases on the cloud on a daily basis, without considering data consistency issues. The MSHA console supports data synchronization (asynchronous replication) on and off the cloud by integrating the DTS synchronization component. At the same time, based on the MSHA-Agent aspect capability, it has the ability to switch application database access connections. If Redis or RDS on the cloud fails, the read and write access connections can be switched to Redis or MySQL in the IDC, and vice versa. During the switching process, it also has the ability to disable write protection to avoid data quality problems such as reading old data and dirty writing.

stop control service code without invade &

The MSHA console supports unified management and control of HTTP and database access traffic, and unified switching. The operation is converged on a one-stop management and control platform, which is convenient for fast white-screen operation and automatic execution of fault scenarios. At the same time, it provides an Agent access mode for business application MSHA, which can obtain relevant disaster recovery switching capabilities without business code modification.

Retrofit content

Application to the cloud
Choose an Alibaba Cloud region that is close to the self-built IDC, and deploy a set of applications, middleware, and databases on the cloud with complete redundancy, so as to build an active-active disaster recovery architecture on the cloud and off the cloud. In this Demo case, Hangzhou Region is selected as the disaster recovery unit.
Network connection:
Access to the CEN cloud enterprise network to realize network interworking on and off the cloud (see the document [3 ] details on building an enterprise-level hybrid cloud by multiple access methods).
Access cluster deployment and configuration:
The MSHA access layer cluster (MSFE) is deployed on the cloud and off the cloud, and the SLB is attached to the public network for public network access and load balancing of the MSFE cluster (refer to the usage document [4 ] ).
Enter the domain name, URI and back-end application address, so as to have the ability of on-cloud and minute-level traffic switching (see usage document 161ea44e45f905 [5 ] ).
application:
Deploy business applications in batches on the cloud.
The JAVA application installs MSHA-Agent and uses Nacos as the control command distribution channel, so as to have the ability to preferentially call same unit and switch database access connections (see usage document 161ea44e45f971 [6 ] ).
Middleware and Database:
MSE is deployed on the cloud to host the ZK/Nacos registry, ApsaraDB for Redis and RDS. It is recommended to deploy the high-availability version across the availability zone, which has the same-city active-active disaster recovery capability.
If an application is only deployed in IDC, you need to configure the service synchronization of the registry (refer to the usage document [7 ] ).
Redis/RDS and self-built Redis/MySQL (see documentation [8 161ea44e45fa29 161ea44e45fa2a] ).

Revamped application deployment architecture

Daily scenario: IDC + cloud bears business traffic at the same time - application active-active

Visit the e-commerce Demo homepage to view the actual traffic call chain: probabilistic visits to the Beijing or Hangzhou unit, both read and write the database in Beijing unit

Disaster recovery capability

RPO: <=1min (depending on DTS synchronization performance)
RTO: <=1min (Depending on the DTS synchronization delay, the MSHA component achieves second-level switching. Overall RTO<=1min)

Disaster Recovery Capability Verification

After completing the construction of the application active-active architecture based on MSHA, it is necessary to verify whether the service disaster recovery capability meets the expectations. Next, a real fault will be created to verify the disaster recovery capability.

7.1 Exercise preparation

Enter the MSHA console and select monitor the large in the left menu bar. At the top of the page, drop down the selection to switch to the actual namespace .
View various monitoring indicators on the page.

description: exercise, determine the monitoring indicators of the business steady state based on MSHA traffic monitoring or other monitoring products (such as daily RT<=200ms, error rate<1%), so as to determine the fault impact area and After the fault is recovered, the actual recovery status of the service is judged.

7.2 Application Fault Injection

Here we use the Alibaba Cloud fault drill product to inject faults into the commodity application Cloud-Beijing

Enter the Chaos fault drill product console [9 ] , select switch to the corresponding region at the top, and select my space left navigation bar.
In My Space select the configured exercise (50% probability network packet loss), and then click to execute exercise .

After the fault injection is successful, open the e-commerce homepage or place an order, there is a probability of abnormal access, which is in line with expectations.

7.3 Cut flow recovery

In the event of a product application failure in the Beijing unit, the MSHA flow switching function can be used to cut off the ingress traffic on the cloud to 0 and quickly restore services.

expected

After 100% of the traffic was switched to the Hangzhou unit, the business was completely restored and was not affected by the failure of the Beijing unit.

Cut stream operation

Enter the MSHA console, and select cut flow > offsite application active-active cut flow in the left navigation bar.
On the cut flow page, click for the Beijing unit to cut zero one click.

Click Execute Pre-Check, and click OK in the flow cut check area to start flow cut.
The current status on the streaming task page displays streaming completed , indicating that streaming has been successful.

Refresh the e-commerce Demo homepage, and it can be displayed normally after multiple visits, which is in line with expectations.

Check actual flow call chain: traffic is always accessible to Hangzhou unit, read and write Beijing unit database within.

7.4 Database fault injection

As can be seen from the above call chain, the applications in the Hangzhou unit still access the Redis and MySQL databases of the Beijing unit. We continue to use Chaos fault drill [10 ] to inject faults into the Redis and MySQL databases Beijing unit

After the fault injection is successful, opening the e-commerce homepage or placing an order always accesses abnormally, which is in line with expectations.

7.5 Switching the database for recovery

In the case of a database failure in the Beijing unit, the MSHA database switching function can be used to switch the Redis/MySQL connection accessed by the application to the database in the Hangzhou unit (during the switching process, the data will be synchronized and equalized, and writing will be temporarily disabled during the switching process).

expected

After the database connected by the application is switched to Hangzhou, the business is completely restored and is not affected by the failure of the Beijing unit.

Cut stream operation

to the MSHA console, and select 161ea44e45fec5 remote application active-active > data layer configuration in the left navigation bar.

2. In the list of data protection rules, find the database of products, orders, and shopping carts, and click switch between standby 161ea44e45fedd one by one.

After clicking the master/standby switch, you will enter the pre-check page. After confirming that the status of each check item is normal, click Confirm to execute to enter the switch details page and automatically execute the switch process.

On the master/slave switch details page, you can see the switch progress and switch results. When the task progress is 100%, the switch is complete.

After the product, order, and shopping cart databases are all switched between master and slave. I have visited the e-commerce Demo homepage many times or placed an order, and found that everything is normal, and the business functions are fully restored after the master-standby switchover, which is in line with expectations.

Summarize

In this article, we introduce the practical case of MSHA's multi-active disaster recovery assisting enterprises in the construction of hybrid cloud application active-active disaster recovery, and give a practical method for the construction of disaster recovery architecture. Verify that the service disaster recovery capability of the fault scenario meets expectations.

Finally, you are welcome to scan the QR code below or search the group number (31623894) to enter the DingTalk group for consultation and communication. The group name: Multi-Live Disaster Recovery (MSHA) Exchange DingTalk Group.

Best Practices for Active-Active Disaster Recovery for Hybrid Cloud Applications

foreword

Business Hybrid Cloud Disaster Recovery Practice

Business Background Information

Current application deployment architecture

Hybrid cloud disaster recovery goals

Construction difficulties

solution

Application active-active architecture

Detailed plan

Retrofit content

Revamped application deployment architecture

Disaster recovery capability

Disaster Recovery Capability Verification

7.1 Exercise preparation

7.2 Application Fault Injection

7.3 Cut flow recovery

expected

Cut stream operation

7.4 Database fault injection

7.5 Switching the database for recovery

expected

Cut stream operation

Summarize

Related Reading

阿里云云原生

引用和评论

通义灵码 AI IDE 上线，第一时间测评体验

支付宝H5下载被拦截的原因排查与解决指南

JManus - 面向 Java 开发者的开源通用智能体

MCP协议重大升级，Spring AI Alibaba联合Higress发布业界首个Streamable HTTP实现方案

PAI Model Gallery 支持云上一键部署 Qwen3 全尺寸模型

2025年3月中国数据库排行榜：PolarDB夺魁傲群雄，GoldenDB晋位入三强

深度测评国产 AI 程序员，在 QwQ 和满血版 DeepSeek 助力下，哪些能力让你眼前一亮？