Disaster recovery has become a basic requirement for enterprises to migrate to and use the cloud

The "Global Cloud Computing IT Infrastructure Market Forecast Report" released by IDC in 2019 shows that in 2019, the proportion of IT infrastructure on the global cloud will exceed that of traditional data centers. More and more enterprises choose to build systems in the cloud because of the low cost and stability of cloud computing, and the cloud has become a mainstream IT infrastructure. In recent years, open source technology and cloud technology have maintained rapid development, a wide variety of products and services have emerged, the decision-making power of technicians has increased, and the speed of architecture change has been accelerated. In the process of high-speed evolution, it is necessary to guard against unreasonable man-made failures, and at the same time, pay attention to the impact of natural disasters. An inappropriate business interruption may bring serious brand, customer, and economic losses.

All cloud companies take disaster recovery system capacity building as the most basic goal and guarantee investment. Only by ensuring that in the event of a disaster, key data is not lost and system services resume operation as soon as possible, can an enterprise ensure long-term, stable and high-speed development.

Common disaster failures

In the production practice of enterprises, large and small failures will inevitably occur, affecting the stability of the system. Some faults recover quickly after they occur, and external users are not aware of them. Some faults cannot be recovered for a long time, causing problems such as external public opinion and capital loss, and may even cause the company to go bankrupt. The faults generally fall into the following categories:

  • Human errors, such as configuration errors, application release failures, etc. are common;
  • Hardware failure, such as a common failure of network equipment, which affects multiple servers in the computer room or cluster, etc.;
  • Network attacks, such as DDoS and other network attacks;
  • Network disconnection/power outage, such as fiber optic cable being cut, etc.;
  • Natural disasters, such as lightning strikes that cause power failures in the computer room.

Under these disasters, the public network, access gateway, computer room and other facilities are often interrupted, which will cause business problems such as traffic drop, website cannot be opened, and fault alarms. For enterprises, they need to face "business recovery" and "" "Failure recovery" two major problems, the best way is to decouple the two types of problems, when a failure occurs, quickly switch traffic, and give priority to business recovery. On the premise of service recovery, locate and repair the fault.

Growth of Fault Escape Capability

Common fault location and recovery in the industry covers four steps: problem finding - problem locating - problem repairing - service recovery. Obviously, it cannot meet the needs of decoupling processing of "business recovery" and "failure recovery". A better way to deal with it is to upgrade these 4 fault handling steps into 3 fault handling steps of "Discovery of Problems - Flow Cut - Service Recovery". From tens of minutes or even hours" to "minutes or even seconds", improving the disaster recovery capability of the business.

In order to ensure the realization of fast streaming and "effective" streaming in real scenarios, we need to build higher-level disaster recovery architecture technologies, and also need to enhance "infrastructure", "business system", "guarantee tools", " production system” and “emergency personnel”. Through the synergy of architecture and organization, the ability of disaster recovery and multi-activity preservation is realized.

This capability is not something that can be broken through immediately, but requires continuous optimization of the architecture and organizational coordination in order to promote the spiraling rise of business disaster tolerance and multi-activity capabilities.

Break through geographical restrictions

Enterprises generally choose single-region deployment in the initial stage, but with the development of business scale, single-region computer rooms will not be able to meet business needs. At the same time, with the explosive growth of the number of connections of clustered components in a single region, the capacity of a single cluster can no longer be expanded, and it is urgent to split the cluster.

However, when splitting clusters that support cross-regions, the principles of "routing consistency" and "data consistency" need to be met, so that services can break through regional restrictions, achieve horizontal capacity expansion across regions, and flexibly schedule traffic, thereby Solve capacity challenges in a single region, such as:

1. Machine capacity. Multiple remote computer rooms are deployed peer-to-peer, and enterprise applications can flexibly deploy business applications in multiple computer rooms in multiple locations.

2. Connection capacity. The clustered components in the computer room are independent, and each computer room is connected to its own components to avoid the problem of unlimited increase in the number of connections.

Disaster recovery and recovery limitations

Disaster recovery and disaster recovery are based on data-level disaster recovery. The common implementation method is to build a set of identical application systems in the backup computer room. When a disaster occurs, it will resume operation within the agreed time frame (RTO), reducing as much as possible. losses from disasters. In actual implementation, there are the following problems:

1. The disaster recovery center does not provide services at ordinary times, and it cannot be determined whether the switch can be successfully switched at the critical moment of switching to the disaster recovery center.

2. The disaster recovery center usually does not provide services, and the entire disaster recovery resources will be in an idle state, resulting in high cost waste.

3. The disaster recovery center does not usually provide services, so the computer room that usually provides services is still in a single region. When the business volume is large to a certain extent, this model cannot solve the problem of resource bottlenecks in a single region.

Apply the concept of multi-activity

"Multi-active application" is an advanced form of "application disaster recovery" technology, which refers to the establishment of a production system corresponding to part or all of the local production system in the same city or remote computer room, and all applications in the computer room provide services to the outside world at the same time. When a disaster occurs, the multi-active system can switch business traffic within minutes, and users can't even feel the failure.

Common application multi-active architectures are divided into intra-city multi-active, remote multi-active, and hybrid cloud multi-active. Compared with traditional disaster recovery, application multi-active has the following four advantages:

  • -minute RTO . The recovery time is fast. The average recovery time of Alibaba's internal production level is within 30s, and the average recovery time of external customer production systems is 1 minute.
  • resources fully utilize . There is no problem of idle resources, and multi-machine rooms and multi-resources are fully utilized to avoid resource waste.
  • has a higher switching success rate than . Relying on the mature multi-active technology architecture and visual operation and maintenance platform, compared with the existing disaster recovery architecture, the success rate of switching is high.
  • flow control . Application multi-active supports traffic from top to bottom, relying on precise traffic drainage capabilities to enter specific business traffic into the corresponding computer room, enterprises can incubate features such as global grayscale and key traffic protection based on this advantage.

By 2025, more than 50% of enterprises will use distributed cloud. The public cloud service capabilities will be extended to edge computing and IDC, and a distributed cloud will cover all scenarios. Cross-cloud, cross-platform, and cross-geographical application multi-activity scenarios and technologies will begin to emerge. Without disaster recovery, the application system must have the ability to escape disaster failures at any time. A smooth migration to the cloud is a key decision point for every decision maker.

Business continues to develop, architecture continues to evolve, and disaster recovery governance solves problems in development. How to implement a disaster recovery architecture with multiple active applications and organizational coordination has increasingly become a concern of more and more enterprises.

----The above is an excerpt from the "Application Multi-Active Technology White Paper", click here to download!


阿里云云原生
1.1k 声望310 粉丝