[ESSD Technical Interpretation] Asynchronous replication of Alibaba Cloud&#39;s block storage enterprise features

Introduction to In the era of big data, data is the core asset of an enterprise and the lifeline of an enterprise. In the real world, disasters happen from time to time. When a disaster occurs, disaster tolerance becomes the key to the survival of an enterprise. Cloud disaster recovery services, usually called DRaaS (disaster recovery as a service) can not only save the cost of self-built disaster recovery centers, but also save subsequent operation and maintenance costs, and help customers quickly establish cross-regional disaster recovery solutions, that is, buy Ready to use, the feature of releasing at any time also provides users with great flexibility. This article introduces the form of disaster recovery products on the cloud. Enterprises can choose appropriate disaster recovery solutions based on their own characteristics, and analyze the technical architecture of traditional disaster recovery and cloud disaster recovery for asynchronous replication products. Alibaba Cloud Block Storage has also implemented asynchronous replication products for block storage based on its own architectural features.

Preface

Data is the lifeline of an enterprise

Data remote disaster recovery is a universal demand of enterprise-level customers, especially for large customers such as government and finance, it is a core demand. In the era of big data, data is the core asset of an enterprise and the lifeline of an enterprise. In the real world, disasters happen from time to time. When a disaster occurs, disaster tolerance becomes the key to the survival of an enterprise.

In the "911" incident in the United States, the twin towers in the United States collapsed and the data centers of several banks were destroyed. Deutsche Bank, because it had backed up data tens of kilometers away, quickly resumed its business and was well received by users, while the Bank of New York closed down a few months later because it had no disaster recovery plan.

In March 21, the computer room of OVHcloud, France's largest data center operator, caught fire, and more than 3.5 million websites were affected.

In the Zhengzhou 720 flood that just passed, the river hospital area of the First Affiliated Hospital of Zhengzhou University was affected by continuous rainstorms. The entire hospital area was flooded and power outages. The core computer room in the east area ensures the normal operation of the other two campuses.

Disaster recovery on the cloud has become a trend

The alarm bells are ringing in real cases one by one, which has also led to the continuous expansion of enterprises' investment in data protection and disaster recovery. Traditional disaster recovery solutions often require enterprises to build their own disaster recovery centers, purchase dedicated lines, and invest manpower in operation and maintenance, etc., and the investment costs are relatively high. In the era of rapid development of cloud computing, more and more enterprise customers are considering disaster recovery on the cloud. Cloud disaster recovery services, usually called DRaaS (disaster recovery as a service) can not only save the cost of self-built disaster recovery centers, but also save subsequent operation and maintenance costs, and help customers quickly establish cross-regional disaster recovery solutions, that is, buy Ready to use, the feature of releasing at any time also provides users with great flexibility. The following table summarizes the comparison between DRaaS and traditional disaster recovery solutions. It can be seen that compared with traditional disaster recovery, DRaaS has the characteristics of zero infrastructure, less operation and maintenance, and high flexibility. Therefore, in the era of rapid development of cloud computing, DRaaS has also become The trend of disaster tolerance.

	DR AS A Service	DR traditional offline
elastic	establish on-demand, on-demand release	advance planning
construction costs	by the amount of charge, just clouds disk and the required bandwidth costs, infrastructure 0	a one-time buyout arrays and networks, high initial investment
disaster recovery data link	private network	laying its own dedicated line, leased bandwidth of
best practices	cloud oN cloud and cloud with a variety of components, to achieve the optimal mode	self-exploration + vendor recommended mode
function evolution	with public cloud upgrade and evolution	the purchase of new equipment, software, upgrades

#### ESSD cloud disk asynchronous replication Alibaba Cloud's block storage ESSD product is the world's leading flagship product and has gradually matured. In order to better serve enterprise customers and meet their cloud disaster recovery needs, Alibaba Cloud Block Storage also launched its own DRaaS product, cloud disk asynchronous replication, to achieve cross-regional asynchronous replication of cloud disks. This article introduces how users choose appropriate cloud disaster recovery products, analyzes the similarities and differences of different disaster recovery architectures from a technical perspective, and then introduces how we choose disaster recovery architecture for ESSD architecture and the technical principles behind cloud disk asynchronous replication. ## How do companies choose cloud disaster recovery solutions #### According to RPO, RTO select the appropriate disaster recovery type When choosing a disaster tolerance solution, an enterprise should first determine its disaster tolerance level according to its own business characteristics. In the field of disaster recovery, RPO (Recovery Point Objective) is usually used to measure the maximum possible data loss of a disaster recovery system, and RTO (Recovery Time Objective) is used to measure the maximum time required from the occurrence of a disaster to the recovery of the entire system to normal. The country has issued relevant standards, dividing disaster tolerance into six levels, as shown in the figure below

For enterprises, from level 1 to level 6, the higher the level, the lower the risk of data loss, but the higher the cost of disaster recovery construction. In the traditional storage industry, usually data backup and archive products can meet the needs of one to two disaster tolerance, the backup function of ordinary storage arrays can meet the needs of three to five levels, and the asynchronous replication function of high-end storage arrays can meet the needs of four to five levels. , While high-end storage synchronous replication, active-active functions, and application-based replication can meet the five to six requirements. So for the cloud, the major cloud vendors also provide a wealth of cloud products to meet the needs of different disaster tolerance levels. The cloud disaster recovery center usually provides cross-regional or cross-availability zone cloud disaster recovery services, which can meet one to four Asynchronous replication and synchronous replication products can meet the needs of five to six levels of disaster tolerance. Mainstream applications, such as database services, usually also have their own disaster tolerance products, which can achieve the highest IO-level disaster tolerance granularity. From the above levels, it can be seen that asynchronous replication can meet the needs of four to five levels of disaster tolerance, and it is also widely required by banks and other financial customers and government units. #### Choose the appropriate disaster recovery service according to the characteristics of the system From the perspective of implementation, the disaster recovery solutions of existing cloud vendors are roughly divided into three categories: application-based, instance-based, and block-based storage: based on application This type of disaster recovery solution is usually for a specific application service, such as cloud database, message queue, object storage, etc. Users who use related cloud services can choose the disaster recovery service of the corresponding product according to their needs. The advantage of disaster service is that it can often achieve application-level consistency in combination with business. The disadvantage is that it is not universal, and only business based on specific applications can be used. based on cloud host For those who have only purchased IaaS services, or have their own customized services, or application-level disaster recovery services that cannot meet their needs, you can choose a cloud host-based disaster recovery solution, which will protect the data consistency of the entire machine , Or cross-instance data consistency protection. In addition to the recovery of stored data, the disaster recovery end usually also restores the host network. It is more convenient to use. The advantages of this disaster recovery service are simple operation and strong universality. The disadvantages are The disaster recovery end also needs to buy host resources, which is relatively expensive. based on block storage (cloud disk) The core of disaster recovery is data disaster recovery. Therefore, some manufacturers have launched cross-regional replication products for the cloud disk itself. This product form is more flexible and generally has no restrictions on applications. During the replication period, the disaster recovery end does not need to purchase a host, which can reduce user costs. , Can also be used seamlessly with other cloud services to form a similar effect to application-level disaster recovery. Similarly, a cloud disk can be used as a consistency group, and the replicated data of a group of cloud disks satisfies the semantics of crash consistency. ## ESSD cloud disk asynchronous replication, a few simple steps to help you get business recovery The cloud disk asynchronous replication function supports the asynchronous replication of cloud disk data across regions (regions) of ESSD products, with an RPO of 15 minutes. Users can complete the creation of a disaster recovery pair in 3 simple steps: first select the cloud disk that needs to be replicated, second select the disaster recovery site and create the slave disk, and third create the disaster recovery pair and activate the disaster recovery pair. After the disaster recovery pair is activated, the cloud disk data will be periodically copied to the slave disk corresponding to the disaster recovery site. When the user wants to temporarily stop the replication, the stop function of the disaster recovery pair can be used to temporarily stop the replication.

When a failure occurs, the user can use the failover function to complete the switchover between the primary and standby sites. The failover will disconnect the replication link and restore the slave device to the last replication consistency point, making it readable and writable for users Permissions.

After disaster recovery, users who want to restore business to the original production site can use the reverse recovery function to restore the incremental data generated from the site back to the primary site.

## The technology behind asynchronous replication on the cloud This chapter discusses the implementation principle of asynchronous replication technology widely used in disaster recovery products and the similarities and differences with traditional storage architecture. The core of disaster recovery is data disaster recovery. Block storage disaster recovery is the most common disaster recovery solution. Therefore, the following discussion mainly focuses on the asynchronous replication technology of cloud disks. #### Traditional storage replication architecture There are roughly three ways to implement traditional storage asynchronous replication: based on the storage gateway : the storage gateway is located between the server and the storage device, and is a storage service technology built on a SAN network. The storage gateway can provide flexible and diverse storage services for incoming IO streams. The storage gateway is separated from the host server and the array, does not occupy the resources of the host and the storage end, and can conveniently support the replication between heterogeneous systems. However, due to the large number of IOs With the gateway link, there will be a certain loss in performance, which is not suitable for services with high performance requirements. based on the host : In SAN storage, it is usually implemented on the Initiator side, and data is offloaded on the host side according to IO replication requirements. A typical implementation is DRBD. This architecture does not require the back-end storage array, and the host side needs to install corresponding software. Third-party vendors that provide disaster recovery services mostly use this architecture. based on storage arrays: Most storage array vendors will combine their own array characteristics to implement a storage array-based replication architecture. Under this architecture, vendors will implement their own array IO architecture on the Target side to track and double write data. #### Two technical architectures of asynchronous replication on the cloud Cloud vendors usually combine their own product features and usually have two technical architectures: has proxy implementation architecture : This method usually requires a plug-in to be installed in the user virtual host as an IO proxy. The plug-in intercepts user IO requests and forwards them for replication. The advantage of this solution is that it can provide application-level consistency semantics. This architecture has no special requirements for cloud disk vendors, and it is easy to achieve disaster recovery of heterogeneous systems. The disadvantage is that users need to deploy plug-ins before they can use it, and there may be restrictions on the version of the user's operating system. Third-party service providers of cloud products usually adopt this architecture. Agentless implementation architecture : This implementation is often based on the underlying storage system, relying on the consistency points provided by the storage system and obtaining data difference bitmaps and other technologies for full or incremental data replication, which can provide users with consistent data collapse Sexual semantics. The advantage of this method is that it can be combined with the storage system for efficient differential data replication, and there is no intrusion to the user's host system, and the business use method is simpler. The disadvantage is that it cannot achieve application-level data consistency. Mainstream cloud vendors usually have self-developed block storage services, and usually combine the characteristics of block storage's own architecture to implement an agentless replication architecture. ## Alibaba Cloud ESSD cloud disk asynchronous replication architecture Alibaba Cloud Block Storage has also launched its own asynchronous replication product. This chapter introduces how Alibaba Cloud combines its own architecture to implement asynchronous replication products. The asynchronous replication function of Alibaba Cloud block storage adopts an agentless implementation method. The system architecture is shown in the figure. The disaster recovery management software is deployed at the production site and the disaster recovery site. The disaster recovery management and control system periodically initiates replication tasks to the asynchronous replication IO component. The replication component obtains data differences from the backend of the cloud disk storage system. Copy the difference data to the target area. The current RPO design target for cross-region copying is 15 minutes.

#### High-availability architecture High-availability architecture is adopted in the implementation of asynchronous replication technology. Considering that the system can still be available in a failure scenario, the disaster recovery management component will be deployed at the production site and the disaster recovery site at the same time, instead of deploying the disaster recovery component on a single side or in the third place. The metadata information of disaster recovery management will be synchronized in the master and slave of the disaster recovery pair. This ensures that in the event of a disaster at the primary site, the disaster recovery management function of the secondary site is still available. In addition, both the disaster recovery management software and the replication link adopt a high-availability architecture, and all management and control nodes are deployed in a one-master and two-standby manner to ensure the service continuity of the disaster recovery service itself. #### Efficient copy The replication process of asynchronous replication adopts incremental mode for replication, which can minimize the amount of data copied and transmitted, which also improves the efficiency of replication. Relying on the high-efficiency internal acquisition consistency point technology of the underlying storage system, it can efficiently obtain the data collapse consistent data view of the cloud disk, and the internal index technology of the storage system can efficiently obtain the incremental difference of the data. The following figure shows how a storage system obtains the difference bitmap of the consistency point. The obtained difference bitmap will be serialized into a data difference log, that is, DCL (Data Change Log) is sent to the replication component, and the consistency point data of the corresponding area is read according to the difference bitmap and written to the slave disk.

The replication link will also automatically fragment the replication process according to characteristics such as the size and bandwidth of the cloud disk, and replicate concurrently, thereby improving the efficiency of replication and meeting RPO to the greatest extent. The following figure shows the process of the replication component obtaining the difference bitmap and the replication task. The cloud disk will be divided into multiple data slices according to the size and stored on different data servers. The replication component will obtain the DCL of the consistency view from the storage server. The size and bandwidth of the cloud disk determines how many subtasks to divide a cloud disk into for replication, so as to better adapt to the replication bandwidth. The following figure shows the working principle of the copy IO component.

#### No loss of main disk performance Relying on the high-efficiency indexing system and high-performance consistency point generation technology of the storage system, asynchronous replication has little impact on the performance of the main site cloud disk itself, which can be ignored, and the performance of the main disk fully meets the official sales standards. #### Second RTO Traditional backup services usually store data on external systems such as OSS, and then use OSS snapshots to create disks, create disks, and load data when needed. This will make the RTO time longer, usually reaching minutes or longer. Cloud disk asynchronous replication performs periodic writes to the slave cloud disk. The cloud disk is not readable or writable during the replication phase, and can be instantly available after failover. The RTO can reach the second level, thanks to the design that the slave disk data is always online. And the storage system can quickly restore the architectural advantage to the point of consistency. ## Summary and Outlook This article introduces the form of disaster recovery products on the cloud. Enterprises can choose appropriate disaster recovery solutions based on their own characteristics, and analyze the technical architecture of traditional disaster recovery and cloud disaster recovery for asynchronous replication products. Alibaba Cloud Block Storage also implements an asynchronous replication product of block storage based on its own architectural features. Compared with the traditional remote disaster tolerance solution, the block storage disaster tolerance solution has the following advantages: Low cost : There is no need to bind virtual machines to use. Users only need to purchase cloud disks at the disaster recovery site instead of purchasing backup virtual machines. Virtual machines can be purchased as needed during disaster recovery, thereby greatly reducing operating costs. Usability : No need to install agent plug-ins on the user virtual machine, so that the application is unaware, and there is no version requirement for the user's host operating system. Buy on-demand, simple operation, ready to use, support one-key switching and one-key generation of disaster recovery drill disks. High Availability : Disaster tolerance components are designed with high availability zones to ensure that the disaster tolerance system can perform disaster tolerance switching operations in disaster scenarios. fast service recovery : Provides a lower service recovery time, with an RTO of up to seconds. Very low performance overhead : The performance of the master disk is almost unaffected during copying. Block storage asynchronous replication products are committed to providing users with simple, efficient, easy-to-use and low-cost remote disaster recovery solutions. In the future, we will continue to enrich product features, and successively launch consistent replication groups, shared disk support, data link compression, and deduplication. , Enrich product usage scenarios and reduce user usage costs, and create reliable, easy-to-use, low-cost DRaaS services, so stay tuned. Original work: Alibaba Cloud Storage Li Weiwei > Copyright Statement: content of this article is contributed spontaneously by Alibaba Cloud real-name registered users, and the copyright belongs to the original author. The Alibaba Cloud Developer Community does not own its copyright and does not assume corresponding legal responsibilities. For specific rules, please refer to the "Alibaba Cloud Developer Community User Service Agreement" and the "Alibaba Cloud Developer Community Intellectual Property Protection Guidelines". If you find suspected plagiarism in this community, fill in the infringement complaint form to report it. Once verified, the community will immediately delete the suspected infringing content.

[ESSD Technical Interpretation] Asynchronous replication of Alibaba Cloud's block storage enterprise features

Preface

Data is the lifeline of an enterprise

Disaster recovery on the cloud has become a trend

阿里云开发者

引用和评论

福利来了！计算巢支持在已经购买的 ECS 上搭建幻兽帕鲁服务器，支持图形化管理配置

[ESSD Technical Interpretation] Asynchronous replication of Alibaba Cloud&#39;s block storage enterprise features

Preface

Data is the lifeline of an enterprise

Disaster recovery on the cloud has become a trend

阿里云开发者

引用和评论

福利来了！计算巢支持在已经购买的 ECS 上搭建幻兽帕鲁服务器，支持图形化管理配置

[ESSD Technical Interpretation] Asynchronous replication of Alibaba Cloud's block storage enterprise features