Introduction to block storage snapshot service based on the high-performance ESSD cloud disk to improve the performance of the snapshot service, providing a lightweight, real-time user experience and revealing the technical principles behind it. According to industry development and cloud data protection scenarios, it provides enterprise users and backup vendors with data protection technical solutions based on advanced features of snapshots to meet the urgent needs of cloud user data protection and ensure business continuity on the cloud.

Introduction to This article uses cloud native as the background to introduce how the Alibaba Cloud block storage snapshot service improves the performance of the snapshot service based on high-performance ESSD cloud disks, provides a lightweight, real-time user experience and reveals the technical principles behind the secrets. According to industry development and cloud data protection scenarios, it provides enterprise users and backup vendors with data protection technical solutions based on advanced features of snapshots to meet the urgent needs of cloud user data protection and ensure business continuity on the cloud.

July 2021, the internationally renowned consulting firm Gartner issued a public cloud IaaS (Infrastructure as a Service) and PaaS "Magic Quadrant ((Platform as a Service) platform Magic Quadrant )", Ali cloud Relying on its leading technical capabilities, it became a public cloud service provider in the " Cloud Block Storage won the first score in individual items, and Alibaba Cloud's computing, storage, network and security scores ranked first in the world. The storage industry is inseparable from the high-performance ESSD cloud disk product to provide users with highly available, highly reliable, high-performance block-level random access services and native snapshot data protection capabilities.

image

New requirements for native business

With the development of cloud-native technology, more and more enterprises are building large-scale enterprises based on cloud computing virtualization, elastic expansion, and the booming cloud-native technology distributed framework, container technology, orchestration system, continuous delivery, and rapid iteration. Large-scale, flexible expansion, and rich distributed business scenarios on the cloud. The deployment scale of enterprise applications, storage, computing and other resource requirements have subsequently increased exponentially, resulting in traditional data protection solutions that cannot meet the new technological changes in the cloud. The market competition environment faced by users has become more intense, and there is an urgent need for cloud data protection solutions that adapt to the scale and development of the business to meet their own competitiveness and business development needs. , users' demands for data protection have not changed. The measurement standard is still the recovery point target 16184f29a39488 RTO and the recovery point target RPO .

image

The primary goal pursued by users is still business continuity, that is, when the business faces the threat of interruption, the business can be quickly restored; the business faces the pressure of growth, and the business can be expanded quickly. According to business scenarios, users put forward the following urgent requirements for data protection and snapshot services on the cloud:

    • creation time : The snapshot is completed extremely quickly, and key business data is backed up immediately.
    • available at extremely snapshots are available at extremely high speed, to deal with emergencies and complete cloud disk rollback recovery.
    • business expansion: business volume sudden increase requires business expansion.
    • machine protection: single ECS instance and multiple ECS instances of the associated multi-disk consistent data protection .
    • test verification: data test verification and recovery can be performed outside the production environment.
    • recovery speed: file system and application data are in an application-consistent backup state to avoid application downtime and recovery process.
    • Container backup : The rapid iteration and release of the container business environment requires an urgent need to protect metadata and application business data.

According to the definition of Snapshot in the Storage Network Industry Association SNIA : A snapshot is a fully usable copy of a specified data set. The copy includes an image of the corresponding data at a certain point in time (the point in time when the copy starts). Cloud block storage snapshot is to provide ESSD 16184f29a3976b cloud disk at a certain moment. Adapting to the development trend of the industry, the snapshot service constantly discovers new needs and new scenarios of users, and unremittingly carries out new function development and iterative evolution, extreme upgrade and optimization ESSD cloud disk snapshot advanced enterprise new features: snapshot availability features, applications Consistent snapshots and consistent group snapshots that adapt to distributed application architectures and remote disaster recovery functions for cross-regional replication of snapshots. In the development process of continuous independent output and integration, it has met the needs of enterprise users on the cloud, serving big data, games, artificial intelligence, financial industries and other fields, and has also received other teams from Cloud such as: cloud database team 16184f29a3976f RDS , Hybrid cloud backup team, flexible container instance ECI , container service ACK and other business teams and user feedback:

    • The cloud database team RDS industry users' evaluation is: RDS 's second-level backup product aligns with the industry's database backup products, reducing the original physical file backup on instance resources and effectively reducing data protection risks.
    • Elastic container instance ECI container acceleration revenue customer Tucson's evaluation is: the extremely fast cache acceleration function accelerates the release of container applications, reduces the calculation time of the simulation platform, reduces the calculation task to an average of less than 5 minutes, and the product release cycle is extremely large shorten.
    • According to hybrid cloud backup customers, the application-consistent whole-machine backup capability is fully compliant with the snapshot function of the VMware
    • Group Snapshot consistency and the ability to apply consistent snapshot of the services provided, to fully meet 2021 years Gartner Ali clouds storage service evaluation capacity. The container business ACK team passed the 2021 year Forrestor container backup evaluation capabilities.

Typical scene

Lightweight, real-time snapshots with extremely fast available features, consistency group snapshots and advanced features of application consistent snapshots, quickly build for enterprise users and third-party backup vendors: fast backup and recovery, disaster recovery testing, copy utilization, and disaster recovery switching Replica data management ( C D ata M ) application scenarios. Gartner released in July 2021 about storage and data protection technology trends ( H ype C ycle) In the analysis, container backup, cloud data backup and copy data management ( ) The industry development trend of data protection in the past few years. Gartner basic definition of management of the copy of the data is: based application consistency generation "Golden Image" primary storage snapshots on a secondary storage, and use it to carry out backup , disaster and test , And heterogeneous storage is the basic condition of capacity. The advanced snapshot service feature of Alibaba Cloud ESSD fully meets the conditions for building CDM , helping users to achieve typical scenarios of native data protection for copy data management on the cloud:

image

backup and recovery: fast backup and standard backup to provide near-close and distant backup recoverable points. Based on the whole machine protection of ECS instances on the cloud and container applications in the K8S environment, it regularly creates extremely fast and available snapshots. After the consistency group snapshot feature and the extremely fast availability feature are enabled, the generation interval of local instant snapshots can reach the second level. Instant copies of snapshots are retained locally and become extremely fast backups for non-destructive recovery of IO performance in seconds. Periodically generate consistent snapshots of the entire machine application based on the upper-level enterprise applications. The local snapshot copy is also uploaded to the object storage OSS via the network as a standard backup. After uploading the backup data, the standard backup is visible in all available zones in the local domain, which is suitable for historical data with a long retention time.

disaster recovery test: disaster recovery test is based on speed-type backup. Replica data management requires regular testing of the disaster recovery environment. Regular testing can improve the reliability of the disaster recovery environment, and avoid configuration problems and environmental change issues that make the disaster recovery switchover fail to be completed when a real disaster occurs, causing the business to fail to quickly recover the disaster recovery system. Fast cloning technology based on local snapshot copies, disaster recovery instances and container applications, and periodic mount and backup data test verification. Traditional solutions based on replication technology need to wait for snapshots to be replicated on the disaster recovery side before they can be tested and drilled. After adopting the ultra-fast backup method, the second-level clone, second-level mount and second-level start-up test of the disaster recovery end are realized.

copy use: based on data analysis of extremely fast backup. Under the condition of not affecting the production environment, based on the rapid cloning technology in the disaster recovery environment, the container application is regularly pulled up, and the copy is calculated and analyzed with big data to mine the value of the data. Replica utilization is also reflected in practice in MySQL database applications based on extremely fast backups for instant read-only backup of the database, and offline data analysis.

Disaster tolerance switch: business is cut from the production environment to the disaster recovery environment. When a major disaster occurs in production, business cannot be restored in a short time, and production cannot continue. Switch the business from the production center to the disaster recovery center; after the business is restored in the production center, the business is switched back to disaster recovery.

Compared with the traditional copy data management CDM solution, the cloud computing environment and cloud native environment have large-scale and flexible homogeneous computing environments, enterprise users do not need to invest in equipment resources and software; extremely fast backup and fast cloning technology Dadi has reduced the recovery point-in-time target for copy development, testing, and disaster tolerance switching RTO ; the unified backup data format of the cloud snapshot service reduces the number of copies required in various management processes and eliminates data between backup software Format compatibility issues.

Technical Principle

We have made a lot of optimizations on the distributed snapshot algorithm and implementation, so that users can put aside the concerns of affecting performance and carry out lightweight, real-time data protection at any time. "Light": Does not affect the IO read and write performance during the snapshot creation period. "Fast": ESSD cloud disk snapshots can be created in seconds, rolled back in seconds, and cloned in seconds- extremely fast available feature meets the needs of users for real-time data protection and rapid DevOps orchestration.

image

Extremely fast available features

The snapshot service with extremely fast availability features can not only perform data backup, compliance scenarios, and long-term archive services, but also cloud disk data can be backed up to Cloud's object storage service (16184f29a39fac O S torage 16184f29a39 ), the retention of local snapshot copies at a second interval forms a close and distant snapshot protection strategy, which realizes the lightweight creation of snapshots, real-time available extremely fast clones, and the advanced features of second-level non-destructive rollback.

fast cloning: is isolated from production in a cross-availability zone disaster recovery environment, snapshot cloning new disks to achieve writable snapshots, application test verification and business recovery preparations; eliminate business pressure on the cloud and achieve horizontal business expansion. For example, the horizontal expansion of the MySQL database application, the establishment of the standby database, the instance creation and the separation of read and write all need to be pulled up in seconds, and the ultra-fast cloning uses delayed loading technology to realize the availability of second-level data in the local domain of the local snapshot copy and across the cluster. Clone the new disk and realize the instance pull up in seconds.

image

second-level rollback: local snapshot copy data and cloud disk local storage, realize second-level IO lossless rollback recovery. The snapshot generation process is based on the improved ROW technology and holographic indexing technology. As the cloud disk data block changes ESSD IO performs the best mode of cloud disk reading performance. Optimization. There is no need to pull data from remote object storage, and rollback IO performance in seconds is achieved without loss.

image

Under the test conditions after the cloud disk created multiple extremely fast and available snapshots and after the rollback was initiated, the cloud disk performance read performance was basically unchanged. After a friend’s cloud disk kept multiple local snapshots, IO read performance exhibited varying degrees of delay jitter.

image

Consistency group snapshot

Container environments and ECS instances need to protect stateful applications that are associated with multiple disks. The biggest problem with single disk snapshots is: stateful applications are based on cross-cloud disk LVM, Windows dynamic disks and file systems as persistent storage, single cloud disk snapshot data backup errors; database applications take into account both performance and data security, and log files WAL and data files are located in unused storage devices, and the system cannot be backed up and disaster recovery regularly.

image

In addition to the deployment of stateful applications in the POD under K8S and the deployment of single ECS instances, there are also distributed application deployment architectures and application high-availability clusters in the cloud environment, such as: Windows Failover Cluster, high-availability architecture of active and standby application servers, Oracle RAC is based on a shared storage application architecture, and these distributed architectures also require data consistency protection requirements across cloud disks and nodes.

image

Cloud computing storage backends often use distributed storage architecture. The lack of a global logical clock in a distributed environment makes it difficult to implement single ECS instances and cross-ECS instances, single POD in the K8S environment, and cross-node consistency group snapshots of multi-cloud disks. It is even more technically challenging to achieve the lowest impact of snapshots on IO performance. The industry's implementation technologies for multi-disk crash consistent snapshots are mainly divided into two categories:

  • Adopt the method of blocking write IO during the snapshot to realize the consistency of data collapse across multiple disks based on the point in time
  • The sequencing algorithm of logic clock is adopted, but it relies on the implementation of distributed storage, which is difficult to implement.

Consistency group snapshots adopt the second method, pursuing that snapshots do not damage IO performance, and realize that snapshots have minimal impact on application performance.

The principle: taken sequencing algorithm based on IO, snapshot creation without having to write IO blocked. Many users worry that creating snapshots will affect IO performance, and only perform snapshot data protection during low business periods. Our optimized and improved multi-disk consistency group snapshot algorithm has broken people's impression of the impact of snapshot IO. Based on the write order preserving mechanism, we actively follow the order of write IO to the underlying storage and adopt the IO marking and sequencing process. Determine the set of IO data that should be included in the snapshot based on the time when the snapshot is completed and the IO sequence. Compared with the traditional method, the snapshot sequence process does not prevent the IO writing process; compared to the traditional copy-on-write COW method, the snapshot generation process adopts the write-on-write redirection ROW write method, background data The collection reference generation process has no impact on the IO link, reducing the impact of snapshots on IO performance is minimal, and achieving no loss of IO performance in the read and write scenarios of database services.

image

For database applications, use 2 disks, 2 clients, 4TB capacity, random write, iodepth=16, jobs=1, write block size 16KB in the test database high IOPS scenario, the IO impact test during the snapshot creation process, The impact on IO performance during the snapshot creation process of Youshang 1 and Youshang 2 has almost increased by 1 to 3 times.

image

Application consistent snapshot

The consistency types of ESSD cloud disk snapshot data are mainly divided into crash consistency and application consistency. Crash consistency requires that file systems and applications have crash recovery capabilities, which are characterized by low recovery point target RPO and low business impact. However, the following scenarios cannot meet the high reliability of data backup and the target RTO of the second-level recovery point in time:

  • atomic defect risk: file system and database applications have certain difficulties in achieving transaction atomicity. may have defects . The article "All File Systems Are Not Created Equal" published on the top system conference USENIX explained that the application and the kernel guarantee atomicity may have implementation flaws.
  • data loss risk: mainstream file systems work in a performance-first mode by default, and crash-consistent backups have data loss risks . The default data writing mode of the ext4 file system on Linux is the ordered mode, and there is a risk of data loss during the file system verification and repair process; the database application configuration is performance-first, and there is a risk of business data loss.
  • takes a long time to generate and has a large impact: Traditional file-level physical backup methods and backup proxy methods rely on the generation of logical volume snapshots. takes a long time and has a large system impact . The backup agent needs to install a kernel driver, which has poor compatibility and high maintenance costs; the file backup process needs to read data, which consumes system CPU and IO resources. Application-consistent snapshots are only interoperable with applications at the point of generation consistency, without incremental data generation and backup read-write operations.

The principle: compared to traditional backup methods, application-consistent snapshots provide value to users of cloud-native agentless application-consistent snapshots to simplify the way customers use traditional backup produced: resource consumption, publishing complexity, Software compatibility, kernel development, and software maintenance costs. A combination of cross-platform plug-ins and proprietary consistency components is adopted, based on the file system kernel and the VSS mechanism on Windows, to achieve data quiescing of IO and application transactions during snapshots, and to meet the data consistency requirements of enterprise applications in storage snapshots. The adopted generation protocol automatically restores the IO impact based on the impact duration. The snapshot consistency type depends on the creation protocol submission result and application status. The link length from the upper layer application to the underlying storage and the performance of the consistency component are optimized, and the IO impact duration is reduced to Second level. The creation frequency interval can achieve file system consistency in seconds to complete creation and minute-level application consistency snapshot intervals according to business requirements.

image

From crash consistency to application consistency, from single-disk consistency snapshots to multi-cloud disk group snapshot consistency, all types of block storage public clouds in the industry. In terms of security risks and application support scalability, compared with the implementation of friends, the advantages of native agentless snapshots realized: no resident services, no public network IP addresses and port opening risks, role security authorization, and no additional kernel driver participation; Supports dynamic discovery of logical volumes and enterprise applications. Based on ESSD cloud disk storage snapshot, no agent backup, no need to maintain the kernel driver, no data read and transfer inside the virtual machine.

image

Through the actual snapshot creation time and IO impact duration tests of major cloud vendors at home and abroad, SQL Server database applications based on ESSD system disks and data disks can achieve second-level write IO blocking, minute-level snapshot intervals, and application-consistent snapshot creation time It is 2 to 3 times lower than that of friends. Application-consistent whole-machine recovery avoids the log replay process when recovering from a crash-consistent snapshot, thereby improving the startup speed of database applications.

image

Industry feature comparison

Compared with the snapshot features of other public cloud vendors in the industry, ESSD cloud disk is currently the only cloud vendor that fully supports the ultra-fast availability of snapshots and consistency group snapshots, meeting the data protection scenarios of enterprise core applications on the cloud. Snapshot RTO and RPO Requirements.

image

image

Future outlook

image

Data protection is not a fix but a precaution. With the vigorous development of cloud native technology, especially the evolution of container technology, enterprise users have higher and higher requirements RPO and the recovery time point target RTO In the future, we will also ESSD cloud disk, such as: high-density snapshots, continuous data protection, application consistency protection capabilities based on multiple ECS instances, and continue to provide users with the “light” and “fast” snapshot features. "" and "bomb" feature quality, reduce the RTO and RPO of enterprise data protection, and provide more advanced features of native snapshot services to help enterprise data protection.

Original work: Alibaba Cloud Storage Fan Jun

Series articles pass the door:

[ESSD Technical Interpretation-General Chapter] Enterprise-level storage on the cloud-opening a new dimension of storage and promoting user core business innovation https://developer.aliyun.com/article/793534?spm=a2c6h.13148508.0.0.73b34f0eS1PElF

Copyright Notice: content of this article is contributed spontaneously by Alibaba Cloud real-name registered users, and the copyright belongs to the original author. The Alibaba Cloud Developer Community does not own its copyright and does not assume corresponding legal responsibilities. For specific rules, please refer to the "Alibaba Cloud Developer Community User Service Agreement" and the "Alibaba Cloud Developer Community Intellectual Property Protection Guidelines". If you find suspected plagiarism in this community, fill in the infringement complaint form to report it. Once verified, the community will immediately delete the suspected infringing content.

阿里云开发者
3.2k 声望6.3k 粉丝

阿里巴巴官方技术号,关于阿里巴巴经济体的技术创新、实战经验、技术人的成长心得均呈现于此。


引用和评论

0 条评论