[ESSD Technical Interpretation] Enterprise-level weapons, Alibaba Cloud NVMe disks and shared storage

Introduction to combines the industry's most advanced software and hardware technology. In the cloud storage market, it is the first to realize the NVMe protocol + shared access + IO Fencing technology at the same time. It achieves high reliability, high availability, and high performance on top of ESSD. At the same time, it implements rich enterprise features based on the NVMe protocol, such as multiple mounts, IO Fencing, encryption, offline expansion, native snapshots, asynchronous replication and other functions. This article introduces in detail the development history of SAN and NVMe on the cloud, and makes a vision for the future

How is 7x24 high availability achieved?

In the real world, a single point of failure is the norm. Ensuring business continuity under failure is the core capability of a high-availability system. In key applications such as finance, insurance, and government affairs, how to ensure business 7*24 high availability? Generally speaking, a business system is composed of computing, network, and storage. On the cloud, network multi-path and storage distribution ensure stable and high availability. However, to achieve high availability across all links of the business, it is necessary to solve the single point of computing and business side. Fault. Take a common database as an example. It is unacceptable for users to stop the business due to a single point of failure. Then, when the instance becomes unserviceable due to power outage, downtime, hardware failure, etc., how to quickly restore the business?

Different scenarios have different solutions. MySQL usually builds a master-slave/master-slave architecture to achieve high business availability. When the master database fails, it switches to the standby database to continue to provide services to the outside world. But after the instance is switched, how to ensure the consistency of the master-slave database data? According to the business's tolerance for data loss, MySQL usually uses synchronous or asynchronous data replication, which introduces additional problems: data loss in some scenarios, synchronous data affects system performance, and business expansion requires a new set of equipment and full capacity Data replication and long time for active/standby switchover affect business continuity, etc. It can be seen that in order to build a highly available system, the architecture will become complicated, and it will be difficult to take into account availability, reliability, scalability, cost, performance, etc., so is there a more advanced solution that can have both fish and bear paws? The answer must be: Yes!

Figure 1: High-availability architecture of the database

Through shared storage, different database instances can share the same data, so that high availability can be obtained through fast switching of computing instances (Figure 1). Oracle RAC, AWS Aurora, and Aliyun PolarDB databases are representative of them. The key here is shared storage. Traditional SAN is expensive, it is troublesome to expand and shrink, and the camera head is also easy to become a bottleneck. The higher threshold for use is not user-friendly. Is there a better, faster, and more economical shared storage? Solve the pain points of users? The NVMe cloud disk and sharing features recently launched by Alibaba Cloud will fully meet the demands of users, and we will focus on them next. Here is a question for everyone. After the instance is switched, if the original library is still writing data, how to ensure the correctness of the data? To sell it, readers can think about it first.

Figure 2: Data correctness issues in master-slave switching scenarios

The wheel of history: SAN and NVMe on the cloud

We have entered the digital economy era of "data oil". The rapid development of cloud computing, artificial intelligence, Internet of Things, 5G and other technologies has promoted the explosive growth of data. From the IDC 2020 report, it can be seen that the scale of global data is increasing year by year and will reach 175 ZB in 2025. Data will be mainly concentrated in public clouds and enterprise data centers. The rapid growth of data has provided new impetus and requirements for the development of storage, allowing us to recall how the next block storage form evolves step by step.

Figure 3: Evolution of block storage morphology

DAS : The storage device adopts direct connection (SCSI, SAS, FC and other protocols) to connect to the host. The system is simple, easy to configure and manage, and has low cost. Because storage resources cannot be fully utilized and shared, it is difficult to achieve centralized unification. Management and maintenance.

SAN : Connect storage arrays and business hosts through a dedicated network, which solves the problems of unified management and data sharing, and realizes high-performance and low-latency data access. However, SAN storage is expensive, complicated in operation and maintenance, and poor in scalability. The user’s threshold for use has been improved.

full flash : The revolution of the underlying storage media and the decrease in cost marked the arrival of the era of all flash memory. Since then, storage performance has shifted to the software stack, forcing software to undergo large-scale changes, promoting user-mode protocols, software and hardware integration, The rapid development of RDMA and other technologies has brought about a leap in storage performance.

Cloud Disk : In the high-speed development of cloud computing, storage is transferred to the cloud. Cloud disks have inherent advantages: flexibility, flexibility, ease of use, easy expansion, high reliability, large capacity, low cost, free operation and maintenance, etc. Become a solid base for storage in the course of digital transformation.

Cloud-based SAN response to the times. It inherits many advantages of cloud disks and also has traditional SAN storage capabilities, including shared storage and data protection Features such as, synchronous/asynchronous replication, and extremely fast snapshots will surely continue to shine in the enterprise storage market.

On the other end, in the evolution of storage protocols, NVMe is becoming the darling of the new era.

Figure 4: The evolution of storage protocols

SCSI/SATA : Storage in ancient times, hard disks were mostly low-speed devices, and data was transmitted through the SCSI layer and SATA bus. The performance was limited by storage slow media, such as mechanical hard disks, which obscured the performance of SATA single-channel and SCSI software layer Disadvantage.

VirtIO-BLK/VirtIO-SCSI : With the rapid development of virtualization technology and cloud computing, VirtIO-BLK/VirtIO-SCSI has gradually become the mainstream storage protocol for cloud computing, making the use of storage resources more flexible, agile, safer, Scalable.

NVMe/NVMe-oF : The development and popularization of flash memory technology has promoted a new generation of storage technology revolution. When storage media is no longer a performance barrier, the software stack has become the biggest bottleneck, which gave birth to NVMe/NVMe-oF, Various high-performance and lightweight protocols such as DPDK/SPDK and user-mode networks. The NVMe protocol family combines high performance, advanced features and high scalability, and will surely lead a new era of cloud computing.

In the foreseeable future, SAN and NVMe on the cloud will surely become the trend of the future, which is the general trend.

NVMe in the New Era of Cloud Disk

The rapid development and popularization of flash memory technology has shifted the performance bottleneck to the software side, and more demands on storage performance and functions have pushed NVMe onto the stage of history. NVMe specifically designed a data access protocol for high-performance devices. Compared with the traditional SCSI protocol, it is simpler and lighter. With multi-queue technology, it can greatly improve storage performance. At the same time, NVMe also provides a wealth of storage features. Since the birth of the NVMe standard in 2011, through continuous improvement of the protocol, it has standardized multi-Namespace, Multi-Path, full-link data protection T10-DIF, Persistent Revervation permission control protocol, atomic write, etc. Many advanced functions and new storage features defined by them will continue to help users create value.

Figure 5: Alibaba Cloud NVMe cloud disk

The high performance and rich features of NVMe provide a solid foundation for enterprise storage. Together with the scalability and growth of the protocol itself, it has become the core driving force for the evolution of NVMe cloud disks. NVMe cloud disk is based on ESSD. It inherits ESSD's high reliability, high availability, high performance, atomic write capabilities, and ESSD's native snapshot data protection, cross-domain disaster recovery, encryption, and second-level performance changes. The integration of ESSD and NVMe features can effectively meet the needs of enterprise applications and enable most NVMe and SCSI-based services to go to the cloud seamlessly. The shared storage technology described in this article is based on the NVMe Persistent Reservation standard. As one of the additional functions of NVMe cloud disks, its multiple mount and IO Fencing technology can help users greatly reduce storage costs, and effectively improve business flexibility and data reliability. It has a wide range of applications in distributed business scenarios, especially for high-availability database systems such as Oracle RAC and SAP Hana.

Enterprise storage weapon: shared storage

As mentioned earlier, shared storage can effectively solve the problem of high availability of databases. Its main reliance on capabilities include multiple mounts and IO Fencing. Taking databases as an example, we will describe how they function.

The key to high business availability-multiple mounts

Multi-mounting allows cloud disks to be mounted to multiple ECS instances at the same time (currently, the maximum support is 16), and all instances can read and write access to the cloud disk (Figure 6). Through multiple mounts, multiple nodes share the same data, which can effectively reduce storage costs. When a single node fails, the business can be quickly switched to a healthy node. This process does not require data replication and provides atomic capabilities for rapid recovery from failures. Highly available databases such as Oracle RAC and SAP HANA all rely on this feature. It should be noted that shared storage provides the consistency and recovery capabilities of the data layer. To achieve final business consistency, the business may need to perform additional processing, such as database log replay.

Figure 6: Multi-instance mount

Generally, a single-machine file system is not suitable as a multi-mounted file system. In order to speed up file access, file systems such as ext4 will cache data and metadata, and file modification information cannot be synchronized to other nodes in time, resulting in multiple nodes. Inconsistent data. The non-synchronization of metadata will also lead to conflicts in access to hard disk space between nodes, thereby introducing data errors. Therefore, multi-mounting usually needs to be used with cluster file systems. Common ones are OCFS2, GFS2, GPFS, Veritas CFS, Oracle ACFS, etc. Alibaba Cloud DBFS and PolarFS also have this capability.

With multiple mounts, can you sit back and relax? Multiple mounting is not a panacea. It has a blind spot that cannot be solved by itself: rights management. Generally, applications based on multiple mounts need to rely on a cluster management system, such as Linux Pacemaker, to manage permissions, but in some scenarios, permission management will fail and cause serious problems. Recall the problem raised at the beginning of the article. Under the high-availability architecture, the primary instance will switch to the standby instance after an exception occurs. If the primary instance is in a suspended animation state (such as network partition, hardware failure, etc.), it will mistakenly think of itself Have the write permission to write dirty data with the standby instance. How to avoid this risk? Now it's IO Fencing's turn.

Data correctness guarantee - IO Fencing

One of the options for solving dirty data writing is to terminate the in-transit request of the original instance, and reject the new request to continue issuing, and switch the instance after ensuring that the old data is no longer written. Based on this idea, the traditional solution is STONITH (shoot the other node in the head), which is to prevent old data from being dropped to disk by remotely restarting the failed machine. However, there are two problems with this solution. First, the restart process is too long and the service switching is slow, which usually results in a service stop of tens of seconds to minutes. What’s more serious is that because the IO path on the cloud is long and involves many components, component failures of computing instances (such as hardware, network failures, etc.) may cause IO to be unrecoverable in a short period of time, so 100% of the data cannot be guaranteed. Correctness.

In order to fundamentally solve this problem, NVMe standardizes the Persistent Reservation (PR) capability, which defines the permission configuration rules for NVMe cloud disks, and can flexibly modify the permissions of cloud disks and mount nodes. Specific to this scenario, after the main library fails, the slave library first sends a PR command to prohibit the write permission of the main library, and rejects all in-transit requests of the main library. At this time, the slave library can update data without risk (Figure 7). IO Fencing can usually assist applications to complete failover at the millisecond level, which greatly shortens the failure recovery time. The smooth migration of services makes the upper-layer applications basically unaware, which is a qualitative leap from STONITH. Next, we further introduce the rights management technology of IO Fencing.

Figure 7: Application of IO Fencing under failover

The Swiss Army Knife of Rights Management - Persistent Reservation

The NVMe Persistent Reservation (PR) protocol defines the permissions of the cloud disk and the client, and with multiple mounting capabilities, business switching can be performed efficiently, securely, and steadily. In the PR protocol, there are three identities for mounting nodes, namely Holder (owner), Registerant (registrant), and Non-Registrant (visitor). As can be seen from the name, the owner has all rights to the cloud disk, and the registrant With partial permissions, visitors only have read permissions. At the same time, the cloud disk has 6 sharing modes, which can achieve exclusive, one-write, multiple-read, and multiple-write capabilities. By configuring the sharing mode and role identity, the permissions of each node can be flexibly managed (Table 1) to meet the needs of rich business scenarios . NVMe PR inherits all the capabilities of SCSI PR. All SCSI PR-based applications can run on NVMe shared cloud disks with a few changes.

Table 1: NVMe Persistent Reservation permission table

Multiple mounts and IO Fencing capabilities can perfectly build a highly available system. In addition, NVMe shared disks can also provide one-write and multiple-read capabilities, and are widely used in separate read-write databases, machine learning model training, and streaming processing. Waiting for the scene. In addition, technologies such as mirror sharing, heartbeat detection, master selection by arbitration, and lock mechanism can be easily realized by sharing cloud disks.

Figure 8: NVMe shared disk one write multiple read application scenario

Secrets of NVMe Cloud Disk Technology

NVMe cloud disks are implemented based on a computing and storage separation architecture, relying on the Shenlong hardware platform to achieve efficient NVMe virtualization and extremely fast IO paths, using Pangu 2.0 storage as the base to achieve high reliability, high availability, and high performance, and computing storage through user-mode networks Protocol and RDMA interconnection, NVMe cloud disk is the crystallization of full-stack high-performance and high-availability technology (Figure 9).

Figure 9: NVMe shared disk technology architecture

: builds NVMe hardware virtualization technology based on the Shenlong MOC platform, through Send Queue (SQ) and Completion Queue (CQ) for efficient interaction of data flow and control flow, simple NVMe protocol and efficient design , With hardware offloading technology, reduce the latency of NVMe virtualization by 30%.

16184f9a418ddf Ultra-fast IO channel fast IO channel is realized, which effectively shortens the IO channel and obtains the ultimate performance.

user mode protocol : NVMe uses a new generation of Solar-RDMA user mode network communication protocol, combined with Leap-CC self-developed congestion control to achieve reliable data transmission and reduce network long-tail delay, based on 25G network JamboFrame achieves efficient Large packet transmission, through the complete separation of the data plane and the control plane, simplifies the network software stack and improves performance. The network multipath technology supports millisecond-level recovery from link failures.

Management and control of high availability : Pangu 2.0 distributed high-availability storage realizes the NVMe control center, NVMe control commands no longer pass through the control node, thereby obtaining reliability and availability close to IO, which can assist users in achieving millisecond-level business switching; based on The NVMe control center realizes precise flow control between multiple clients and multiple servers, and realizes precise distributed flow control of IO in sub-second level; realizes IO Fencing consistency for multiple nodes on a distributed system. The phase update keeps the permissions status between cloud disk partitions consistent, effectively solving the split-brain problem of partition permissions.

Large IO Atomicity : Based on the distributed system, the atomic write capability of large IO is realized from computing, network, and storage end-to-end. Under the condition that the IO does not cross the adjacent 128K boundary, it is ensured that the same data will not be partially placed on the disk. This has an important effect on application scenarios that rely on atomic writes, such as databases. It can effectively optimize the double write process of the database, thereby greatly improving the write performance of the database.

Current status and future prospects

It can be seen that the current NVMe cloud disk combines the most advanced software and hardware technology in the industry. In the cloud storage market, it is the first to realize the NVMe protocol + shared access + IO Fencing technology at the same time. It achieves high reliability, high availability, and high performance on top of ESSD. At the same time, it implements rich enterprise features based on the NVMe protocol, such as multiple mounts, IO Fencing, encryption, offline expansion, native snapshots, asynchronous replication and other functions.

Figure 10: The world's first NVMe + shared access + IO Fencing technology integration

Currently, NVMe cloud disks and NVMe shared disks have been invited for testing, and have received preliminary certifications from Oracle RAC, SAP HANA and the internal database team. Next, it will further expand the scope of public testing and commercialization. In the foreseeable future, we will gradually continue to evolve around NVMe cloud disks to better support advanced features such as online expansion, full-link data protection T10-DIF, cloud disk multi-namespace, etc., so as to evolve comprehensive cloud SAN capabilities , Stay tuned!

	<span class = "Lake-fontSize-12 is"> Vendor. 1 </ span>	<span class = "Lake-fontSize-12 is"> Vendor 2 </ span>	<span class = "Lake-fontSize-12 is"> Vendor. 3 </ span>	<span class = "Lake-fontSize-12 is"> Ali cloud </ span>
<span> cloud disk protocol </ span>	the SCSI	the SCSI	the NVMe	the NVMe
<span> multiple mount </ span>	✓	✓	✓	✓
<span >IO Fencing</span>	✓	✓	×	✓
data persistence	N / A	. 9 th. 9	. 5 th. 9	. 9 th. 9
the IO delay	> 300 US	100 ~ 200 is US	100 ~ 200 is US	<100 US
cloud disc maximum IOPS	160K	128K	256K	1000K
cloud disc maximum throughput	2GB / S	1GB/s	4GB/s	4GB/s

Table 2: Overview of block storage from mainstream cloud computing vendors Original work: Alibaba Cloud Storage Aaron > Copyright Statement: content of this article is contributed spontaneously by Alibaba Cloud real-name registered users. The copyright belongs to the original author. The Alibaba Cloud Developer Community does not own its copyright and does not assume corresponding legal responsibilities. For specific rules, please refer to the "Alibaba Cloud Developer Community User Service Agreement" and the "Alibaba Cloud Developer Community Intellectual Property Protection Guidelines". If you find suspected plagiarism in this community, fill in the infringement complaint form to report it. Once verified, the community will immediately delete the suspected infringing content.