Overview: The NVMe cloud disk combines the industry’s most advanced software and hardware technologies. It is the first cloud storage device to implement the NVMe protocol, shared access, and IO Fencing technologies. ESSD provides high reliability, high availability, and high performance. In addition, it implements various enterprise features based on NVMe, such as multiple mount, I/O Fencing, encryption, offline capacity expansion, native snapshot, and asynchronous replication. This article details the evolution of SAN and NVMe on the cloud, and provides a vision for the future

How did 7×24 high availability come about?

In the real world, single points of failure are common. Ensuring business continuity is the core capability of a high availability system. How to ensure 7*24 high availability of services in critical applications such as finance, insurance, and government affairs? Generally speaking, a service system consists of computing, network, and storage systems. On the cloud, network multipathing and distributed storage ensure stable high availability (HA). However, to achieve full-link HA, a single point of failure on the computing and service sides must be resolved. Take a common database as an example. It is unacceptable for users to stop services caused by a single point of failure. How can I quickly recover services when instances become unavailable due to power failure, outage, or hardware faults?

The solution varies depending on the scenario. In general, the MySQL database adopts a primary/secondary/active/standby architecture to implement high service availability. When the primary database fails, the database is switched to the standby database to provide services continuously. However, how to ensure data consistency between master and slave libraries after instance switchover? According to the business of data loss tolerance, MySQL usually adopts synchronous or asynchronous data replication, the introduction of additional questions: some scenarios lead to loss of data, synchronous data affect the system performance, business expansion to the full amount to the set of new equipment and data replication, master switch time is longer than the influence of business continuity and so on. As we can see, in order to build a high availability system, the architecture will become complex and difficult to balance availability, reliability, scalability, cost, performance, etc., so is there a more advanced solution that can have the best of both worlds? The answer must be: Yes!

Figure 1: High availability architecture for the database

Through shared storage, different database instances can share the same data, so as to achieve high availability through quick switching of computing instances (Figure 1). Oracle RAC, AWS Aurora, and Alicloud PolarDB databases are among them. The key here is shared storage, the traditional SAN price is high, expansion and contraction trouble, the head is also easy to become a bottleneck, its use threshold is higher to the user is not friendly, there is no better, faster, more provincial shared storage, to solve the pain point of the user? Ali Cloud recently launched NVMe cloud disk and sharing features, will fully meet the demands of users, we will focus on the next. After instance switching, if the original library is still writing data, how to ensure the data correctness? In suspense, the reader can think about it first.

Figure 2: Data correctness in the master-slave switchover scenario

Wheels of history: SAN on the cloud and NVMe

We have stepped into the digital economy era of “data oil”. The rapid development of cloud computing, artificial intelligence, Internet of Things, 5G and other technologies has contributed to the explosive growth of data. It can be seen from IDC 2020 report that the global data scale is increasing year by year and will reach 175 ZB in 2025, and data will be mainly concentrated in public cloud and enterprise data center. The rapid growth of data has provided new impetus and requirements for the development of storage. Let us recall how the block storage form has evolved step by step.

Figure 3: Block storage evolution

DAS: A storage device is directly connected to a host using the SCSI, SAS, and FC protocols. DAS is simple, easy to configure and manage, and low cost. As storage resources cannot be fully utilized and shared, centralized management and maintenance are difficult.

SAN: Connects storage arrays to service hosts through dedicated networks, which solves problems such as unified management and data sharing, and provides high-performance and low-latency data access. However, THE high cost, complex o&M, and poor scalability of SAN storage devices raise the user threshold.

Full flash: The revolution of the underlying storage media and the decrease of the cost mark the arrival of the era of all-flash memory. Since then, the storage performance has been transferred to the software stack, forcing the software to carry out a large-scale reform, promoting the rapid development of user-mode protocol, software and hardware integration, RDMA and other technologies, and bringing about a leap in storage performance.

Cloud disk: in the rapid development of cloud computing, storage is transferred to the cloud. Cloud disk has inherent advantages: flexibility, flexibility, ease of use, easy expansion, high reliability, large capacity, low cost, free operation and maintenance, etc., and has become a solid base for storage in the process of digital transformation.

Cloud SAN: support all aspects of store operations, to replace the traditional SAN storage, cloud SAN should era, it inherited the advantages of the cloud disk, also have the traditional SAN storage capacity, including Shared storage, data protection, synchronous/asynchronous replication, speed snapshot features, will be stored in the enterprise market continues.

On the other end of the spectrum, NVMe is emerging as the darling of the new age in the evolution of storage protocols.

Figure 4: Evolution of the storage protocol

SCSI/SATA: Storage In ancient times, hard disks were mostly low-speed devices. Data was transmitted through the SCSI layer and SATA bus. However, their performance was limited due to the storage of slow media, such as mechanical hard disks, which covered the performance disadvantages of the SATA single-channel and SCSI software layer.

Virtio-blk/Virtio-SCSI: With the rapid development of virtualization technology and cloud computing, virtio-BLK/Virtio-SCSI has gradually become the mainstream storage protocol of cloud computing, making storage resources more flexible, agile, secure, and scalable.

NVMe/NVMe – : The development and popularization oF flash memory technology has promoted a new generation oF storage technology revolution, when storage media no longer become the performance oF the block, software stack has become the biggest bottleneck, thus spawned NVMe/ NVME-OF, DPDK/SPDK, user-mode network and other high-performance lightweight protocols. With high performance, advanced features, and high scalability, NVMe will usher in a new era of cloud computing.

In the foreseeable future, SAN on the cloud and NVMe will be the wave of the future.

NVMe in the new era of cloud disk

The rapid development and popularity of flash memory technology shifted the performance bottleneck to the software side, and increased demand for storage performance and functionality pushed NVMe onto the stage of history. NVMe provides a data access protocol for high-performance devices. Compared with the traditional SCSI protocol, NVMe uses the multi-queue technology to greatly improve storage performance. NVMe also offers a wealth of storage features. Since the inception of the NVMe standard in 2011, It has standardized many advanced functions such as multi-namespace, multi-path, full-link data protection T10-DIF, Persistent Revervation permission control protocol, atomic write, etc. The new storage features defined by it will continue to help users create value.

Figure 5: Aliyun NVMe cloud disk

The high performance and rich features of NVMe provide a solid foundation for enterprise storage, and the scalability and growth of the protocols themselves become the core driving force for NVMe cloud disk evolution. NVMe cloud disks are based on ESSD and inherit ESSD’s high reliability, high availability, high performance, and atomic write capabilities, as well as enterprise features such as ESSD’s native snapshot data protection, cross-domain DISASTER recovery (Dr), encryption, and second-level performance configuration. The combination of ESSD and NVMe features effectively meets enterprise-level application requirements. Enable most NVMe – and SCSI-based services to move seamlessly into the cloud. The shared storage technology described in this paper is based on NVMe Persistent Reservation standard. As one of the additional functions of NVMe cloud disk, the multiple mounting and IO Fencing technology can greatly reduce storage costs and effectively improve service flexibility and data reliability. It is widely used in distributed service scenarios, especially in high availability database systems such as Oracle RAC and SAP Hana.

Enterprise storage: Shared storage

As mentioned earlier, shared storage can effectively solve the problem of high availability of databases. The main capabilities it relies on include multiple mounts and IO Fencing. Using databases as examples, we will describe how they work.

Service high availability key – Multiple mount

Multiple mount allows a cloud disk to be mounted simultaneously to multiple ECS instances (currently up to 16 are supported), and all instances can read and write the cloud disk (Figure 6). Multiple mounts enable multiple nodes to share the same data, effectively reducing storage costs. When a single node fails, services can be quickly switched over to a healthy node without data replication. This feature provides atomic capabilities for fast fault recovery. High availability databases such as Oracle RAC and SAP HANA rely on this feature. Note that shared storage provides consistency and recovery capabilities at the data layer. To achieve consistency, services may need to perform additional processing, such as database log replays.

Figure 6: Multi-instance mount

File systems such as ext4 cache data and metadata to speed up file access. Modification information of files cannot be synchronized to other nodes in a timely manner, resulting in data inconsistency among multiple nodes. Inconsistent metadata also causes conflicts between nodes in accessing disk space, which leads to data errors. Therefore, multiple mount is usually used with clustered file systems, such as OCFS2, GFS2, GPFS, Veritas CFS, Oracle ACFS, etc. Ali Cloud DBFS and PolarFS also have this capability.

With multiple mounts, can we rest easy? Multiple mounting is not a panacea, and it has its own blind spot that it cannot solve: permission management. Applications based on multiple mounts usually rely on cluster management systems to manage permissions, such as Linux Pacemaker, but in some scenarios, permissions can fail and cause serious problems. Memory based on the problems started throwing, under a high availability architecture, primary instance will switch to the standby instance, after an exception occurs if the primary instance is in a state of suspended animation (such as network partition, such as hardware failures scenario), it will make the mistake of thinking that they have write access, thus write dirty data with for instance, how to avoid the risk? Now it’s IO Fencing’s turn.

Data Correctness – I/O Fencing

You can terminate the in-route request of the original instance and reject the new request. After the old data is not written, the instance is switched over. Based on this idea, the traditional solution is STONITH (Shoot the other node in the head), which prevents old data from falling to disk by remotely restarting the failed machine. However, this solution has two problems. First, the restart process is too long and the service switchover is slow, which usually leads to the service stop of tens of seconds to minutes. Worse, because the I/O paths on the cloud are long and involve many components, component faults (such as hardware and network faults) of compute instances may cause I/O failures in a short period of time. Therefore, data correctness cannot be 100% guaranteed.

To fundamentally solve this problem, NVMe standardizes the Persistent Reservation (PR) capability. It defines the permission configuration rules for NVMe cloud disks, allowing flexible modification of the permission of cloud disks and mounted nodes. In this scenario, after the failure of the master library, the slave library first sends PR command to forbid the write permission of the master library and rejects all the in-transit requests of the master library. At this time, the slave library can update data without risk (Figure 7). IO Fencing can usually assist applications to complete failover at the millisecond level, greatly shortening the fault recovery time. Smooth service migration makes upper-layer applications basically unaware, which is a qualitative leap over STONITH. Next, we further introduce IO Fencing permission management technology.

Figure 7: IO Fencing application in failover

The Swiss Army Knife of Rights Management — Persistent Reservation

The NVMe Persistent Reservation (PR) protocol defines cloud disk and client permissions. With multiple mount capabilities, services can be switched over efficiently, safely, and smoothly. In PR protocol, the mounted node has three identities, namely Holder (owner), Registerant (Registrant) and non-registrant (visitor). As can be seen from the name, the owner has all the rights of cloud disk, the Registrant has some rights, and the visitor only has read rights. In addition, the cloud disk provides six sharing modes to implement exclusive, one-write, multi-read, and multi-write capabilities. You can configure the sharing mode and role identity to flexibly manage node permissions (Table 1) and meet the requirements of various service scenarios. NVMe PR inherits all SCSI PR capabilities. All APPLICATIONS based on SCSI PR can run on NVMe shared cloud disks with a few modifications.

Table 1: NVMe Persistent Reservation Permission table

Multiple mounting and I/O Fencing capabilities provide a perfect high availability system. NVMe shared disks also provide multiple reads with one write, and are widely used in read-write separated databases, machine learning model training, and streaming processing scenarios. In addition, technologies such as mirror sharing, heartbeat detection, quorum selection, and locking mechanism can be easily implemented by sharing cloud disks.

Figure 8: NVMe shared disk write multiple read application scenario

NVMe Cloud disk technology revealed

NVMe cloud disk is based on computing and storage separation architecture, relying on DpCA hardware platform to achieve efficient NVMe virtualization and speed IO path, with Pangu 2.0 storage as the base to achieve high reliability, high availability and high performance, computing and storage through user-mode network protocol and RDMA interconnection. NVMe cloud disk is the culmination of full-stack high performance and high availability technologies (Figure 9).

Figure 9: NVMe shared disk technology architecture

NVMe Hardware Virtualization: The NVMe hardware virtualization technology is built on dpCA MOC platform, and efficient interaction between data flow and control flow is carried out through Send Queue(SQ) and Completion Queue(CQ). Simple NVMe protocol and efficient design are combined with hardware unload technology. Reduce NVMe virtualization latency by 30%.

Speed IO channel: Based on dpCA MoC software and hardware integration technology to achieve speed IO channel, effectively shorten the IO channel, and then obtain the ultimate performance.

User-mode protocol: NVMe uses a new generation of solar-RDMA user-mode network communication protocol, combined with leap-CC self-developed congestion control to achieve reliable data transmission and reduce the network long tail delay, JamboFrame based on 25G network to achieve efficient large packet transmission. The network software stack is simplified and the performance is improved by separating the data plane from the control plane. The network multipath technology supports the millisecond recovery of link faults.

Managed high availability: Pangu 2.0 distributed HIGH availability storage implements an NVMe control center. NVMe control commands do not pass through managed nodes, achieving reliability and availability close to I/O, and assisting users to switch services at the millisecond level. The precise flow control between multiple clients and multiple servers is realized based on NVMe control center, and the precise distributed flow control for IO is realized in sub-second level. In the distributed system, the I/O Fencing consistency is implemented for multiple nodes. The two-phase update is used to keep the same permission status among cloud disk partitions, effectively resolving the problem of split permission between partitions.

Large IO atomicity: Based on distributed system, the atomic write capability of large IO is realized from computing, network and storage end to end. Under the condition that IO does not span adjacent 128K boundary, it ensures that the same data will not partially fall off disk. This plays an important role in application scenarios that rely on atomic write, such as database, and it can effectively optimize the double write process of database. Thus, the database write performance is greatly improved.

Current status and future prospects

It can be seen that the CURRENT NVMe cloud disk combines the industry’s most advanced software and hardware technologies. In the cloud storage market, it is the first to realize the NVMe protocol + shared access + IO Fencing technology. ESSD provides high reliability, high availability, and high performance. In addition, it implements various enterprise features based on NVMe, such as multiple mount, I/O Fencing, encryption, offline capacity expansion, native snapshot, and asynchronous replication.

Figure 10: The world’s first fusion of NVMe + Shared access + IO Fencing

NVMe Cloud disk and NVMe Shared disk are currently in beta and have received initial certification from Oracle RAC, SAP HANA and internal database teams. It will be further expanded to public beta and commercialized. For the foreseeable future, we will continue to evolve around NVMe cloud disk to better support advanced features such as online capacity expansion, full-link data protection T10-DIF, and cloud disk multi-Namespace to evolve comprehensive ON-cloud SAN capabilities. Stay tuned!

Table 2: Block storage overview by major cloud vendors

Author: Ali Cloud storage Lun

The original link to this article is the original content of Ali Cloud, shall not be reproduced without permission.