GaussDB(for Redis) reveals the most complete analysis of the Redis memory separation architecture

Preface:

This article is based on a speech delivered by Wen Long Yu, Architect of Huawei Cloud NoSQL Database, at this year’s China System Architects Conference (SACC). The contents are as follows.

The outline of this sharing is divided into four parts:

What is GAUSSDB (for Redis)?
Why do you choose separation of memory
Design and Implementation
Competitiveness Summary

What is GAUSSDB (for Redis)

1.1 What are the disadvantages of open source Redis?

To answer the question of what GaussDB(for Redis) is (hereafter referred to as Gauss Redis), we need to start with the background. Open source Redis is a very good KV cache, but with the booming development of various businesses, the data scale, throughput scale, business complexity continues to rise, open source Redis exposed many problems:

1.AOF expansion problem

The location of open source Redis is caching, but in order to meet the rapid recovery of business downtime data, AOF log is added to achieve a certain persistence function. Unfortunately, in Redis’s design, there is no dump mechanism to consume AOF. Instead, AOF rewrites are used to constantly de-merge old logs. This overwriting mechanism requires a fork call, which can cause memory doubling, performance blocking and other problems.

2. Snapshot backup problem

As businesses become more and more dependent on Redis, data backup becomes very important. As we all know, the Redis architecture is not an MVCC architecture, so if you want to back up data, you will inevitably need to copy the in-memory data after a pessimistic lock. However, Redis has designed a copy on write solution, which calls fork and creates a child process to copy the data, avoiding user mode locking. However, this process actually locks the kernel side, still causing significant jitter to business performance.

3. The problem of disconnection between master and slave

The open source Redis adopts the master-slave high availability architecture, and the data is transmitted in an asynchronous mode. Therefore, after a primary outage, it is very easy to cause data loss or inconsistency. In addition, when the write pressure of the primary node is high, the single-threaded master-slave replication is likely to fail to catch up with the incremental data, which will lead to buffer accumulation and further write failure or OOM disaster. While Redis is able to try to bridge the huge differences between master and slave by temporarily generating snapshots and synchronizing large files, this can lead to fork issues, as mentioned earlier.

4. The problem of the fork

Fork is a very heavy system call. Although it is a copy on write, it is usually reserved for twice the amount of memory. Fork also requires locking to copy the process page table and other information, which has a great impact on the business. Behind the above three problems is the factor of fork, which usually requires DBA to turn off the AOF of the primary node, turn off the backup of the primary node and other complex operation and maintenance measures to avoid. However, the operation and maintenance is very difficult in the scenario of frequent master-slave switching and a large number of nodes. Even in the master-slave disconnection scenario, there is theoretically no way around it.

5. Capacity issues

Open source Redis is not suitable for large-scale use, and two important factors limit its scalability. First of all, fork limits the vertical Scale Up capability of Redis. The larger the amount of data, the slower the fork, and the greater the impact on the business, so the amount of data that a single Redis process can carry is very limited. Second, the inefficient management of the Gossip cluster limits its ability to Scale Out: because the more nodes there are, the longer it takes to find a fault, and the network storm of internal communication increases exponentially, making large clusters almost unusable.

1.2 What solutions does the industry have?

The above is the major enterprises in the production practice of open source REDIS, the real encountered classic problems. These problems limit the wide use of open source Redis. As a result, a number of solutions have been proposed in the industry in recent years, as shown in the figure below.

Essentially, Redis is KV storage, which can be further divided into two camps based on the scenario: caching and persistence.

Cache scenario: Generally used to store the data of seckill and hot events. For example, Weibo hot search, this kind of data is valid, and can be lost.

Persistence scenario: When using Redis as a cache, due to its simple interface and rich features, people will want to persist more important data to Redis, such as history orders, feature engineering, location coordinates, machine learning, etc. This kind of data data volume is often very large, validity is also very long, generally can not be lost.

The caching scene is relatively simple and open source Redis. There are many self-developed products in the industry for persistence scene, such as 360’s SSDB/PIKA, Ali’s TAIR, Tencent’s Tendis, and of course, Huawei Cloud’s Gauss Redis also belongs to self-developed persistent Redis.

As another reason to persist, a 256GB memory chip is nearly 30 times more expensive than a 256GB SSD disk, and there is a huge difference in available capacity.

1.3 What is the solution of Huawei Cloud Database?

Huawei cloud database team learned from the experience of open source Redis and chose self-developed and persistent Redis, which is the protagonist of today’s sharing — Gauss Redis. Its one sentence positioning is: support Redis protocol NoSQL database, not cache. It has two features that are completely different from the industry:

Separation of memory and computation. Gauss Redis is based on Huawei internal self-developed distributed storage DFV, providing powerful data storage capabilities, including strong consistency, elastic expansion capacity and other advanced features. What is DFV? It is the cornerstone of Huawei’s full-stack data services, such as file EVS, object OBS, block storage, database family and big data family, all depend on it, so you can imagine its strength and stability.
Multimodal architecture. In fact, Gauss Redis is a member of the multi-mode database Gauss NoSQL, Gauss NoSQL provides a full-stack distributed KV engine, user-mode file system, storage pool and other technologies, only need to encapsulate the Redis protocol on the interface, you can easily achieve a new NoSQL product. Similarly, we provide NoSQL engines such as MongoDB, Cassandra, and InfluxDB.

2. Why do you choose separation of memory and computation?

Today, the concept of cloud native is ubiquitous, and the database is gradually moving towards cloud native, and one of the important characteristics of its cloud native is the separation of memory and computation. Memory separation also represents the latest trend in the cloud over databases.

The first generation of database service: it can be seen from the figure below that when traditional IDC was built, the database was built on bare metal. Due to the sensitive particularity of database service, DBA or R & D need to care about the selection of model, disk RAID array, networking, and even procurement and many other matters.

The second generation of database services: with the popularization of virtualization technology, a large number of applied businesses are moving to the cloud, and the database also begins to move to the cloud. The simplest way is to run a database service in the virtual machine or container. The advantages of doing this are obvious, but there are two disadvantages: one is the universal cloud disk are 3 copies, plus the database upper multiple copies, a serious waste of resources; Another is the waste of standby resources, usually unable to provide services. There are also issues with cloud disk IO performance.

The third generation of database services: based on the separation of memory and computation architecture, the database services are divided into CPU-intensive computing layer and IO-intensive storage layer. The data copy management is completely handed over to the storage layer, and the computing layer realizes stateless forwarding, which can not only give play to the elastic advantages of cloud, but also share the full load. However, the disadvantages are also obvious, that is, based on the old architecture is difficult to adapt.

After the separation of memory and computing architecture, the database service is a divide-and-conquer idea: the computing layer is responsible for all kinds of processing of service and productization, and the whole process is stateless; The storage layer focuses on the maintenance of the data itself, including replication, disaster recovery, hardware awareness, scaling capacity, and so on.

3. Design and implementation

Next comes the overall design and implementation, starting with the software architecture. The modules of Gauss Redis computing layer are as follows, mainly including CFGSVR, Proxy and DataNode. Connecting computing and storage resources are RocksDB and Geminifs (self-developed user-mode file systems), which are responsible for converting KV data to SST files and for pushing SST files down to DFV’s object storage pool, respectively.

Next is networking design. The database resources applied by a tenant are distributed on different physical machine containers in an anti-affinity way, all belonging to the same VPC of the same tenant. Although it is possible for database resources of different users to share the same physical machine, due to VPC isolation, data isolation is guaranteed. In addition, computing layer database resources are container-exclusive, while storage layer resources share physical hardware.

Next, read the disaster resilience architecture. Since Gauss Redis is positioned as a database rather than a cache, it takes a serious attitude towards data: it not only realizes 3AZ disaster tolerance within Region, but also provides disaster tolerance across Region.

Disaster tolerance in Region implements a high-availability scheme that tolerates AZ level failures. Under this fault, the data still maintain a strong consistent state, which provides a very powerful data security guarantee for enterprise applications. The reliability index of this architecture can meet the standard of RPO 0 and RTO less than 10s.

The specific implementation principle is that the computing layer also does the anti-affinity deployment of 3AZ, depending on the strong consistent replication ability of DFV’s 3 replicas. When a piece of user data is written to datanode1 by proxy, datanode1 calls the SDK of DFV through the user-mode file system of Geminifs to find a DFV storage node of Local Az and a DFV storage node of the nearest remote Az to form the majority. The write is returned to the user after success. In this architecture, AZ failures, whether computing or storage, have no impact on the security of the data.

Moving on to disaster recovery at the Region level. In addition to the strong consistency scheme of the above 3AZ, Gauss Redis also provides disaster tolerance across Region level, that is, asynchronous disaster tolerance between two instances. In this scheme, we add a rsync-server module, which is used to subscribe to the newly added logs on the main instance, and then uncode the logs into the corresponding format, and forward the logs to the standby instance of the opposite end, which can be played back by the standby instance. This scheme can realize bidirectional synchronization, breakpoint continuation, conflict resolution and so on. Among them, conflict resolution, for different Redis data structures, different resolution algorithms are adopted to ensure the final consistency.

4. Competitiveness summary

The last section is a summary of the advantages of Gauss Redis, mainly including: strong consistency, high availability, cold and heat separation, elastic expansion, high performance.

The first is the strong consistency property.

This mainly benefits from the DFV’s 3-copy mechanism, so the data written to Gauss Redis will be 3-copy strongly consistent by the time the client receives a reply. Strong Consistency is friendly to business implementations, without having to tolerate data inconsistencies and without having to validate data. However, the open source Redis data adopts asynchronous replication, so there is always a difference buffer between the master and the slave. If the power is off, this part of data will be lost, and when writing under high pressure, buffer accumulation will be generated, which will lead to OOM in serious cases. Therefore, the strong consistency of Gauss Redis is a very important feature, which can provide a consistent state for the business without worrying about the data consistency and loss problems after the master-slave switch of open source Redis.

The second feature is high availability.

High availability is the basic capability of a database, and it is emphasized here again because the availability of Gauss Redis is different from that of other databases, and it is able to tolerate n-1 node failures. The implementation principle benefits from shared storage DFV: when a compute node fails, the slot routing information maintained by it is automatically taken over by the remaining nodes. Because there is no migration of underlying data involved, this takeover process is very fast. Similarly, the failure of n-1 nodes can be accepted without affecting the reading and writing of all data. Of course, the reduction of compute nodes can have an impact on performance.

The third property is the separation of heat and cold.

A classic use scenario of open source Redis is to do hot and cold separation with MySQL, but this requires the business implementation code to implement hot and cold data exchange and maintain its consistency, which is a complicated delivery logic. Gauss Redis implements its own separation of hot and cold, in which data newly written and frequently accessed by the user is loaded into memory as hot data, while data that is not frequently accessed is washed out into persistent storage. As a result, businesses using Gauss Redis no longer need to write code from the business layer to maintain hot and cold exchange logic, and can achieve better consistency.

The fourth property is elastic stretching.

Gaussian Redis after the separation of memory and calculation can be used to expand capacity on demand, that is, the calculation is not enough to expand the calculation, storage is not enough to expand the storage. The expansion of computing resources is also very simple. As mentioned earlier, this process does not involve data copying and relocation, but only involves metadata modification, that is, the corresponding slot routing information (no more than 1MB) can be migrated to the newly added node. Therefore, it is very fast and can be completed in seconds. The storage resource expansion is simpler. Since the underlying shared storage is adopted, logical capacity expansion is carried out in most cases, which only needs the user to modify the quota on the console and does not involve any data relocation and copying. Of course, there are also cases of physical capacity expansion. In this case, our operation and maintenance generally find the warning water level in advance and conduct smooth migration and capacity expansion before this, which is transparent to users without perception.

The fifth feature is high performance.

The architecture of memory and computing separation seems to be heavy and the link is complex, but in fact, in terms of hardware adoption and software optimization, we can do more bold and radical things, such as RDMA network, user mode protocol, persistent memory and so on. So thanks to these dedicated storage devices, plus our full-load sharing architecture at the computing layer (no slave nodes are introduced, so performance easily doubles), we perform well in storage scenarios where the data volume of the friend quotient is greater than the memory. In addition, compared with open source Redis, we also have a great performance advantage in the point search scenario where data is less than memory. Of course, range query is still to be optimized.

5. Conclusion

That’s all the content of this time. For more information, please refer to the official blog of Gauss Redis and the official homepage of Gauss Redis.

Click on the attention, the first time to understand Huawei cloud fresh technology ~