BBB 0 trillion KV storage architecture and practice

As an important online storage service of Meituan, KV storage carries trillion-level requests of online services every day. In 2019 QCON Global Software Development Conference (Shanghai), Meituan senior technical expert Qi Zebin shared “Meituan Review on trillion KV Storage Architecture and Practice”. This paper is a summary of the speech, which is mainly divided into four parts: the first part describes the development history of Meituan KV storage; The second part describes the architecture and practice of memory KV Squirrel. The third section introduces the persistent KV CELLAR architecture and practices; Finally, the future development plan and new trends of the industry are shared.

Meituan review the development history of KV storage

The first generation of distributed KV storage for Meituan is shown in the architecture on the left of the figure below. I believe many companies have gone through this stage. By doing consistent hashing within the client and deploying many Memcached instances on the back end, the most basic KV storage distribution design is implemented. However, there are obvious problems with such a design: for example, data will be lost when nodes are removed during downtime, cache space will not be enough to expand, consistent hash will also lose some data, etc., which will bring a lot of trouble to business development.

As the Redis project matured, Meituan introduced Redis to solve the problems we mentioned above, evolving the architecture shown on the right of the image above. As you can see, the client side is the same, using the consistent hash algorithm, and the server side is a master slave structure made up of Redis. When any node is down, we can complete the Failover through the Redis Sentinel to achieve high availability. But there is still a problem, if you scale up the volume, the consistent hash will still lose data, so how to solve this problem?

At this time, we found a relatively mature KV storage open source project: Ali TAAIR. In 2014, we introduced TAIR to meet the business KV storage requirements. The architecture of the open source version of TAIR is divided into three main parts: the storage node at the bottom of the image, and the storage node reports the heartbeat to its central node, which has two configuration management nodes inside it that monitor all storage nodes. When any of the storage nodes are down or expanded, it does the topology reconstruction of the cluster. When the client starts, it pulls a routing table directly from the central node. The routing table is simply a cluster data distribution diagram, and the client directly reads and writes to the storage node according to the routing table. In view of the data loss problem of KV capacity expansion before, it also has a data migration mechanism to ensure the integrity of data.

But, we are in the process of using, also met with some other issues, such as center node though it is the Lord for high availability, but in fact it is not similar to distributed arbitration mechanism, so in the case of a network partition, it is possible “split brain”, this also gives us a big impact on the business. In addition, the problem of data migration affecting business availability has also been encountered during disaster capacity expansion. In addition, we have used Redis before, and businesses will find Redis to be particularly rich in data structures that TAIR does not yet support. While we solved some problems with TAIR, TAIR didn’t fully meet the business needs. After all, with Meituan being a large and complex business, it’s hard for an open source system to meet our needs well. In the end, we decided to do our own research on top of the existing open source system.

Just in 2015, Redis officially released the Redis Cluster version. Therefore, we followed the pace of the community and did a lot of development work in combination with internal requirements, and evolved KV storage Squirrel with full memory, high throughput and low latency. In addition, based on TAIR, we also added many self-developed functions to evolve the KV Storage Cellar with persistence, large capacity and high data reliability. Since the open source version of TAIR hasn’t been updated for four or five years, Cellar’s iteration was entirely self-developed by Meituan, and the Redis community has been active. In general, the iteration of SQuirreL focuses on both self-research and community, and the design of self-research functions is compatible with the official architecture as much as possible. As you will see later, because of these differences, the Cellar and Squirrel have different designs to solve the same problem.

These two stores are actually different solutions in the KV storage field. In practical applications, if a business has small data volumes and is sensitive to latency, we recommend Squirrel; If you have a lot of data and are not particularly sensitive to latency, we recommend a lower cost CELLAR. At present, the daily volume of these two KV storage systems in Meituan has exceeded one trillion, and their request peak has also exceeded one hundred million per second.

Memory KV Squirrel architecture and practice

Before we begin, this article describes what both storage systems have in common. Take the classic problem of distributed storage: how is the data distributed? The problem in the KV storage domain is how the keys are distributed among storage nodes. Squirrel is the same as Cellar here. When we get a Key, we use the fixed hash algorithm to get a hash value, and then modulo the number of slots to get a Slot ID. Now our two KV are both pre-sharding 16384 slots. Once you have the Slot ID, you can check the routing table to find out on which storage node the Slot is stored. The routing table is simply a comparison table for Slot to storage nodes.

Next, I will talk about the cognition of high availability architecture. Personally, high availability can be viewed from both macro and micro perspectives. From the macro point of view, high availability refers to disaster tolerance how to do. For example, if a node fails, what should you do? What do you do when a machine room or a group of machines in a certain area goes down? From a micro point of view, high availability is how to ensure high end-to-end success rates. When we are doing some operation and maintenance upgrades or scaling capacity data migration, can we achieve high availability of business requests? This article will also share some of the high availability work done by Meituan from both macro and micro perspectives.

Above is our Squirrel architecture. The middle part is consistent with the official Redis cluster. It has a master-slave structure, and Redis instances communicate with each other via the Gossip protocol. We added a cluster scheduling platform on the right, including scheduling service, scaling service and high availability service. It will manage the entire cluster and update the management results to ZooKeeper as metadata. Our client will subscribe to the metadata changes on ZooKeeper, get the topology status of the cluster in real time, and read and write directly in the Redis cluster.

The Squirrel node is disaster tolerant

Then take a look at Squirrel’s disaster tolerance. Node outages are well established for Redis clusters. In the official scenario, it usually takes 30 seconds for any node to go down until it is removed as Fail. The removal of the primary database may affect the integrity of the data, so we need to be careful. But what about slave libraries? We think this process is completely unnecessary. The other thing is, we all know that KV of memory is usually a small amount of data. For a company with a large volume of business, it tends to have a lot of clusters. If there is a switch failure, many clusters will be affected, and it will be very cumbersome to make up replicas after an outage. To solve these two problems, we did HA High Availability Services.

Its architecture is shown below, and it monitors all nodes of the cluster in real time. In case of network jitter or downtime (such as Redis 2), ZooKeeper can update ZooKeeper in real time and tell ZooKeeper to remove Redis 2. When the client receives the message, the read traffic is routed directly to Redis 3. If Redis 2 is only a few dozen seconds of network jitter, after a few dozen seconds, if the HA node monitors its recovery, it will be added back.

If after some time HA determines that it is a permanent outage, the HA node will directly request a new Redis 4 container instance from the Kubernetes cluster and add it to the cluster. At this point, the topology becomes the standard structure of one master and two slaves. After the HA node updates the cluster topology, it will write ZooKeeper to notify the client to update the route, and the client will be able to read to the new Redis 4 slave library.

With the above solution, we reduced the removal time from the library from 30 seconds to 5 seconds. In addition, we use HA to automatically request container instances to join the cluster, turning downtime replicas into a minute-level automatic operation, without any human intervention.

Squirrel is disaster tolerant across regions

We solved the single node outage problem, but what about the cross-geographic problem? Let’s first look at the differences across regions. First, compared with the network between the same regional room, the cross-regional dedicated lines are very unstable; Second, bandwidth across regional lines is very limited and expensive. Replication within a cluster does not take into account extreme network environments. If we deploy the master database to Beijing and two secondary databases to Shanghai, the same data will have to be transmitted twice on the north dedicated line, which will cause a huge waste of dedicated line bandwidth. In addition, as the business grows and evolves, we are also doing unit deployment and remote multi-live architecture. With official master-slave synchronization, we can not meet these needs. Based on this, we have done the replication scheme between clusters.

As shown in the figure above, the master cluster in Beijing and the slave cluster in Shanghai are drawn here. What we need to do is to synchronize the data of the master cluster in Beijing to the slave cluster in Shanghai through the cluster synchronization service. According to the process, first of all to synchronize scheduling module we issued “synchronous link is established between two clusters” task, synchronous scheduling module is based on master-slave cluster topology structure, the master-slave synchronization task issued by synchronous cluster between cluster, after receiving a sync task synchronization cluster will be dressed as Redis Slave, through Redis replication protocol, Data fetches from the kula on the main cluster, including the RDB and subsequent incremental changes. When the synchronizer receives the data, it translates it into a write command for the client, which is then written to the primary node of the cluster. In this way, we synchronize the data from the primary cluster in Beijing to the secondary cluster in Shanghai. Similarly, we need to do more work is also very simple, and a reverse synchronization link, you can achieve the two-way synchronization between clusters.

Now let’s talk about how to do high availability at the micro level, which is to maintain a high end-to-end success rate. For Squirrel, there are mainly three problems that affect the success rate:

Data migration causes timeout jitter.
Persistence causes timeout jitter.
The hot Key request caused the single node to be overloaded.

Squirrel Intelligent Migration

For data migration, we encountered three main problems:

Although Redis Cluster provides data migration capability, it does not care which slots to move or from which Slot to which Slot.
When doing data migration, we all want to move as fast as possible, but moving too fast may affect normal business requests.
The Redis Migrate command can block worker threads for a long time, especially when migrating large values.

To solve these problems, we made a new migration service.

Let’s look at how it works in terms of workflow. First, the migration task is generated. The core of this step is the “proximity principle”. For example, two nodes in the same machine room must be migrated faster than two nodes across the machine room. Once the migration tasks are generated, they are sent to a batch of migration machines. When the migration machine migrates, it has the following characteristics:

First, concurrency between migrating nodes in the cluster, such as issuing migration commands to Redis 1 and Redis 3 simultaneously.
Second, each Migrate command will Migrate a batch of keys.
Third, we will use the monitoring service to collect the success rate, time consumption, load and QPS of the client in real time, and then feed back this state to the migration machine. Migration data process is similar to TCP slow start, it will add velocity upwards, if request success rate decline, and so on and so forth, will reduce its speed, migration velocity will eventually stabilized in the dynamic balance, so that to reach the most rapid migration, at the same time as small as possible to affect the normal business request.

Next, let’s look at the migration of large values, where we have implemented an asynchronous Migrate command that will continue to process other normal requests while the main thread of Redis executes. If there is a write request for a Key that is being migrated, Redis will simply return an error. This maximizes the normal processing of business requests without blocking the main thread.

Squirrel persistent refactoring

RDBs are generated when Redis master-slave synchronization occurs. The process of generating RDB will call Fork to generate a child process to write data to the hard disk. Although Fork has the COW mechanism of the operating system, when the memory consumption reaches 10G or 20G, it will still cause the blocking of the whole process close to the second level. This is almost unacceptable for an online business. We will also turn on AOF for businesses with high data reliability requirements, and turn on AOF may cause process blocking due to IO jitter, which will also affect the success rate of requests. Our solution to these two problems with the official persistence mechanism is to refactor the persistence mechanism.

The picture above shows the latest version of our Redis persistence mechanism. Write requests are written first to the DB and then to the memory Backlog, which is the same as the official one. At the same time, it sends the request to the asynchronous thread, which is responsible for brushing the changes into the hard disk Backlog. When the hard disk Backlog is too many, we will take the initiative to do an RDB in the low peak period of business, and then delete the Backlog generated before the RDB.

What if we want to do master-slave synchronization, to find a synchronization point? The first step is the same as the official one. We will look in the memory Backlog for the required synchronization points. If not, we will go to the hard disk Backlog to find the synchronization points. Due to the large amount of hard disk space, the hard disk Backlog can store an extremely large amount of data, so it is rare that a synchronization point is lost. If the hard disk Backlog does not exist, we will trigger a similar full retransmission operation, but the full retransmission does not need to generate the RDB on the spot, it can be done directly with the RDB stored on the hard disk and the subsequent hard disk Backlog. With this design, we reduce the number of full retransmissions a lot. In addition, we reduce a lot of jitter caused by RDB by controlling the generation of RDB in low peak area. At the same time, we also avoid the jitter caused by writing AOF. However, since the write AOF is completely asynchronous, this solution is less reliable than the official data, but we think the increase in availability is well worth the price.

Squirrel hot Key

Let’s take a look at Squirrel’s hot Key solution. As shown in the figure below, ordinary master and slave are nodes in a normal cluster, while hot master and slave are nodes outside the normal cluster. Let’s see how they relate to each other.

When a request comes in to read or write a common node, the node will do the request Key statistics at the same time. If a Key reaches a certain number of visits or bandwidth usage, flow control is automatically triggered to restrict access to the hot Key and prevent nodes from being filled with hot requests. At the same time, the monitoring service will periodically search all Redis instances for the hot Key statistics. If there is a hot spot, the monitoring service will report the Slot of the hot Key to our migration service. The migration service then adds the hotspot master and slave nodes to the cluster, and then migrates the hot Slot to the hotspot master and slave. Because the hotspot master and slave have only hot Slot requests, the processing power of the hot Key is greatly improved. Through this design, we can achieve real-time hot spot monitoring, and timely through the flow control to stop the loss; With hotspot migration, we can achieve automatic hotspot isolation and rapid capacity expansion.

Persisting the KV CELLAR architecture and practices

Let’s take a look at the architecture and practices of the persistent KV CELLAR. Below is our latest CELLAR architecture diagram.

There are two major architectural differences from Alibaba’s TAIR. The first one is OB and the second one is ZooKeeper. Our OB serves a similar function to ZooKeeper’s Observer, providing a query service for the Cellar central node metadata. It can synchronize the latest routing tables with the Master of the central node in real time. The client routing tables are fetched from OB. There are two main advantages of doing this. First, a large number of business clients are naturally isolated from the Master brain of the cluster, so as to prevent routing table requests from affecting the management of the cluster. Second, because OB is only available for routing table query and does not participate in cluster management, it can be scaled horizontally, which greatly improves our routing table query capability. In addition, we introduced ZooKeeper as distributed arbitration to solve the “brain-splitting” problem of Master and Slave in the case of network segmentation that I mentioned just now. By storing the cluster metadata in ZooKeeper, we ensured the high reliability of metadata.

The CELLAR node is disaster tolerant

After introducing the overall architecture, let’s take a look at how the Cellar does node disaster recovery. The outage of a cluster node is usually temporary, and the network jitter of a node is also temporary, and they quickly recover and rejoin the cluster. Removing the node completely because of its temporary departure and completing the data copy will consume a lot of resources and then affect the business request. So, we implemented the Handoff mechanism to deal with the impact of this short-term failure of the node.

As shown in the figure above, if node A goes down, the Handoff mechanism is triggered, in which case the central node notifies the client that node A is down, and allows the client to send requests from shard 1 to B as well. After node B normally handles read and write requests from the client, it also writes the shard 1&2 data that should have been written to node A to the local Log.

If A recovers after 3 to 5 minutes of downtime, or after 30 to 50 seconds of network jitter, A will report the heartbeat to the central node, and the central node will inform B: “A has recovered, you can send it the data during its absence.” At this point, node B will write the Log of the local storage back to node A. Once Node A has the full amount of data during the failure period, the central node will tell the client that Node A has fully recovered, and the client can redirect Shard 1’s requests back to Node A.

Through such an operation, we can achieve the second level of rapid node removal, and the node after recovery to add back, only need to fill up a small amount of incremental data. In addition, if node A is to be upgraded, the central node first cuts the traffic from node A to node B via active Handoff. After A is upgraded, it writes back the incremental Log, and then cuts back the traffic to join the cluster. So by actively triggering the Handoff mechanism, we have the ability to silently upgrade.

Cellar is disaster tolerant across regions

Now let me talk about how the Cellar does it across the region. Cellar and Squirrel face the same cross-region disaster mitigation problem, and the solution is also inter-cluster replication. In the following figure, A cross-region scenario of Beijing master cluster and Shanghai slave cluster is taken as an example. For example, if the client writes to node A of Beijing master cluster, node A will copy it to nodes B and D just like normal replication within the cluster. At the same time, node A will also copy A copy of the data to node H of the slave cluster. After the H node handles the inter-cluster copy write, it also does the intra-cluster copy, copying the write to the I and K nodes of the slave cluster. By establishing such a replication link between the nodes of the master and slave clusters, we complete the data replication between clusters, and this replication ensures the minimum cross-geographic bandwidth consumption. Similarly, by configuring two bidirectional replication links, two nodes in a cluster can achieve bidirectional synchronization and multi-activity in different places.

Top Cellar is consistent

After we have done a good job of node disaster recovery and cross-region disaster recovery, our business has put forward higher requirements for us: strong consistent storage. Our previous data replication was asynchronous. When doing fault removal, data may be lost because the data of the fault node has not been copied yet. But for scenarios such as financial payments, they do not allow data loss. Faced with this difficult problem, how should we solve it? The current mainstream solution in the industry is strong consistent replication based on the Paxos or RAFT protocols. We chose the RAFT protocol. Mainly because the RAFT paper is very detailed, it’s a highly engineered paper. The industry also has a lot of mature open source implementations of RAFT, which can be used as the basis for our research and development, thus shortening the research and development cycle.

The following diagram shows the architecture of the current Cellar cluster in the RAFT replication mode, where the central node does the scheduling of the RAFT group and determines on which nodes the three copies of each Slot are stored.

As you can see, Slot 1 is on storage nodes 1, 2, 4 and Slot 2 is on storage nodes 2, 3, 4. Each Slot forms a RAFT group and the client goes to the RAFT Leader to read and write. Since we are pre-allocated 16,384 slots, we may have hundreds or even thousands of slots on our storage nodes at a very small cluster size. At this point, if each RAFT replication group has its own replication thread, replication request, Log, etc., the resource consumption will be very high and write performance will be poor. So we did the Multi Raft implementation, and the Cellar would write a Log of all the Raft replication groups on the same node and replicate with the same set of threads, and the replication packages between different Raft groups would also consolidate according to the target node to ensure that the write performance wouldn’t get worse due to too many Raft groups. Raft has its own master selection mechanism, which controls its own master node, and if any of the nodes go down, it can choose a new master through an election mechanism. So, does the central node not need to manage the RAFT group? Isn’t. In a typical scenario, if parts of a cluster go through several rounds of outages, the Raft Leader will become extremely unevenly distributed among the storage nodes. In order to ensure strong consistency of data, the read and write traffic of the client must be sent to the RAFT Leader, and at this time, the node traffic of the cluster will be very unbalanced. So our central node will also do the Leader scheduling for the RAFT group. For example, Slot 1 is stored on nodes 1, 2, and 4, and node 1 is the Leader. If Node 1 fails, Raft selects Node 2 as the Leader. Node 1 then recovers and rejoins the cluster, and the central node then asks Node 2 to return the Leader to Node 1. This way, even after a series of outages and recoveries, the number of leaders between our storage nodes is still balanced.

Next, let’s look at how Cellar guarantees its high end-to-end success rate. Here are three problems that affect success rates. Cellar has the same data migration and hot Key problems as Squirrel, but with a different solution. This is because Cellar went the self-development path and didn’t have to worry about compatibility with the official version, making more architectural changes. Another problem is that slow request blocking service queues cause large areas of timeout, which is a different problem for the CELLAR network, working multithreading model design.

CELLAR Intelligent Migration

Above is the Cellar Intelligent Migration architecture diagram. We have divided the migration of the bucket into three states. The first state is the normal state, without any migration. If Slot 2 is to be migrated from A to B at this time, A will take A snapshot of Slot 2 and send the full snapshot to B. When the data is migrated, the back packet of node B brings back the state of node B. What does the state of B include? Engine pressure, network card traffic, queue length, etc. Node A will adjust its migration speed according to the state of node B. Like SQuirreL, after a period of adjustment, its migration speed reaches a dynamic balance, achieving the fastest migration while affecting normal business requests as little as possible.

When Slot 2 has been migrated, the state of Slot 3 in the figure is entered. The client may not have updated the routing table at this time. When it reaches node A, node A will find that the client requested the wrong node, but instead of returning an error, it will delegate the request to node B, and then send the response packet of B back to the client. At the same time, it tells the client that it needs to update the routing table so that the client can access node B directly. This resolves the request error caused by the client routing update delay.

Cellar lines up quickly

Above the figure below is a standard thread queue model. The network thread pool receives network traffic, parses the request packet, and puts the request into the work queue. The worker thread pool fetches the request from the work queue to process it, and then puts the response packet back into the network thread pool and sends it out.

When we analyzed the timeout cases on the line, we found that only one or two of the timeout requests in a batch were caused by slow engine processing. Most of the requests were simply timed out because the overall response time was too long due to waiting in the queue for too long. From the online analysis, only 1/20 of the timeout requests are truly slow.

So what’s our solution? Very simple, split thread pool, split queue. After receiving the packet, our network thread will be sorted into four queues according to the characteristics of its request, whether it reads or writes, whether it is fast or slow. Read and write requests are easy to distinguish, but how to separate fast and slow? We will distinguish the speed of the request according to the number of keys, Value, data structure elements and so on. Then use the corresponding four worker thread pool to process the request of the corresponding queue, and realize the isolation of fast and slow read and write requests. This way, if I have a slow read request, it won’t affect the other three requests. But this also raises the question of if we go from one thread pool to four, does that quadruple the number of threads? Not really, we help other thread pools handle requests when one thread pool is idle. So, our thread pool became four, but the total number of threads remained the same. In our online verification, such a design can reduce the latency of service TP999 by 86% and significantly reduce the timeout rate.

Hot Key Cellar

The diagram above shows the architecture of the CELLAR hot Key solution. We can see that the central node has added a new responsibility to the hot spot management. It is now not only responsible for the normal distribution of data copies, but also manages the distribution of hot spot data. The cluster shown in the diagram puts hot spots in nodes C and D. Let’s see how this works through the read and write flow. If the client has A write operation to node A, node A will judge whether the written Key is A hot spot according to the real-time hot spot statistics after processing. If the Key is a hot spot, it will copy the data to nodes with hot spots, i.e., nodes C and D in the figure, while doing intra-cluster replication. At the same time, when the storage node returns the result to the client, it will tell the client that this Key is a hot spot, and then the client will cache this hot Key. When the client has a read request for this Key, it will go directly to the hotspot to do the reading. In this way, we can expand only the hot spot data, unlike Squirrel, which has to relocate an entire Slot for capacity expansion. If necessary, the central node can also place the hot spot on all the nodes of the cluster, so that all hot read requests are evenly distributed among all the nodes. In addition, by replicating the hot spot data in real time, we have well solved the consistency problems caused by the same client-side caching hotspot KV solution.

Development plans and industry trends

Finally, let’s take a look at the planning of our project and the technology trends in the industry. This part of the content will be elaborated according to the three levels of service, system and hardware. First, in the service layer, there are three main points:

First, Redis Gossip protocol optimization. Everyone knows that when the Gossip protocol gets larger, the message volume increases dramatically, and its Failover time becomes longer and longer. Therefore, when the size of the cluster reaches TB level, the availability of the cluster will be greatly affected, so we will focus on optimizing this aspect later.
Second, we have done RAFT replication among the data replicas of the Cellar storage node to ensure strong consistency of data. Later we will do a RAFT replication inside the Cellar’s central point so that we don’t have to rely on ZooKeeper for distributed mediation and metadata storage. Our architecture will also become simpler and more reliable.
Third, although Squirrel and Cellar are both KV storage, they are developed based on different open source projects, so their APIs and access protocols are different. Later, we will consider integrating Squirrel and Cellar in the SDK layer, although there will be different storage clusters at the back end. But the business side can be accessed using a set of SDKs.

At the system level, we are investigating and implementing some Kernel Bypass techniques, such as DPDK, SPDK, and user-mode IO for network and disk Bypass. It can bypass the kernel and access these devices through a polling mechanism, which can greatly improve the IO capability of the system. Storage, as an IO intensive service, provides a significant performance boost.

At the hardware level, intelligent network cards such as RDMA support can significantly reduce network latency and improve throughput; And flash technologies like 3D XPoint, like Intel’s newly released AEP memory, have access latencies that are closer to memory, and the line between flash and memory will become increasingly blurred. Finally, take a look at computing hardware, such as adding FPGA cards to flash memory to perform tasks that the CPU is supposed to do, such as data compression and decompression, on the card. Such hardware can free up the CPU while reducing the latency of service response.

Author’s brief introduction

Ze Bin, Senior Technical Expert of Meituan Dianping, joined Meituan in 2014.

Recruitment information

Meituan Basic Technology Department Storage Technology Center is looking for senior/senior engineers and technical experts in C/C++, GO, JAVA. Welcome to join Meituan Basic Technology Department family. Interested students are welcome to send their resumes to: [email protected] (email subject: Basic Technology Department – Storage Technology Center)

To read more technical articles, please scan the code to pay attention to WeChat official account – Meituan Technical Team!