Redis cluster fault tolerance and failover

1. Redis service model

There are three service modes of Redis, namely single-machine mode, master-slave mode and cluster mode. When the machine fails in different modes, Redis responds differently. In single-machine mode, faults are solved through backup and restoration. In the master-slave mode, the sentry re-elects the master node from the slave node to replace the faulty node. In the cluster mode, the sentry re-elects the new master node in a similar way to the master-slave mode. The difference is that in cluster mode there is no need for sentries to preside over elections.

2. Backup and restoration in single-machine mode

Redis supports two backup modes, namely RDB backup and AOF backup.

2.1 RDB backup

RDB persistence can be performed manually or periodically by configuration. RDB backup is to save the Redis data at a point in time to an RDB file that holds the data in the Redis database at the current point in time. When the Redis service starts, you can specify to load the RDB file and restore Redis.

There are two commands to generate RDB files, SAVE and BGSAVE. The SAVE command will block the Redis server, that is, during the execution of the SAVE command, the Redis service will reject all command requests. It can be used normally only after the execution is complete.

BGSAVE is executed by child processes, so it does not block the Redis service. And we can manually set BGSAVE to be executed automatically, by setting the number of times BGSAVE is executed, or as many writes are performed.

2.2 AOF backup

The difference between AOF and RDB backup is that RDB files save the specific data in Redis server, while AOF backup stores the commands of Redis. When Redis’s AOF function is enabled, each time Redis executes a write command, the command is written to the end of the aof_buf buffer (similar to Java’s Writer method). Each time the Redis execution time event ends, the contents of the AOF_buf buffer are flushed to disk as determined by the configuration (similar to the Java flush method).

2.3 Data Recovery

The RDB file or AOF file can be specified to load when the Redis server starts. The Redis server will load AOF files first, and only load RDB files automatically if AOF is not enabled.

3. Failover in master/slave mode

Redis SLAVEOF allows one Redis server to copy another Redis server. This pattern is called master-slave, or master-slave replication. The replicated server is called the primary server. Servers that replicate the master are called slave servers. The master and slave servers will hold the same data. In the event that the primary server goes down, the secondary server can be changed to the primary server to continue executing commands without data loss.

3.1 Primary/Secondary Replication

There are two cases of master-slave replication, one is the first time to start replication, which is called sync, and the other is command propagate.

Synchronization means that when running the slaveof command, the slave server needs to synchronize the data of the master server. The slave server needs to use the sync command to synchronize the data of the master server. After receiving the slaveof command, the slave server sends the sync command to the master server. After receiving the sync command, the master server runs the BGSAVE command. Write commands received during file generation are saved to the buffer. The generated RDB file is then sent to the secondary server, and the secondary server loads the RDB file to update the database data. The master server then sends the command for the buffer to the slave server. Synchronize buffer data from the server.

Command propagation means that after the synchronization, the status of the primary server is consistent with that of the secondary server. However, if the primary server receives a write command again, the status of the primary server is inconsistent with that of the secondary server. In this case, the primary server sends the write command to the secondary server to ensure data consistency between the primary and secondary servers.

3.2 Partial resynchronization

You need to perform synchronization in two cases. One is when you receive the Slaveof command for the first time, and the other is when the primary and secondary servers are disconnected and reconnected. In older versions of Redis servers, synchronization is performed after disconnection, but full synchronization is performed. That is, regenerate a new RDB file. This will synchronize a large amount of data that does not need to be synchronized, or better yet, data during disconnection. So in the new version (V2.8 +) of Redis database support partial synchronization function.

New psync command support full amount heavy synchronization and part of the synchronization of two ways, which total amount heavy synchronization is to use under the condition of the first copy, and some heavy synchronization is to replicate in the wake of a break even situation, after the reconnect the master server from the server, the server will be disconnected during the execution of write command to send from the server.

The realization of partial resynchronization mainly depends on three parts:

The replication offsets of the master and slave servers
Replication backlog buffer for the primary server
Running ID of the server

With these three parts, Redis implements partial resynchronization.

3.2.1 Replication Offset

The master and slave servers that need to perform replication each maintain a replication offset, adding N to the replication offset each time the master propagates N bytes to the slave. Each N bytes received from the service will also add N to its own replication offset. If the primary and secondary servers are in the same state, the replication offset is always the same. If a wire break occurs, the copy offset will be unequal.

3.2.2 Copying backlogs

The replication backlog buffer is a first-in, first-out queue maintained by the primary server. When the master server performs command propagation, not only does it send data to the slave service, but it also sends data to the replication backlog buffer, which reserves the corresponding replication offset for each byte,

When the slave server reconnects, the slave server sends its replication offset to the master server. If the replication offset sent from the slave server is in the replication buffer, the master server performs partial resynchronization, and if the replication backlog buffer is not replicated, the master performs full resynchronization.

3.2.3 Server ID

Each from the server can record the primary server ID, after the break line reconnection, will save themselves from the server’s server ID sent to the primary server, if the primary server found that do not agree with the current server ID, will perform a full amount heavy synchronization directly, if the agreement may be using replication backlog buffer try to perform partial synchronization, If the attempt fails, full resynchronization is performed.

3.2 the sentry

Sentinel is a highly available solution to Redis: a Sentinel system consisting of one or more Sentinel instances can monitor any number of Redis instances. In addition, when the primary server is monitored to go down or go offline, a secondary server of the current primary server can be automatically upgraded to the primary server, and the new primary server continues to process requests.

3.3 leader election

When a primary server is monitored to go offline, the multiple sentinel instances monitoring the primary service elect a Leader sentinel, which is responsible for the subsequent failover process.

Each sentinel instance stores an attribute called epoch, which increments with each election, successful or not. The epoch parameter is really just a counter, not a special property. In a configuration era, each sentinel has a chance to set one sentinel as the local lead leader, which, once selected, cannot be changed in the current era.

Each sentinel that monitors an objective shutdown of the master server asks the other sentinels to set themselves as the local lead leader. If the other servers have not already set up a local lead leader, the request will be granted. Otherwise, they refuse.

Take the three Sentinel nodes as an example. The sentinels A, B, and C simultaneously monitor A primary server. When the primary server goes offline, B first sets its current_EPOCH +1, and then sets the local leader to itself. Then the SENTINEL IS-master-down-by-ADDR election command is sent to nodes A and C with current_epoch and its own RUNID (i.e. server ID), hoping that nodes A and C can elect themselves as the local lead leader.

After receiving the request, nodes A and C first determine whether the epoch carried by the request is greater than their own epoch. If the epoch is greater than their own epoch, it indicates that the current node has not selected the local leader. At this point, you update your epoch to the epoch in the request, set your leader_rUNId to B, and return leader_runid = B. Note Nodes A and C select node B as the local leader.

If the epoch in the request is not greater than its own epoch, for example, A D node also finds that the primary server is offline and sends the SENTINEL is-master-down-by-addr command to A and C. Since the current EPOCH of A and C is already 1, the current request will be rejected (returning leader_runid=B). It tells node D that node B has been elected as the local lead leader.

The SENTINEL is-master-down-by-addr command will receive a response carrying not only the Leader_RUNID but also the Leader_EPOCH. After receiving the response, node B will first judge whether the Leader_EPOCH is consistent with the current epoch. If so, it will judge whether the Leader_RUNId is consistent with the current RUNID. If so, it indicates that the target sentry has set the current node as the local leader. The complete process is shown below:

If a sentry is elected by more than half of the sentries as the local lead leader, the sentry B in the figure above becomes the lead leader. If a majority of sentries are not present for a period of time, a new election is called.

Subsequent transitions are performed by the lead leader.

3.4 Transfer Process

There are three main steps in the transfer process:

Select an appropriate slave server from the services and set it as the master.
Make all slave servers replicate the new master server
Set the offline primary server as a secondary server for the new primary server

3.4.1 Selecting a new server

Instead of randomly selecting a slave from a list of servers, sentinels go through a series of filters and then select the highest slave based on priority.

First, Sentry retrieves a list of all slave servers of the primary server that has gone offline, and then filters it in turn

Filter out all offline or disconnected slave servers to ensure that all servers in the list are online.
Filter out servers that have not responded to the lead Sentinel INFO command in the last 5 seconds to ensure that all servers in the list are communicating properly.
Filter out disconnections with offline primary servers over timedown-after-milliseconds*10Milliseconds server, ensure that the server in the list is not disconnected for a long time, data is relatively new.

If the secondary server has the same priority, the secondary server is sorted by the replication offset. If the secondary server has the same priority, the secondary server is sorted by the replication offset. If the secondary server has the same priority, the secondary server is sorted by the replication offset.

So the new server selected is the slave service with the highest priority, the highest replication offset, and the lowest RUNID.

3.4.2 Have the secondary server replicate the new primary server

The SLAVEOF no one command will be sent to the slave servers selected above to stop the replication, and the SLAVEOF IP port command will be sent to the other slave servers to make all slave servers copy the new master server.

3.4.3 Setting the offline primary server to the secondary Server

Since the old master server is offline, the Settings are saved in the sentry instance. When the old master server is online again, sentry sends the SLAVEOF command to make it the SLAVEOF the new master server.

4. Failover in cluster mode

A Redis cluster is composed of multiple Redis nodes that can communicate with each other. And each node can be a master-slave mode composed of multiple Redis servers. For example, we can combine nine nodes into a master-slave mode in threes (one master and two slaves), and then three master nodes into a cluster (three master and six slaves).

For details, see data synchronization within the Redis cluster. The following focuses on failover in cluster mode.

4.1 Fault Detection

Each node in the cluster keeps a state table of all nodes in the cluster. Something like this.

ID	role	state
1000	The master node	online
1001	From the node	online
2000	The master node	online
2001	From the node	online
3000	The master node	online
3001	From the node	online

Each node in the cluster periodically sends a PING message to another node to check whether the other node is offline. If no response is received from PONG within the specified time, the node will be marked as suspected offline (PFAIL). For example, if 1000 sends a PING message to 2000 and does not receive a response from PONG within a certain period of time, it will modify the status table it maintains and mark 2000 as suspected offline (PFAIL).

The nodes in the cluster exchange their maintained node status tables through messages. For example, whether a node is online, offline, or suspected offline. Note the swap, not the synchronization. If a node receives a suspected offline status from another node, a offline report is created and saved.

For example, 1000 received a message from 2000, 2000 flagged 3000 as suspected offline status, then. 1000 will create a referral report and save it to its own status table. The structure of the state table for 1000 is as follows:

ID	role	state	Offline report
1000	The master node	online
1001	From the node	online
2000	The master node	online
2001	From the node	online
3000	The master node	online	[2000 Report 3000 Offline]
3001	From the node	online

If more than half of the primary nodes in the cluster flag a primary node as suspected offline, the primary node will be marked as FAIL. The primary node that flags the primary node as offline will broadcast a message to the cluster, and all nodes that receive the message will mark the primary node as offline.

For example, if 1000 also marks 3000 as suspected offline and receives an offline report from 2000, 1000 marks 3000 as offline and broadcasts a message about 3000. After receiving the message, other nodes also mark 3000 as offline.

4.2 leader election

When the slave node finds that the master node copied by the slave node goes offline, it will trigger the election process, and the Redis cluster will select a new master node from the slave node of the offline master node.

The election process is very similar to the Sentry election process above. It’s all based on Raft algorithm.

First, the cluster has a counter that configures the epoch. The initial value is 0 and increments by +1 with each failover. Only the primary node in the cluster has the right to vote, and each primary node has only one chance to vote during a failover.

When a secondary node in the cluster finds that the primary node it replicates goes offline, it broadcasts a message asking all the nodes that received the message and have the right to vote to vote for it.

If the primary node has not yet voted, an ACK message is returned indicating support for the secondary node to become the primary node. If a slave node receives more than half the votes of the primary node, the slave node becomes the new primary node.

4.3 Partition problem

If the remaining nodes in the cluster except the offline nodes can just be divided by the slave nodes, in extreme cases, the votes of all slave nodes will not exceed half, resulting in the failure of the election, and then configure epoch +1 to conduct a new election. For example, if two slave nodes vote at the same time, it is possible that each slave node will receive one vote, resulting in the defeat of the election.

For example, the Redis cluster is optimized to avoid such a situation. After the primary node goes offline, the secondary node does not send messages immediately, but delays sending election messages for a certain period of time, which is random. The calculation formula is as follows:

DELAY = 500 milliseconds 
        + random delay between 0 and 500 milliseconds 
        + SLAVE_RANK * 1000 milliseconds
Copy the code

500ms plus a random time, SLAVE_RANK is the difference of the replication offset, that is, the slave node with a smaller data gap with the master node will have a smaller waiting time and is more likely to become the new master node.

4.2 Transfer Process

The newly elected master node will first execute SLAVEOF no one to stop copying the old master node. It then assigns itself the slot of the old master node. Finally, a broadcast message is sent to the cluster to let other nodes in the cluster know that the new primary node has taken over all slots and the old primary node has gone offline.

This completes the failover process, and the new master node will begin accepting data related to the slot it is responsible for.

5 end

The three working modes of Redis are constantly improving the availability of services. The most basic single-machine mode adopts THE BACKUP mode of RDB and AOF. In this mode, once the service is down, the Redis service is immediately unavailable, but can be manually restored using backup files. Manual operation is required.

In the master-slave mode, the sentry mechanism is added for monitoring. When the master node breaks down, the Redis service can still be used. The disadvantage is that there is still only one node to process commands, which cannot handle a large number of requests with high concurrency.

The cluster mode has the highest availability, and failover can be performed within the cluster without the need to introduce sentinels. Is a commonly used mode at present.