Redis Cluster is a distributed implementation of Redis. As the official documentation cluster-Spec emphasizes, its design prioritizes high performance and linear scaling capabilities, with maximum write safety.

Write loss refers to the situation that data is not changed or lost in subsequent requests after a client ACK is replied. This problem may be caused by primary/secondary switchover, instance restart, and split brain. The following are the reasons for this problem.

Master-slave switch

Failover may cause route changes. The active and passive cases need to be discussed separately.

Passive failover

Assume that the cluster is in normal state, node C is the master in slot 1-100, and the slave is C’.

If master C fails, slave C’ marks C as FAIL within the maximum cluster_node_timeout period to trigger the failover logic.

Before the slave C’ is successfully switched to master, slot 1-100 is still in the charge of SLOT C, and an access error will be reported. When C’ is switched to master, gossip broadcasts route changes. During this process, clients accessing C’ will still receive normal responses, while clients accessing other nodes with old routes will be MOVED to suspended C, and an access error will be reported.

The only case in which write loss can occur is caused by the master-slave asynchronous replication mechanism.

If data written to the master hangs before it can be synchronized to the slave, the data will be lost (the merge operation does not exist after restart). It is rare for the master to reply to the client and synchronize the slave almost simultaneously, but this is a risk and the time window is small.

Active failover

Through active failover sysadmin in slave node to perform CLUSTER failover [FORCE | TAKEOVER] command triggered.

The complete Manual failover process, discussed in detail in the previous blog, is summarized as the following six steps,

  1. Slave initiates a request. The gossip message carries the CLUSTERMSG_TYPE_MFSTART identifier.
  2. Master blocks the client service and stops for CLUSTER_MF_TIMEOUT twice. The current version is 10s.
  3. Slave Chases the master/slave replication offset data.
  4. Slave started campaigning and was eventually elected.
  5. Slave Switches its role to take over slots and broadcasts new routing information.
  6. The route of other nodes is changed, and the cluster route is leveled.

The three options have different behaviors, which are analyzed as follows.

(1) Default options. After the complete MF process is executed, the master service stops, so there is no write loss problem.

(2) FORCE option. Start with step 4. In the slave C’ vote counting phase, master C can still receive user requests normally, and the master and slave asynchronously replicate, which may cause write loss. Mf will be executed at a certain point in the future. The timeout period is CLUSTER_MF_TIMEOUT (5s in the current version). Each clusterCron check is performed.

(3) TAKEOVER options. Start with step 5. Slave directly adds its own configEpoch (without the consent of other nodes) and takes over slots. When the slave C’ is switched to master, the route is updated to the original master C, and the request sent to C may be lost. Generally, the request is completed within a ping time with a small time window. A delayed update route for nodes other than C and C’ will only cause one more ‘MOVED’ error and will not result in a lost write.

The master reset

The cluster status was initialized. Procedure

There is a state member variable in the clusterState structure, which represents the global state of the cluster and controls whether the current cluster can provide services. There are the following two values:

 #define CLUSTER_OK 0 /* Everything looks ok */
 #define CLUSTER_FAIL 1 /* The cluster can't work */
Copy the code

After server restart, state is initialized to CLUSTER_FAIL and the code logic can be found in the clusterInit function.

For cluster in CLUSTER_FAIL state, access is denied. The code is as follows:

 int processCommand(client *c) {...if(server.cluster_enabled && ! (c->flags & CLIENT_MASTER) && ! (c->flags & CLIENT_LUA && server.lua_caller->flags & CLIENT_MASTER) && ! (c->cmd->getkeys_proc ==NULL && c->cmd->firstkey == 0&& c->cmd->proc ! = execCommand)) {int hashslot;
          interror_code; clusterNode *n = getNodeByQuery(c,c->cmd,c->argv,c->argc, &hashslot,&error_code); . }... }Copy the code

The focus is on the getNodeByQuery function, which, when cluster mode is enabled, is used to find which node is actually running the command.

Note: Redis Cluster uses a decentralized routing management strategy. Each node can be accessed directly. If the node to command is not currently connected, it will return a -Moved redirection error to the node to which the command is actually to be executed.

Let’s look at some of the logic of the getNodeByQuery function,

 clusterNode *getNodeByQuery(client *c, 
    struct redisCommand *cmd, robj **argv, 
    int argc, int *hashslot, 
    int *error_code) {...if(server.cluster->state ! = CLUSTER_OK) {if (error_code) *error_code = CLUSTER_REDIR_DOWN_STATE;
            return NULL; }... }Copy the code

As you can see, the cluster must be in the CLUSTER_OK state.

We say that this restriction is necessary to ensure Write safety!

As you can imagine, if master A fails, the corresponding slave A’ is elected as the new master. At this point, **A restarts, and some clients see routes that have not been updated. They will still write data to A. If they accept these writes, they will lose data! **A’ is the master of this Sharding. Therefore, after A’ restarts, the service needs to be disabled until the route change is complete.

Cluster Status Change

So, when will CLUSTER_FAIL -> CLUSTER_OK status change? The answer lies in the clusterCron timed task.

 void clusterCron(void) {...if (update_state || server.cluster->state == CLUSTER_FAIL)
        clusterUpdateState();
 }
Copy the code

The key logic is in the clusterUpdateState function.

#define CLUSTER_WRITABLE_DELAY 2000
void clusterUpdateState(void) {
    static mstime_t first_call_time = 0; .if (first_call_time == 0) first_call_time = mstime();
    if (nodeIsMaster(myself) &&
        server.cluster->state == CLUSTER_FAIL &&
        mstime() - first_call_time < CLUSTER_WRITABLE_DELAY) return; new_state = CLUSTER_OK; .if (new_state != server.cluster->state) {
        ...
        server.cluster->state = new_state;
    }
}
Copy the code

CLUSTER_WRITABLE_DELAY CLUSTER_WRITABLE_DELAY = 2 milliseconds

Access delays are meant to wait for route changes. When does a route change happen? When a new server is started, its link to other nodes in the clusterCron file is null. If the link is detected in the clusterCron file, it will be pinged. As an old node whose route has expired, it receives an Update message from other nodes and changes its route.

CLUSTER_WRITABLE_DELAY The CLUSTER_WRITABLE_DELAY time window is sufficient to update the route.

partition

Partition in

Due to the unreliability of network, network partition is a problem that must be considered, that is, P in CAP theory.

After a partition occurs, the cluster is divided into majority and minority. The partition is divided by the number of master nodes.

(1) For the minority part, the slave will initiate the election, but cannot receive the majority of votes from the master, so the normal failover process cannot be completed. Meanwhile, in clusterCron, most nodes will be marked as CLUSTER_NODE_PFAIL, thus triggering the clusterUpdateState logic, which is roughly as follows:

void clusterCron(void) {... di = dictGetSafeIterator(server.cluster->nodes);while((de = dictNext(di)) ! =NULL) {... delay = now - node->ping_sent;if (delay > server.cluster_node_timeout) {
            if(! (node->flags & (CLUSTER_NODE_PFAIL|CLUSTER_NODE_FAIL))) { serverLog(LL_DEBUG,"*** NODE %.40s possibly failing", node->name);
                 node->flags |= CLUSTER_NODE_PFAIL;
                update_state = 1; }}}...if (update_state || server.cluster->state == CLUSTER_FAIL)
        clusterUpdateState();
}
Copy the code

In the clusterUpdateState function, it changes the state of the cluster.

void clusterUpdateState(void) {
    static mstime_tamong_minority_time; . { dictIterator *di; dictEntry *de; server.cluster->size =0;
        di = dictGetSafeIterator(server.cluster->nodes);
        while((de = dictNext(di)) ! =NULL) {
            clusterNode *node = dictGetVal(de);
            if (nodeIsMaster(node) && node->numslots) {
                server.cluster->size++;
                if ((node->flags & (CLUSTER_NODE_FAIL|CLUSTER_NODE_PFAIL)) == 0)
                    reachable_masters++;
            }
        }
        dictReleaseIterator(di);
    }
    {
        int needed_quorum = (server.cluster->size / 2) + 1;
​
        if(reachable_masters < needed_quorum) { new_state = CLUSTER_FAIL; among_minority_time = mstime(); }}... }Copy the code

As can be seen from the code above, cluster status will be changed to CLUSTER_FAIL after some time in minority. However, for a minority master B that is accessible until the state changes, there is a window in which write will be lost!!

The size of this time window can be calculated in the clusterCron function.

Cluster_node_timeout Indicates that nodes with PFAIL are marked as PFAIL after the partition time. Node B can mark the cluster as CLUSTER_FAIL without waiting for Cluster_node_timeout /2 to ping cluster nodes. It can be calculated that the time window is about Cluster_node_timeout.

In addition, the time when the service is disabled is recorded, which is among_minority_time.

(2) For the majority, the slave initiates an election. Take SLAVE B’ of B as an example, failover becomes the new master and provides services.

If the partition time is smaller than cluster_node_TIMEOUT and no PFAIL flag appears, no write will be lost.

Partition recovery

When the partition is restored, master B will be added to the cluster again. To provide services, master B must change the cluster status from CLUSTER_FAIL to CLUSTER_OK.

We know that B is an old route and it should be changed to slave. Therefore, we still need to wait for a period of time to change the route. Otherwise, write loss may occur (as previously analyzed), which is also in the logic of clusterUpdateState function.

#define CLUSTER_MAX_REJOIN_DELAY 5000
#define CLUSTER_MIN_REJOIN_DELAY 500
void clusterUpdateState(void) {...if(new_state ! = server.cluster->state) {mstime_t rejoin_delay = server.cluster_node_timeout;
​
        if (rejoin_delay > CLUSTER_MAX_REJOIN_DELAY)
            rejoin_delay = CLUSTER_MAX_REJOIN_DELAY;
        if (rejoin_delay < CLUSTER_MIN_REJOIN_DELAY)
            rejoin_delay = CLUSTER_MIN_REJOIN_DELAY;
​
        if (new_state == CLUSTER_OK &&
            nodeIsMaster(myself) &&
            mstime() - among_minority_time < rejoin_delay) 
        {
            return; }}}Copy the code

The time window is Cluster_node_TIMEOUT, 5s at most and 500ms at least.

summary

Failover may cause write loss due to election and data deviation of asynchronous replication.

Master restart The CLUSTER_WRITABLE_DELAY delay is used. After the cluster status changes to CLUSTER_OK, the user can access the cluster again without write loss.

In the minority part of a partition, write may be lost before the cluster status changes to CLUSTER_FAIL.

After the partition is restored, run the rejoin_delay command to delay the partition. After the cluster status changes to CLUSTER_OK, the partition can be accessed again without write loss.