Redis cluster is a distributed database solution provided by Redis. The cluster shares data through sharding and provides replication and failover functions

node

A Redis cluster is usually composed of multiple nodes. At the beginning, each node is independent of each other, and they are all in a cluster that only contains itself. To build a real working cluster, We can connect individual nodes via CLUSTER MEET < IP > .

Sending the CLUSTER MEET command to a node tells the node to shake hands with the node specified by IP and port. When the handshake succeeds, the node adds the node specified by IP and port to the CLUSTER where the node is currently located.

Start node

A node is a Redis server running in cluster mode. When the Redis server starts up, it determines whether to enable the cluster mode according to whether the cluster-Enabled configuration option is set to Yes.

Nodes (Redis servers running in cluster mode) continue to use all server components in standalone mode, for example

  • Continue to use file event handlers to process command requests and return command replies
  • The event event handler continues to be used to execute the serverCron function, which in turn calls the clusterCron function unique to cluster mode. The clusterCron function is responsible for performing general operations that need to be performed in cluster mode. For example, check whether the node is disconnected or whether automatic failover of the offline node is required.
  • If you continue to use databases to hold key-value pair data, key-value pairs will still be different types of objects
  • Continue to use RDB persistence and AOF persistence
  • Continue to use the replication module to replicate the nodes
  • Continue pub/sub using the publish subscription module
  • Continue to use the Lua scripting environment to execute the Lua script entered by the client

In addition, the node continues to use the redisServer structure to hold the state of the server and the redisClient structure to hold the state of the client. H /clusterNode, cluster.h/clusterLink, and cluster.h/clusterState are used only in cluster mode.

Cluster data structure

ClusterNode: Stores the current status of the node, such as the creation time, node name, node configuration, node IP address and port. Each node uses a clusterNode to record its status. A clusterNode structure is not created for all other nodes in the cluster to record the status of other nodes.

typedef struct clusterNode {
    mstime_t ctime; /* Node creation time */
    char name[CLUSTER_NAMELEN]; /* The name of the node, consisting of 40 hexadecimal characters */
    int flags;      /* Node id. Use various identifiers to record the role of the node and the state of the node */
    uint64_t configEpoch; /* The current configuration era of the node for failover */
    char ip[NET_IP_STR_LEN];  /* Node IP address */
    int port;                   /* Node port number */
    clusterLink *link;          /* Save the information needed to connect the nodes */

    // ...
} clusterNode;
Copy the code

The Link attribute in the clusterNode structure is a clusterLink structure, which stores the information needed to connect nodes.

typedef struct clusterLink {
    mstime_t ctime;             /* Connection creation time */
    connection *conn;           / * * / connection
    sds sndbuf;                 /* The output buffer holds information waiting to be sent to other nodes */
    char *rcvbuf;               /* Receive buffer, which holds messages received from other nodes */
    size_t rcvbuf_len;          /* Used size of rcvbuf */
    size_t rcvbuf_alloc;        /* Used size of rcvbuf */
    struct clusterNode *node;   /* Nodes associated with this connection are not NULL */
} clusterLink;
Copy the code

Each node keeps a clusterState structure, which records the current state of the cluster from the perspective of the current node.

typedef struct clusterState {
    clusterNode *myself;  /* points to the current node */
    uint64_t currentEpoch; /* The current configuration era of the cluster for failover */
    int state;            /* Current cluster status */
    int size;             /* Number of nodes in a cluster that processes at least one slot */
    dict *nodes;          /* List of cluster nodes (including myself) Key: node name value: clusterNode structure for the node */

    // ...
} clusterState;
Copy the code

Implementation of the CLUSTER MEET command

On receipt of CLUSTER MEET < IP > Node A and node B shake hands to confirm each other’s existence.

  1. Node A creates A clusterNode structure for node B and adds the structure to its ClusterState. nodes dictionary
  2. Node A will be based onCLUSTER MEETCommand the given IP and PORT to send a MEET message to node B
  3. Node B receives the MEET message sent by node A. Then, node B creates A clusterNode structure for node A and adds the structure to its ClusterState. nodes dictionary
  4. Node B will return A PONG message to node A
  5. Node A receives the PONG message sent by node B, through which it knows that node B has successfully received the MEET message sent by itself
  6. Node A will return A PING message to node B
  7. Node B receives the PING message sent by node A, through which it knows that node A has successfully received the PONG message sent by node A, and the handshake is complete
  8. Node A transmits the information about node B to other nodes in the cluster through the Gossip protocol, so that other nodes shake hands with node B. After A period of time, node B is recognized by all nodes in the cluster

Slots assigned

The Redis cluster stores key-value pairs in the database by sharding: the whole database of the cluster is divided into 16384 slots, and each key in the database belongs to one of the 16384 slots. Each node in the cluster can handle zero or up to 16384 slots.

When all 16384 slots in the database have nodes processing, the cluster is online. Conversely, if a slot in the database is not processed, the cluster goes offline.

CLUSTER ADDSLOTS

[slot…] Command to assign one or more slots to a node.

Records the slot assignment information of a node

The slots and numSlot properties in the clusterNode structure record which slots the node is responsible for processing

typedef struct clusterNode {
    unsigned char slots[CLUSTER_SLOTS/8]; /* slots handled by this node */
    int numslots;   /* Number of slots handled by this node */

    // ...
} clusterNode;
Copy the code
  • Slots is a binary array whose length is 16384/8 = 2048 bytes.

    Redis numbers the 16384 bits in the slots array with 0 as the starting index and 16383 as the ending index, and determines whether the node is responsible for processing slot I based on the values of all the bits on I:

    • If the bits of the Slots array on index I have a value of 1, that means the node is responsible for processing slot I
    • If the bits of the Slots array have a value of 0 on index I, then the node is not responsible for processing slot I
  • Numslots records the number of slots handled by the slots node, which is the binary number of slots arrays with a value of 1

Propagate slot assignment information for a node

In addition to logging its slots in the Slots array and numslots attributes of the cluserNode structure, a node also sends its slots array to other nodes in the cluster via message.

Record the information about all slots assigned to a cluster

The Slots array in the clusterState structure records the assignment information of all 16384 slots in the cluster

  • ifslots[i] A pointer to NULL indicates that no node has been assigned
  • ifslots[i] The pointer points to a clusterNode structure that has been assigned to the node represented by clusterNode

The presence of the clusterNode. Slots array solves some of the problems that can’t be solved efficiently by storing slot assignment information only in the clusterNode.

  • If the node only uses the ClusterNode. slots array to record the slot assignment, the program needs to traverse all clusterNode structures in the ClusterState. nodes dictionary to know whether slot I has been assigned or to which node slot I has been assigned. Check the slots array of these structures until the node responsible for processing slot I is foundO(N).
  • Using the ClusterState. slots array, the node assigned to slot I can be retrieved directly from the index of slot I. Time complexity isO(1)

Replication and failover

Nodes in the Redis cluster are divided into master node and slave node. The master node is used for processing slots, and the slave node is used to replicate a master node. When the replicated master node goes offline, it continues to process command requests on behalf of the offline master node.

Setting the slave Node

Sending the command CLUSTER REPLICATE

to a node makes the node that receives the command become the primary node of the node specified by node_id and starts replication on the primary node.

  • Receives the command of the node is first in his own clusterState. Nodes in the dictionary to find node_id clusterNode structure of the corresponding node, and will own clusterState. Myself. Slaveof pointer to this structure, To record the master node that the node is replicating.
  • Nodes will change their in clusterState. Myself. Properties of flags, close the originalREDIS_NODE_MASTERFlag, openREDIS_NODE_SLAVEOFLogo.
  • The node calls the copy method to copy the master node. Using the same replication mode as a single machine

Fault detection

Each node in the cluster periodically sends PING messages to other nodes in the cluster. If the node that receives the PING message does not return the PONG message to the node that sends the PING message within the specified time, the node that sends the PING message will mark the node that receives the PING message as suspected offline.

If more than half of the primary nodes responsible for processing operations in a cluster report a primary node X as suspected offline, then primary node X will be marked offline and a message will be broadcast about primary node X being offline, and the receiving nodes will mark primary node X as offline.

failover

  • Of all slave nodes that replicate the offline master node, one node is selected.
  • The selected secondary node will run the SLAVEOF no one command to become the new primary node
  • The new master undoes all slots assigned to the offline master and assigns them all to itself
  • The new master node broadcasts a PONG message to the cluster, which makes it immediately known to the other nodes in the cluster.
  • The new master node starts receiving command requests related to the slot it is responsible for processing, and the failover is complete.

Elects a new master node

Similar to the Sentinel election method, the lead election method was implemented based on Raft algorithm

  • The configuration era of the cluster is an increment counter that starts with 0
  • When a node in the cluster starts a failover operation, the value of the cluster configuration era is increased by one
  • For each configuration era, each primary node responsible for processing slots in the cluster has one vote, and the first slave node that asks for a vote from the primary node gets the primary node’s vote
  • When the slave node detects that the master node it is replicating has gone offline, the slave node broadcasts a messageCLUSTERMSG_TYPE_FAILOVER_AUTH_REQUESTMessage that requires all primary nodes that receive the message and are eligible to vote to the secondary node
  • If a master node has a vote and the master node has not voted for another slave node, the master node will return one to the slave node that requested the voteCLUSTERMSG_TYPE_FAILOVER_AUTH_ACKMessage indicating support
  • Each participating elector from the node statistics itself receivedCLUSTERMSG_TYPE_FAILOVER_AUTH_ACKNumber of messages
  • If when a gets from a nodeN / 2 + 1The new master node is elected with a vote of support.
  • If there are not enough support tickets collected from nodes in a configuration era, the cluster enters a new configuration era. And elected again until elected.