1. Introduction

From the previous part (Redis core part (3)- High availability architecture (1) master slave and Sentinel), Sentinel can automatically failover when the master slave failure occurs, ensuring the high availability of the whole Redis service; But its pain point is that it does not solve the problem of data capacity; As the business scale grows, Redis may cache tens or even hundreds of gigabytes of data. If only vertical expansion, simple increase of machine memory, not only greatly increase the cost of expansion, but also lead to unusually slow data persistence, reduce the whole Redis service RT; Therefore, from reDIS3.0, the official solution of horizontal expansion is provided. By introducing cluster, the data is distributed to different nodes to reduce the data capacity of single instance and achieve the purpose of data fragmentation.

2. Cluster Basic running unit – Node Node

Rituals, directly above: usually a Redis cluster is composed of multiple node (the node), in the beginning, each node is independent of each other, they are only contains its own cluster, to form a real working cluster, we must connect individual node, form a cluster contains multiple nodes

If a common Redis service wants to become a cluster node, it must enable the cluster mode of the service before starting the service:

cluster-enabled yes
Copy the code

When starting the Redis service, redis checks whether the cluster-Enabled option is set to YES. If YES, the server becomes a node in cluster mode; otherwise, the server becomes a common Redis server. In cluster mode, node will maintain the features of normal Redis server, such as data persistence, replication, etc. There are also some special features, such as clusterNode, clusterLink, clusterState and other cluster-related data structures. The functions of these data structures are briefly introduced in the following section.

Careful friends may have noticed that redis cluster architecture model is also a decentralized high-availability model.

2.1 Node Association

You can use the CLUSTER MEET command to associate nodes that are not in the same CLUSTER. The command format is as follows:

CLUSTER MEET <ip> <port>
Copy the code

As shown in the figure below, by sending the CLUSTER MEET command to node A, the client can ask node A receiving the command to add another node B to the CLUSTER where node A is currently located, and node A receiving the command will shake hands with node B. And lay a good foundation for further communication in the future

After the handshake is complete, the cluster data of node A and nodeB will add A copy of each other’s node information. At the same time, node A will spread the information of nodeB to other nodes in the cluster through the Gossip protocol, so that other nodes can also shake hands with nodeB. Finally, after a period of time, node B will be recognized by other nodes in its cluster, and the node information of node B will also be stored in the cluster data information of other nodes.

3. Implementation principle of Cluster

Since The beginning of Redis 3.0, Redis has officially provided a cluster solution to solve the problem of data capacity. The solution uses 16384 Hash slots (I will refer to them as slots) to map data to instances, thus achieving data fragmentation. Spread the data evenly among nodes in the cluster

3.1 the hash solt

3.1.1 Slot Assignment

You can use the CLUSTER MEET command to associate nodes. However, the associated CLUSTER still cannot provide services because slot 16384 of the current CLUSTER has not been assigned to the corresponding node. The cluster is still offline after you run the info command.

You can assign one or more slots to a node by sending the CLUSTER ADDSLOTS command:

Redis -cli -h 127.0.0.1 -p 7000 cluster addslots 0,5460 redis-cli -h 127.0.0.1 -p 7001 cluster addslots 5461,10922 Redis -cli -h 127.0.0.1 -p 7002 cluster addslots 10923,16383Copy the code

The above instructions allocate hash slots for each instance: Instance 1 is responsible for hash slots 0 to 5460, instance 2 for hash slots 5461 to 10922, and instance 3 for hash slots 10923 to 16383

When 16384 slots have been assigned to corresponding nodes and the cluster goes online, run the Info command to query the following information:

3.1.2 Slot Information Storage ·

After the slot assignment is complete, each node in the cluster has its own associated slot. The associated slot information is stored in the clusterState structure:

struct clusterNode {
    unsigned char slots[16384/8];
    int numsslots;
}
Copy the code
  • Numsslots: indicates the number of slots managed by the current node instance.
  • Slots: A binary array of 2048(16384/8) bytes, containing 16384 bits. Starting with index 0, each bit represents a slot. If the value of the index is 1, the slot is processed by the current instance. 0 indicates that the instance is not associated.

The binary bits on index 1, 3, 5, 8, 9, 10 have values of 1, while all other binary bits have values of 0, indicating that the node is responsible for processing slots 1, 3, 5, 8, 9, 10.

3.1.3 Slot Information Broadcast

In addition to recording its slots in the clusterNode slots and Numslots properties, a node also sends its slots array to other nodes in the cluster to tell them which slots it is currently handling.

Node 1 broadcasts its slot information to node 2 and node 3. Similarly, node 2 and node 3 broadcast its slot information:

Each node in the cluster sends its slot information to other nodes in the cluster, and each receiving node stores this information in the clusterNode structure:

typedef struct clusterNode {
    clusterNode *slots[13684];
}
Copy the code

The Slots array contains 16384 entries, each of which is a pointer to the clusterNode structure:

  • If the slots[I] array entry is NULL, then slot I has not been assigned to any of the nodes
  • If solots[I] points to a clusterNode, slot I has been assigned to the node represented by the clusterNode.

3.1.4 Reassigning hash Slots

Solt reallocation can change any number of slots assigned to one node (source node) to another node (target node), and the key-value pairs of the associated slots are moved from the source node to the target node.

The reallocation operation does not affect the cluster to provide services to me. It can be carried out online. During this process, both source and destination nodes can continue to process client requests.

Reallocation is performed by Redis’s cluster management software, Redis-Trib, which provides all the commands required for reallocation, while Redis-Trib does it by sending commands to source and target nodes.

  • Step 1: Run the CLUSTER SETSLOT < slot > IMPORTING < source_id > command on the target nodes to prepare the target nodes for the migration of slot X data
  • Step 2: Run the CLUSTER SETSLOT < slot > MIGRATING < target_id > command on the source node to prepare for MIGRATING data in slot X
  • Step 3: Redis-trib sends the CLUSTER GETKEYSINSLOT < slot > < count > command to the source node to batch obtain a maximum of count key pairs in slot X
  • Step 4: Migrate the key obtained in Step 3 atomically to the target node
  • Repeat Step 3 and Step 4 until data migration in slot X is complete

If a redistribution involves multiple slots, then Redis-Trib performs the steps given above separately for each given slot

3.1.5 MOVED Error, ASK Error

When the client starts for the first time, it pulls slot information from the cluster and stores it locally. Above is the startup log of the local test project:

When making a request, the client computs the hash value of the key, uses CRC16 to calculate a value, and mimics 16384 to obtain the corresponding Slot. Then, the client locates the instance where the data resides based on the local cache hash Slot instance mapping information, and sends the request to the corresponding instance.

If the cluster slot information changes and the client sends a request to an instance that does not have the corresponding data, the Redis instance will return an ASK or MOVE error telling the client to send the request to another instance.

3.1.5.1 ASK error

An ASK error occurs when redis-Trib migrates slot data to the target node while the other slot data is stored on the source node:

If the requested key is found on the current node, execute the command, otherwise respond with an ASK error: you send node2 an ASKING command, followed by an action command

The ASK error does not update the hash slot allocation information cached by the client, so the request for slot X will still go to node1 instance first.

3.1.5.2 version error

MOVED error (load balancing; data has been migrated to other instances) : When a client sends a key-value pair operation request to an instance that is not in its own slot, the instance returns a MOVED error directing it to the node that is in charge of the slot

At the same time, the client also updates the local cache to correct the mapping between the slot and the Redis instance

3.2 Failover

3.2.1 Fault Detection

In a high availability architecture introduced the sentinel fault detection is mainly achieved through monitoring, cluster and how to achieve automatic fault detection?

3.2.1.1 PFAIL Suspected offline

Each node in the cluster periodically sends PING messages to other nodes in the cluster to check whether they are online. If the node that receives the PING message does not return the PONG message to the node that sent the PING message within the specified time, Then the node will receive the PING PING messages sent message node is marked as suspected referrals (probable fail, PFAIL)

3.2.2 offline

If a node is suspected to be offline, it does not mean that all nodes consider it to be offline. The Redis cluster uses Gossip to broadcast this message to other nodes. If the number of lost nodes (PFail Count) has reached the majority of the cluster, the node can be marked as offline. Then broadcast to the entire cluster, forcing other nodes to accept the fact that the node is offline, and immediately perform a primary/secondary switchover on the lost node.

3.2.2 Automatic Primary/Secondary Switchover

When a Slave finds that its master node is offline, the Slave node starts to failover the offline master node. The following is how to perform failover

  • Elects master: selects a slave server from the list of nodes and converts it to the master
  • The new master node unassigns all slots to the offline master and assigns those slots to itself
  • The new master node broadcasts a PONG message to the cluster, which lets the other nodes in the cluster know immediately that the slave node has become the master node and that the master node has taken over the slot that was handled by the offline node
  • The new master node starts receiving command requests related to the slot it is responsible for processing, and the failover is complete

Electing a new Master process:

  • The cluster’s configuration epoch +1, which is a self-history counter with an initial value of 0, is +1 each time a failover is performed.
  • A secondary node that detects that the primary node is offline broadcasts a CLUSTERMSG_TYPE_FAILOVER_AUTH_REQUEST message to the cluster, requiring all primary nodes that receive this message and have voting rights to vote for the secondary node.
  • If the master node has not voted for another slave node, the master node will return a CLUSTERMSG_TYPE_FAILOVER_AUTH_ACK message to the slave node requesting the vote, indicating that the master node supports the slave node as the new master node.
  • Each slave node that participates in the election receives a CLUSTERMSG_TYPE_FAILOVER_AUTH_ACK message and is elected as the new master if the votes collected are >= (N/2) + 1 support.
  • If not enough support votes are collected from the nodes in a configuration era, the cluster enters a new configuration era and elections are held again until a new primary node is elected.

The selection of a new master is very similar to the selection of a sentinel leader because both are implemented using the Raft algorithm’s leader election method.

4 Cluster Scale

With Cluster, is not our Redis Cluster no longer need to worry about the data capacity problem, can be infinite level expansion; The answer is no, because it involves the message communication overhead within the cluster. After a lot of verification in Redis official, the size of the online is 1000 instances

Intra-cluster communication consists of the following types of messages:

  • MEET message: node association
  • PING messages: By default, each node in the cluster randomly selects five nodes from the known node list every one second. Then, it sends a PING message to the node that has not pinged the node for the longest time to check whether the selected node is online
  • PONG message: When the receiver receives a MEET or PING message from the sender, in order to confirm to the sender that the MEET or PING has arrived, the receiver will return a PONG message to the sender. In addition, a node can broadcast its PONG messages to the cluster so that other nodes in the cluster can immediately refresh their knowledge of the node
  • FAIL message: Broadcast suspected offline mater message

4.1 Protocol message body of Gossip

Cluster nodes use the Gossip protocol to exchange status information about different nodes. The message body of the Gossip protocol is composed of the clusterMsgDataGossip structure:

typedef struct { char nodename[CLUSTER_NAMELEN]; //40 bytes uint32_t ping_sent; //4 bytes uint32_t pong_received; //4 bytes char IP [NET_IP_STR_LEN]; // uint16_t port; //2 bytes uint16_t cport; //2 bytes uint16_t flags; //2 bytes uint32_t notused1; //4 bytes} clusterMsgDataGossip;Copy the code

The type of message is determined by the type attribute of the message header to determine whether it is a MEET message, PING message or PONG message.

So each instance that sends a Gossip message needs to send 104 bytes. If the cluster has 1000 instances, each instance will take up about 10KB to send a PING message and 20KB to reply to PONG message. As the cluster size increases, more and more heartbeat messages will occupy the network communication bandwidth of the cluster and reduce the cluster throughput.

5. To summarize

  • Using Redis Cluster, you can solve the data capacity problem that sentinels cannot solve
  • The 16,384 slots in the cluster can be assigned to each node in the cluster, and each node records which slots are assigned to it and which slots are assigned to other nodes
  • The MOVED error indicates that the responsibility for the slot has been MOVED from one node to another, and the ASK error is a temporary measure used by both nodes to move the slot and does not update the client local cache
  • Nodes in a cluster communicate by sending and receiving messages. Common messages include MEET, PING, PONG, and FAIL
  • Clusters cannot grow indefinitely, and because clusters propagate cluster instance information through the Gossip protocol, communication overhead is the main reason for limiting cluster size