Redis Cluster Gossip protocol

Hello everyone, I am Li Xiaobing, today will talk about the Gossip protocol and Cluster operation of Reids Cluster, the mind map of the article is as follows.

Cluster mode and Gossip introduction

In the field of data storage, when the volume of data or request traffic reaches a certain level, it is inevitable to introduce distribution. For example, Redis, despite its excellent stand-alone performance, had to introduce clustering for the following reasons.

A single machine cannot guarantee high availability, and multiple instances need to be introduced to provide high availability
A single machine can provide up to 8W or so QPS, and higher QPS requires the introduction of multiple instances
The amount of data supported by a single machine is limited, and multiple instances need to be introduced to process more data.
The network traffic processed by a single server exceeds the upper limit of the network interface card (NIC) on the server. Therefore, multiple instances need to be introduced to divert network traffic.

Clusters often need to maintain certain metadata, such as the IP address of the instance, the slots information of the cache fragments, so a set of distributed mechanisms are needed to maintain the consistency of metadata. There are generally two modes of such mechanisms: decentralized and centralized

A distributed mechanism stores metadata on some or all nodes and maintains metadata changes and consistency through continuous communication between different nodes. Redis Cluster, Consul, and so on are all in this mode.

Centralized means that the cluster metadata is centrally stored on external nodes or middleware, such as ZooKeeper. Older versions of Kafka and Storm use this mode.

The two modes have their own advantages and disadvantages, as shown in the following table:

model	advantages	disadvantages
The centralized	The data is updated in a timely manner, and the metadata is updated and read in a timely manner. Once the metadata is changed, it is immediately updated to the centralized external node, and other nodes can immediately perceive it when they read it.	Large data update pressure, update pressure all concentrated in the external node, as a single point to affect the whole system
decentralized	The data update pressure is dispersed. The update of metadata is dispersed rather than centralized on a single node. The update requests are dispersed and processed by different nodes, resulting in a certain delay, which reduces the concurrent pressure	Data update delay may cause a lag in cluster perception

The distributed metadata pattern has several alternative algorithms for metadata synchronization, such as Paxos, Raft, and Gossip. While Paxos and Raft need all or most of the nodes (more than half) to run properly for the cluster to run, Gossip doesn’t need more than half of the nodes to run.

The Gossip protocol, as its name implies, uses a random, contagious method to spread information across the network and make all nodes in the system consistent within a certain period of time. For you, knowing this protocol will not only give you a good understanding of the algorithm that is most commonly used to achieve final consistency, but also give you a handy way to achieve final consistency in your data later on.

The Gossip protocol, also known as The Epidemic Protocol, is a protocol based on the information exchange between nodes or processes of epidemic transmission. It is widely used in P2P networks and distributed systems, and its methodology is very simple:

In a cluster in a bounded network, if each node randomly exchanges certain information with other nodes, after a long enough time, the cognition of each node in the cluster will converge to the same information.

By “specific information” I mean cluster status, node status, and other metadata. The Gossip protocol is fully BASE compliant and can be used in any domain that requires ultimate consistency, such as distributed storage and registries. In addition, it can easily realize elastic clustering, allowing nodes to go online at any time, providing quick failure detection and dynamic load balancing.

In addition, the biggest benefit of the Gossip protocol is that even if the number of nodes in the cluster increases, the load on each node does not increase very much and is almost constant. This allows Redis Cluster or Consul to scale horizontally to thousands of nodes.

Redis Cluster Gossip communication mechanism

Redis Cluster is the introduction of clustering in version 3.0. In order for each instance in the cluster to know the status information of all other instances, the Redis cluster specifies that each instance communicates with each other according to the Gossip protocol.

The figure above shows a diagram of a Redis Cluster in a master-slave architecture, where solid lines represent the master-slave replication relationship between nodes and dotted lines represent Gossip communication between nodes.

Each node in the Redis Cluster maintains a current status of the whole Cluster from its own perspective, which mainly includes:

Current Cluster Status
The slots information and migrate status of each node in the cluster
Indicates the master-slave status of each node in the cluster
Indicates the live status and suspected Fail status of each node in the cluster

That is to say, the above information is the content topic of gossip spread gossip among nodes in the cluster, and relatively comprehensive, both their own more others, so that everyone is passing each other, and the final information is comprehensive and consistent.

Redis Cluster nodes send various messages to each other. The important messages are as follows:

MEET: Using the “cluster MEET IP port” command, a node in an existing cluster sends an invitation to a new node to join the existing cluster. Then the new node starts to communicate with other nodes.
PING: A node sends a PING message to other nodes in the cluster at the configured interval. The PING message contains its own status, metadata of the cluster maintained by it, and metadata of some other nodes.
PONG: the node is used to respond to the PING and MEET messages. Its structure is similar to that of PING messages, and it also contains its own status and other information. It can also be used for information broadcast and update.
FAIL: After a node cannot be pinged, the node broadcasts a message informing all nodes that the node is down. Other nodes are marked offline after receiving the message.

Redis source cluster.h file defines all message types, code for Redis version 4.0.

// Note that PING, PONG, and MEET are actually the same message.
// PONG is a reply to PING. The actual format of PONG is PING message.
// MEET is a special PING message that forces the receiver of the message to add the sender to the cluster (if the node is not already in the node list)
#define CLUSTERMSG_TYPE_PING 0          /* Ping message */
#define CLUSTERMSG_TYPE_PONG 1          /* Pong is used to reply to Ping */
#define CLUSTERMSG_TYPE_MEET 2          /* Meet a request to add a node to the cluster */
#define CLUSTERMSG_TYPE_FAIL 3          /* Fail Marks a node as Fail */
#define CLUSTERMSG_TYPE_PUBLISH 4       /* Broadcast messages via publish and subscribe */
#define CLUSTERMSG_TYPE_FAILOVER_AUTH_REQUEST 5 /* Request a failover operation that requires the receiver of the message to vote in favor of the sender of the message */
#define CLUSTERMSG_TYPE_FAILOVER_AUTH_ACK 6     /* The receiver of the message agrees to vote with the sender of the message */
#define CLUSTERMSG_TYPE_UPDATE 7        /* Slots has changed, and message senders are asking message recipients to update the slots accordingly */
#define CLUSTERMSG_TYPE_MFSTART 8       /* Pause each client */ for manual failover
#define CLUSTERMSG_TYPE_COUNT 9         /* Total number of messages */
Copy the code

From these messages, each instance in the cluster can obtain status information for all other instances. In this way, even when a new node is added, a node fails, or Slot changes occur, the cluster status can be synchronized on each instance through the transmission of PING and PONG messages. Let’s take a look at some common scenarios in turn.

Periodic PING/PONG messages

Nodes in the Redis Cluster periodically send PING messages to other nodes to exchange node status information and check node status, including online status, suspected offline status PFAIL, and offline status FAIL.

The working principle of timed PING/PONG in Redis clusters can be summarized as two points:

First, some instances are randomly selected from the cluster at a certain frequency. PING messages are sent to the selected instances to check whether they are online and exchange status information. PING messages encapsulate the status information of the instance that sends the message, some other instances, and Slot mapping tables.
Second, when an instance receives a PING message, it sends a PONG message to the instance that sent the PING message. A PONG message contains the same content as a PING message.

The following figure shows the PING and PONG message transfer between two instances, where instance 1 is the sending node and Instance 2 is the receiving node

New node online

To add Redis Cluster to a new node, run the Cluster MEET command on the client, as shown in the following figure.

When executing the CLUSTER MEET command, node 1 will first create a clusterNode data for the new node and add it to the clusterState Nodes dictionary maintained by node 1. The relationship between clusterState and clusterNode will be explained in detail in the last section.

The node then sends a MEET message to the new node based on the IP address and port number in the CLUSTER MEET command. After receiving the MEET message sent by node 1, the new node will also create a clusterNode structure for node 1 and add the structure to the Dictionary of clusterState Nodes maintained by the new node.

The new node then returns a PONG message to the first node. After node 1 receives the PONG message returned by node B, it knows that the new node has successfully received the MEET message sent by itself.

Finally, node 1 sends a PING message to the new node. After the new node receives the PING message, it can know that node A has successfully received the PING message returned by node A, thus completing the handshake operation for new node access.

After the MEET operation is successful, the node will send the information of the new node to other nodes in the cluster through the periodic PING mechanism mentioned earlier, so that other nodes will also shake hands with the new node. Finally, after a period of time, the new node will be recognized by all nodes in the cluster.

Suspected and actual offline nodes

Nodes in Redis Cluster will periodically check whether the receiving node that has sent PING messages returns PONG messages within a specified time (cluster-Node-timeout), if not, it will be marked as suspected offline state, that is, PFAIL state. As shown in the figure below.

The node then transmits the information that node 2 is in the suspected offline state to other nodes, such as node 3, through a PING message. After receiving the PING message from node 1, node 3 finds the clusterNode structure corresponding to node 2 in the clusterState Nodes dictionary it maintains. Add the offline report of primary node 1 to the fail_Reports linked list of the clusterNode structure.

Over time, if node ten (for example) also considers node two to be offline because PONG timed out, In addition, more than half of the primary nodes in fail_reports of clusterNode 2 maintained by themselves mark node 2 as PFAIL status report log, so node 10 will mark node 2 as offline FAIL state. In addition, node 10 immediately broadcasts the FAIL message that primary node 2 is offline to other nodes in the cluster. All nodes that receive the FAIL message immediately mark the status of node 2 as offline. As shown in the figure below.

If cluster-node-timeout *2 is exceeded, the report will be ignored and node 2 will return to normal.

Redis Cluster communication source code implementation

To sum up, we understand the principle and operation process of Redis Cluster in timing PING/PONG, new node online, node suspected offline and real offline links, let’s really look at the source code implementation and specific operation of Redis in these links.

The data structures involved

First of all, we will explain the data structures involved, namely ClusterNode and other structures mentioned above.

Each node maintains a clusterState structure that represents the overall state of the current cluster, as defined below.

typedef struct clusterState {
   clusterNode *myself;  /* clusterNode information for the current node */. dict *nodes;/* name clusterNode dictionary */. clusterNode *slots[CLUSTER_SLOTS];/* Mapping between slots and nodes */. } clusterState;Copy the code

It has three key fields, as shown below:

The Myself field is a clusterNode structure that keeps track of its own status;
The Nodes dictionary records the mapping between a Name and the clusterNode structure to record the status of other nodes.
Slot array, which records the clusterNode structure of nodes corresponding to slots.

The clusterNode structure stores the current state of a node, such as its creation time, name, current configuration era, IP address and port number, and so on. In addition, the Link attribute of the clusterNode structure is a clusterLink structure, which holds the information needed to connect nodes, such as socket descriptors, input buffers and output buffers. ClusterNode also has a list of fail_reports for logging suspected offline reports. The specific definition is as follows.

typedef struct clusterNode {
    mstime_t ctime; /* Node creation time */
    char name[CLUSTER_NAMELEN]; /* Node name */
    int flags;      /* Node id, which marks the node role or status, such as primary node secondary node or online and offline */
    uint64_t configEpoch; /* The current node is known as the cluster unified epoch */
    unsigned char slots[CLUSTER_SLOTS/8]; /* slots handled by this node */
    int numslots;   /* Number of slots handled by this node */
    int numslaves;  /* Number of slave nodes, if this is a master */
    struct clusterNode* *slaves; /* pointers to slave nodes */
    struct clusterNode *slaveof; /* pointer to the master node. Note that it may be NULL even if the node is a slave if we don't have the master node in our tables. */
    mstime_t ping_sent;      /* The time when the current node last sent a PING message to the node */
    mstime_t pong_received;  /* The time when the current node last received a PONG message from the node */
    mstime_t fail_time;      /* FAIL the time when the flag bit is set */
    mstime_t voted_time;     /* Last time we voted for a slave of this master */
    mstime_t repl_offset_time;  /* Unix time we received offset for this node */
    mstime_t orphaned_time;     /* Starting time of orphaned master condition */
    long long repl_offset;      /* The current node's repL is cheap */
    char ip[NET_IP_STR_LEN];  /* Node IP address */
    int port;                   / * * / port
    int cport;                  /* Communication port, usually port +1000 */
    clusterLink *link;          /* TCP connection to this node */
    list *fail_reports;         /* List of offline records */
} clusterNode;
Copy the code

ClusterNodeFailReport is a structure that records node offline reports. Node reports node information and Time indicates the report time.

typedef struct clusterNodeFailReport {
    struct clusterNode *node;  /* Reports the node whose current node is offline */
    mstime_t time;             /* Report time */
} clusterNodeFailReport;
Copy the code

Message structure

Now that you know the data structures maintained by the Reids nodes, let’s look at the message structures that the nodes communicate with. The outermost structure of communication message is clusterMsg, which contains a lot of message record information, including RCmb flag bit, total message length, message protocol version and message type. It also contains recorded information about the node that sends the message, such as the node name, slot information, node IP address, and port. Finally, it contains a clusterMsgData to carry specific types of messages.

typedef struct {
    char sig[4];        /* RCmb (Redis Cluster message bus). */
    uint32_t totlen;    /* Total message length */
    uint16_t ver;       /* Message protocol version */
    uint16_t port;      / * * / port
    uint16_t type;      /* Message type */
    uint16_t count;     / * * /
    uint64_t currentEpoch;  /* Represents the unified epoch of the entire cluster currently recorded by the node, used to decide elections, votes, etc., unlike configEpoch, which represents the unique flag of the master node, and currentEpoch, which is the unique flag of the cluster. * /
    uint64_t configEpoch;   /* Each master node is marked with a unique configEpoch. If it conflicts with another master node, it is forced to increment to make it unique */ in the cluster
    uint64_t offset;    /* Information about the primary/secondary replication offset. The meanings of the primary and secondary nodes are different */
    char sender[CLUSTER_NAMELEN]; /* The name of the sending node */
    unsigned char myslots[CLUSTER_SLOTS/8]; /* This node is responsible for the slots information. 16384/8 char arrays are 16384 bits */
    char slaveof[CLUSTER_NAMELEN]; /* Master message. If this node is a slave node, the protocol carries master message */
    char myip[NET_IP_STR_LEN];    /* IP */
    char notused1[34];  /* Retain the field */
    uint16_t cport;      /* Cluster communication port */
    uint16_t flags;      /* Current status of this node, for example, CLUSTER_NODE_HANDSHAKE, CLUSTER_NODE_MEET */
    unsigned char state; /* Cluster state from the POV of the sender */
    unsigned char mflags[3]; /* There are only two types of this message: CLUSTERMSG_FLAG0_PAUSED and CLUSTERMSG_FLAG0_FORCEACK */
    union clusterMsgData data;
} clusterMsg;
Copy the code

ClusterMsgData is a union structure that can be a PING, MEET, PONG or FAIL message body. The PING field is assigned when the message type is PING, MEET, or PONG, and the FAIL field is assigned when the message type is FAIL.

// Notice that this is the union keyword
union clusterMsgData {
    /* When PING, MEET, or PONG is sent, the PING field is assigned */
    struct {
        /* Array of N clusterMsgDataGossip structures */
        clusterMsgDataGossip gossip[1];
    } ping;
    /* FAIL message, FAIL is assigned the value */
    struct {
        clusterMsgDataFail about;
    } fail;
    / /... Omit fields of publish and UPDATE messages
};
Copy the code

ClusterMsgDataGossip is the structure of PING, PONG and MEET messages. It contains information of other nodes maintained by the sending node, which is also the information contained in the clusterState Nodes field above. The specific code is as follows: You’ll also notice that the fields are similar.

typedef struct {
	/* The name of the node, which is random by default. After the MEET message is sent and the reply is received, the cluster sets the formal name for the node */
    char nodename[CLUSTER_NAMELEN]; 
    uint32_t ping_sent; /* The timestamp of the last PING message sent by the sending node to the receiving node
    uint32_t pong_received; /* The sending node last received the timestamp of the PONG message sent by the receiving node */
    char ip[NET_IP_STR_LEN];  /* IP address last time it was seen */
    uint16_t port;       /* IP*/       
    uint16_t cport;      / * * / port  
    uint16_t flags;      /* 标识*/ 
    uint32_t notused1;   /* Align the character */
} clusterMsgDataGossip;

typedef struct {
    char nodename[CLUSTER_NAMELEN]; /* The name of the offline node */
} clusterMsgDataFail;
Copy the code

After looking at the data structure maintained by the node and the message structure sent, we will look at the source code of Redis behavior.

PING messages are sent randomly and periodically

Redis’s clusterCron function is called periodically, preparing to PING a random node every 10 times it is executed.

It randomly selects five nodes and then calls the clusterSendPing function to send a CLUSTERMSG_TYPE_PING message to the node that has not communicated with it for the longest time

/ / cluster. C file
// clusterCron() sends the gossip message to a random node every 10 times (at least one second apart)
if(! (iteration %10)) {
    int j;

    /* Randomly select one of the five nodes */
    for (j = 0; j < 5; j++) {
        de = dictGetRandomKey(server.cluster->nodes);
        clusterNode *this = dictGetVal(de);

        /* Do not PING a disconnected node or a recently paged node */
        if (this->link == NULL || this->ping_sent ! =0) continue;
        if (this->flags & (CLUSTER_NODE_MYSELF|CLUSTER_NODE_HANDSHAKE))
            continue;
        /* Compare the pong_received field to select the node that has not received PONG messages for a longer time. */
        if (min_pong_node == NULL || min_pong > this->pong_received) {
            min_pong_node = this;
            min_pong = this->pong_received; }}/* PING the node that has not received a PONG reply for the longest time */
    if (min_pong_node) {
        serverLog(LL_DEBUG,"Pinging node %.40s", min_pong_node->name);
        clusterSendPing(min_pong_node->link, CLUSTERMSG_TYPE_PING); }}Copy the code

The behavior of the clusterSendPing function will be learned later, as it is often used elsewhere

Adding a Node to a Cluster

After the CLUSTER MEET command is executed, the node maintains a clusterNode structure for the new node. The link field of the structure, namely TCP connection field, is null, indicating that the connection has not been established for the new node.

These unconnected new nodes are also handled in the clusterCron function by calling createClusterLink to create a connection and then calling the clusterSendPing function to send the MEET message

/* cluster.c clusterCron part of the function to create connections for nodes that have not yet created connections */
if (node->link == NULL) {
    int fd;
    mstime_t old_ping_sent;
    clusterLink *link;
    /* Establish a connection with the node */
    fd = anetTcpNonBlockBindConnect(server.neterr, node->ip,
        node->cport, NET_FIRST_BIND_ADDR);
    / *... Exception handling when fd is -1 */
    /* Create link */
    link = createClusterLink(node);
    link->fd = fd;
    node->link = link;
    aeCreateFileEvent(server.el,link->fd,AE_READABLE,
            clusterReadHandler,link);
    /* Send the PING command to the newly connected node to prevent the node from being identified as offline */
    /* If the node is marked MEET, issue MEET, otherwise issue PING */
    old_ping_sent = node->ping_sent;
    clusterSendPing(link, node->flags & CLUSTER_NODE_MEET ?
            CLUSTERMSG_TYPE_MEET : CLUSTERMSG_TYPE_PING);
    /* .... */
    /* If the current node (sender) does not receive a reply to the MEET message, it will not send the command to the target node. * /
    /* If a reply is received, the node is no longer in HANDSHAKE state and continues to send the plain PING */ to the target node
    node->flags &= ~CLUSTER_NODE_MEET;
}
Copy the code

Prevent node false timeout and status expiration

Preventing node false timeouts and marking suspected offline tags are also in the clusterCron function, as shown below. It will check the current list of all nodes. If it finds that the communication time between a node and its last PONG exceeds half of the preset threshold, it will release the link connection and then send a PING message to it in order to prevent the node from having a false timeout.

/* cluster.c clusterCron */
while((de = dictNext(di)) ! =NULL) {
    clusterNode *node = dictGetVal(de);
    now = mstime(a);/* Use an updated time at every iteration. */
    mstime_t delay;

    /* If it takes more than half of node timeout for PONG to arrive */
    /* Because although the node is still healthy, the connection may have been broken */
    if (node->link && /* is connected */
        now - node->link->ctime >
        server.cluster_node_timeout && /* not reconnected */ yet
        node->ping_sent && /* The ping message has been sent */
        node->pong_received < node->ping_sent && /* Still waiting for news on Pong */
        /* Waiting for a pong message to exceed timeout/2 */
        now - node->ping_sent > server.cluster_node_timeout/2)
    {
        /* Release the connection and the next clusterCron() will automatically reconnect */
        freeClusterLink(node->link);
    }

    /* If you are not currently PING the node */
    /* And has not received a PONG reply from the node for half of the node timeout */
    /* Then send a PING to the node to make sure that the node information is not too old, it may not have been random */
    if (node->link &&
        node->ping_sent == 0 &&
        (now - node->pong_received) > server.cluster_node_timeout/2)
    {
        clusterSendPing(node->link, CLUSTERMSG_TYPE_PING);
        continue;
    }
    / *... Handling failover and marking lost offline */
}
Copy the code

Process failover and flag suspected offline

If a node does not receive a PONG message from the target node and the cluster_NODE_timeout time has exceeded after preventing node false timeout processing, the node is marked as suspected offline.

/* If this is a master node and there is a slave request for manual failover, PING*/ to the slave
if (server.cluster->mf_end &&
    nodeIsMaster(myself) &&
    server.cluster->mf_slave == node &&
    node->link)
{
    clusterSendPing(node->link, CLUSTERMSG_TYPE_PING);
    continue;
}

/* The following code only executes */ if the node sends the PING command
if (node->ping_sent == 0) continue;

/* Calculate how long it takes to wait for PONG to return */ 
delay = now - node->ping_sent;
/* Waiting for PONG to reply exceeded the limit, marking the target node as PFAIL (suspected offline)*/
if (delay > server.cluster_node_timeout) {
    /* Time out, flagged as suspected offline */
    if(! (node->flags & (REDIS_NODE_PFAIL|REDIS_NODE_FAIL))) {redisLog(REDIS_DEBUG,"*** NODE %.40s possibly failing",
            node->name);
        // Open the suspected offline flag
        node->flags |= REDIS_NODE_PFAIL;
        update_state = 1; }}Copy the code

Actually sends the Gossip message

ClusterSendPing () : clusterSendPing(); clusterSendPing() : clusterSendPing(); The main operation is to convert the clusterState maintained by the node itself into the corresponding message structure.

/* Sends a MEET, PING, or PONG message to the specified node */
void clusterSendPing(clusterLink *link, int type) {
    unsigned char *buf;
    clusterMsg *hdr;
    int gossipcount = 0; /* Number of gossip sections added so far. */
    int wanted; /* Number of gossip sections we want to append if possible. */
    int totlen; /* Total packet length. */
    // FreshNodes is the counter used to send gossip information
    // Each time a message is sent, the program decreases the freshNodes value by one
    // When the freshNodes value is less than or equal to 0, the program stops sending the gossip message
    // The number of FreshNodes is the current number of nodes in the nodes table minus 2
    // The "2" here refers to two nodes, the first being the myself node (the one that sends the message).
    // The other is the node that receives the gossip information
    int freshnodes = dictSize(server.cluster->nodes)2 -;

    
    /* Calculate how many nodes to carry information, minimum 3, maximum 1/10 of the total number of nodes in the cluster */
    wanted = floor(dictSize(server.cluster->nodes)/10);
    if (wanted < 3) wanted = 3;
    if (wanted > freshnodes) wanted = freshnodes;

    / *... Omit totlen calculation, etc. */

    /* If PING is sent, update the timestamp of the last PING sent */
    if (link->node && type == CLUSTERMSG_TYPE_PING)
        link->node->ping_sent = mstime(a);/* Record the current node information (such as name, address, port number, responsible slot) to the message */
    clusterBuildMessageHdr(hdr,type);

    /* Populate the gossip fields */
    int maxiterations = wanted*3;
    /* Each node has the freshNodes chance to send gossip messages twice each time to the destination node
    while(freshnodes > 0 && gossipcount < wanted && maxiterations--) {
        /* Selects a random node from the nodes dictionary */
        dictEntry *de = dictGetRandomKey(server.cluster->nodes);
        clusterNode *this = dictGetVal(de);

        /* The following nodes cannot be selected: * Myself: the node itself. * PFAIL nodes * Nodes in HANDSHAKE state. * Nodes identified with NOADDR * nodes disconnected because no Slot is being processed */
        if (this == myself) continue;
        if (this->flags & CLUSTER_NODE_PFAIL) continue;
        if (this->flags & (CLUSTER_NODE_HANDSHAKE|CLUSTER_NODE_NOADDR) ||
            (this->link == NULL && this->numslots == 0))
        {
            freshnodes--; /* Tecnically not correct, but saves CPU. */
            continue;
        }

        // Check whether the selected node is already in the HDR ->data.ping.gossip array
        // If so, the node has already been selected
        // Do not select it again (otherwise there will be a repeat)
        if (clusterNodeIsInGossipSection(hdr,gossipcount,this)) continue;

        /* The selected node is valid, and the counter is reduced by one */
        clusterSetGossipEntry(hdr,gossipcount,this);
        freshnodes--;
        gossipcount++;
    }

    / *... If there are PFAIL nodes, add */ last


    /* Calculates the length of the message */
    totlen = sizeof(clusterMsg)-sizeof(union clusterMsgData);
    totlen += (sizeof(clusterMsgDataGossip)*gossipcount);
    /* Records the number of selected nodes in the count property */
    hdr->count = htons(gossipcount);
    /* Record the length of the message in the message */
    hdr->totlen = htonl(totlen);
    /* Send a network request */
    clusterSendMessage(link,buf,totlen);
    zfree(buf);
}


void clusterSetGossipEntry(clusterMsg *hdr, int i, clusterNode *n) {
    clusterMsgDataGossip *gossip;
    /* Points to the gossip information structure */
    gossip = &(hdr->data.ping.gossip[i]);
    /* Records the name of the selected node to the gossip message */   
    memcpy(gossip->nodename,n->name,CLUSTER_NAMELEN);
    /* Sends the PING command of the selected node to the gossip information */
    gossip->ping_sent = htonl(n->ping_sent/1000);
    /* Records the timestamp of the PONG command reply of the selected node to the gossip message */
    gossip->pong_received = htonl(n->pong_received/1000);
    /* Records the IP address of the selected node into the gossip message */
    memcpy(gossip->ip,n->ip,sizeof(n->ip));
    /* Records the port number of the selected node to the gossip message */
    gossip->port = htons(n->port);
    gossip->cport = htons(n->cport);
    /* Records the identity value of the selected node to the gossip message */
    gossip->flags = htons(n->flags);
    gossip->notused1 = 0;
}
Copy the code

Below is the clusterBuildMessageHdr function, which populates the basic information in the message structure and the state information of the current node.

/* Build the message's header */
void clusterBuildMessageHdr(clusterMsg *hdr, int type) {
    int totlen = 0;
    uint64_t offset;
    clusterNode *master;

    /* If the current node is salve, the master node is its master node, and if the current node is master, the master node is the current node */
    master = (nodeIsSlave(myself) && myself->slaveof) ?
              myself->slaveof : myself;

    memset(hdr,0.sizeof(*hdr));
    /* Initializes the protocol version, identifier, and type, */
    hdr->ver = htons(CLUSTER_PROTO_VER);
    hdr->sig[0] = 'R';
    hdr->sig[1] = 'C';
    hdr->sig[2] = 'm';
    hdr->sig[3] = 'b';
    hdr->type = htons(type);
    /* Message header sets the current node ID */
    memcpy(hdr->sender,myself->name,CLUSTER_NAMELEN);

    /* The message header sets the current node IP */
    memset(hdr->myip,0,NET_IP_STR_LEN);
    if (server.cluster_announce_ip) {
        strncpy(hdr->myip,server.cluster_announce_ip,NET_IP_STR_LEN);
        hdr->myip[NET_IP_STR_LEN- 1] = '\ 0';
    }

    /* Base port and cluster node communication port */
    int announced_port = server.cluster_announce_port ?
                         server.cluster_announce_port : server.port;
    int announced_cport = server.cluster_announce_bus_port ?
                          server.cluster_announce_bus_port :
                          (server.port + CLUSTER_PORT_INCR);
    /* Set the slot information of the current node */
    memcpy(hdr->myslots,master->slots,sizeof(hdr->myslots));
    memset(hdr->slaveof,0,CLUSTER_NAMELEN);
    if(myself->slaveof ! =NULL)
        memcpy(hdr->slaveof,myself->slaveof->name, CLUSTER_NAMELEN);
    hdr->port = htons(announced_port);
    hdr->cport = htons(announced_cport);
    hdr->flags = htons(myself->flags);
    hdr->state = server.cluster->state;

    /* Set currentEpoch and configEpochs
    hdr->currentEpoch = htonu64(server.cluster->currentEpoch);
    hdr->configEpoch = htonu64(master->configEpoch);

    /* Sets the replication offset */
    if (nodeIsSlave(myself))
        offset = replicationGetSlaveOffset();
    else
        offset = server.master_repl_offset;
    hdr->offset = htonu64(offset);

    /* Set the message flags. */
    if (nodeIsMaster(myself) && server.cluster->mf_end)
        hdr->mflags[0] |= CLUSTERMSG_FLAG0_PAUSED;

    /* Calculates and sets the total length of the message */
    if (type == CLUSTERMSG_TYPE_FAIL) {
        totlen = sizeof(clusterMsg)-sizeof(union clusterMsgData);
        totlen += sizeof(clusterMsgDataFail);
    } else if (type == CLUSTERMSG_TYPE_UPDATE) {
        totlen = sizeof(clusterMsg)-sizeof(union clusterMsgData);
        totlen += sizeof(clusterMsgDataUpdate);
    }
    hdr->totlen = htonl(totlen);
}
Copy the code

Afterword.

Originally just want to write about Redis Cluster Gossip protocol, did not expect the article more write, more content, finally source analysis is a bit anticlimactic, we will make do with a look, also hope that we continue to pay attention to my follow-up problems.

Personal blog, welcome to play

Redis Cluster Gossip protocol

Redis Cluster Gossip protocol

Cluster mode and Gossip introduction

Redis Cluster Gossip communication mechanism

Periodic PING/PONG messages

New node online

Suspected and actual offline nodes

Redis Cluster communication source code implementation

The data structures involved

Message structure

PING messages are sent randomly and periodically

Adding a Node to a Cluster

Prevent node false timeout and status expiration

Process failover and flag suspected offline

Actually sends the Gossip message

Afterword.

Related Posts

k8s Secret

How does the Python neural network recognize handwritten characters?

Spring advanced @ControllerAdvice with unified exception handling