In the last article, we learned that when the main library is down, there is a magical thing – sentry – to automatically switch from master to slave. In order to improve accuracy, Redis has introduced sentry cluster. But what happens if the sentry dies?

Don’t panic, since it is a cluster, it will not all fail. If one sentinel instance fails, other sentinels will continue to serve.

So how do sentinels form a cluster?

So let’s move on with the question.

For those of you who have deployed sentinels, we only need to use the following configuration item to configure the sentinels’ information, setting the IP and port of the master library, and not configuring the connection information of other sentinels.

sentinel monitor

How did the sentinels get together, since they didn’t know each other’s addresses?

This is where the Pub /sub mechanic provided by Redis comes in.

1. Sentinel cluster composition based on PUB/SUB mechanism

The PUB /sub mechanism is also known as the publish-subscribe mechanism. As long as the sentry establishes a connection with the main library, it can know the IP and port of the other sentries. In effect, the master library acts as an intermediary, and the sentinels pass through the master library and share their Ip and port with other sentinels.

Here’s the concept of a channel. What is a channel? So you can understand, is only in the same channel, we can communicate with each other, the official said the channel is the news category, that is very appropriate also, news category is the same in the same channel, such as two dogs at dinner, ask you to eat do not eat, and then you in crouch hole, ask he squatted don’t squat down, like a look would not be in a channel, and it also exchange a MAO.

Apps can only communicate with each other if they subscribe to the same channel.

Redis has a channel, Sentinel: Hello, on the master library of the master/slave cluster through which the sentinels communicate with each other.

So let’s take a look at this picture;

Suppose there are three sentinels. Sentinels 1 posts their IP and port to the __sentinel__: Hello channel. Sentinels 2 and 3 subscribe to the channel, and can directly access sentinel1’s information and connect to it. Similarly, sentry 2 and Sentry 3 are connected to each other.

Once connected to each other, they form a cluster, and at this point they can perform their duties, which I did not repeat in the previous article.

I don’t know if you noticed this, but we’re talking about sentinels connecting to the master library, but what about the slave library? From the library also want to connect ah, little brother also have human rights ah!!

The slave library must also communicate, otherwise how to do a master/slave switch?

So how does the sentry know the IP address and port of the slave library?

The sentinel sends an INFO command to the master library, and then the master library returns the information from the slave library to the sentinel, which includes the connection information from the slave library. Then the sentinel will connect to each slave library and monitor it in real time.

Is this the end of the connection? Don’t forget, after the master/slave switch, how can the client know?

So the sentinel also needs to connect to the client and notify the client about the new master library.

Furthermore, how can the client monitor the sentry’s master/slave switchover process when it is actually used? For example, where is the master/slave switch? This essentially requires that the client be able to obtain various events that occur during the monitoring, primary selection, and switching of the sentinel cluster.

At this point, we can still rely on the PUB /sub mechanism to help us synchronize information between the sentinel and the client.

2. Client event notification based on PUB/SUB mechanism

Sentinel is essentially an instance of Redis running in a specific mode, except that it does not serve request operations, only monitoring, master selection, and notification. So, each sentinel instance also provides a pub/sub mechanism where clients can subscribe to messages from the sentinel. Sentinel provides a variety of channels for message feeds, each of which contains different key events during the master/slave library switch.

Here are the important channels, covering several key events:

With these channels, clients can subscribe to messages from sentinels.

After reading the configuration file of the sentinel, the client can obtain the address and port of the sentinel and establish a network connection with the sentinel. We can then execute subscription commands on the client side to get different event messages.

For example, you can SUBSCRIBE to “all instances that go offline” by executing the following command: SUBSCRIBE +odown Or you can SUBSCRIBE to all events by executing the following command: PSUBSCRIBE *

When the sentry selects the new master library, the client will see the switch-master event below. This event indicates that the primary library has been switched and the IP address and port information for the new primary library is available. At this point, the client can communicate with the new master library address and port listed here.

switch-master

Well, with pub/sub mechanism, sentry and sentry between sentry, between sentry and slave library, sentry and client can establish connections, plus we introduced the main library offline judgment and selection basis, sentry cluster monitoring, selection and notification three tasks can basically work normally.

This is where you think it’s over? Let me ask you another question to see if you know:

After a master library failure, the sentinel cluster has multiple instances, so how do you determine which sentinel does the actual master/slave switchover?

I don’t know. Let’s go on.

3. Which sentry performs the master/slave switchover?

The process of determining which sentry should perform master/slave switchover is similar to the judgment process of “objective offline” of master library, which is also a process of “voting arbitration”. Before we understand this process in detail, let’s look again at the arbitration process of judging “objective referral”.

For the Sentinel cluster to determine that a master library is “objectively offline”, a certain number of instances need to agree that the master library is “subjectively offline”. In our last article, we talked about the principle of “objective logoff”. Next, I will introduce the specific judgment process.

If an instance determines that the master library is “subjectively offline”, it sends the is-master-down-by-addr command to other instances. The other instances then respond with either Y or N, where Y equals yes and N equals no, depending on their connection to the main library.

Once a sentinel has obtained the necessary votes of approval for arbitration, it can mark the main library as “objective offline”. This required approval is set through the quorum configuration item in the Sentinel configuration file. For example, if there are now five sentinels and quorum is configured at 3, a sentinel needs three votes to mark the master as “objective offline”. The three votes included one from the sentry himself and two from the other sentries.

At this point, the sentry can send another command to the other sentries, indicating that he wants to perform the master/slave switch and asking all the other sentries to vote. This voting process is called the Leader election. Because the sentry who ultimately performs the master/slave switch is called the Leader, the voting process is to determine the Leader.

In the voting process, any sentry who wants to be the Leader must meet two conditions: First, get more than half of the votes; Second, the quorum must also be greater than or equal to the quorum value in the Sentinel configuration file. In the case of three sentinels, assuming quorum is set to 2, any sentinel who wants to be the Leader needs only two votes.

Does that make sense to you? What? Don’t understand?

Let me find another graph showing the election process with three sentinels and quorum of two.

At time T1, S1 judges that the master library is “objective offline”, and if it wants to become the Leader, it votes for itself first, and then sends commands to S2 and S3 respectively, indicating that it wants to become the Leader.

At T2, S3 judges that the master library is “objective offline”, and it also wants to be the Leader. Therefore, it also votes for itself first, and then sends commands to S1 and S2 respectively, indicating that it wants to be the Leader.

At time T3, S1 receives S3’s Leader vote request. Since S1 had already voted Y for itself, it could not vote for the other sentries, so S1 replied to N that it did not agree. Meanwhile, S2 received the Leader voting request sent by S3 at T2. Since S2 has not voted before, it will reply Y to the first sentry who sends it a voting request, and N to the sentry who sends a subsequent voting request. Therefore, at T3, S2 replies S3 and agrees that S3 becomes the Leader.

At T4, S2 received the voting command sent by S1 at T1. Because S2 had already agreed to S3’s voting request at T3, S2 replied N to S1, indicating that it did not agree to S1 becoming the Leader.

You might have a question here, right? Why can’t we accept the voting request of S1 first? We are just taking an example here. A real scenario may exist, for example, the network communication between S2 and S3 is normal, but the communication between S2 and S1 is abnormal, so S3’s voting request will be accepted first.

Finally, at time T5, S1 gets one vote Y from itself and one vote N from S2. In addition to its own affirmative vote, S3 also received a vote from S2. At this point, S3 not only obtains the approval of more than half of the leaders, but also reaches the preset quorum value (quorum is 2), so it finally becomes the Leader. S3 then starts the master selection and notifies other slave libraries and clients of the new master library once it is selected.

There is also a situation where S3 does not get 2 votes Y, then the Leader will not be produced in this round of voting. The sentinel cluster waits a certain amount of time (twice the sentry failover timeout) before re-electing.

Why?

This is because the sentinel cluster’s ability to vote successfully depends heavily on the normal network propagation of election orders. If the network is stressed or temporarily blocked, no sentry can get a majority vote. So the chances of success will increase if you wait until Internet congestion improves.

Note that if the sentinel cluster has only 2 instances, a sentinel must get 2 votes, not 1, in order to become the Leader. So, if one of the sentinels fails, the cluster will not be able to switch from master to slave. Therefore, we typically configure at least three sentinel instances. This is very important and should not be ignored!!

If every sentry elects himself during the sentry election, is it impossible to elect a leader and fall into an endless cycle?

The answer is definitely not to get stuck in a loop. Why? On the one hand, this situation is difficult, because the network connection system pressure between different sentinels is different, it is relatively small to judge that the time point of the main library objective offline is the same. In fact, sentinel online status monitoring of master and slave libraries belongs to a time event, which is completed by a timer. Generally speaking, a small random time is added to the execution cycle of the timer of each sentinel, in order to stagger the time and avoid the occurrence of the above situation.

Well, it just happened to happen at the same time, so don’t panic. The sentries will stop for a while and then move on to the next round of voting.

So it is impossible to fall into a loop!!