High availability Redis service architecture analysis and construction

C memory-based Redis should be the most commonly used key-value database in various Web development businesses at present. We often use it in business to store user login state (Session storage) and speed up some hot data query (compared with MySQL, the speed has an order of magnitude increase). Do simple message queues (LPUSH and BRPOP), subscribe to publish (PUB/SUB) systems, etc. Large Internet companies generally have a dedicated team to provide Redis storage as a basic service to each business invocation.

But one of the questions that any provider of an underlying service will be asked by the caller is: Is your service highly available? I’d better not have my side of the business suffer because of your frequent service problems. Recently, I also built a small set of “high availability” Redis service in my project. Here I would like to make my own summary and thinking.

First of all, we need to define what is highly available for Redis service, that is, it can still be provided under various abnormal conditions. Or more relaxed, abnormal circumstances, only after a very short period of time to restore normal service.

Exceptions should include at least the following possibilities:

[Exception 1] A process of a node server suddenly goes down (for example, a developer disabled and killed the Redis-server process of a server).

[Exception 2] If a server on a node is Down, all processes on the node stop. (For example, a faulty operation and maintenance worker unplugs the power supply of a server. For example, some old machines have hardware failure.

[Exception 3] The communication between any two node servers is interrupted (for example, a temporary worker has broken the optical cable used for communication between the two computer rooms);

In fact, any of the above exceptions are low-probability events, and the basic guiding principle of high availability is that the probability of multiple low-probability events occurring at the same time can be ignored. As long as we design a system that can tolerate a single point of failure in a short time, we can achieve high availability.

For building high availability Redis service, there are many solutions on the Internet, such as Keepalived, Codis, Twemproxy, Redis Sentinel. Codis and Twemproxy are mainly used in large-scale Redis clusters. They are also open source solutions provided by Twitter and Pea Pod before the official release of Redis Sentinel. There’s not a lot of data in my business, so clustering is a waste of machine. It was a choice between Keepalived and Redis Sentinel, and the official solution was Redis Sentinel.

Redis Sentinel can be understood as a process to monitor whether the Redis Server service is normal or not. Once abnormal is detected, the slave Redis Server can be automatically enabled, so that external users are not aware of the anomalies inside the Redis service. We set up the smallest highly available Redis service according to the steps from simple to complex.

Solution 1: Standalone Redis Server without Sentinel

Q1.png

Under normal circumstances, we build a personal website, or usually do development, will be a single instance of Redis Server. The caller can connect directly to the Redis service, even if the Client and Redis themselves reside on the same server. This configuration is only suitable for personal learning and entertainment, after all, there will always be a single point of failure. Once the Redis server process hangs, or server 1 is down, the service is unavailable. And if Redis data persistence is not configured, data already stored inside Redis will also be lost.

Scheme 2: Primary/secondary Synchronization Redis Server, single instance Sentinel

Q2.png

In order to achieve high availability, we must add a backup service for the single point of failure described in Solution 1, that is, start a Redis Server process on each Server. Generally, the service is provided by the master, while the slave is only responsible for synchronization and backup. At the same time, an additional Sentinel process is started to monitor the availability of the two Redis Server instances, so that when the master fails, the slave can be promoted to the role of master in time to continue providing services, thus realizing the high availability of Redis Server. This is based on a high availability service design rationale that a single point of failure is itself a low probability event, whereas simultaneous failure of multiple single points (i.e., simultaneous failure of the master and slave) is considered (basically) impossible.

For the callers of the Redis service, it is now the Redis Sentinel service to connect to instead of the Redis Server. The common call process is that the client first connects to Redis Sentinel and asks which service in Redis Server is master and which service is slave, and then connects to the corresponding Redis Server for operation. Of course, the current third-party libraries have generally implemented this call process, no longer need us to manually implement (such as Nodejs ioredis, PHP predis, Golang Go-redis /redis, JAVA Jedis, etc.).

However, after we implemented the master-slave switch of The Redis Server service, a new problem was introduced, that is, Redis Sentinel itself is a single point of service, once the Sentinel process was suspended, then the client could not link to Sentinel. Therefore, the configuration of Scenario 2 does not achieve high availability.

Scheme 3: Primary/secondary Synchronization Redis Server, dual instance Sentinel

Q3.png

In order to solve the problem of solution 2, we also start an additional Redis Sentinel process. Both Sentinel processes provide the service discovery function for the client. For the client, it can connect to any Of the Redis Sentinel services to get basic information about the current Redis Server instance. Normally, we configure multiple Redis Sentinel link addresses on the Client side. When the Client finds that one address is disconnected, it tries to connect to other Sentinel instances. This does not require manual implementation. The popular Redis connection libraries in various development languages help us do this. The expectation is that even if one Of the Redis Sentinels fails, another Sentinel will be available.

However, while the vision is good, the reality is brutal. Under such architecture, the high availability of Redis service still cannot be realized. In the schematic diagram of scheme 3, the part in red line is the communication between two servers. The abnormal scenario we imagine ([Exception 2]) is that a Server is down as a whole. Suppose that Server 1 is down, only the Redis Sentinel and slave Redis Server processes are left on Server 2. In this case, Sentinel will not switch the remaining slave to master to continue the service, which leads to the unavailability of Redis service, because the setting of Redis is that only when more than 50% of Sentinel processes can connect and vote for a new master, the master/slave switch will actually occur. In this example, only one Sentinel can be connected, equal to 50%, which is not in the master-slave switching scenario.

Why does Redis have this 50% setting, you may ask? It is assumed that the master-slave switch can also be performed in the case that Sentinel connectivity is less than or equal to 50%. Imagine [exception 3], where the network between server 1 and server 2 is down, but the server itself is up and running.

As shown below:

Q4.png

In fact, for server 2, if server 1 goes down directly, it will have the same effect as if server 1 is disconnected from the network, because all of a sudden, you can’t communicate with server 1. Suppose we allow Sentinel on Server 2 to switch slave to master during a network outage. The result is that you now have two Redis servers that can be serviced externally.

Any add, delete, or change operations performed by the Client may fall on either server 1’s Redis or server 2’s Redis (depending on which Sentinel the Client is communicating with), resulting in data confusion. Even if the network between server 1 and server 2 were restored, we would not be able to unify the data. Data consistency is completely destroyed.

Scheme 4: Master/slave Synchronous Redis Server and three instances Sentinel

Q5.png

Since scheme 3 was not highly available, our final version was Scheme 4 as shown in the figure above. And that’s actually what we ended up building. We introduced Server 3 and set up a Redis Sentinel process on top of it. Now three Sentinel processes manage two Instances of Redis Server. In this scenario, the Redis service can continue to be provided regardless of a single process failure, a single machine failure, or a network communication failure of two machines.

In fact, if your machine is idle, of course, you can also enable a Redis Server on Server 3 to form a 1 master + 2 slave architecture. Each data has two backups, which will improve the availability. Of course, the more slaves are not the better, after all, master slave synchronization also needs time cost.

In scenario 4, servers 2 and 3 switch from slave to master once the communication between server 1 and other servers is completely broken. For the client, there are two masters available at this point in time, and once the network is restored, all new data that fell on server 1 during the outage is lost. To partially solve this problem, you can configure the Redis Server process to stop service as soon as it detects a problem on its network. Avoid having new data coming in during network outages (see the two configuration items min-rabes-to-write and Min-rabes-max-lag of Redis).

At this point, we have built a highly available Redis service with three machines. An even more machine-saving solution is to put a Sentinel process on the Client machine rather than the service provider’s machine. It’s just that within a company, the provider and caller of a typical service are not on the same team. For two teams to operate the same machine together, it is easy to cause some misoperations due to communication problems. Therefore, for such human factors, we still adopt the architecture of Scheme 4. Since only one Sentinel process is running on server 3, it does not consume much resources on the server. It can also be used to run some other services.

Ease of use: Use Redis Sentinel as you would use the stand-alone Redis

As a service provider, we always talk about user experience issues. There is always one area where the Client is not comfortable. For stand-alone Redis, the Client directly connects to the Redis Server, we only need to give an IP and port, then the Client can use our service. After changing to Sentinel mode, clients have to adopt some external dependencies supporting Sentinel mode and modify their Redis connection configuration, which is obviously unacceptable to “fidgeting” users. Is there a way to provide the Client with a fixed IP and port as in the stand-alone Redis?

Q6.png

The answer, of course, is yes. This may involve the introduction of Virtual IP(VIP), as shown in the figure above. We can point the virtual IP address to the Server where the Redis Server master resides. When the Redis master/slave switchover occurs, a callback script will be triggered. The callback script will switch the VIP to the Server where the slave resides. In this way, for the Client side, he seems to be using a standalone version of the highly available Redis service.

conclusion

Setting up any service to “work” is as simple as running a standalone version of Redis. But when it comes to “high availability,” things get complicated. The service uses two additional servers, three Sentinel processes and one Slave process, just to ensure that the service is still available in the event of a small probability of failure. In actual services, we also enable the Supervisor to monitor processes. Once a process exits unexpectedly, it automatically tries to restart it.

High availability Redis service architecture analysis and construction

Related Posts

Pandas Basic operation

Distributed service framework gRPC

Programmer pig teammate, so bad that everyone is afraid of him submitting code…