Case after

In the second case, a department made a log system with its existing Redis server. The log data was first stored in Redis, and then the data was read, analyzed and calculated by other programs to make data reports.

When they were done with the project, the logging component made them feel comfortable using it. They all thought it was a good idea, easy to log and fast to analyze. What other company’s distributed logging service would they use?

As time goes by, thousands of clients have been quietly mounted on this Redis, with tens of thousands of concurrent requests per second and the single-core CPU utilization rate of the system approaching 90%. At this time, this Redis has begun to be overwhelmed.

Finally, the final straw came when a program wrote a 7MB log to the log component.

Redis then blocked, and once blocked, thousands of clients are disconnected and all logging fails.

The logging failure itself should not affect normal business, but because the logging service is not a standard distributed logging service for the company, few people are paying attention.

The developers who wrote it at the beginning did not know it would be used so much, and the operation and maintenance students did not know the existence of this illegal log service.

The service itself was not designed to be fault-tolerant, so it threw an exception where it was logged. As a result, a significant portion of the company’s business systems failed, and the number of “5XX” errors in the monitoring system skyrocketed.

A gang of tears, the top of the huge pressure investigation problem, but because of the disaster is too wide, the pressure of the platoon is imaginable.

Problem analysis

In this case, it seems that a logging service is not well done or the development process management is not in place. Moreover, many logging services also use Redis as the buffer for collecting data, which seems to be no problem.

In fact, the technical points to be considered from collection to analysis in such a large-scale and high-traffic logging system are huge, not just a simple write performance problem.

In this case Redis brings a super simple performance solution to the application, but this simplicity is relative and scenario-constrained.

Here, such simple is poison, ignorant eating is to kill themselves, this is like “a small fish in the river ditch can not arrogant, that is because it has not seen the sea, and so on the sea…” .

Another issue in this case: the existence of an illegal logging service, ostensibly a management issue, is essentially a technical issue.

Because the use of Redis cannot be supervised by dbAs like that of relational databases, its operators cannot manage and know in advance what data is stored in Redis, and developers can write data to Redis and use it without any declaration.

Therefore, we find that the use of Redis is easy to get out of control in the long-term use without the management of these scenarios. We need a transparent layer that can be governed and controlled by Redis.

Two small examples show that in the days when Redis was widely used, the brothers who used it must have been in pain and were bombarded with various malfunctions:

Redis is blocked by Keys command

Keepalived Failed to switch the virtual IP address. The virtual IP address was released

Using Redis to do the calculation, Redis CPU usage is 100%

The master/slave synchronization failed

Redis client connections have exploded

How to change the myth that Redis is not good?

Such chaos must be impossible to continue, at least in the rabbit brother unit, such a way of use can not continue, the user also began from love to pain.

How to do? It’s a heavy load: “It’s hard to have a badly used system that’s like a burnt dish and get you to cook it again and get people to applaud it.”

The point is, you can’t just stop everything, wait for a new system to come on and switch, right? What a job it was: “Changing a tire on the highway.”

But problems always have to be solved, think again, talk again, summarize the following points:

You have to have a good monitoring system, and you have to have a warning beforehand, and you can’t just wait for something to happen and then we find it.

Control and guide the use of Redis, we need to have our own research and development of Redis client, start to control and guide when using.

The role of Redis is changed from Storage to Cache.

Redis persistence scheme to do again, need to develop a Redis protocol based persistence scheme so that users can use Redis as DB.

The high availability of Redis should be separated according to scenarios, and different high availability solutions should be adopted according to different scenarios.

There’s not a lot of time left for developers, only two months to do this. This thing is still very challenging, the test of the development of students can not change the tire down the time came.

Students began to develop our own Redis cache system. Let’s take a look at the first version of the cache system code-named Phoenix:

The first is the monitoring system. The original open source Redis monitoring is generally just some monitoring tools, not a complete monitoring system. Of course, this monitoring is all the way from the client to the full link of the returned data.

The second is to transform the Redis client. The widely used Redis client is either too simple, too heavy, or simply not what we want.

Like BookSleeve under.NET, which hasn’t been maintained in a while, and Servicestack.Redis (also a bit of an older.NET app), which doesn’t charge for it.

Well, we’ll develop a client and push developers across the company to use it to replace the one they’re using.

In this client, we have implanted a log record, which records all the operation events of the code to Redis, such as time consuming, Key, Value size, network disconnection, etc.

These problematic events are collected in the background, analyzed and processed by a collector, and IP addresses and ports are assigned through a configuration center instead of a direct IP port connection.

When Redis has a problem and needs to be switched, it can be directly modified in the configuration center, and the configuration center will push the new configuration to the client, so as to avoid the trouble that the salesman needs to modify the configuration file when Redis is switched.

In addition, the command operation of Redis is split into two parts:

You can use security commands directly.

Unsafe commands can be opened only after being analyzed and approved, which is controlled by the configuration center.

This solves the specification problem when developers use Redis, and makes Redis a cache role, which should be treated unless there is a special need.

Finally, the Redis deployment mode has been changed from Keepalived to master-slave + sentinel mode.

In addition, we have realized sharding of Redis by ourselves. If businesses need to apply for large-capacity Redis database, they will divide Redis into multiple pieces and balance the size of each piece through Hash algorithm. Such sharding is also insensitive to the application layer.

Of course, reloading the client is not good, and what we want to do is cache, not just Redis, so we will make a Redis Proxy to provide a unified entry point.

The Proxy can be deployed in multiple ways. No matter which Proxy the client connects to, the client can obtain complete cluster data. In this way, the problem of selecting different deployment modes based on scenarios is basically solved.

Such a Proxy also solves the problems of a variety of development languages, for example, the operation and maintenance system is developed using Python, also need to use Redis, can be directly connected to Proxy, and then access to the unified Redis system.

Do client or do Proxy, not just for Proxy request but in order to unify the use of Redis cache, do not let chaos appear.

The stable operation and maintenance of cache in a manageable and controllable scenario allows developers to continue using Redis safely and unscrupulously, but this “mess” is virtualized mess, because its underlying layer can be managed.

System Architecture Diagram

Of course, all the above transformation needs to be carried out without affecting the business. There are some challenges to implementing this, especially sharding.

Splitting a Redis into multiple Redis also allows the client to correctly find the Key it needs, which requires a lot of care because the data in memory can be completely lost.

During this time, we developed a variety of synchronization tools, almost the entire implementation of Redis master slave protocol, and finally Redis smooth transition to the new mode.

PS: You may have such a question, why not Redis cluster mode? 2.X and 3.X versions are in the majority, 2.X is also greatly reduced, the addition of proxy is not for simple sharding, but for more other functions, such as single Key hot issues, etc. Generally speaking, what we do is a private cache cloud, not only a cache management container.