K8s container deployment practice of Xiaomi Redis

background
Why K8S
How K8s
Why Proxy
Proxy problems
The benefits of K8s
Problems encountered
conclusion

background

Xiaomi’s Redis is used on a massive scale, with tens of thousands of instances and hundreds of trillions of visits per day, supporting almost every product line and ecosystem company. Previously, all Redis were deployed on physical machines without resource isolation, which brought great difficulties to management and governance. Our operation and maintenance personnel are under great pressure. The offline Redis nodes caused by machine downtime and network jitter often need manual intervention. If CPU resource isolation is not performed, the RDB of the slave node or the CPU usage of the slave node increases due to the QPS of the node due to the traffic surge may affect the nodes of the cluster or other clusters, resulting in unpredictable delay increase.

Redis sharding adopts the Redis Cluster protocol, and the Cluster shards independently. Redis Cluster brings a certain ease of use at the same time, but also raises the threshold of application development, application developers need to understand Redis Cluster to a certain extent, and need to use intelligent clients to access Redis Cluster. These smart clients have a lot of configuration parameters that application developers can’t fully grasp and set, which can lead to a lot of pitfalls. At the same time, because the intelligent client needs to do sharding calculation, it also brings a certain load to the application machine.

Why K8S

Resource isolation

The current Redis Cluster is deployed on physical machine clusters. In order to improve resource utilization and save costs, the Redis clusters of multiple service lines are mixed. CPU resource isolation is not implemented. As a result, the CPU usage of one Redis node is too high, and other NODES in the Redis cluster cannot compete for CPU resources, causing delay jitter. Because different clusters are mixed, such problems cannot be quickly located, affecting o&M efficiency. K8s containerized deployment can specify CPU request and CPU limit to improve resource utilization and avoid resource contention.

Automated deployment

Automated deployment. The current deployment process of Redis Cluster on physical machines is very tedious. It is necessary to check the metadata database to find machines with free resources, manually modify many configuration files, and then deploy nodes one by one. Finally, the redis_trib tool is used to create clusters, and the initialization of new clusters often takes one or two hours.

K8s uses StatefulSet to deploy Redis clusters and configMap to manage configuration files. New cluster deployment takes only a few minutes, greatly improving o&M efficiency.

How K8S

The client uses the VIP of LVS to access and forwards service requests to the Redis Cluster through Redis Proxy. Here we introduce Redis Proxy to forward requests.

Redis Cluster deployment mode

Redis is deployed as StatefulSet. As a stateful service, StatefulSet is the most reasonable choice to persist the RDB/AOF nodes to distributed storage. When the node is restarted and migrated to another machine, the original RDB/AOF can be retrieved from the mounted PVC(Persistentvolum claim) to synchronize data. PersistentVolume (PV) is the Ceph Block Service. The read/write performance of Ceph is lower than that of local disks, resulting in read/write latency of 100 to 200ms. However, since the RDB/AOF of Redis is written asynchronously, the read/write delay caused by distributed storage does not affect the service.

The Proxy selection

Open source Redis Proxy there are many open source Redis Proxy, common open source Redis Proxy is as follows: we want to continue to use Redis Cluster to manage the Redis Cluster, so Codis and Twemproxy are not considered. Redis-cluster-proxy is a proxy that supports The Redis Cluster protocol officially launched in version 6.0 by Redis. However, there is no stable version and it cannot be applied on a large scale for the time being. The only alternatives are Cerberus and Predixy. We performed performance tests on Cerberus and Predixy on a K8s environment with the following results:

The test environment

Test tool: Redis-benchmark

Proxy CPU: 2 core

Client CPU: 2 core

Redis Cluster: 3 master nodes, 1 CPU per node

The test results

Under the same workload and configuration, the highest QPS of Predixy was better than that of Cerberus, and the delay was similar. Overall, Predixy had 33% to 60% better performance than Cerberus, and the larger the key/value, the greater Predixy’s advantage, so we chose Predixy. In order to adapt to the business and K8s environment, we made a lot of changes to Predixy before launch, adding many new features, such as dynamic switching of the back-end Redis Cluster, blacklist and whitelist, abnormal operation audit, etc.

Proxy Deployment mode

Proxy is deployed as deployment, stateless and lightweight, and provides external services through LB, making it easy to expand and shrink dynamically. At the same time, we developed the function of dynamically switching the back-end Redis Cluster for Proxy, which can realize online addition and switching of Redis Cluster.

Proxy Indicates the automatic capacity expansion or reduction mode

We use K8s native HPA(Horizontal Pod Autoscaler) to realize the dynamic expansion and shrinkage of Proxy. When the average CPU usage of all Proxy pods exceeds a certain threshold, expansion is triggered automatically. HPA increases the replica number of Proxy by 1. Then LVS detects new Proxy Pods and cuts off some traffic. If the CPU usage still exceeds the threshold, the expansion logic continues to be triggered. However, the capacity reduction logic is not triggered within five minutes, no matter how low the CPU usage is. In this way, frequent capacity expansion and reduction will not affect cluster stability.

The HPA can be configured with the minimum (MINPODS) and maximum (MAXPODS)pod number of the cluster, and the cluster load will not shrink to less than the number of MINPODS. It is recommended that customers can decide the value of MINPODS and MAXPODS according to their actual business situation.

Why Proxy

Redis Pod restart may cause IP changes

The Redis client that uses the Redis Cluster needs to configure some IP addresses and ports of the Cluster to search for the Redis Cluster when the client restarts. If a Redis node is deployed in a Cluster of physical machines, the IP address and Port remain unchanged, and the client can still find the topology of the Redis Cluster, even if the instance or machine is restarted. However, for Redis Cluster deployed on K8s, pod restart does not guarantee the same IP address (even if it is restarted on the original K8s node), so when the client restarts, the portal of Redis Cluster may not be found.

By adding a Proxy between the client and the Redis Cluster, the information of the Redis Cluster is shielded from the client. The Proxy can dynamically perceive the topology change of the Redis Cluster. The client only needs to forward the request to the Proxy with the IP:Port of the LVS as the entrance. The Redis Cluster can be used in the same way as the standalone Redis, without the Redis smart client.

Redis handles high connection load

Prior to version 6.0, Redis was single-threaded for most tasks. When the connections of Redis nodes are high, Redis consumes a large amount of CPU resources to process these connections, resulting in higher latency. After Proxy is established, a large number of connections are made to Proxy, while only a few connections are maintained between Proxy and Redis instance, which reduces the burden of Redis and avoids the delay increase caused by the increase of connections.

Application restart is required for cluster migration switchover

In the process of use, with the growth of business, the data volume of Redis cluster will continue to increase, when the data volume of each node is too high, BGSAVE time will be greatly extended, reducing the availability of the cluster. At the same time, the increase of QPS also leads to higher CPU utilization per node. All these need to be solved by adding clusters. The current Redis Cluster is not very scalable, and the native slots move is inefficient. When a new node is added, some clients, such as Lettuce, cannot recognize the new node due to security mechanisms. In addition, the migration time is completely unpredictable, and problems encountered during migration cannot be rolled back.

The current capacity expansion solution for a dedicated server cluster is as follows:

Create new clusters as needed
Use a synchronization tool to synchronize data from the old cluster to the new cluster
After the data is correct, communicate with the service to restart the service and switch to a new cluster

The whole process is tedious and risky, and services need to be restarted.

With the Proxy layer, you can shield backend creation, synchronization, and switching clusters from clients. After the synchronization between the old cluster and the new cluster is complete, you can send a command to the Proxy to change the connection to the new cluster. In this way, the cluster can be expanded or shrunk without being aware of the client.

Data security Risks

Redis realizes authentication through AUTH. The client is directly connected to Redis, and the password still needs to be saved in the client. When Proxy is used, the client only needs to use the Proxy password to access Proxy, without knowing the password of Redis. Proxy also restricts operations such as FLUSHDB and CONFIG SET to prevent clients from empties data or modifies Redis configurations by mistake, which greatly improves system security.

At the same time, Redis does not provide auditing capabilities. We added the log saving function for high-risk operations on Proxy to provide audit capability without affecting overall performance.

Proxy problems

The delay caused by one more hop

Proxy is between the client and the Redis instance. When the client accesses the Redis data, it needs to access the Proxy first and then the Redis node. If one hop is added, the delay increases. It has been tested that an extra hop can increase latency by 0.2 to 0.3ms, but this is generally acceptable for business.

The IP address changes due to Pod drift

Proxy is deployed in Deployment on K8s, and IP may change as a result of node restart. The LB scheme of K8s can sense the IP change of Proxy and dynamically cut LVS traffic to the restarted Proxy.

LVS delay

LVS also bring latency. In the test shown in the following table, LVS introduce latency less than 0.1ms for different data length GET/SET operations.

The benefits of K8S

The deployment of convenient

The o&M platform calls K8s API to deploy the cluster, which greatly improves the o&M efficiency.

Resolve port management problems

At present, Xiaomi deploys Redis instances on physical machines by port, and offline ports cannot be reused, that is to say, each Redis instance in the whole company has a unique port number. At present, the number of 65535 ports has been used up more than 40,000. According to the current speed of business development, port resources will be exhausted in two years. Through K8s deployment, each Redis instance corresponding to THE K8s pod has an independent IP, there is no port exhaustion problem and complex management problems.

Lower the threshold of customer use

For the application, it only needs to use the non-intelligent client of the stand-alone version to connect to VIP, which reduces the threshold of use and avoids tedious and complex parameter Settings. And because viPs and ports are immutable, applications no longer need to manage the topology of the Redis Cluster themselves.

Improve client performance

Using non-smart clients can also reduce the client load because smart clients need to hash the key at the client to determine which Redis node to send the request to, which can consume CPU resources on the client machine at high QPS. Of course, in order to reduce the difficulty of client application migration, we made Proxy also support intelligent client protocol.

Dynamic upgrade and capacity expansion

Proxy dynamically adds the Redis Cluster switchover function, so that the service end is completely unaware of the Redis Cluster upgrade and capacity expansion switchover process. For example, the business side uses the 30-node Redis Cluster. Due to the increase of service volume, the data volume and QPS both grow rapidly, and the Cluster size needs to be doubled. To expand capacity on an existing physical server, perform the following operations:

Coordinate resources to deploy a new cluster of 60 nodes
Manually configure the migration tool to migrate data from the current cluster to the new cluster
After the data is correct, notify the service provider to modify the Redis Cluster connection pool topology and restart the service

Although Redis Cluster supports online capacity expansion, the relocation of slots will affect online services during the expansion process and the migration time is not controllable. Therefore, this method is rarely used at this stage and only occasionally used when resources are severely insufficient.

Under the new K8s architecture, the migration process is as follows:

Create a new cluster of 60 nodes in one click through the API interface
Also, one-click cluster synchronization tool is created through THE API interface to migrate data to the new cluster
After verifying that the data is correct, send a command to the Proxy to add new cluster information and complete the switchover

The whole process is completely insensitive to the business end.

Cluster upgrades are also convenient: if the business is comfortable with a delay burr, they can be done using StatefulSet rolling upgrades at peak times. If your business requires latency, you can create a new cluster to migrate data.

Improve service stability and resource utilization

The K8s uses the resource isolation capability to mix applications with other types of applications, improving resource utilization and ensuring service stability.

Problems encountered

The Pod restarts, causing data loss

If the K8s pod encounters a problem and is restarted too fast, the POD will be restarted before the Redis Cluster finds out and cuts off the master. If Redis on pod is slave, this does not matter. However, if Redis is the master and there is no AOF, the original memory will be emptied after the restart, and Redis will reload the RDB file stored previously, but the RDB file is not real-time data. The slave then synchronizes its own data to the data mirror in the previous RDB file, causing some data loss.

StatefulSet is a stateful service, and the deployed POD name is in a fixed format (StatefulSet name + number). When we initialize the Redis Cluster, we set the adjacent numbered pods to a master slave relationship. When the POD is restarted, the slave node is identified by the POD name. Before the POD is restarted, the cluster failover command is sent to the slave node to forcibly switch the slave node from the active one. After the restart, the node is automatically added to the cluster as a secondary node.

LVS mapping delay

Proxy PODS use LVS to achieve load balancing. LVS takes effect with a certain delay for mapping back-end IP:Port. When Proxy nodes go offline suddenly, some connections are lost. In order to reduce the impact of Proxy operation and maintenance on business, we added the following options in the Deployment template of Proxy:

  lifecycle:    preStop:      exec:        command:        - sleep        - "171"Copy the code

For normal Proxy POD offline, such as cluster capacity reduction, rolling update of Proxy version, and other K8s controllable pods offline, before POD offline, a message will be sent to LVS and wait for 171 seconds, which is enough time for LVS to gradually switch the traffic of this POD to other PODS without any sense of service.

The K8s StatefulSet cannot meet the Redis Cluster deployment requirements

K8s native StatefulSet does not fully meet the requirements of Redis Cluster deployment:

The Redis Cluster cannot be deployed on the same machine as the nodes in active/standby mode. This is easy to understand. If the machine goes down, the data shard becomes unavailable.
Redis Cluster does not allow more than half of the primary nodes in the Cluster to fail, because if more than half of the primary nodes fail, there will not be enough node votes to satisfy the gossip protocol. Because the primary and secondary nodes of the Redis Cluster can be switched at any time, we cannot avoid the situation that all nodes on the same machine are the primary nodes. Therefore, during deployment, more than 1/4 nodes in the Cluster should not be deployed on the same machine.

To meet the above requirements, native StatefulSet can use the anti-affinity feature to ensure that the same cluster has only one node deployed on the same machine, but the machine utilization is low.

Therefore, we developed a CRD: RedisStatefulSet based on StatefulSet, which deploys Redis nodes using a variety of policies. At the same time, some Redis management functions have been added to RedisStatefulSet. We will continue to explore these in detail in other articles.

conclusion

At present, dozens of Redis clusters of multiple businesses within the group have been deployed on K8s and run for more than half a year. Thanks to the rapid deployment and failover capabilities of K8s, the operation and maintenance workload of these clusters is much lower than that of Redis clusters on physical machines, and their stability has been fully verified.

In the process of operation and maintenance, we have encountered many problems. Many functions mentioned in the article are extracted according to actual needs. At present, there are still many problems that need to be solved gradually in the future to further improve resource utilization and service quality.

Mix vs. standalone deployment

The Redis instances of physical servers are deployed independently. Each Redis instance is deployed on a physical server. This facilitates management, but has low resource utilization. Redis instances use CPU, memory, and network IO, but storage space is mostly wasted. Deploying an instance of Redis on K8s on a machine where any other type of service may be deployed can improve the utilization of the machine, but for services with high availability and latency requirements such as Redis, it is not acceptable to be expelled because the machine is out of memory. This requires the o&M staff to monitor the memory of all machines where Redis instances are deployed. If the memory is insufficient, the primary nodes are cut off and migrated, which increases the o&M workload.

At the same time, if other services in the mix are applications with high network throughput, Redis service may also be affected. Although K8s’ anti-Affinity feature allows for selective deployment of Redis instances to machines that do not have such applications, it is not possible to avoid this situation when machine resources are tight.

Redis Cluster management

Redis Cluster is a P2P Cluster architecture with no central nodes. It relies on gossip protocol to propagate and automatically repair the Cluster status. Offline nodes and network problems may lead to the status problems of some nodes in Redis Cluster. For example, nodes in the failed or Handshake state may appear in the cluster topology, or even the split brain. For this abnormal state, we can add more functions on Redis CRD to solve it step by step and further improve the operation and maintenance efficiency.

Audit and Security

Redis itself only provides Auth password authentication protection function, without permission management, poor security. Using Proxy, you can distinguish client types by password. Administrators and common users use different passwords to log in and have different operation permissions. In this way, functions such as permission management and operation audit can be implemented.

Multiple Redis clusters are supported

Due to the limitation of gossip protocol, a single Redis Cluster has limited ability of horizontal expansion. When the Cluster size is 300 nodes, the efficiency of topology change such as node selection is significantly reduced. At the same time, because the capacity of a single Redis instance should not be too high, it is difficult for a single Redis Cluster to support more than TB of data. Through Proxy, we can make logical sharding for key, so that a single Proxy can access multiple Redis clusters. From the perspective of the client, it is equivalent to accessing a Redis Cluster that can support a larger data scale.

Finally, the containerized deployment of stateful service like Redis is still not very mature in large domestic factories, and Xiaomi cloud platform is also gradually improving in the process of groping. At present, most of our new clusters have been deployed on K8s, and we plan to migrate most of the physical Redis clusters in the group to K8s within one to two years. This can effectively reduce the burden on the operations staff and maintain more Redis clusters without significantly increasing the operations staff.