background

Zhihu, as a well-known Chinese knowledge and content platform, deals with a huge amount of page views every day. How to better carry such huge page views while providing stable and low-delay service guarantee is a big challenge that students of Zhihu technology platform need to face.

The Redis platform management system built by zhihu storage platform team based on open source Redis components has formed a complete set of automated operation and maintenance service system after continuous development and iteration, providing auxiliary functions such as one-click deployment cluster, one-click automatic expansion and shrinkage, Redis ultra-fine-grained monitoring, bypass traffic analysis and so on.

Currently, the scale of Redis in Zhihu is as follows:

  • The total memory of the machine is about 70TB, and the actual memory is about 40TB.
  • It handles about 15 million requests per second on average and 20 million requests per second at peak;
  • Process more than 1 trillion requests per day;
  • A single cluster processes a maximum of about 4 million requests per second;
  • There are about 800 cluster instances and single-machine instances in total.
  • Running approximately 16,000 instances of Redis;
  • Redis uses the official version 3.0.7, with a few instances using version 4.0.11.

Redis at Zhihu

Based on service requirements, Standalone instances are divided into Standalone and Cluster instances. Standalone instances are used for small storage with low capacity and performance requirements, while clusters are used for performance and capacity requirements.

Standalone Standalone

For single machine instances, we adopt the Master-Slave mode to achieve high availability, and only the Master node is exposed in the normal mode. Since native Redis is used, the standalone instance supports all Redis instructions.

For a single instance, we use the Sentinel cluster of Redis to monitor the status of the instance and perform Failover. Sentinel is a highly available component of Redis. After registering Redis in the Sentinel cluster composed of multiple Sentinels, Sentinel will perform health check on the Redis instance. When the failure of Redis occurs, Sentinel uses Gossip protocol for fault detection, and after confirming an outage, promotes the Slave to become the new Master through a simplified Raft protocol.

Usually, only one Slave node is used for cold backup. If there is a Read/write separation request, multiple Read/write slaves can be created for Read/write separation.


As shown in the figure, high availability of instances is realized by registering Master nodes with Sentinel cluster. After the connection information of Master instance is submitted, Sentinel will actively detect all Slave instances and establish connections to periodically check the health status. The client discovers the Master node using multiple resource discovery policies, such as simple DNS, and plans to migrate to resource discovery components, such as Consul or ETcd.

When the Master node breaks down, the Sentinel cluster will promote the Slave node to the new Master and broadcast the switched message in its own Pubsub channel +switch-master. The specific message format is as follows:

switch-master <master name> <oldip> <oldport> <newip> <newport>
Copy the code

After watcher listens to the message, it proactively updates the resource discovery policy, points the client connection to the new Master node, and completes a Failover. For details about the Failover process, see the official Redis document.

Redis Sentinel Documentation – Redis

Note the following in actual use:

  • Read-only Slave nodes can be configured as requiredslave-priorityThe parameter is 0 to prevent a read-only node rather than a hot spare Slave node from being selected during failover.
  • Sentinel is executed after failoverCONFIG REWRITEIf the CONFIG command is disabled in the Redis configuration, an error will occur during the switchover. You can replace the CONFIG command by modifying the Sentinel code.
  • The number of nodes monitored by Sentinel Group should not be too many, and more than 500 switching processes measured will occasionally enterTILTIt is recommended to deploy multiple Sentinel clusters and ensure that the number of monitored instances in each cluster is less than 300.
  • Master nodes and Slave nodes should be deployed on different machines. Capable users can deploy them on different racks. It is not recommended to deploy Redis Master/Slave instances across equipment rooms.
  • The Sentinel switch function mainly relies ondown-after-millisecondsfailover-timeoutTwo parameters,down-after-millisecondsDetermines the timeout for Sentinel judgment of Redis node outage, and Zhihu uses 30000 as the threshold. whilefailover-timeoutThe minimum waiting time between two switches is determined, which can be appropriately shortened if the switching success rate is higherfailover-timeoutTo the second level to ensure that the switch is successful, see detailsRedis official documentation;
  • A single-node network fault equals to a machine breakdown. However, if a large-scale fault occurs on the entire network of the equipment room, primary/secondary switchover may occur multiple times. In this case, the resource discovery service may not be updated in a timely manner and manual intervention is required.

Cluster

When instances require more than 20GB of capacity or require throughput of more than 200,000 requests per second, we use Cluster instances to handle the traffic. Clustering is a solution for spreading traffic over multiple Instances of Redis through middleware (clients or intermediate agents, etc.).

Zhihu’s Redis cluster scheme has gone through two stages: client sharding and Twemproxy proxy

Client Sharding (before 2015)

In the early stage, Zhihu used redis-shard for client sharding. The redis-Shard library implemented three hash algorithms, CRC32, MD5 and SHA1, which supported most of the Redis commands. The user can use the Redis-shard as a native client without worrying about the underlying shard.


Client-based sharding mode has the following advantages:

  • The client sharding solution is the fastest of the cluster solutions. There is no middleware, only a hash by the client, no proxy, and no official cluster solution MOVED/ASK switch.
  • There is no need for redundant Proxy machines and no need to consider Proxy deployment and maintenance.
  • You can customize hashing algorithms that are more suitable for production environments.

But there are also the following problems:

  • The client logic needs to be implemented in each language. In the early stage, the whole zhihu station used Python for development, but later the business lines increased, and the languages used were Python, Golang, Lua, C/C++, JVM system (Java, Scala, Kotlin), etc., resulting in high maintenance costs.
  • Out of orderMSET,MGETFor multiple commands that operate on multiple keys at the same time, Hash tags are required to ensure that multiple keys are in the same fragment.
  • The upgrade client requires all services to be upgraded, updated and restarted, and cannot be promoted after the service scale becomes large.
  • Capacity expansion is difficult because the storage system needs to be shut down and all keys are scanned by the script for migration. The cache can only be expanded by the traditional mode of doubling the modulus.
  • Since each client has to pool connections with all shards, a large client base can cause too many connections on the Redis side, and too many Redis shards can cause an increase in Python client load.

Specific features see

zhihu/redis-shard

In the early stage, most of zhihu’s business was built by Python, and the capacity used by Redis fluctuated little. Redis-shard was a relatively good solution to the business needs of this period

Twemproxy Cluster (2015-now)

Since 2015, the business has grown rapidly and the demand for Redis has skyrocketed. The original Redis – Shard model has been unable to meet the growing demand for capacity expansion. We began to investigate a variety of cluster solutions and finally chose Twemproxy as our cluster solution, which is simple and efficient.

Twemproxy, which is open-source by Twitter, has the following advantages:

  • Good performance and stable enough, self-built memory pool Buffer reuse, high code quality;
  • supportfnv1a_64,murmur,md5And other hash algorithms;
  • Support consistency hash (KETAMA), modulus hash (Modula) and random (random) three distributed algorithms.

Specific features see

twitter/twemproxy

But the disadvantages are also obvious:

  • Single-core model causes performance bottleneck;
  • In traditional capacity expansion mode, capacity expansion can only be performed during downtime.

For this, we split the cluster instance into two modes, Cache and Storage:

If the user can accept the loss of a small amount of data to ensure availability, or if the user can recover the data in the instance from the remaining storage, the instance is cache, and the rest is storage.

We use different strategies for caching and storage:

storage


For storage, we use fNV1A_64 algorithm combined with modula mode, that is, taking modulus hash to fragment Key. The underlying Redis uses single-machine mode combined with Sentinel cluster to achieve high availability. By default, 1 Master node and 1 Slave node are used to provide services. If the service has higher availability requirements, you can expand Slave nodes.

When the Master node in the cluster goes down, switch to Twemproxy according to the high availability process in single-machine mode. Twemproxy will reconnect after the connection is disconnected. For the cluster in storage mode, auto_EJect_hosts will not be set and the node will not be removed.

At the same time, for storage instances, noeviction policy will be used by default, OOM error will be returned when memory usage exceeds specified limit, Key will not be removed proactively to ensure data integrity.

Because Twemproxy only performs high-performance command forwarding and does not perform read/write separation, there is no read/write separation function by default. In practice, we do not meet the requirement of cluster read/write separation. If we want to implement read/write separation, we can use the resource discovery policy to set up Twemproxy cluster on the Slave node. Route for read/write separation by the client.

The cache

Considering the pressure from the back end (MySQL, HBase, RPC, etc.), most of zhihu’s services are not degraded for cache. In this case, the requirements for cache availability are higher than data consistency. However, if the high availability of storage is implemented according to the master-slave mode, Deployment policy of one Slave node The online environment can only tolerate the breakdown of one physical node. When N physical nodes are down, at least N Slave nodes are required for high availability. This is a waste of resources.


Therefore, we use Twemproxy Consistent Hashing strategy to construct Redis cache cluster with auto_EJect_hosts automatic pop-up policy.

For cache we still use fNV1A_64 for hash calculation, but for distribution we use ketama (consistent hash) for Key distribution. Cache nodes have no Master and slave nodes, and each fragment has only one Master node to carry traffic.

Twemproxy configuring auto_EJect_hosts will weed out nodes if instance connection failures exceed server_failure_limit and retry after server_retry_TIMEOUT. After removal, the hash ring was recalculated with ketama consistent hash algorithm to restore normal use, so that multiple physical nodes can still maintain service even if a downtime.


Note the following points in the actual production environment:

  • After nodes are removed, the hit ratio decreases in a short period of time. Back-end storage such as MySQL and HBase need to perform traffic monitoring.
  • The online environment cache back-end fragmentation should not be too large, it is recommended to keep within 20G, and the fragmentation scheduling should be scattered as much as possible, so that even if some nodes break down, the additional pressure on the back-end will not be too much.
  • After the server is restarted, the cache instance must be cleared and started. Otherwise, the original cache data conflicts with the newly created cache data, resulting in dirty caches. Simply not enabling caching is also an option, but it can cause periodicity during sharding outagesserver_failure_limitNumber of connection failures;
  • server_retry_timeoutserver_failure_limitIt needs to be confirmed carefully. Zhihu uses 10min and 3 times as the configuration, that is, the node is removed after 3 times of connection failure, and the node is re-connected after 10 minutes.

Twemproxy deployment

In the early stage of the solution, we used a fixed number of physical machines to deploy Twemproxy, and started the instance through the Agent on the physical machine. During the operation, the Agent would perform health check and fault recovery on Twemproxy. Since Twemproxy only provides the full usage count, Therefore, the Agent also performs timed differential calculation to calculate indicators such as requestS_per_second of Twemproxy.

Later, in order to better fault detection and resource scheduling, we introduced Kubernetes and put Twemproxy and Agent into two containers of the same Pod. The configuration of the underlying Docker network segment enables each Pod to obtain an independent IP for easy management.

At the beginning, based on the principle of simplicity and ease of use, DNS A Record was used to discover resources on the client side. Each Twemproxy used the same port number, and A DNS A Record was followed by multiple IP addresses corresponding to multiple Twemproxy instances.

In the beginning, this solution was simple and easy to use, but in the later years, traffic increased and the number of Twemproxy instances in a single cluster soon exceeded 20. Because the UDP protocol used by DNS has A packet size limit of 512 bytes, A single Record can only be connected to about 20 IP addresses. If the number exceeds this number, the IP address will be converted to TCP. If the client does not process the packet, an error will be reported, resulting in the client startup failure.

At that time, due to the emergency situation, we could only establish multiple Twemproxy groups and provide multiple DNS A records to the client, and the client could conduct polling or random selection. This scheme was available, but not elegant enough.

How to solve the limitation of Twemproxy single CPU computing capability

Then we modified the source code of Twemproxy and added SO_REUSEPORT support.

Twemproxy with SO_REUSEPORT on Kubernetes

In the same container, the Starter starts multiple Twemproxy instances and binds them to the same port. The operating system performs load balancing. Externally, one port is still exposed, but internally, multiple TwemProxies have been evenly distributed by the system.

At the same time, the Starter will periodically go to the STATS port of each Twemproxy to obtain the running status of Twemproxy for aggregation. In addition, the Starter also bears the responsibility of signal forwarding.

The original Agent does not need to be used to start the Twemproxy instance, so the Monitor calls the Starter to obtain the aggregated STATS information for difference calculation, and finally exposes the real-time running status information to the outside world.


Why not use the official Redis clustering solution

In 2015, we investigated a variety of cluster schemes, and after comprehensive evaluation of various schemes, we finally chose the outdated Twemproxy instead of the official Redis cluster scheme and Codis. The specific reasons are as follows:

  • Congestion problems caused by MIGRATE

The official Redis cluster scheme uses CRC16 algorithm to calculate hash values and disperse keys into 16,384 slots. The slots are allocated to each fragment by the user. During capacity expansion, the user selects the slots and traverses them. Run the MIGRATE command on each Key in the Slot.

The MIGRATE command is implemented in three phases:

  1. DUMP stage: the source instance traverses the memory space of the corresponding Key and serializes the Redis Object corresponding to the Key. The serialization protocol is consistent with the Redis RDB process.
  2. RESTORE phase:A TCP connection is established from the source instance to the peer instance, and theDUMPUse the RESTORE command to RESTORE the content to the peer end. The new version of Redis caches connections to the peer instance.
  3. DEL phase (optional) :If migration fails, a Key with the same name may exist on both nodesMIGRATEREPLACEThe parameter determines whether to overwrite the peer Key with the same name. If overwritten, the peer Key will be deleted once. After version 4.0, deletion can be done asynchronously without blocking the main process.

After investigation, we believe that this mode is not suitable for zhihu’s production environment. Redis To ensure the consistency of MIGRATE, all operations of MIGRATE are synchronized. When MIGRATE is executed, Redis at both ends will enter a BLOCK state with unequal duration.

For small keys, this time is negligible, but if the memory usage of the Key is too high, a MIGRATE command can cause P95 spikes or trigger a Failover within the cluster, causing unnecessary switches

At the same time, when accessing the Key of the Slot in the middle of the migration process, an ASK turn may occur depending on the progress. In this case, the client needs to send the ASKING command to another shard where the Slot is located to request again, and the request delay is twice as long as the original.

Similarly, Codis initially used the same MIGRATE scheme, but used proxies to control Redis to MIGRATE instead of third-party scripts such as redis-trib.rb, based on synchronous commands similar to MIGRATE. In fact, it has the same problem as Redis’ official cluster solution.

The decision on Huge keys is entirely up to the business side, and sometimes it will be very awkward when the business needs to produce Huge keys, such as the following list. Once a Key larger than 1MB is used improperly, the delay will be tens of milliseconds, much higher than the sub-millisecond delay of Redis. Sometimes, during slot migration, the business inadvertently writes multiple large keys simultaneously to the source and target nodes of slot migration, and the migration becomes a dilemma unless a script is written to remove these keys.

The author of Redis has mentioned non-blocking MIGRATE in the roadmap of Redis 4.2. However, as of now, Redis 5.0 is about to be released, there are still no changes in the roadmap. There are already Pull requests in the community. This feature will probably be incorporated into the Master branch after 5.2 or 6.0, but we’ll keep an eye on it.

  • The high availability scheme in cache mode is not flexible

In addition, the high availability of official clustering schemes is only master-slave. The high availability level is positively related to the number of slaves. If there is only one Slave, only one physical machine can be allowed to break down in Redis 4.2 roadmap. The auto-culling re-sharding strategy similar to Twemproxy is provided, but it has not been implemented so far.

  • The built-in Sentinel creates additional traffic load

In addition, the official Redis CLUSTER solution has Sentinel function built into Redis, which results in a large amount of PING/INFO/CLUSTER INFO traffic in the Gossip phase when the number of nodes is larger than 100. According to the situation mentioned in the issue, In the case of 200 Redis clusters built with nodes of version 3.2.8, each node still generates 40Mb/s traffic without any client request. Although Redis officially tried to compress and repair it in the later stage, according to the Redis cluster mechanism, This amount of traffic will be generated regardless of the number of nodes, which is worth noting for users with large memory machines but gigabit network cards.

  • Slot storage overhead

Finally, the storage overhead of the Slot corresponding to each Key will occupy a large amount of memory when the scale is large, even several times the actual memory before 4.x. Although 4.x uses the RAX structure for storage, it still occupies a large amount of memory. When migrating from unofficial to official clustering solution, You have to pay attention to the extra memory here.

To sum up, the official Redis cluster solutions with Codis for most of the scene is a very good solution, but we carefully research has found that is not very suitable for cluster number and diversity of use we, different emphasis on scene will be different, but in the still want to thank the developers to develop these components, Thank you for your contributions to the Redis community.

capacity

The static capacity

For standalone instances, if the scheduler observes that the corresponding machine still has free memory, we simply adjust the maxMemory configuration and alarm of the instance directly. Similarly, for cluster instances, we observe the machine where each node is located through the scheduler. If all the machines where each node is located have free memory, we will directly update maxMemory and alarm just like expanding single instance.

Dynamic capacity

However, if the free memory of a machine is insufficient or the back-end instances of a single machine or cluster are too large, the capacity cannot be directly expanded. You need to perform dynamic capacity expansion:

  • For a single instance, if the single instance exceeds 30GB and there is no examplesinterstoreWe will expand it to a cluster instance;
  • For cluster instances, we do horizontal Resharding, which we call the Resharding process.
Resharding process

The original Twemproxy cluster solution does not support capacity expansion, so we developed a data migration tool to expand Twemproxy. In essence, the migration tool is a proxy between upstream and downstream, which transfers data from upstream to downstream in a new sharding way.

Native Redis master-slave synchronization The Master /PSYNC command is used to establish a master-slave connection. The Master that receives the SYNC command forks a process that traverses the memory space to generate an RDB file and sends it to the Slave. All write commands sent to the Master are cached in the memory buffer when they are executed. After the RDB is sent, the Master forwards the commands in the buffer and subsequent write commands to the Slave node.

The migration agent we developed will send SYNC command upstream to simulate the Slave of the upstream instance. The agent receives the RDB and parses it. Since the format of each Key in the RDB is the same as that of the RESTORE command, So we use the build RESTORE command to recalculate the hash against the downstream Key and send it downstream in batches using Pipeline.

After RDB forwarding is completed, we generate a new Twemproxy configuration according to the new Twemproxy configuration, and establish Canary instance according to the new Twemproxy configuration, and test whether the Resharding process is correct by taking Key from the upstream Redis backend. During the test, keys are compared according to size, type, and TTL.

After passing the test, for the cluster instance, we used the generated configuration to replace the original Twemproxy configuration and restart/reload Twemproxy proxy. We modified the Twemproxy code and added config Reload function. However, in practice, it is more controllable to restart the instance directly. For a single-node instance, the commands supported by the single-node instance and the cluster instance are different. Therefore, you need to confirm the command with the service provider and manually restart the service provider.

Since Twemproxy is deployed in Kubernetes, we can achieve fine-grained gray scale. If the client has access to read/write separation, we can first connect the read traffic to a new cluster and finally access all the traffic.

Compared to the official Redis clustering scheme, the impact on the upstream Redis is minimal, except for the spikes caused by the fork of the page table when BGSAVE is performed upstream and the flicker of the connection when restart is performed.

Problems existing in such capacity expansion:

  • Send upstreamSYNCAfter, the upstream fork will cause spikes;
    • For the storage instance, we use Slave for data synchronization, which does not affect the Master node receiving the request.
    • For the cache instance, since there is no Slave instance, this spike is unavoidable. If we are too sensitive to the spike, we can skip the RDB phase and pass it directlyPSYNCUse the latest SET message to establish a cache downstream.
  • In the switching process, it is possible to write to the downstream and read to the upstream.
    • For clients with access to read/write separation, we switch the read traffic to the downstream instance and then the write traffic.
  • Consistency problem, two has the agent in order to write the same Key commands the backend when synchronized 1) write the upstream to the downstream 2) directly to the downstream downstream of the two ways to write, at this point, there may be should be executed command is executed behind by 2) by 1), leading to command order upside down.
    • This problem is unavoidable in the process of switching, fortunately, most applications do not have this problem, if not acceptable, only through the upstream stop write empty Resharding agent to ensure the sequence;
    • The official Redis clustering scheme and Codis use blocking migrate to ensure consistency and there is no such problem.

In actual use, if the upstream sharding arrangement is reasonable, tens of millions of times per second migration speed can be achieved, and 1TB Resharding only takes about half an hour. In addition, for a real production environment, planning ahead is much faster and safer than emergency capacity expansion if problems occur.

The bypass analysis

For debugging in the production environment, it is sometimes necessary to MONITOR access to online Redis instances. Redis provides multiple monitoring methods, such as the MONITOR command.

However, due to the limitation of Redis single thread, the built-in MONITOR command will run high CPU again when the load is too high, which is too dangerous for the production environment. Other methods, such as Keyspace Notify, only have write events but no read events, so it is impossible to observe carefully.

In this regard, we developed a bypass analysis tool based on LibpCap, which replicates traffic at the system level, analyzes the protocol of traffic at the application layer, and realizes bypass MONITOR. The actual measurement has little impact on the running instances.

Meanwhile, for Twemproxy without MONITOR command, bypass analysis tool can still analyze it. Because most of the business in the production environment is deployed in Docker using Kubernetes, and each container has its corresponding independent IP. Therefore, you can use the off-line analysis tool to find out the application where the client resides and analyze the usage mode of the service side to prevent abnormal use.

Future work

As Redis 5.0 is about to be released and version 4.0 tends to be stable, we will gradually upgrade instances to version 4.0, which will bring features such as MEMORY commands, Redis Module, new LFU algorithm, and so on, which will be of great help to the operation and maintenance side and the business side.

The last

The Architecture platform team of Zhihu is the basic technical team supporting the whole business of Zhihu. It develops and maintains almost all the core basic components of Zhihu, including containers, Redis, MySQL, Kafka, LB, HBase and other core infrastructure. The team is small but exquisite, and each student is responsible for the core system mentioned above.

With the rapid growth of Zhihu business scale and the continuous increase of business complexity, our team is facing more and more technical challenges. We welcome friends who are interested in technology and eager for technical challenges to join us and build a stable and efficient Zhihu cloud platform together.

The resources

  1. Redis Official site redis.io/
  2. Twemproxy Github Page twitter/twemproxy
  3. Codis Github Page CodisLabs/codis
  4. SO_REUSEPORT Man Page socket(7) – Linux manual page
  5. Kubernetes Production-Grade Container Orchestration