This is a study note for the column “Redis core Technology and Actual Combat – Jiang Dejun – Geek Time”. In addition, I would like to thank Kaito, the leader of the comment section, for his sharing and help.

✔️ Knowledge overview

First of all, we all know that Redis is a very classic, high-performance, “single-threaded” key-value database.

Why high performance? In addition to Redis being a memory-based database, it is also due to its underlying data structure. Efficient data structures are the basis for Redis to process data quickly.

Apart from data structures, why is Redis so fast when it is “single threaded”? Then we need to understand what the Redis threading model is.

For a database, it’s not enough to be fast, it also needs to be robust, which is often referred to as highly available.

For Redis’s high availability, memory-based databases have a fatal problem: in the event of an outage, all data in memory is lost. Simply restoring data from a back-end database is very performance – and time – consuming. So persistence is a necessity for Redis. We know that reading and writing disks is a very time-consuming operation, so how does Redis implement the persistence mechanism while maintaining high performance? So we need to know a little bit about AOF and RDB.

High availability includes not only data recovery after outages, but also minimal service outages. Redis uses the mode of separating the read and write from the master library. How to achieve this? How is data synchronized? And how to ensure that the master and slave data is consistent? At the same time also take into account in this process as far as possible do not let the main library to interrupt the external service. This requires an understanding of the master-slave architecture of Redis.

This raises a new question: what if the main library dies? If the primary library dies, we definitely need a new primary library, such as switching one of the secondary libraries to the primary. So the question to consider is: How do you tell if the main library really died? Which secondary library should be selected as the new master library if you switch? How do I inform slave libraries and clients about the new master library after the switch?

Redis realizes automatic switch between master and slave libraries through sentinel mechanism, which effectively solves the problem of failover in active replication mode.

Once the sentry mechanism is understood, a new problem arises: which sentry should perform the master/slave switch? Can we do a master/slave switch if sentry dies?

Fast enough, strong enough, and finally fake enough. What if the amount of data you need to store is huge? We need to understand what a sliced cluster is and how it can be implemented.

At this point, we have a general understanding of the basic knowledge points related to Redis, and then dig deeply for each point.

✔️ Data types & data structures

What types of data does Redis have? What is the underlying data structure? How do they correspond to each other?

All data in Redis is stored in a global hash table as key-value pairs, and the values of each key-value pair correspond to multiple data types, using a diagram from the column:

Why do ordered sets choose jumpers over red-black trees?

The ordered set selects a skip table instead of a red-black tree, because although the search time complexity of insert and delete is the same, the search operation based on interval is less efficient than the red-black tree.

Integer arrays and compressed lists do not have great advantages in the time complexity of lookup operations. Why are they still selected as the underlying data structures by Redis?

First, because Redis is an in-memory database, memory needs to be optimized as much as possible to improve memory utilization. Arrays and compressed lists are very compact data structures that take up less memory than linked lists.

The second reason is that arrays are more CPU-friendly to cache support (spatial locality: accessing an array brings multiple elements nearby to the cache together). Therefore, when the set data elements are relatively small, the default memory compact arrangement is used for storage, and the CPU cache can be utilized at the same time, without reducing the access speed. When the number of elements exceeds the threshold, change the query time to hash or skip table to ensure query efficiency.

What is progressive Rehash? What’s the process like?

When there are more and more data in the global hash table, some collision chains will be too long, and the query efficiency will be reduced. So Redis rehashes by increasing the number of hash buckets and reducing the number of elements in a single bucket.

To make the rehash operation more efficient, Redis uses two global hash tables by default, one at a time. When the number of elements reaches the threshold, a rehash operation is performed: allocate more memory to another hash table, copy the data from the current hash table, and free the old hash table space.

But this process involves a lot of data copying, and a one-time migration can cause threads to block. To avoid a progressive rehash operation, Redis still processes requests normally after allocating new space, and copies all elements from the first index position in the old hash table into the new hash table for each request. Copy it for your next request. This spreads a large number of copies at once over multiple request processing and avoids thread blocking.

✔️ Thread model

Why does Redis use single threads? Redis is single threaded why is it so fast? What does the Redis threading model look like?

Why does Redis use single threads?

When we write multithreaded programs, we increase the number of threads at the beginning, and the system throughput goes up. However, after increasing to a certain level, the increase of throughput will flatten out or even decrease. This is because multithreaded mode handles concurrent access control for shared resources, and switching between multiple threads also consumes resources. Multithreading also increases system complexity, increases development difficulty, and reduces maintainability. Finally, the performance bottleneck for the vast majority of operations running in the Redis service is not CPU, so using multithreading doesn’t make much sense.

Redis is single threaded why is it so fast?

First, Redis uses an efficient data structure, which is an important reason for its high performance. Another reason is Redis’s threading model: multiplexing under a single thread. It is actually a single-reactor single-thread model based on the REACTOR model. Note that Redis is not completely single-threaded, but the main network IO and key pair reads and writes are done by a single thread, which is also the main flow of external services, so Redis is often said to be single-threaded. Other functions such as persistence, asynchronous deletion, and cluster data synchronization are performed by additional threads.

What does the Redis threading model look like?

The Redis single-threaded model is primarily a file event handler, consisting of four parts:

  1. Multiple socket
  2. IO multiplexing procedures
  3. File event dispatcher
  4. Event handler (connection reply handler, command request handler, command reply handler)

A communication between the client and the server looks like this:

When Redis is initialized, AE_READABLE events on the Server socket are associated with the connection reply handler.

When a client socket requests a connection to Redis, the Server socket will generate an AE_READABLE event that the IO multiplexer listens for, pushes the socket into a queue, and the file event dispatcher gets the socket from the queue and hands it to the connection reply handler. The connection reply handler creates a Socket01 that can communicate with the client and associates the AE_READABLE event with the command request handler.

If the client sends a set request, socket01 in Redis will generate AE _READABLE events. The IO multiplexer will queue socket01. After the event dispatcher gets socket01, since the event has been associated with the command request handler, So it goes to the command request handler. The socket01 AE_WRITABLE event is associated with the command reply handler by asking the handler to complete the write operation. If the client is ready to receive the result returned, Redis socket01 generates an AE_WRITABLE event, which is queued. The event dispatcher finds the associated command reply handler, which enters the result of the operation (such as OK) to socket01. Disassociate the AE_WRITABLE event of socket01 from the command reply handler. This completes a communication.

Don’t be misled by the fact that there should be more than one server socket. I’m sorry I forgot where the picture came from.

Note that Redis 6.0 introduces multithreading, more on that later.

✔️ Persistence mechanism

What is an AOF log? How is it implemented?

AOF is a post-write log, unlike WAL in a typical database. The AOF log is a text record of every command received by Redis, written to a buffer in memory and then dropped to disk. Because the log is recorded after the command is successfully executed, the syntax check on the command is not required and the log does not block the current write operation.

But at the same time, it also brings two problems: one is that if the AOF log is not written after the execution of the command, the crash, then there is the problem of data loss. Another is that while logging to disk does not block the current operation, it may block subsequent operations because AOF is also performed in the main thread, and writing data to disk is a relatively slow operation.

We can solve these two problems with configuration trade-offs. The appendfysNC configuration of AOF has three optional values:

  1. Always: Synchronous write back. Write the AOF log to disk immediately after each command is executed.
  2. Evertsec: Write back every second. After each command is executed, the AOF log is written to the AOF memory buffer, and the contents of the buffer are written to disk every second.
  3. No: The operating system controls when buffer contents are written back to disk.

None of these three schemes can achieve both high performance and high reliability. The first is the most reliable, with a very low probability of data loss, but the worst performance. The second one has moderate performance and can lose data within 1 second. The third has the best performance, but also loses more data when it goes down.

What if AOF logs get bigger and bigger? What is the AOF rewrite mechanism? What is the process of rewriting? Where is it likely to be blocked?

AOF keeps adding more and more content, and it gets bigger and bigger. Files that are too large may be limited and cannot be saved. Adding new content is also less efficient. And recovering data after an outage can be slow, with many commands executed one by one. This is where the AOF rewrite mechanism comes in.

AOF rewrite is to create a new AOF log with all the key/value pairs in the current database, which records all the key/value pair write commands. Thus, the old AOF log may have many commands for a key, which are rewritten to become one.

To ensure high performance, Redis overwrites without blocking the main thread. The rewrite process can be summarized as ** “one copy, two logs” **.

When overwritten, the main thread forks a background child process, and the fork copies the main thread memory to the child process. The main thread does not block, the new operation continues, and the new command remains in the old AOF log. These new instructions are also stored in the AOF rewrite buffer. After the child process has overwritten all the copied data, the contents of the AOF rewrite buffer will be written to the new AOF log, and the new AOF can be used to replace the old one.

Note that Redis uses Copy On Write to avoid copying a large amount of data at once. When fork, the child process is actually copied to the memory page table (a table of mapped indexes of virtual and physical memory) rather than the actual memory data. At this point, the master and child processes share the data in memory. When new data is written to a key in the main process, a new memory is allocated to write data to the new memory. In this way, the data of the master and child processes is gradually separated. Note that the granularity of Copy On Write is memory pages. That is, after the main process allocates a new block of memory, it copies all the memory pages where the data is currently written.

There are two possible blocking points to note: copying the page table at fork consumes a lot of CPU resources and blocks the process depending on the memory size of the entire instance. In addition, when copying On Write, the granularity of the Copy is memory pages, and the default size of a page is 4kb. If the copied key is a bigkey, reapplying a large chunk of memory and copying it takes a long time.

In addition, if the system is enabled with Huge Page mechanism (memory Page size is 2M), the blocking time of replication after the main process applies for memory will be greatly increased. Therefore, you are advised to disable huge Page when using Redis. (Huge Page feature is mainly used to improve TLB hit ratio. Under the same memory size, Huge Page entries can be reduced. TLB can cache more Page entries and reduce the overhead caused by TLB miss.)

In fact, in many business scenarios where data loss is not sensitive, AOF is not required.

What is RDB? What is the mechanism of RDB?

RDB: memory snapshot. Redis is another solution for persistence, which tastes better when used with AOF.

If only AOF is used for persistence, many commands need to be executed one by one for fault recovery when there is a large amount of data and many operation records. This is inefficient. RDB is required for processing.

To generate an RDB file is to write all the data in Redis in the form of a file on the disk at a certain time. If the data is recovered after a downtime, it only needs to be read directly into the memory, which is more efficient than AOF.

Redis provides two instructions to generate RDB files:

  1. save: executed in the main thread, causing blocking
  2. bgsave: Creates a subprocess dedicated to writing to RDB files, default configuration.

The bgsave command creates an RDB file in a similar way to AOF rewriting, again using COW technology. Forks a child process from the main thread (copies the memory page table) and shares memory data with the main thread. When a write operation occurs on the main thread, the corresponding memory page is copied.

Note that RDB files should not be generated frequently. On the one hand, it will put a lot of pressure on the disk, and it is possible to write the RDB before it is finished, and then start the RDB again, thus falling into a vicious circle. Forking, on the other hand, blocks the main thread and can affect performance.

In addition, you can use incremental snapshots to avoid the overhead of multiple full snapshots: After a full snapshot, the modified data is recorded, and the RDB is generated to record only the modified data. But there is another problem: Logging changes consumes a lot of extra memory, and Redis memory is a valuable resource.

So if you just use RDB for data persistence, you can’t determine a good snapshot frequency. If the frequency is too high, performance will be affected, and if the frequency is too low, large amounts of data will be lost when outages occur. So the common usage is RDB combined with AOF.

RDB is executed at a frequency, and in between, AOF logging is used. Wait until the second RDB, the middle AOF can be cleared. When restoring data, first use RDB file to restore most of the data, and then use AOF to restore the rest of the data, so that basically achieve the purpose of having both fish and cake.

The default Redis configuration for RDB frequency is to execute bgsave if any of the following three conditions are met

Save 900 1 // Modify the database at least once within 900 seconds. Save 300 10 save 60 10000Copy the code

After Redis 4.0, when AOF is rewritten, it is to write the memory data in RDB format at the beginning of AOF file. The problem is that RDB data is not readable.

✔️ Master/slave architecture

For high reliability, RDB and AOF ensure that data is not lost as much as possible, and service interruptions are minimized by the master/slave architecture. The master-slave library mode of Redis is read/write separation. Both the master and slave libraries can read, but only the master can write, and then the master to the slave library synchronization.

How does Redis achieve data consistency between master and slave? What is the process of data synchronization?

A full copy is required during the first synchronization between the master and slave libraries. The slave and master establish a connection and tell the master that synchronization is imminent. Once the master has confirmed its response, synchronization can begin.

The slave library needs to send the psync command to the master library: psync runID offset. RunID is a random unique ID that is automatically generated for each Redis instance to mark the example. Offset is the offset of replication. First copy, runID unknown, pass? . Offset – 1. The master library will return its own runID and the current master library’s replication progress offset.

The first copy of the master sends all data to the slave, a process that depends on the RDB. This is to generate an RDB file and send it to the slave library, which will empty the database and then read the data into memory. At the same time as the replication, the master library is normally serviced and the new write operations are cached in the Replication buffer. Finally, when the RDB file has been sent, the master library sends the data in the Replication buffer to the slave library, which performs these operations again, and the synchronization is complete.

After the first full copy is completed, a long connection is maintained between the master and slave, and the master synchronizes all subsequent commands to the slave through the link. It is important to note that if there are many slave nodes hanging on the master node, it will put a lot of pressure on the master node to make full copies with all slave libraries. The master – slave – slave mode is generally adopted, so that more slave nodes are attached to other slave nodes, so that the pressure of the master library can be distributed.

Another consideration is, what if the network connection is blocked or even disconnected? Before Redis 2.8, once the network between master and slave nodes is disconnected, the slave library will make a full copy again, which is very expensive. After 2.8, incremental replication starts from the library. Specifically, the REPL_backlog_buffer buffer is used.

Repl_backlog_buffer is a circular buffer that is allocated each time a slave is attached to the master. In addition to synchronizing all commands to the slave library, the master library also records a copy in this buffer. When the slave library disconnects and reconnects, the command is re-sent to the master library. The master library finds the disconnected position in the REPL_backlog_buffer according to the offset and sends the subsequent command to the slave library. Since repl_backlog_buffer is circular, if the master and slave are disconnected for too long, the new cache overwrites the old cache, which then connects back from the library (or network delays, slow execution from the library), and you have to do a full copy again. So control the size of the repl_backlog_size parameter. Generally, the rough calculation is as follows: repl_backlog_size = Command write speed of the master library * command size – Command transfer speed of the master and slave libraries * command size. This is usually multiplied by 2 to allow for unexpected request pressures. If the concurrency peak is particularly high, you can also set it to a higher value, or consider using a slicer farm to share the main library request load. Later.

Note the distinction between replication buffer and REPL_backlog_buffer.

The former is the cache used by the Redis server to communicate with the client. Each client connection is allocated a buffer. Redis writes data to the buffer and then sends the data to the client socket over the network. The slave library is also a client, which is used to transfer user write commands from the master library to the slave library. Redis provides the client-output-buffer-limit parameter to limit the size of this buffer. If this limit is exceeded, the master library will force the slave library to disconnect. If not limited and requests are processed slowly from the library, the buffer will expand indefinitely, resulting in OOM.

The latter is designed for master/slave synchronization to avoid the performance overhead of full replication once disconnected. This buffer is only used to compare master and slave data; the real information transfer is still dependent on the Replication buffer.

What is Redis sentinel mechanism? What is the basic process?

The sentinel mechanism enables failover of the Redis primary/secondary cluster. If the primary library fails, the primary/secondary switchover can be performed automatically.

Sentinel mechanism mainly solves three problems of master-slave switchover:

  1. How do I know if the primary library really died? (monitoring)
  2. Which slave library is selected as the master library? (optional)
  3. How do I notify slave libraries and clients about the new master library? (notice)

The flow of sentry mechanism:

While running, the Sentinel periodically sends PING commands to all master and slave libraries to check whether they are running properly. If a node does not respond to the sentinel within the specified time, it is marked as “subjective offline”.

If it’s from a node, it doesn’t matter if you go offline, just mark it. If the node is the primary node, the primary/secondary switchover cannot start directly. There could be miscalculations, such as network congestion or high pressure on the main library. Primary/secondary switching is expensive and unnecessary overhead must be avoided.

Sentinels are also clustered, so more than half of the sentinels must consider the master library “subjectively offline” for it to be marked as “objectively offline.” Then the primary/secondary switchover process is triggered.

The sentry selects the new master library by filtering and scoring.

If the slave library is always disconnected from the master library, the slave library is poorly networked and not suitable for the master library. Such nodes are filtered out.

There are three rounds of scoring for the remaining nodes.

The slave with the highest priority in the first round gets the highest score. We can configure different priorities for slave libraries. If the slave library on a machine with the highest performance is set to the highest priority, the slave library will be selected as the master library during the master/slave switchover.

In the second round, the degree of synchronization between the slave and the old master is judged. The closer the slave is, the higher the score is. This is determined by the offset in repl_backlog_buffer of the master/slave synchronization.

If they are still equal, the third round of judgment is made, and the slave with the smaller ID gets the higher score.

After the new master database is selected, the sentry sends the connection information of the new master database to the other slave databases, who can execute replicaof to establish connections and synchronize data with the new master database. At the same time, sentry notifies the client to send the requested action to the new master library node.

Note that during a master/slave switchover, read requests can be executed normally in the slave library if read/write is separated. The write request fails when the primary library dies and no new primary library has been created. If the client does not want to be aware of the primary/secondary switchover, the client needs to write the write request to the message queue. After the primary/secondary switchover is complete, the client can pull the command from the message queue and execute it.

The client synchronizes the master and slave node information with the sentinel through broadcast. Sentry broadcasts the master library address to the client. The client can also fetch it from sentry, and most SDKS encapsulate this functionality.

Communication between sentinel clusters is via Redis’ PUB/SUB mechanism. Once the sentinel has established a connection with the main library, it publishes its IP and port information under a fixed theme of the main library, Sentinel: Hello. All sentinels discover other sentinels by subscribing and publishing to this topic. When they do, they set up network connections to communicate with each other.

Sentinels also need to connect to slave libraries for monitoring. This is done by sending the INFO command to the main library. After the master library receives the INFO command, it sends the list of slave libraries to the sentry.

The sentry also needs to connect to the client and broadcast to the client the events occurring in the process of monitoring, primary selection and switching. This is also done through the PUB /sub mechanism. Sentinel provides multiple message topics, each containing different key events. For example, master library subjectively/offline, objective/offline, master/slave switch, etc.

How is it determined which sentry performs the master/slave switch? Can we do a master/slave switch if a sentry dies?

Sentries use an election mechanism to select sentries to perform the master/slave switch. Take a look at the process of judging the main library offline:

When any sentry discovers that the master library is offline, it sends a message to the other sentries asking them to confirm the master library status immediately. Other sentinels return Y if they find that the main library is also offline, otherwise return N. When a sentinel receives an affirmative vote (including one of its own) greater than or equal to the number given for quorum in the configuration item, the sentinel marks the master library as objective offline.

At this point, the sentinel sends a message to the other sentinels to vote, indicating that it wants to perform the master/slave switch (equivalent to an investment and one vote). The sentinels who eventually become the Leader to perform the master-slave switch need to achieve two conditions: obtain a majority of votes and the number of votes must be greater than quorum. Three sentinels, quorum=2, would require two votes to be elected.

Each sentry may cast only one vote in a round. When a sentry receives a message from another sentry, it votes for the sentry who received the message first. Or he himself has judged that the main database objective offline, also initiated the election vote, then he has voted for himself, can not vote for another sentry.

If no Leader is born in a round of voting, the sentinels wait for a period of time before re-electing. If a Leader is created, the Leader sentry performs the master/slave switch.

Note: If the sentinel cluster has only 2 instances, a sentinel must obtain 2 votes to become the Leader. In this case, the sentinel cluster cannot elect a Leader and cannot perform a master/slave switchover. So you usually have at least three sentries.

✔️ Sliced cluster

What if you need to store a very large amount of data? What is a sliced cluster? How is the official Redis Cluster implemented?

The easy solution: vertical expansion. This means upgrading the resource configuration of a single Redis instance, such as increasing memory capacity, disk capacity, and using higher CPU configuration. However, this scheme faces two difficulties: one is the limitation of hardware and cost. The higher the memory, the higher the upgrade cost. Second, as mentioned before, in RDB persistence, the operation time of fork process is proportional to the amount of data, especially large amount of data will lead to a longer blocking time of the main thread. The advantage of vertical scaling, however, is that it is simple and straightforward to implement.

So is there a better solution? Yes, it is scale-out, also known as Redis slice cluster: scale-out the number of Redis instances to evenly store data across multiple instances. This raises two questions: how is the data distributed across multiple instances once sliced? How does the client find out which instance the desired data is on?

Redis 3.0 provides the Redis Cluster solution to implement the sliced Cluster. Hash slots are used to deal with the mapping between data and instances. A slice cluster has a total of 16384 hash slots, and each key-value pair will be mapped to a hash slot according to its key: first, a 16bit value is calculated according to the key according to the CRC16 algorithm, and then this value is modulo to 16384 to get the position in the hash slot. When we create a cluster using the cluster create command, Redis automatically allocates 16,384 slots evenly to all instances. Of course, you can manually specify the number of slots on each instance. (Caution When manual allocation is performed, all 16384 slots must be allocated. Otherwise, the cluster cannot work properly.)

So how does the client determine which instance of the hash slot the desired data is in when accessing the cluster? The Redis instance will send its hash slot information to other instances connected to it, completing the spread of hash slot allocation information, each instance in the cluster will have all hash slot and instance mapping relationship. The client caches the received mapping information locally, and when requested, the hash value is used to locate the hash slot and then the instance.

However, instances in the cluster are added and deleted, and when new or deleted instances occur, Redis redistributes hash slots across all instances for load balancing. The mapping of the hash slots has changed, so what can the client do? The Redis Cluster solution provides a redirection mechanism. If a client sends a read/write request to an instance that does not have a hash slot for the key-value pair, it will return the ‘MOVED’ command with the correct access address for the instance:

GET hello:key
(error) MOVED 13320 172.16.19.5:6379
Copy the code

On the other hand, because multiple key-value pairs are distributed in a hash slot, there are hash slots that are being accessed that are being migrated, with one part on the new instance and one part on the original instance. If the client sends a read/write request, and if there is data on the old instance, the instruction is executed normally. If not, the ASK command is returned to the client, which also carries the address of the new instance. The client sends the ASKING command to the new instance. In other words, the ASK command indicates that the slot is being moved. I did not indicate that the slot may have been moved to a new address. You can look for the new address. You don’t want the MOVED command to update the hash slot mapping information in the client cache, so you don’t want to lose data in the migration.

Why Redis Cluster uses key – > hash slot – > Instance? Instead of directly storing the mapping between the key and the instance, right?

The main reasons are as follows:

  1. The total number of keys in a cluster is incalculable. If the mapping relationship between keys and instances is directly recorded, the mapping table will be very large when there are too many keys, occupying a large amount of storage space whether stored on the server or client. The total number of hash slots in Redis Cluster is fixed and does not expand excessively.
  2. The Redis Cluster uses a decentralized model. A client accesses a key to a node. If the node does not have the key, the client needs to be able to correct errors and route to the correct node (MOVED/ASK). This requires each node to have a full hash slot mapping relationship, and this information needs to be exchanged between nodes. If the mapping between keys and instances is stored, the amount of information exchanged between nodes is very large, consuming network resources.
  3. When instances are added or reduced in the cluster and data is balanced, data migration occurs between nodes, which requires modification of the mapping relationship between each key and node, resulting in high maintenance costs.
  4. Adding a mid-level hash slot between the key and the instance decouples the data from the node. The key is computed by hash and only depends on which hash slot it corresponds to. It consumes less CPU resources and distributes data evenly. In addition, the storage space of the mapping relationship is small, which facilitates the storage of clients and servers, and the information exchanged between nodes is lighter.
  5. When the number of cluster instances increases or decreases, data balancing only needs to be performed in the unit of hash slots, simplifying cluster maintenance and management.

A sideband: a hash slot is essentially a consistent hash — a hash ring with 16,384 slots. However, compared with the direct consistent hash, the hash slot method has an intermediate layer, that is, the slot, to achieve the purpose of decoupling, more convenient data migration, reduce the difficulty of maintenance.