Redis is an in-memory database. Once the server goes down, all the data in the memory is lost. Therefore, it is critical for Redis to persist data and avoid recovery from back-end databases.

At present, Redis persistence mainly has two mechanisms, namely AOF (Append Only File) log and RDB snapshot.

One, the implementation of AOF log

We know that the Write Ahead Log (WAL) of a database is to record the modified data to a Log file before the actual data is written, for recovery in case of a failure. However, AOF logging is the opposite, it is a write after log. “write after” means that Redis executes commands and writes data to memory before logging, as shown in the following figure:

So why does AOF execute commands before logging?

Whereas traditional database logs, such as the redo log, record modified data, AOF records every command received by Redis, which is stored as text. Let’s take a look at the contents of AOF logs, which are recorded by Redis after receiving the “set testkey testValue” command. *3 indicates that the current command has three parts. Each part starts with a + number and is followed by a specific command, key, or value. Here, “number” indicates the number of bytes of the command, key, or value in this section. For example, a “+ number” begins, followed by a specific command, key, or value. Here, “number” indicates the number of bytes of the command, key, or value in this section. For example, a “+ number” begins, followed by a specific command, key, or value. Here, “number” indicates the number of bytes of the command, key, or value in this section. For example, “3 set” means that this part has three bytes, which is the “set” command.

However, to avoid extra checking overhead, Redis does not first check the syntax of these commands when logging to AOF. Therefore, if the log is recorded before the command is executed, the wrong command may be recorded in the log. When Redis uses the log to recover data, it may make an error.

In the post-write log mode, the system executes commands first. The commands are recorded in logs only when they are successfully executed. Otherwise, the system reports errors to the client. So one of the big benefits of Redis using post-write logging is that you can avoid logging incorrect commands. In addition, AOF has another benefit: it logs after the command is executed, so it does not block the current write operation.

AOF also has two potential risks:

First, if a command goes down just after it is executed before it can be logged, the command and its data are at risk of being lost. If Redis is used as a cache, you can re-read data from the back-end database for recovery, but if Redis is used directly as a database, the command is not logged, so recovery cannot be logged.

Second, while AOF avoids blocking on the current command, it may risk blocking on the next operation. This is because the AOF log is also executed in the main thread, and if the disk is under a lot of write pressure while the log file is being written to disk, the write to disk will be slow and subsequent operations will not be able to be performed.

Both risks are related to the timing of AOF writing back to disk. This means that both risks are eliminated if we can control when AOF logs are written back to disk after a write command is executed.

Write back strategy of AOF

The AOF mechanism gives us three options, which are the three optional values of the AOF configuration item appendfsync.

  • Always, synchronous write back: Logs are written back to the disk synchronously immediately after each write command is executed.
  • Everysec, write back per second: after each write command is executed, only the log is written to the memory buffer of AOF file first, and the contents of the buffer are written to disk every second.
  • No, operating system-controlled write back: After each write command is executed, the log is first written to the memory buffer of the AOF file. The operating system decides when to write the contents of the buffer back to disk.

None of the three write back strategies can solve the problem of preventing main thread blocking and reducing data loss. Let’s look at why.

“Synchronous write back” can basically do not lose data, but it has a slow down operation after each write command, which will inevitably affect the main thread performance.

Although “operating system control write back” after writing buffer, you can continue to execute subsequent commands, but the timing of disk fall is not in the hands of Redis, as long as AOF records are not written back to disk, once the corresponding data will be lost;

Write back per second uses the write back frequency of one second to avoid the performance overhead of synchronous write back. Although the impact on system performance is reduced, commands that do not fall off disks in the last second are still lost in the event of a breakdown. So, this is a compromise between not affecting main thread performance and not losing data.

** To summarize: ** To achieve high performance, choose the No strategy; To ensure high reliability, choose the Always policy. If you want to allow a bit of data loss, but you don’t want performance to suffer too much, choose the Everysec policy.

The write-back strategy selected according to the performance requirements of the system is not “safe and sound”. After all, AOF is a file that records all write commands received. As more write commands are received, the AOF file becomes larger and larger. This means that we must be careful of the performance problems associated with large AOF files.

The “performance problem” here mainly lies in the following three aspects:

  • First, the file system itself has a limit on the size of files, so large files cannot be saved.
  • Second, if the file is too large, and then add command records to it, the efficiency will become low;
  • Third, if there is an outage, the commands recorded in AOF have to be re-executed one by one for fault recovery. If the log file is too large, the whole recovery process will be very slow, which will affect the normal use of Redis.

So we need AOF rewriting

In simple terms, the AOF rewriting mechanism is that Redis creates a new AOF file according to the current state of the database during rewriting. That is, Redis reads all the key/value pairs in the database and records the write of each key/value pair with a command.

Why does overwriting make log files smaller?

In fact, the rewriting mechanism has a “multiple one” function. This is called “polychange-one”, which means that multiple commands in the old log file become one command in the new log after rewriting.

The AOF file records the write commands received one by one in an appending manner. When a key-value pair is repeatedly modified by multiple write commands, the AOF file records the corresponding multiple commands. However, when overwritten, write commands are generated for the key-value pair based on its current state. This allows a key-value pair to be written in a single command in the rewrite log, and when the log is restored, the key-value pair can be written directly by executing this command.

Here’s an example:

When the final state of a list is [“D”, “C”, “N”] after six times of modification, only LPUSH u:list “N”, “C”, “D” command can be used to restore the data, which saves the space of five commands. For key pairs that have been changed hundreds or thousands of times, the space savings from overwriting are of course even greater.

However, even though the log files shrink after the AOF rewrite, it is still a very time-consuming process to write the operation logs of the latest data for the entire database back to disk. At this point, we move on to another question: will overwriting block the main thread?

Will AOF overwriting block?

Unlike AOF logs written back by the main thread, the rewrite process is done by the backend child bgreWriteAof to avoid blocking the main thread and causing database performance degradation.

I summarize the process of rewriting as “one copy, two logs”.

“One copy” means that the main thread forks out of the background bgrewriteaof child every time a rewrite is performed. Fork sends a copy of the main thread’s memory to the bgrewriteaof child, which contains the latest data from the database. The bgrewriteAof child process can then write the copied data as operations, one by one, to the rewrite log without affecting the main thread.

What is “two logs”?

Since the main thread is not blocked, incoming operations can still be processed. At this point, if there is a write operation, the first log is the AOF log being used, and Redis writes this operation to its buffer. This way, even if there is an outage, the OPERATION of the AOF log is still complete and ready for recovery.

The second log is the new AOF rewrite log. This operation is also written to the buffer of the rewrite log. This way, the rewrite log does not lose the latest operations. After all operation records of copying data are overwritten, the latest operation records of rewriting log records will also be written to a new AOF file to ensure the latest state records of the database. At this point, we can replace the old file with the new AOF file.

In summary, every time AOF is overwritten, Redis first performs a memory copy for overwriting. Then, two logs are used to ensure that the newly written data is not lost during the rewrite process. Also, because Redis uses additional threads for data rewriting, this process does not block the main thread.

When the AOF log is rewritten, it is done by the bgreWriteAof child, without the main thread. When we say non-blocking today, we also mean that the child does not block the main thread. But do you see any other potential blocking risks in the rewrite process? If so, where does it block?

  1. (note that the child process is not copied to the child process at once). The operating system provides a realistic Copy On Write mechanism for forking. Is in order to avoid disposable copy a lot of memory data to the child process caused by the blocking problem for a long time, but the fork the child process need to copy the necessary data structures, and one of them is to copy the memory page table (virtual memory and physical memory mapping index table), the copy process will consume a large amount of CPU resources, copy to complete the whole process will be blocked before, The blocking time depends on the memory size of the entire instance. The larger the instance, the larger the memory page table, and the longer the fork blocks. After the page table is copied, the child and the parent point to the same memory address space, that is, the child is created, but does not apply for the same memory size as the parent. When do parent and child processes really have memory separation? Realistic copying, as the name implies, is copying the actual data in memory while the write takes place, and in this process, the parent process may block, as described below.
  2. Fork the child to the same memory address space as the parent, and the child can perform AOF overwrite, writing all data in memory to the AOF file. But at this time the parent is still there will be a flow to write, if the parent process is the key of an existing operation, so this time the parent process will copy the key corresponding to the real memory data, apply for a new memory space, so gradually, and his son began to separation process memory data, gradually have a parent and child process independent memory space. The default memory allocation is 4K. If the parent process operates on a bigkey, re-allocating large chunks of memory takes longer and may cause blocking risks. In addition, if Huge Page (Page size: 2M) is enabled in the operating system, the probability of blocking when the parent process applies for memory will be greatly increased. Therefore, Huge Page must be disabled on the Redis machine. After each fork generates an RDB or AOF rewrite, you can see in the Redis log how much memory the parent process has reallocated.

AOF rewrite also has a rewrite log, why doesn’t it share the log that uses AOF itself?

AOF overwriting does not reuse AOF’s own logs. One reason is that the parent process writes the same file in a race, and controlling the race means that the parent process’s performance will be affected. Second, if the AOF rewrite process fails, then the original AOF file is equivalent to being polluted and cannot be restored. So Redis AOF overwrites a new file. If the overwrite fails, delete the file directly, without affecting the original AOF file. When the rewrite is complete, simply replace the old file.

2. RDB memory snapshot

A memory snapshot is a record of the status of data in memory at a certain time. It’s similar to a photo, when you take a picture of a friend, a single photo can capture exactly what that friend looked like in an instant.

For Redis, it achieves a photo-like effect by writing the status of a given moment to disk as a file, known as a snapshot. In this way, even if there is a downtime, snapshot files will not be lost, and data reliability is guaranteed. This snapshot file is called an RDB file, where RDB stands for Redis DataBase.

Compared with AOF, RDB records data at a certain point in time, not operations, so when we do data recovery, we can directly read RDB files into memory, and quickly complete recovery. That sounds great, but a memory snapshot is not the best choice. Why do you say that?

There are two key issues to consider:

  • What data does the snapshot take? This is related to the efficiency of snapshot execution.
  • Can data be added, deleted, or changed when taking snapshots? This is related to whether Redis is blocked and can handle requests properly at the same time.

What data does the snapshot take?

Redis data is stored in memory. To ensure the reliability of all data, it performs a full snapshot, that is, all data in memory is recorded to disk. The advantage of this is that all data is recorded at one time, one at a time.

Taking a snapshot of all the data in memory and writing it all to disk also takes a lot of time. Also, the more full data, the larger the RDB file, and the more time it takes to write data to disk.

For Redis, its single-threaded model dictates that we should avoid all operations that block the main thread, so for any operation, we ask the soul question: “Does it block the main thread?” Whether or not the generation of RDB files blocks the main thread depends on whether or not Redis performance is reduced.

Redis provides two commands to generate RDB files, save and BGSave.

  • Save: Executed in the main thread, blocking;
  • Bgsave: Create a subprocess that is dedicated to writing to RDB files, avoiding blocking on the main thread, which is also the default configuration for Redis RDB file generation.

Well, at this point, we can perform the full snapshot with the BGSave command, which provides data reliability and avoids performance impact on Redis.

Can data be modified during snapshot?

Let me give you an example. Let’s take a snapshot of the memory at time t. Assume that the amount of memory data is 4GB and the disk write bandwidth is 0.2GB/s. Simply speaking, it takes at least 20 seconds (4/0.2 = 20) to complete. If A memory data A, which has not been written to disk, is changed to A ‘at time t+5s, the snapshot integrity will be damaged because A’ is not the state at time t. So, like taking a photo, we don’t want the data to “move,” that is, not be modified, when we take a snapshot.

There are potential problems if the data cannot be modified during snapshot execution. In the example above, if the 4GB of data cannot be modified within 20 seconds of making the snapshot, Redis cannot handle the writes to the data, which will have a significant impact on the business services.

It is certainly not acceptable to pause a write for a snapshot. In this case, Redis uses the copy-on-write (COW) technology provided by the operating system to process Write operations while performing snapshots.

In simple terms, the BGSave child process is generated by the main thread fork and can share all memory data of the main thread. Once the BGSave child process runs, it starts reading the main thread’s memory data and writing it to an RDB file.

At this point, if the main thread also reads these data (for example, key-value pair A in the figure), then the main thread and the BGSave child do not affect each other. However, if the main thread modifies a piece of data (such as the key pair C in the figure), then the piece of data is copied, making a copy of that data (key pair C ‘). The main thread then makes changes on this copy of data. In the meantime, the BGSave child process can continue to write the original data (key-value pair C) to the RDB file.

This ensures snapshot integrity and allows the main thread to modify data at the same time, avoiding impact on normal services. Redis uses bgSave to take a snapshot of all the data currently in memory. This is done in the background by the child process, which allows the main program to modify the data at the same time.

If full snapshots are performed frequently, there are also two aspects of overhead.

  • On the one hand, frequently writing a full amount of data into disks puts a lot of pressure on disks. Multiple snapshots compete for limited disk bandwidth. As a result, a vicious circle is created.
  • The BGSave child, on the other hand, needs to be forked out of the main thread. Although the child process does not block the main thread after it is created, the process of forking itself blocks the main thread, and the larger the main thread, the longer it blocks. If the bgSave child is forked frequently, this will block the main thread frequently (so in Redis, if one BGSave is running, the second bgSave child is not started). So, what’s another good idea?

In this case, you can create an incremental snapshot. After a full snapshot is created, subsequent snapshots record only the modified data to avoid the overhead of each full snapshot.

After the first full snapshot, we only need to write the modified data to the snapshot file at time T1 and time T2. However, to do this, we need to remember which data has been modified. Don’t underestimate this “remember” feature, which requires us to use additional metadata information to keep track of what data has been changed, which can cause additional space overhead issues. As shown below:

If we keep a record of every key pair modification, then if we have 10,000 key pairs that were modified, we need 10,000 additional records. Also, sometimes the key-value pair is very small, such as 32 bytes, while recording the metadata information that it was modified might require 8 bytes, thus incurring a large overhead of extra space in order to “remember” the change. This is not worth the cost to Redis, which has precious memory resources.

Compared with AOF, snapshot recovery is faster, but the snapshot frequency is difficult to grasp. If the frequency is too low, more data may be lost once the two snapshots break down. If the frequency is too high and there is extra overhead, what other ways to take advantage of RDB’s fast recovery and lose as little data as possible with less overhead?

Redis 4.0 proposes a hybrid approach using AOF logging and memory snapshots.

In simple terms, memory snapshots are taken at a certain frequency, and AOF logs are used to record all command actions between snapshots. This way, snapshots are not executed as frequently, which avoids the impact of frequent forks on the main thread. Also, AOF logs only record operations between snapshots, which means that you don’t need to record all operations, so you don’t get too big files and you avoid overwriting.

As shown in the following figure, changes made at time T1 and time T2 are recorded in AOF logs. After the second full snapshot is taken, the AOF logs can be cleared because the changes made at this time have been recorded in the snapshot and will not be used in recovery.

This method can enjoy the benefits of RDB file quick recovery, and can enjoy the simple advantage of AOF only record operation command, quite a bit of “you can have it both ways” feeling, recommend you use it in practice.

Scenario: We use a cloud host with 2 core CPU, 4GB memory and 500GB disk to run Redis. The data volume of Redis database is about 2GB, and we use RDB for persistence guarantee. At that time, the running load of Redis was mainly modification operations, and the write/read ratio was about 8:2, that is, if there were 100 requests, 80 requests were modification operations. Do you see any risk in using RDB for persistence in this scenario?

** Memory resource risks: ** The Redis fork child performs RDB persistence, and the “real_copy” process real_copies 80% of the entire memory during the persistence process, which requires about 1.6GB of memory to be real_allocated, so that the entire memory usage is close to full. Soon, the machine will run out of memory. If the Swap mechanism is enabled on the machine, part of Redis data will be transferred to the disk. When Redis accesses this part of the data on the disk, the performance will decline sharply, and it has not reached the standard of high performance (it can be understood that the martial art is wasted). If Swap is not enabled on the host, OOM is triggered, and the parent and child processes may be killed by the system.

**CPU resource risk: ** Although the child process is doing RDB persistence, the RDB snapshot generation process consumes a lot of CPU resources. Although Redis processing requests is single-threaded, Redis Server also has other threads working in the background, such as AOF flush per second, and asynchronously closing file descriptors. Since the machine has only 2 core CPUS, it means that the parent process occupies more than half of the CPU resources. In this case, the child process does RDB persistence, which may lead to CPU competition. As a result, the parent process has an increased request processing delay, and the child process takes a longer time to generate RDB snapshots, and the performance of the entire Redis Server deteriorates. If the Redis process is bound to the CPU, then the child process will inherit the CPU affinity attribute of the parent process, and the child process will inevitably compete with the parent process for the same CPU resources, and the performance of the entire Redis Server will be affected. So if Redis needs to enable timed RDB and AOF rewriting, the process must not be bound to the CPU.