How can I avoid data loss after a system outage?

It is a fact that once the server goes down, all the data in memory is lost. This requires a mechanism for data recovery, and logging is a good way to do that.


Classification of log

  1. Before writing the log: The modified log data is recorded in a file before data writing for recovery in case of a fault.
  2. Post-write logging (called AOF logging by Redis): Redis executes commands and writes data to memory before logging, called AOF logging.

What does the AOF record?

Whereas traditional database logs record modified data, AOF logs record every command received by Redis, which is saved as text.

For example, the command set testkey testValue is stored in an AOF text file as *3 $3 set $7 testkey $9 testValue. Where *3 indicates that the command has three parts, and $3 indicates that the number of bytes in this part, set, etc., is the text of this part itself.


Why use post-write logging?

First of all, in order to avoid the extra checking overhead, Redis does not check the syntax of these commands when logging to AOF, so if you log first and then execute the write command, the log may record wrong commands, which will cause errors when using AOF for data recovery.

Second, in order to avoid recording wrong commands, Redis makes the system execute commands first. Only successful commands are recorded in logs. Otherwise, the system reports errors to the client.

The log is recorded after the command is executed, so the current write operation is not blocked


Potential risks

First of all, the risk of data loss. If a command fails before logging, the command and its data may be lost.

Second, there is the risk of blocking the next write operation. The post-write mode of AOF logs greatly avoids blocking the current operation. However, THE recording of AOF logs is also performed in the main thread.

How to avoid risk?

After careful analysis, both risks are related to the write time of AOF mechanism. If we can control the write time of AOF log back to disk after a write command is executed, this risk is eliminated.

The AOF mechanism provides three options that are available with the AOF configuration item appendfsync.

Three write back strategies
  • Always: Synchronous write back, high reliability and high performance impact: After each write command is executed, Lecco synchronously writes logs back to the disk.
  • Everysec, write back per second, performance is always, there is a risk of losing one second of data: after each write command is executed, only the log is written to the memory buffer of the AOF file first, and the contents of the buffer are written to disk every second.
  • No, operating system control write back, good performance, downtime more data loss: after each write command execution, only the first log is written to the MEMORY buffer of the AOF file, the operating system decides when to write the buffer content back to disk.

None of the three write-back strategies can do both. The situation is that the AOF logs all the write commands it receives in a file. As more commands are received, the AOF log files get bigger and bigger, which means we must be careful of performance problems caused by large AOF log files.

Performance issues here are mainly reflected in three aspects:

  1. The file system itself has a limit on the file size and cannot save large files.
  2. Appending command records to a file after the file is too large is inefficient.
  3. In the event of an outage, the commands recorded in the AOF are re-executed one by one for failover, and if the log files are too large the recovery process can become very slow and affect the normal use of Redis.

This brings up another problem, what about files that are too big?


What if the log file is too large? AOF rewrite mechanism

AOF rewriting is when Redis creates a new AOF file based on the current state of the database. That is, it reads all the key/value pairs in the database and records the write of each key/value pair with a command.

Why does overwriting make log files smaller?

As you can see in the bold text of the definition above, the override mechanism has a “change-one” feature, that is, multiple commands in a log file will be rewritten as a single command in the AOF rewritten log. Explain, because AOF file records in the form of an additional one by one to accept to write command, when a key/value pair is more write command to modify repeatedly, AOF file will record more than the corresponding command, but rewriting mechanism will be rewritten according to the latest state of the key/value pair for it generates the corresponding command, written such a key/value pair in the rewrite log only one The write command records the writing process of all key and value pairs in the original log. Similarly, when performing data recovery, you only need to execute this command to recover to the original state that requires multiple operations. This can be a significant saving in log space for multiple writes to a key-value pair.

But this raises the question, does the AOF rewriting mechanism block threads and affect Redis performance?

Will AOF overwriting block?

Of course not.

Writing back to the AOF logging mechanism is handled by the main thread of Redis, while rewriting is handled by a child process called BgreWriteAof in the background, thus avoiding blocking the main thread and affecting Redis performance.

Rewrite the process

It can be summed up as “one copy, two logs “.

A copy means that the main thread forks the bgrewriteaof child every time a rewrite is performed. The main thread copies a copy of the memory that contains the latest data in Redis to the child process, so that the child process can rewrite the data in memory to another AOF log file without affecting the main thread.

The two logs are the AOF log file now being used by the main thread and the new AOF rewrite log. Because the override mechanism does not block the main thread, new operations can still be handled. If there is a write operation, Redis will still use the AOF log file that the main thread is using and write the operation to the buffer as well as to the buffer of the rewrite log. This way, even if there is a real outage, the first AOF log file is complete and can be used to recover data. You can see that if a new write operation is actually performed, the new operation also exists in the memory buffer of the rewrite log, indicating that the rewrite log does not lose the new write operation. Wait for copy of all records and rewrite log data operations into new AOF the latest operation in memory buffers, speaking, reading and writing the log file, you can use the new AOF log file instead of the main thread is now using AOF the log file, which is the main thread can now officially the direct manipulation of rewriting AOF log after the operation.

The specific process is shown in the figure below: