preface

I recently encountered a disk-related online failure to summarize some of my previously unknown knowledge about Linux disk caching.

In general, there are two reasons for the appearance of disk caching. First, the speed of accessing disk is much slower than the speed of accessing memory, which can be improved by caching disk contents in memory. Second, according to the principle of program locality, once the data has been accessed, it is likely to be accessed again in a short time, so caching disk contents in memory can improve the speed of the program.

Locality principle

Principle of Program locality: The execution of a program exhibits the law of locality, that is, the execution of the entire program is limited to a certain part of the program for a certain period of time. Accordingly, the storage that execution accesses is limited to a memory region, and locality, in particular, usually comes in two forms: temporal locality and spatial locality.

Time locality: a memory location that has been referenced once will be referenced many times in the future.

Spatial locality: If a memory location is referenced, its nearby locations will also be referenced in the future.

Page caching

In Linux, to reduce I/O operations on disks, the contents of opened disks are cached in physical memory. In this way, access to disks is converted into access to memory, effectively improving the speed of programs. Linux uses physical memory to cache the contents of disk, called page cache.

A page cache is made up of physical pages in memory whose contents correspond to physical blocks on disk. The size of the page cache adjusts dynamically according to the amount of free memory on the system, and it can expand by occupying memory or shrink itself to relieve memory usage.

Before the advent of virtual memory, operating systems used the block cache family, but after the advent of virtual memory, operating systems managed IO with greater granularity, so they adopted the page cache mechanism, which is a page-based, file-oriented caching mechanism.

Page cache read

When reading a file, the Linux system preferentially reads the file from the page cache. If the page cache does not exist, the Linux system first reads the file from the disk and updates the file to the page cache, and then reads the file from the page cache and returns the file. The general process is as follows:

  1. The process invokes the library function read to initiate a file read request
  2. The kernel checks the list of open files and invokes the read interface provided by the file system
  3. Find the file’s inode and figure out which page to read
  4. Search the corresponding page cache through inode, 1) if the page cache node hits, return the file content directly; 2) If there is no corresponding page cache, a page fault is generated. The system creates a new empty page cache, reads the file contents from disk, updates the page cache, and then repeats step 4
  5. Read file return

Therefore, all reads of file contents, whether or not they initially hit the page cache, ultimately come directly from the page cache.

Write to the page cache

Because of the page cache, when a process calls write, the update to the file is simply written to the file’s page cache, then the corresponding page is marked as dirty, and the whole process ends. The Linux kernel periodically writes the dirty pages back to disk and then cleans up the dirty flags.

Because write operations only write changes to the page cache, the process does not block until disk I/O occurs, and if the computer crashes at this point, the write changes may not occur on disk. Therefore, for some strict write operations, such as data systems, you need to actively call fsync and other operations to synchronize changes to disk in time. Read operations, on the other hand, usually block until the process reads data. To reduce this delay, Linux uses a technique called “preread”, where the kernel reads more pages from disk into the page cache.

To write a thread

Writing back to the page cache is done by a separate thread in the kernel, which can write back in one of three ways:

  1. The free memory is lower than the threshold. When the free memory is insufficient, part of the cache needs to be released. Since only non-dirty pages can be released, all dirty pages need to be written back to disk to make them recyclable and clean.
  2. The processing time of dirty pages in memory exceeds the threshold. This is to ensure that dirty pages do not remain in memory indefinitely, reducing the risk of data loss.
  3. When the user process invokes sync and fsync system calls. This function provides a mandatory write back method for user processes to meet stringent write back requirements.

Implementation of write back threads

The name of the version instructions
bdflush Prior to 2.6 The bdFlush kernel thread runs in the background, and there is only one bdFlush thread in the system, which is woken up when memory consumption falls below a certain threshold. Kupdated Runs periodically to write back to dirty pages. However, there is only one BdFlush thread in the entire system. When a heavy write back task is performed, the BdFlush thread may block the I/O of one disk, causing I/O write backs of other disks to fail in a timely manner.
pdflush Version 2.6 Introduction The number of PDFlush threads is dynamic and depends on the I/O load on the system. It is oriented towards global tasks for all disks in the system. However, because PDFlush is for all disks, it is possible for multiple PDFlush threads to block on a congested disk, causing I/O write backs on other disks to fail.
Flusher thread Introduced later than 2.6.32 The number of flusher threads is not unique, and the flusher thread is not flusher for all disks. Each flusher thread corresponds to one disk

Collection of page cache

The replacement logic for page caching in Linux is a modified LRU implementation, also known as the double-chained policy. Instead of maintaining one LRU list, Linux maintains two lists: active and inactive. Pages on an active list are considered “hot” and will not be paged out, while pages on an inactive list are paged out. Pages in the active list must be in the inactive list at the time they are accessed. Both lists are maintained by pseudo-LRU rules: pages are added from the tail and removed from the head, like queues. The two lists need to be balanced – if the active list becomes too many and exceeds the inactive list, the head page of the active list is moved back to the inactive list and can be reclaimed again. Double – linked list strategy solves the dilemma of only one access in traditional LRU algorithm. It is also easier to implement pseudo LRU semantics. This double-linked list scheme is also called LRU/2. More commonly n linked lists are called LRU/ N.

conclusion

In this case, the root cause of the online failure is that temporary files are cached in the business logic. If a temporary file is deleted within a short period of time after creation, operations on the file are carried out in the page cache and will not actually be written back to disk. When a program is slow to respond to problems, temporary files may survive longer and be written back to disk, resulting in excessive disk pressure and affecting the entire system.