[toc]

Original is not easy, welcome to pay attention to the public number: queer cloud storage. More dry stuff.

Is the written data safe?

Consider a question: How far is it safe to write data?

If the user sends an IO request and you reply “Write successful”, the data can still be read regardless of power failure, restart, etc.

So, what is the essential requirement for data security without considering data quiescent errors?

Underline: that is the data must be in non-volatile storage medium, you can reply to the user “write success”. Please remember this sentence, do storage developers, 80% of the time are thinking about this sentence.

So what are the common volatile and non-volatile media?

Volatile media: registers, memory, etc. Non-volatile media: disk, solid state drive, etc.

Take a look at the simplified classic pyramid:

From top to bottom the speed decreases, the capacity increases, the price decreases.

Linux IO briefly

We mentioned earlier how a file is read and written, how the standard library is written, and how the system calls it. Either way, it is essentially a file based form that follows a layer of file systems. The main layer is: system call -> VFS -> file system -> Block device -> hardware driver.

We open the file and write data into it. Ok, now consider the question, when write returns successfully, does the data reach disk?

The answer is: not really.

Because of the filesystem cache, the default mode is write back. Data is successfully written to memory, and the kernel flusits the disk asynchronously depending on the actual situation (such as periodically or when a certain threshold of dirty data is reached).

The benefit of this is to ensure the performance of the write, it seems that the performance of the write is very good (but not good, data write memory speed), the disadvantage is that there is data risk. Because when the user receives the success, the data may still be in memory, this time the machine power off, because memory is a volatile medium, the data will be lost. Losing data is the most unacceptable thing to do for storage, the equivalent of losing the lifeblood of storage.

Animation demonstration:

How to ensure the reliability of the data?

Delimit key point: still that sentence, must ensure the data falls disk after, just return success to the user.

So how can we ensure this? There are three ways to do this.

  1. openFor files, useO_DIRECTMode turns on, and so onwrite/readThe file system’s I/O bypasses the cache and directly tracks the DISK’S I/O.
  2. openFile, usingO_SYNCMode to ensure that each IO stroke is synchronized. orwriteAfter that, make an active callfsync, forced data disk;
  3. Another way to read and write files is throughmmapThe function maps a file to a process’s address space. Data written to or written to a process’s memory address is actually forwarded to disk for reading and writing.writeAnd then you call onemsyncMandatory brush disk;

Three safe IO positions

O_DIRECT mode

The DIRECT IO mode ensures that each IO accesses disk data directly, rather than returning a success result to the user after the data is written to memory, thus ensuring data security. Because memory is volatile and lost on a power failure, data can only be safely written to persistent media.

Animation demonstration:

The data is directly read from the disk instead of being cached in the memory, thus saving the entire system memory.

The downside is equally obvious, since every TIME I/O is driven, the performance will look bad (but you need to understand that this is real disk performance).

When O_DIRECT mode is used, the user must ensure that the alignment rules are correct, otherwise I/O will report an error.

  1. The size of the disk IO must be aligned with the sector size (512 bytes)
  2. The disk IO offset is aligned to the sector size.
  3. Memory buffer addresses must also be sector aligned;

Examples of C language:

#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <fcntl.h>
#include <errno.h>
#include <string.h>
#include <stdint.h>

extern int errno;
#define align_ptr(p, a) \
    (u_char *)(((uintptr_t)(p) + ((uintptr_t)a - 1)) & ~((uintptr_t)a - 1))
int main(int argc, char **argv)
{
    char timestamp[8192] = {0};char *timestamp_buf = NULL;
    int timestamp_len = 0;
    ssize_t n = 0;
    int fd = - 1;

    fd = open("./test_directio.txt", O_CREAT | O_RDWR | O_DIRECT, 0644);
    assert(fd >= 0);

    // Align memory addresses
    timestamp_buf = (char *)(align_ptr(timestamp, 512));
    timestamp_len = 512;

    n = pwrite(fd, timestamp_buf, timestamp_len, 0);
    printf("ret (%ld) errno (%s)\n", n, strerror(errno));

    return 0;
}
Copy the code

Compile command:

gcc -ggdb3 -O0 test.c -D_GNU_SOURCE
Copy the code

Generate the binary file, execute it and you’ll see that this is successful.

Sh - 4.4 -# ./a.out
ret (512) errno (Success)
Copy the code

If you want to verify this error, you can set the IO offset or size, or the buffer address is not aligned to 512 (for example, set timestamp_buf to align by 1, and then try again), you will get the following:

Sh - 4.4 -# ./a.out
ret (-1) errno (Invalid argument)
Copy the code

Question to consider: Some children may be curious to ask? I can align the IO size and offset with 512, but how can I make sure that malloc’s address is aligned with 512?

Yeah, we can’t use malloc to control the generated address. There are two solutions to this need:

Method 1: allocate a larger size of memory, and then find an aligned address in the chunk of memory, just make sure that the SIZE of the IO does not exceed the last boundary.

In my demo above, I allocated block 8192 and found 512 aligned addresses from it. There is no way 512 bytes from this address can reach the boundary of this large block of memory. The purpose of alignment is achieved safely.

This approach is simple and universal, but it is a waste of memory.

Method 2: Use the POSIX encapsulated interfaceposix_memalignTo allocate memory, this interface allocates memory to ensure alignment;

As follows, allocate a memory buffer of 1 KiB with memory addresses aligned to 512 bytes.

ret = posix_memalign (&buf, 512.1024);
if (ret) {
    return - 1;
}
Copy the code

Consider a question: What are the typical application scenarios for IO in O_DIRECT mode?

  • The most common is the database system, the database has its own cache system and IO optimization, do not need the kernel to consume memory to do the same thing, and may be good for bad;
  • A scenario where the file system is not formatted and the block device is directly managed;

The standard IO +sync

Sync function: Forcibly flush the kernel buffer to the output disk.

In Linux’s cache I/O mechanism, there is a volatile layer of media between the user and disk — the kernel space buffer cache;

  • Read data is cached in memory to improve subsequent read performance.
  • When user data is written to the memory, the cache returns a success message to the user. Then, the disk is flushed asynchronously to improve user write performance.

Read operations are described as follows:

  1. Operating systems look at the kernel firstbuffer cacheIs there a cache? If yes, it will return directly from the cache;
  2. Otherwise, it is read from disk and cached in the operating system cache.

The write operations are described as follows:

  1. When data is copied from user space to the kernel’s memory cache, a success message is returned to the user and the write operation is complete.
  2. When memory data is actually written to disk is determined by operating system policy (if the machine is powered off, user data is lost);
  3. So, if you want to guarantee a drop, you have to explicitly call itsyncCommand to explicitly flush data to disk (only to flush to disk, machine power failure will not cause data loss);

Highlighting: The sync mechanism ensures that all data generated before the current point in time is flushed to disk. There are two ways to use sync:

  1. openUse ofO_SYNCIdentity;
  2. Explicitly callfsyncSystem calls like this;

Method 1: Use the O_SYNC flag for open.

Examples of C language:

#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <fcntl.h>
#include <errno.h>
#include <string.h>
#include <stdint.h>

extern int errno;

int main(int argc, char **argv)
{
    char buffer[512] = {0};ssize_t n = 0;
    int fd = - 1;

    fd = open("./test_sync.txt", O_CREAT | O_RDWR | O_SYNC, 0644);
    assert(fd >= 0);

    n = pwrite(fd, buffer, 512.0);
    printf("ret (%ld) errno (%s)\n", n, strerror(errno));

    return 0;
}
Copy the code

This ensures that every stroke of I/O is a synchronous I/O, which must be flushed to disk before returning, but this is rarely used because it can cause poor performance and is not good for batch optimization.

Animation demonstration:

Method two: Call the function alonefsync

This one fsync the data to disk after write. This method is used more because it is convenient for service optimization. This approach places a higher requirement on the programmer to timing fsync to ensure both security and performance, which is often a tradeoff.

For example, you could write 10 times before calling regular fsync last, which would ensure disk flushing and optimize for batch IO.

There are several similar functions for this gesture, with some differences. Here are some of them:

// Flush the file data and metadata parts
int fsync(int fildes);
// Flush the data part of the file
int fdatasync(int fildes);
// Flush the entire memory cache
void sync(void);
Copy the code

Animation demonstration:

mmap + msync

This is a very interesting I/O mode, using the mmap function to map a file on the disk to the same size as the process address space, and then when access to a segment of memory data, the kernel will be converted to access to the corresponding location of the file data. In terms of posture, it’s just like memory, but in terms of results, it’s essentially file IO.

void *
mmap(void *addr, size_t len, int prot, int flags, int fd, off_t offset)

int
munmap(void *addr, size_t len);
Copy the code

Mmap can reduce the data copy operation between user space and kernel space. When the data is large, using memory mapping to access files can obtain better efficiency (because it can reduce the memory copy amount, and aggregate IO, data batch disk, effectively reduce IO times).

Of course, if you write data, it is still asynchronously dropped. There is no real-time drop. To ensure a drop, you must call msync.

Advantages of MMAP:

  • Reduce the number of system calls. Only one mmap system call is required. All subsequent operations are memory copy operations, not write/read system calls.
  • Reduce the number of data copies;

Examples of C language:

#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <unistd.h>
#include <sys/stat.h>
#include <assert.h>
#include <fcntl.h>
#include <string.h>

int main(a)
{
    int ret = - 1;
    int fd = - 1;

    fd = open("test_mmap.txt", O_CREAT | O_RDWR, 0644);
    assert(fd >= 0);

    ret = ftruncate(fd, 512);
    assert(ret >= 0);

    char *const address = (char *)mmap(NULL.512, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); assert(address ! = MAP_FAILED);// This is where the magic happens (it looks like a memory copy, but it's actually file IO)
    strcpy(address, "hallo, world");
    ret = close(fd);
    assert(ret >= 0);

    // Ensure the bid
    ret = msync(address, 512, MS_SYNC);
    assert(ret >= 0);

    ret = munmap(address, 512);
    assert(ret >= 0);

    return 0;
}
Copy the code

Let’s compile and run it.

gcc -ggdb3 -O0 test_mmap.c -D_GNU_SOURCE
Copy the code

This is a test_map.txt file with a “hello, world” in it.

Animation demonstration:

Hardware cache

This ensures that the file system layer is driven, but the disk hardware itself has caches, which are called hardware caches, and this layer of caches is also volatile. So finally, in order to ensure data drop, the hard disk cache should also be turned off.

# Check the write cache status;
hdparm -W  /dev/sda 
# Disable HDD Cache to ensure strong data consistency; Avoid power failure when the data does not fall;
hdparm -W  0 /dev/sda
# Open HDD Cache (may cause data loss during power failure)
hdparm -W  1 /dev/sda
Copy the code

According to the above IO posture, when you write an IO drive after the data can be said to disk, to ensure that the data is non-volatile power failure.

Original is not easy, welcome to pay attention to the public number: queer cloud storage. More dry stuff.

conclusion

  1. The data must be written on a non-volatile storage medium before you can reply “write successful” to the user. Other ways are taking advantage of rogue, tightrope walking;
  2. This article summarizes the three most fundamental I/O security methods: O_DIRECT write, standard I/O + Sync, and MMAP write + msync. Either each time is synchronous write disk, or each time write, and then call sync active brush, in order to ensure data security;
  3. O_DIRECT imposes strict requirements on users, such as IO offset, length sector alignment, and memory buffer address sector alignment.
  4. Note that the hard drive also has a cache and can passhdparmCommand switch;

Afterword.

Finally, you can rest assured that data has made its way to disk. Hey hey, you think the data is safe? There are many more inside, disk silent error is broken? Can the data still be salvaged? How to ensure that the network transmission process does not go wrong? How to ensure that memory copy process is not a problem? I’ll tell you later;

Original is not easy, welcome to pay attention to the public number: queer cloud storage. More dry stuff.