OS Up Close: Mmap gives you what you want fast!

Original: Taste of Little Sister (wechat official ID: XjjDog), welcome to share, please reserve the source.

I/O issues are generally not a concern of most people, because most development is doing “business”, that is, computing nodes, usually encountered I/O issues, that is, the log type is a bit too much, disk writing is a bit difficult, so ioWAIT is not a concern of many people.

Unfortunately, the job isn’t just about messing around with the CPU, the data has to land anyway. When it comes to high-performance disk storage, the I/O issue comes to the surface. Redis is so popular that it exists to circumvent disk performance issues.

In other words, if my disk is as fast as memory, why bother with memory

Today, let’s talk briefly about I/O. More important, of course, is disk I/O, which tends to deal with persistence. If not specifically, that’s what we’re talking about.

What does I/O do?

What does I/O do? For us users, in a nutshell, there are two.

Ask the operating system for the data and load it into a buffer.
Fill the user process buffer and give it to the operating system to flush.

However, it is not direct reading and writing, because the operating system has a kernel process and user process. In order to protect kernel memory, user processes cannot read the data that the kernel operates on. You want the data, make a copy.

This in and out, it involves switching between user mode and kernel mode, also known as system call, the details will certainly be more complicated.

In the case of reading data, as shown in the figure above, we can divide the reading process into the following stages.

The Java process initiates a read request and invokes the lowest level of code to initiate a read() system call.
The operating system reads data from hardware such as disks through DMA, etc.
The DMA reads the data and stores it into a buffer in the kernel. This part of the operation does not require CPU involvement.
The kernel will then copy the contents of its own buffer into the Java process’s buffer.

As you can see, since a read involves the operating system, its interaction becomes more complex. Typically, when a Java process reads data, the kernel finds that the data is already in the cache and copies it. When this data is not present in the kernel cache, the Java process blocks there until the required data is copied to user space.

To summarize: kernel processes hold memory that is not directly accessible, we need to make a copy of the user process.

Virtual addresses to help

As can be seen from the above description, the contents of disk files have to pass through the kernel in order to be used by user processes. Since this is inefficient, why not just send the files on the disk directly to the user process?

It is not a question of whether it can be done, but whether it should be done. Since user processes use a particular operating system, they must follow the rules of the operating system. On The Linux operating system, it is the safest and most convenient way to program these complicated tasks to the operating system.

So, I now just don’t want to follow the rules, the efficiency of some more heavy, how to do?

There’s no choice but to open a green channel.

If I can read the same data and manipulate the same buffer in the user process and in the operating system kernel, then I’ve done my job.

It is dangerous to give user processes direct access to physical memory addresses owned by the kernel. How to share this physical memory requires virtual memory. This is called the green channel.

Virtual memory is definitely relative to physical memory. If you decompile a binary file, you will see that its reference address is fixed. A virtual memory region is a homogeneous region in a process’s virtual address space, that is, a contiguous address range with the same properties. As shown below, the MMU component is specifically responsible for virtual memory to physical memory to translation, which is the stuff of Computer Architecture and has become the default operation of modern operating systems.

With the help of virtual memory, we can use different virtual memory addresses to point to the same physical memory address, realizing the sharing of memory data in a disguised way, avoiding the data copy between kernel and user process.

If we map the virtual address of a kernel space and the virtual address of user space to the actual physical memory address of the same kernel, then we can manipulate the memory area simultaneously.

The typical application is MMAP

Mmap (Memory Mapped Files) is a special channel for dealing with maps in this way. It maps a file, or other object, to a process’s virtual address space.

When we manipulate this segment of memory, we can directly affect the files on the final operating system. Although the operating system still does the reading and writing of files, it is obvious that the need to call read and write functions to copy data from the operating system’s memory will increase the efficiency of reading and writing files.

MMAP also makes it possible for processes to share programmatic memory, communicate with each other, and interact collaboratively with kernel processes. When we run out of physical memory space, we can even use disks to simulate memory.

This is the magic of abstraction.

The mapping area of mMAP must be an integer multiple of the physical page size (page_size), which is the batch processing mode adopted by the operating system to increase processing efficiency (the minimum granularity of memory management is pages). Also, MMap cannot map areas that are larger than the file size, so when the file size changes, it needs to be remapped.

There is a Java class MappedByteBuffer that deals specifically with MMaps, and you can retrieve this variable via FileChannel’s Map function.

MappedByteBuffer mb = new RandomAccessFile("test", "rw") .getChannel() .map(FileChannel.MapMode.READ_WRITE, 0, 256); / /... public abstract MappedByteBuffer map(MapMode mode, long position, long size)Copy the code

As you can see, you can map the contents of a file directly to the MB variable with the position and size parameters. If our different processes do the same mapping, the memory usage will not double because they are all virtual addresses.

If you have a very large 40GB file, but the operating system only has 2GB of memory, you can still read and modify the file quickly in this way.

When using the top command, we often see the swap area, which uses files to simulate memory. When you use 2GB of memory to manipulate 40GB files, it usually causes a swap out, and the data in memory needs to be written to disk. In MMAP mode, there is no need to use additional swap to ensure this operation. When swap is needed, the operating system uses the raw file directly, and these mappings also take effect on the target file to be operated on.

This process is very efficient because there is no other buffer involved except for the OS page failure causing file reads and writes.

How does it work?

If you take a closer look at the mMAP code, the default functions are very, very, very rare, and there are all sorts of restrictions on how to do things with them.

So we need to work with index files to work with MMap for efficient operations.

Mmap is often found in some databases and middleware, especially those that involve reading and writing large files.

In Kafka and RocketMQ, commitlog needs to read data based on offsets, and Mmap is a great way to speed things up. Kafka’s index files, for example, make heavy use of Mmap; When consuming, Kafka sends the file directly to the consumer and passes it directly to SendFile, along with Mmap as the file read/write method.

Including mainstream ES, mMAP is also used extensively. This is a form of cheating.

Don’t count your chickens before they hatch.

Mmap doesn’t always perform this well on various Linux platforms, as measured by many Benchmarks. Frequent file swapping and page misses can still occur when the file size is not in memory, which requires actual validation to confirm the true performance of the service.

Therefore, in the embedded database Rocksdb, mMAP-related optimization parameters are turned off by default. Mmap should exist as magic, not as a general optimization method.

allow_mmap_reads=false
allow_mmap_writes=false

Copy the code

In addition, Mmap is not very reliable as a write. Because the data written to Mmap is not actually written to disk, it needs to be flushed to disk by the operating system when it calls flush. Therefore, wal logs or Translog, Redolog, etc. used for data recovery are not mMAP.

Another serious problem with MMAP is unexpected I/O pauses. With the operating system, various buffers in the application, and preread operations, the application is smoother when manipulating files. But once you use MMAP, you can get blocked in unexpected circumstances, or get unreasonably preread interference, and have frequent I/O.

End

Performance optimization is always a double-edged sword. Whether this sword can kill the enemy or injure the teammate depends on the level of the sword holder. Mmap is no exception, it has both advantages and disadvantages, a little more awe, conclusion from practice, is the right attitude.

Xjjdog is a public account that doesn’t allow programmers to get sidetracked. Focus on infrastructure and Linux. Ten years architecture, ten billion daily flow, and you discuss the world of high concurrency, give you a different taste. My personal wechat xjjdog0, welcome to add friends, further communication.

Recommended reading:

2. What does the album taste like

3. Bluetooth is like a dream 4. 5. Lost architect, leaving only a script 6. Bugs written by architect, unusual 7. Some programmers, nature is a flock of sheep!

Taste of little sister

Not envy mandarin duck not envy fairy, a line of code half a day

325 original articles

The public,

OS Up Close: Mmap gives you what you want fast!

What does I/O do?

Virtual addresses to help

The typical application is MMAP

How does it work?

End

Related Posts

Interview the kill | please discuss Java8-18 of the new features introduced (4)

Ysapi: By Swoole + YAF implementation socket service base framework

Python’s built-in data types — lists and tuples — are available in 10 chapters