RocketMQ high-performance underlying storage design

Said in the previous

RocketMQ borrows from Kafka for its underlying storage, but also has its own unique design. This article focuses on the underlying file storage structure that profoundly affects RocketMQ’s performance, with a little Kafka for comparison.

example

Commit Log: a collection of files, each 1 gb in size, stored in the next one, for discussion purposes can be treated as a file, all message content is persisted to this file; Consume Queue: a Topic can have multiple Consume queues, each file representing a logical Queue that stores the offset values of messages in the Commit Log along with their size and Tag attributes.

Let me give you an example for simplicity

If the cluster has a Consume Queue of four brokers, as shown in the figure below, send the five messages with different contents in order.

Let’s briefly focus on the Commit Log and Consume Queue.

RMQ messages are generally ordered, so these five messages persist their contents in the Commit Log in sequence. Consume Queue is used to evenly arrange messages in different logical queues. In cluster mode, multiple consumers can Consume Consume Queue messages in parallel.

Page Cache

Now that you know where and what each file is, it’s time to start talking about the performance benefits of this storage solution.

Generally, file reads and writes are slow. Sequential reads and writes are almost as fast as random reads and writes in memory. The reason for this is Page Cache.

Just to give you an idea, the entire OS has 3.7GB of physical memory. When 2.7GB is used, there should be 1GB of free memory left, but the OS gives 175M. Of course the math doesn’t work that way.

When the OS discovers that the system has a large amount of physical memory, it uses the extra memory as a file cache, which is a buff/cache, to improve IO performance.

When reading a disk, the OS reads all the current area to the Cache so that the next read can hit the Cache. When writing a disk, the OS writes the data directly to the Cache and returns the data. The OS’s PDFlush Flush policies Flush the Cache data back to the disk.

However, there are so many files on the system that even an extra Page Cache is a valuable resource. It is impossible for the OS to randomly allocate Page Cache to any file. Linux provides mmap to map files specified by a program into Virtual Memory. Reading or writing to files becomes reading or writing to memory, making full use of the Page Cache. However, using the Page Cache alone is not enough for file IO. Random reads and writes to files can cause many Page faults in virtual memory.

Each user-space process has its own virtual memory. Each process thinks it has all of its physical memory, but virtual memory is only logical memory. To access the memory data, virtual memory must be mapped to physical memory through the memory management Unit (MMU) lookup page table. If the mapped file is very large and the program accesses the virtual memory that cannot be partially mapped to the physical memory, a page miss interrupt occurs. The OS needs to read and write real data from the disk file and load it to the memory. This process is relatively slow, as our application does not Cache a piece of data and accesses the database directly to request the data and then writes the results to the Cache.

In sequential I/O, however, the read and write areas are hot spots passed by OS smart Cache, which does not cause a large number of page misses. File I/O is almost the same as memory I/O, and the performance of course improves.

The kernel allocates free memory to the Page Cache. If a program has a new memory allocation requirement or a Page failure occurs, and the memory is not free enough, the kernel allocates free memory to the Page Cache. The kernel also takes a little time to reclaim memory from the hot Page Cache, which can cause burrs on very performance-critical systems.

About mmap, say, more about the function of complex usage is not described here, but one thing to keep in mind that invokes the mmap and fd into a document, is in the process’s address space (virtual memory) distribution of a continuous location for mapping file, the kernel is not allocated real physical memory to the file into memory. Memory will eventually be allocated through Page missed interrupts, but this obviously affects performance, so it is best to call madvise and pass in the WILLNEED policy to warm up the Page cache to prevent it from getting cold again and being recollected.

Brush set

Brush disk is generally divided into: synchronous brush disk and asynchronous brush disk

Synchronous brush set

The message is returned to the Producer after it actually falls. As long as the disk is not corrupted, the message is not lost.

Generally only used in financial scenarios, this approach is not the focus of this article because RMQ optimizes synchronous flush using GroupCommit instead of taking advantage of Page Cache features.

Asynchronous brush set

Writing/writing files makes full use of the Page Cache. That is, writing to the Page Cache returns success to the Producer. There are two asynchronous flushing methods in RMQ, and the overall principle is the same.

Brushing is controlled by both the program and the OS

Starting with the OS, when a program writes files sequentially, it writes to the Cache first. This part of the memory is modified but not flushed to disk, resulting in inconsistencies. These inconsistencies are called Dirty pages.

If the dirty page setting is too small, the number of Flush disks increases and performance deteriorates. If the dirty page setting is too large, the performance will improve, but in case the OS is down, the dirty page can not brush the disk, the message will be lost.

The preceding figure shows the default configuration of centos. Dirty_ratio is the threshold of blocking flush, and dirty_background_ratio is the non-blocking flush. To get better performance, it is recommended to increase these two values further and then test the performance.

RMQ wants high performance by sending messages that are written to the Page Cache instead of directly to disk, and receiving messages that are fetched directly from the Page Cache instead of missing pages that are read from disk.

RMQ Commit Log and Consume Queue IO after mmap.

RMQ send logic

When sending, Producer does not directly interact with Consume Queue. As mentioned above, all RMQ messages are stored in the Commit Log, which is locked before writing to the Commit Log to prevent messes from being stored.

After serialization of the message persistence lock, the Commit Log is written sequentially, also known as the Append operation. With Page Cache, RMQ can be very efficient in writing Commit logs.

After the Commit Log is persistent, the data in it is dispatched to the corresponding Consume Queue.

Each Consume Queue represents a logical Queue, which is Append by ReputMessageService in a single Thread Loop, apparently also written sequentially.

Bottom of consumption logic

When consuming, the Consumer does not deal directly with the Commit Log, but rather pulls data from the Consume Queue

The order of pull is from old to new, indicating that each Consume Queue is read sequentially, making full use of the Page Cache.

The Consume Queue has no data, there is only a reference to the Commit Log, so pull the Commit Log again.

The Commit Log performs random reads

However, there is only one Commit Log for the entire RMQ. Although it is read at random, it is read in an orderly manner. As long as the entire area is in the Page Cache, the Page Cache can be fully utilized.

Looking at the network and disk on a real MQ machine, even though the message side keeps reading messages from MQ, it is almost impossible to see the process pulling data from disk and sending it directly from the Page Cache to the Consumer over the Socket.

Comparing Kafka

RMQ takes ideas from Kafka and breaks Kafka’s underlying storage design.

There is only one file for storing messages in Kafka, called partitions (regardless of refined segments). It performs the duties common to Commit logs and Consume queues in RMQ, that is, it splits logically to increase consumption parallelism. The real message content is stored internally.

This looks perfect. For both producers and consumers, individual Partition files are sequentially sent and consumed in logic, taking advantage of the huge performance improvement provided by Page Cache. Each Topic is divided into N partitions, and the sequential reads and writes of so many files become random reads and writes for the OS when concurrent.

Then, for some reason, THE game whack-a-Mole came to mind. For each hole, I always hit the gophers in order, but, if there are 10,000 holes, only you to play, countless gophers in and out of each hole first, then is not random to play, students imagine this scene.

However, RMQ Consume Queue is similar to Kafka in that each file is sequential I/O, but the whole file is random I/O. Keep in mind that RMQ’s Consume Queue does not store the contents of messages, and each message takes up only 20 bytes, so files can be kept very small and most accesses are Page Cache accesses, not disk accesses. In formal deployment, Commit logs and Consume queues can be placed on different physical SSDS to avoid I/O contention among multiple classes of files.

Said in the back

For more exciting articles, please pay attention to my wechat public number: Eric’s Technological arena

RocketMQ high-performance underlying storage design

Said in the previous

example

Page Cache

Brush set

RMQ send logic

Bottom of consumption logic

Comparing Kafka

Said in the back

Related Posts

How to design a good seckill system?

【 Spring source 】 Spring bean creation process, three levels of caching and cyclic dependency issues

Finally, I succeeded in getting the offer from Ali by relying on this interview question.