【 introduction 】

This is a full of twists and turns of the problem analysis, little data, a lot of code, little experience, a lot of concepts, when the kernel mode, user mode, DIF, LBA, large page memory, SGL, RDMA, NVME and SSD come together, the problem is a single point of accident, or a group of helpless?

In order to deepen the memory, but also to share out to give people inspiration, the special record of this problem analysis process.


Colleague L needed to use NVMF write disk in the project, found that write disk failed, crazy printing error code:

Although the images are less intercepted, the actual printing is crazy all the time.

A brief description of the fault phenomenon is:

Failed to write disk through NVMF, crazy printing error code 15;

By contrast, through the local write disk, everything is fine.

Note: the disk here, all refer to SSD disk. The current model used in the laboratory is the company’s V3 version (HWE3XXX).


Here are some of the basic abbreviations involved:

After I get used to abbreviations as nouns, I always tend to ignore the more meanings behind them. The analysis of problems requires a deeper understanding of these. At the beginning, I didn’t have a deep understanding of these and I wasn’t clear about the data processing process, so it was very difficult to get started.

Analysis steps (1)

When IO is sent, by changing the size of IO and queue depth, it is found that the data volume is small, there is almost no problem. When I directly send IO with the size of 1M, it must appear.

Therefore, it is obvious that the size of IO is closely related to the occurrence of problems.

Running the business directly to verify the problem is cumbersome and cumbersome, but simplifying the problem to a server side and a requester side, the discovery can be replicated stably, and they are respectively:

1. Run SPDK’s own APP, NVMF_TGT program, and this is the server of NVMF;

  • After entering the SPDK directory, configure the 2M large page;
  • Configure the nvmf.conf file, assuming the file is in /opt/yy. Refer to the appendix for configuration file;
  • /app/nvmf_tgt/ nvmf_tgt-c /opt/yy/nvmf.conf; /app/nvmf_tgt/nvmf_tgt /opt/yy/nvmf.conf;

2. You can use the requester side in both modes,

  • One is SPDK own perf program, path is. / examples/nvme/perf/perf, can configure the necessary parameters; Note: The system also comes with a perf, not the one it comes with; Perf is a testing tool that randomly generates a large number of data writes, verifying the repairability of the problem, but not conducive to the initial analysis of the problem.
  • One is to transform the HelloWorld program under the NVME directory (the initial version, provided by my colleague C, and later after some improvement, later called Demo program); See the appendix for the code;

Because they are running in user mode, it is very convenient to enable debugging. Debugging mode is opened at both ends to carry out single-step tracking, and it is found that the error code is rotated in the asynchronous mode, as shown in the figure

The function name has been told and is the result of the completion of processing;

The call comes from here, line 383:

At the break point below line 303, according to the stack information, the error code may come from an asynchronous call of the SPDK, or it may come from the device. After searching the SPDK code, we find that there is no error code setting of 15, which is basically deduced to be returned by SSD.

According to the initial information, the size of the IO data will affect the occurrence of the problem. When the IO data is small, it will not occur. So where is the cut-off point?

Using dichotomy in the DEMO program to try, found that the number of LBA is 15, is the cut-off point.

So, how do you use it?

Step trace, one parameter into view, the sectors_per_max_io parameter of the namespace (NVME protocol specification, a SSD has a control, there are several namespaces).

Modify this parameter, you can control the size of the final write disk, in the DEMO program test, the problem disappeared.

However, when the size and depth of the IO is large, either the out-of-memory error code occurs, or the error still occurs, and it is very easy to reproduce in multi-disk scenarios.

Give conditional solution 1:

(1) Modify the above position;

(2) The size of IO and the number of plates to be issued are required to be limited when the service is issued;

In actual use, it is very difficult to transform into a single dish because of the need for multiple plates, which is not an ideal solution.

Also found that different versions of the disk, the minimum fit value is different, the most safe value is 7, but later mainly selected a piece of 15 as the safety line of the disk to analyze the use of the problem.

Analysis steps (2)

In order to solve the problem quickly, start to try a wide range of help, such an obvious problem, have others encountered?

After scouring hi3ms and Googling, and asking relevant colleagues to find out, hey, there is no second case!

What’s even more strange is that Intel’s baseline report clearly shows NVMF tests with large IO volumes and normal results.

Why is this a problem here?


  • Intel definitely uses Intel disks;
  • This is the company’s disk;

Is it because of this?

In terms of hardware, there’s not that much difference in theory.

After some exploration, it is found that NVMF is also normal when the hard disk is formatted without DIF. If the hard disk is formatted with DIF, that is, 512+8 format, the problem will appear.

So, the reason why Intel doesn’t have a problem is basically determined. They use the format without DIF, and find that without DIF, the delay will be a little faster, which is easy to understand.

One puzzle, which remains unanswered, is why native writes don’t appear, whereas NVMF writes do.

This is the most important question to answer.

As a foundation, you need to take a brief look at the NVME write disk.

The process is asynchronous;

Before the disk is written, the program prepares the data according to the queue (such as SGL), and then notifies the SSD, and the program is finished;

And then the SSD goes to the machine and takes the data out and writes it to the disk, and when it’s done, it notifies the program, and the program checks the result queue.

It can be seen that the current write disk, mainly refers to the data in accordance with the queue ready to complete, the back of a section is processed by the SSD device.

After calling the SPDK API, the SPDK prepares the queue, and then commits. The real thing to save the data is done by the controller in the SSD…

But what about the NVMF write disk? After all, there is a network in the middle, how to deal with…

In order to facilitate the analysis, we chose to transform the DEMO, mainly because PERF is more complex, random LBA and large data amount have great interference to the analysis.

In the DEMO program, you specify that you start committing data on LBA 0 and commit 17 blocks at a time (total length 17*520=8840).

So why does the block specify 17?

Since 15 and below will not be a problem, according to the previous analysis, the normal dividing line of this SSD is 15, and 16 is 2 to the fourth power. In the computer, 2 to the N power is too special, so the ordinary 17 is chosen.

Second, ensure that everything else is exactly the same, only in the initialization, the formation of two modes, one is a local write, one is NVMF write;

As shown in the figure, manually change the parameters in the red box directly. Tr_RDMA and TR_PCIE can switch between two modes.

The purpose of this is to be able to make a complete comparison, to align all the conditions that can be aligned, to see where the problem is in the NVMF.

After a preliminary step trace of the invocation process, we can tease out the basic processing flow of local and NVMF writes:

The local write:

  1. On the requester side, a contiguous memory of 1M is applied, and the block sizes are aligned with 4K sizes.
  2. 17 blocks (that is, the size of 1M only uses 17*520 bytes) are written by calling the SPDK API;
  3. The API of SPDK will call the interface in PCIe mode (during the system initialization, the registered callback function will go to the corresponding interface of PCIe determined by the parameters in the red box in the above figure during the initialization entry).
  4. Prepare data queue, submit SSD write request, return;
  5. Rotate through the interface that has completed processing, and get the write disk success notification;

NVMF wrote:

Request side:

(1) On the requester side, a continuous memory of 1M is applied, and the block size is aligned with 4K size;

(2) write the 17 blocks (that is, the size of 1M only takes 17*520 bytes) by calling the API of SPDK;

(3) The API of SPDK will call the interface in RDMA mode (as above, during initialization, the RDMA callback function is registered, and the parameters in the red box in the figure above determine that the call here goes to the corresponding interface of RDMA);

(4) Prepare the data queue, send it to the server through the RDMA network, and return;

Server side:

(5) The RDMA of the server receives the notification of the arrival of data in the poll;

(6) Assemble data structures to facilitate internal API calls;

(7) The data is called to BDEV, SPDK and NVME API along the way, the address is converted to physical address, and finally the data interface of PCIe is called for submission;

(8) Press the submission doorbell according to the specification and then return;

Both sides are asynchronous (after submitting the request, it can only wait for the result to print asynchronously).

(9) The requestor rotates through the interface that has been processed. If an error occurs, it will be printed;

Debug shows that the error code is 15

(10) The server rotates through the interface that has completed processing. If there is an error, print will appear:

Repeated local and NVMF data (starting from 0 above, 17 pieces of data), process by process and parameter comparison (double screen provides greater convenience), indeed found a lot of similarities and differences:

(1) The local write process is almost the same as the requestor process written by NVMF. The difference is that the local write commits data to SSD. The NVMF write calls the RDMA interface.

(2) The NVMF server has a very long call stack (30 layers deep), while the local write process does not exist at all;

(3) After a series of calls, the NVMF server finally goes to a function call like a local write disk, NVME_TRANSPORT_QPAIR_SUBMIT_REQUEST;

It seems an obvious conclusion that NVME OVER RDMA is actually NVME OVER PCIE after the data has been transmitted through RDMA;

(4) Local write, there is only one SGL, there is only one SGE in this SGL, NVMF requester before calling RDMA, there is also only one SGL, there is only one SGE in this SGL;

(5) There is only one SGL in the NVMF server before writing disk, but there are two SGEs in this SGL;

The whole process is described as follows:

As shown in figure:

This is an important finding, which basically explains why part 1 of solution is effective in cases (15 secureline data size is less than 8K, ensuring only 1 SGE in 1 SGL), but it does not explain why there are cases where it fails.

If you think about it, it’s much clearer:

The data obtained by RDMA from the requester of NVMF is 1 SGL containing 1 SGE. After passing through RDMA, the data obtained from the server of NVMF is 1 SGL containing 2 SGE.

At this point, it seems that the basic “locked” perpetrators, is RDMA!

However, after reading the data of RDMA and SSD, it is found that 1 SGL, 1 SGE and 2 SGE are free at all.

Although, after receiving data, RDMA divides 1 SGE into 2 SGE, which may be suspected of causing problems, but according to the information introduction, it seems that it cannot directly constitute a problem.

In order to verify whether multiple SGE in one SGL is a problem, we started to transform the DEMO again. Before constructing the write data, we divided the data into multiple SGE, as shown in the figure:

I tried NVMF first, and I found that it could be replicated, just like the previous NVMF,

Next, I tried local, and found no problems, that is to say, doubts were not removed.

Analysis steps (3)

Well there is no way to push back and start all over again, a chance NVMF issued found that 2 SGE’s address, the address of the second SGE before, and the first in the SGE’s address, then pay close attention to, even in the DEMO program, the address have some random, most of the time sequence, the minority is upside down, But anyway, there is discontinuity between one SGE and the other SGE, that is, there is a void between SGE1 and SGE2.

Immediately construct the same shape,

Write local, discovery reappears!

This is an “important find”! Local can also be repeated!

It can almost be concluded that NVMF is not the key! That would rule out RDMA!

When writing a disk, there is no problem if the data extents of multiple SGE are completely contiguous, and there is a problem if the data extents of multiple SGE are not contiguous.

It is easy to deduce the problem, then, that the current SSD does not support discontinuous SGE! Is it SSD? !

And then… (Omit a paragraph here…)



Yes, there is no problem with SSD, the problem is the length of 8192, the correct length is 8320!

What’s 8320? What’s 8192?

8192 is 512 times 16;

8320 is 520 times 16;


However, the truth is that the length of the data in the SGE does not align with the base size of the BLOCK 520! The current format is 512+8=520 with DIF section.

The message tells you that the data blocks are not aligned and the length in SGE is invalid!

When this basic parameter is tailored for each point,

The DEMO is now working locally,

The DEMO’s NVMF is working,

Looks like the truth is out…

However, within a few minutes of being happy, the problem recurred when Perf sent 1M IO!

Analysis steps (4)

After careful tracking, it was found that although the problem recurred, it was not as much as the previous refresh. Moreover, as long as the address of SGE data ended with FF000, there would be a problem through a single step.

Looking back at this address, you can see that the source RDMA appears as soon as the data is received, occasionally ending in FF000, so that can explain the error refresh is not as dense.

It seems that RDMA is still a problem

Further analysis reveals that these addresses are not actually assigned by RDMA temporarily, but are retrieved from the buffer queue.

You can basically assume that there are a lot of options in the buffered queue, and occasionally you’ll get buffered at the end of FF000, but that’s where you’re going to have problems.

So why is this kind of address a problem?

Remember the first step? Set 2M large page memory, SPDK is based on DPDK, DPDK memory queue is required to large page memory, the most commonly used is 2M large page.

These buffers are obtained from the large pages of the DPDK, and FF000 is close to the 2M boundary. There is no problem with the general use of buffers, but SSD does not accept the space of large pages. Therefore, when preparing the submit queue, if there is a space of large pages, the SGE is split into 2 parts. The address ending in FF000 can only store 4096 bytes, so 4096 in one SGE, the rest in the next SGE, and 4096 is not an alignment multiple of 520, so there is a problem.

The targeted solution is to add a judgment before getting the address, and skip it if it is such an address.



Hold your breath…

But, again, surprisingly, testing with perf at large IO is still problematic!

Don’t be discouraged, fight again!

Open log (because is asynchronous, and is a large amount of data to test, so we had to increase the log in key areas, to record these details address assignment, the main site, one is to submit a request, as shown in the above file and code line, don’t post code, one is the most began to get into the RDMA received data, and a is the result of the completion), continue to analyze.

As you can see, there is also an address assignment exception, which also causes length problems in SGE, as shown in the figure below:

Once again in the location of the address to modify the shield, the two to be skipped directly into one.

See the figure (471~475, plus the nvmf_request_get_buffers function that needs to be configured for skip handling) :



All use cases pass the test!

Problem gone!

Provide a second solution, as shown in the code above, to resolve the problem once and for all.

Although the problem was solved, skipping some special addresses, there was some waste,

But always feel this kind of change method is too soil! Eliminate the problem, but feel uncomfortable!

Analysis steps (5)

Is there another way?

Keep digging with questions.

Since RDMA only uses buffered queues, there is a place to allocate such buffered queues and then allocate them without using them, which is obviously a bit wasteful, but at least it can be done, so don’t allocate this data when you allocate it.

All the way back, finally found the application place, but it is very complicated, let slowly digest it.

And found a long text description that had something to do with the assignment of addresses,

With this information, you can step through the buffer allocation process and guess how to change one parameter in the process to affect the rest of the process.

Red box 1 is the default parameter of the code. If it is changed to red box 2, the meaning of the two parameters in red box 2 is single producer and single consumer. This pattern is completely matched in the DEMO program.



RDMA grows in one direction when retrieving SGE addresses.

Problem gone!

One parameter eliminates the problem and, by contrast, is much more comfortable!

【 summary 】

The NVMF configuration file needs to explicitly set the size of iounitSize to an integer multiple of the size of the Block used. For the current use of 520 blocks, it is recommended to set the size to 8320. Modify the create memory pool parameter; A parameter in the final figure is enough.

(2) The process is very tortuous, but as long as you don’t give up, follow the code, read the data, make bold assumptions, carefully verify, and keep iterating, you can finally find the problem. If you are familiar with the relevant concepts and processing process, you will save a lot of time.

(3) Finally, Amway, VSC, with remote-ssh, can directly present the code on the Linux machine for visual debugging, shuttle through the code at will, where the points of confusion are, which is a great help for this analysis.


The NVMF configuration file is as follows

[Global] [Nvmf] [Transport] Type RDMA InCapsuleDataSize 16384 IOUnitSize 8192 [Nvme] TransportID "trtype:PCIe Traddr :0000:04:00.0" NVME0 TransportID "Trtype: PCIE Traddr :0000:05:00.0" NVME1 TransportID" Trtype: PCIE Traddr :0000:82:00.0" Nvme2 [Subsystem1] NQN nqn.2020-05.io. SPDK :cnode1 Listen RDMA SN SPDK001 MN SPDK_Controller1 AllowAnyHost Yes Namespace Nvme0n1 1 [Subsystem2] NQN nqn.2020-05.io.spdk:cnode2 Listen RDMA SN SPDK002 MN SPDK_Controller1 allowAnyHost Yes Namespace Nvme1n1 1 [Subsystem3] NQN Nqn.2020-05.io. SPDK :cnode3 Listen RDMA SN SPDK003 MN SPDK_Controller1 allowAnyHost Yes Namespace Nvme2n1 1

Click on the attention, the first time to understand Huawei cloud fresh technology ~