RDMA DatenLord | with Rust implementation

Author: Wang Pu/Post editor: Zhang Handong


RDMA is a high-speed network commonly used in the field of high performance computing (HPC), but also widely used in dedicated scenarios such as storage networks. The most important feature of RDMA is the combination of hardware and software. When transmitting data over the network, no CPU or core is required, thus achieving a high-performance transmission network. The earliest RDMA required an InfiniBand (IB) network with dedicated IB network adapters and IB switches. Ethernet switches can now be used with RDMA, but a dedicated IB network card is also required. There are software implementations of RDMA based on Ethernet cards, but they have no performance advantages.

The actual use of RDMA requires a specific interface for programming, and since THE CPU/ kernel is not involved in the RDMA data transfer process, much of the underlying work needs to be done by RDMA programming itself. For example, all kinds of memory management work involved in RDMA transfer should be completed by developers calling the RDMA interface, or even implemented by themselves, unlike in socket programming, there is a kernel to help do all kinds of caching and so on. It was also the complexity of RDMA programming, coupled with the high cost of previous RDMA hardware, that prevented RDMA from being as widely used as TCP/IP.

This article mainly introduces the various problems encountered when we use Rust to RDMA C interface encapsulation, and discusses how to use Rust to RDMA safe encapsulation. The following is a brief introduction to the basic programming mode of RDMA, and then to the various technical problems encountered when using Rust to encapsulate THE C interface of RDMA, and finally to the follow-up work. Our RDMA wrapper with Rust is open source, including RDMA-SYS, the unsafe wrapper for the RDMA interface, and Async-RDMA, the safe wrapper (not yet complete).

RDMA programming philosophy

I’ll start with a brief introduction to RDMA programming, because this article is not about how to program with RDMA, so I’ll focus on the concept of RDMA programming. Remote Direct Memory Access is the full name of RDMA, from the literal meaning can be seen, RDMA to achieve Direct Access to Remote Memory, a lot of operations are about how to achieve Memory Access between the local node and the Remote node.

RDMA data operations can be divided into “unilateral” and “bilateral”, i.e. send/receive and read/write, essentially sharing memory between local and remote nodes. For bilateral, the CPU of both nodes is required to participate, while for unilateral, only one CPU is required to participate. For the CPU of the other party, it is completely transparent and will not trigger interrupt. According to the above explanation, you can see that “unilateral” transmission is the main method used to transmit large amounts of data. But “unilateral” transmission also faces the following challenges:

  1. Since the kernel is not required to help RDMA cache data during data transfer, the kernel cannot help RDMA cache data, so RDMA requires that when writing data, the size of the data cannot exceed the size of the shared memory prepared by the receiver, otherwise an error occurs. Therefore, the sender and receiver must agree on the size of each data write before writing data.

  2. In addition, because RDMA don’t need the kernel participation in the process of data transmission, therefore the kernel might put the local node to remote node through RDMA Shared memory to exchange, so RDMA must be Shared with the kernel to apply for resident in memory, the memory space that ensure the safety of the remote node through RDMA access to Shared memory of the local node.

  3. Furthermore, although RDMA requires the kernel to register the shared memory space between the local node and the remote node, in case the kernel swaps the shared memory space out, the kernel does not guarantee that the shared memory is safe to access. That is, when the program of the local node updates the shared memory data, the remote node may be accessing the shared memory, causing inconsistent data to be read by the remote node. Conversely, when a remote node writes to the shared memory, programs on the local node may also be writing to the shared memory, resulting in data conflicts or inconsistencies. Developers who program with RDMA must ensure data consistency in shared memory themselves, which is the most complex aspect of RDMA programming.

In short, RDMA bypasses the kernel in the data transfer process, which greatly improves performance but also introduces a lot of complexity, especially regarding memory management, which requires developers to solve by themselves.

The UNSAFE wrapper for RDMA

The programming interface of RDMA is mainly RDMA-core implemented by C. At first, we thought that Rust encapsulation of RDMA-Core could be easily generated with Rust bingen, but we encountered many problems in practice.

First of all, a large number of rDMA-core interface functions are defined inline, at least hundreds of inline function interfaces. Bindgen ignored all the inline functions when generating Rust packaging, so we had to implement them manually. There are several other open source projects in the Rust community that implement Rust encapsulation of RDMA-core, but none do a good job of addressing inline functions. In addition, we left the function and parameter names unchanged when we implemented the RDMA-core inline function Rust wrapper ourselves.

Second, there are many macro definitions in RDMA-core, and bindgen ignores all of them when generating Rust packages. Therefore, we have to manually implement some key macro definitions, especially interface functions and key constants in RDMA-core.

Furthermore, rDMA-core has many data structure definitions that use unions, but Bindgen does not handle C unions well and does not directly convert them into Rust unions. More seriously, the rdMA-core data structure also uses anonymous unions, as shown below:

struct ibv_wc {.union {
		__be32		imm_data;
		uint32_tinvalidated_rkey; }; . };Copy the code

Since Rust does not support anonymous unions, BindGen automatically generates union type names in Rust Binding for these RDMA-core anonymous unions. However, bindGen automatically generates names that are unfriendly to developers. Names such as IBv_flow_spec__bindgen_ty_1__bindgen_ty_1, so we all manually redefine anonymous union as follows:

#[repr(C)]
pub union imm_data_invalidated_rkey_union_t {
    pub imm_data: __be32,
    pub invalidated_rkey: u32,}#[repr(C)]
pub struct ibv_wc{...pub imm_data_invalidated_rkey_union: imm_data_invalidated_rkey_union_t,
    ...
}
Copy the code

Again, rdM-core references many C data structures, such as pthread_mutex_t and sockaddr_in, which should be defined in Rust Libc rather than redefined by Bindgen. So we need to configure A Rust binding for Bindgen not to duplicate the data structures defined in liBC.

To summarize, Bindgen was only half the game in generating the RDMA-Core wrapper, leaving a lot of work to be done manually, which was extremely delicate. The upside, though, is that the RDMA interface is already stable, and this kind of work only takes one operation, with few subsequent updates.

Safe encapsulation of RDMA

There are two layers to consider regarding THE SAFE encapsulation of RDMA:

  • How to comply with Rust specifications and practices;
  • How to implement memory security for RDMA operations.

First, how can RDMA’s various data structure types be packaged into rust-friendly types? Rdma-core is full of Pointers, most of which are defined by Bindgen as *mut and a few as const. In Rust, these bare pointer types are neither Sync nor Send and therefore cannot be accessed by multiple threads. Struct ibv_wq (struct ibv_wq ()); / / struct ibv_create_wq(); And released by the ibv_destroy_wq() function:

struct ibv_wq *ibv_create_wq(...).;

int ibv_destroy_wq(struct ibv_wq *wq);
Copy the code

However, when developing RDMA applications with Rust, the Rust code does not directly manage the lifecycle of the struct ibv_wq data structure. Furthermore, the various data structures created by RDMA-Core are not directly modified in Rust code, which operates on the various RDMA data structures/Pointers by calling the INTERFACE functions of RDMA-Core. So for Rust code, the pointer to the various data structures generated by RDMA-core is essentially a handle /handler. It doesn’t matter if the handler type is a bare pointer type. Therefore, to facilitate multithreaded access in Rust code, we convert all the bare pointer types returned by RDMA-core to usize, and then convert from USize to the corresponding bare pointer types when we need to call the RDMA-core interface functions. It sounds like a hack, but the reason behind it is pretty obvious. Further, for resources that need to be released manually in RDMA-Core, you can implement Rust’s Drop trait by calling the corresponding interface of RDMA-Core in the Drop () function to achieve automatic resource release.

Second, the work on RDMA’s memory safety is not yet complete. At present, RDMA shared memory access security is a hot research topic in the academic world, and there is no perfect solution. In essence, RDMA’s shared memory access security problem is caused by the fact that the kernel has done a lot of work in memory management in order to achieve high performance network transmission and bypass the kernel to do memory sharing. RDMA’s data transfer bypasses the kernel, so RDMA cannot use the kernel’s memory management mechanism to ensure memory security. Moving all of the kernel’s memory management work to user mode for RDMA shared memory access security would be both too complex and not very performance efficient.

In practice, there are conventions about how RDMA can be used, such as not allowing remote nodes to write to the shared memory of local nodes, only allowing remote nodes to read. But even if only remote reads are allowed, there may be data inconsistencies. For example, the remote node reads the first half of the shared memory, and the local node starts updating the shared memory. Assume that local node update data and remote node rarely read a lot of data, so the local node update speed is faster than the speed of the remote node reads, leading to possible local node before the data read the second half of the remote node updates, so read remote node is inconsistent data, but the second half of the first half data does not include updating data including the update data. The inconsistent data read by the remote node is neither a previous real version of the data nor a new version of the data, breaking the guarantee of data consistency.

A common solution to the RDMA memory security problem is to use lock-free data structures. In essence, the lockless data structure is to solve the problem of memory security under concurrent access. When multiple threads concurrently modify, the lockless data structure ensures the consistency of results. For the above mentioned remote read, local write, you can use Seqlock to achieve. That is, the shared memory space of each RDMA is associated with a sequence number. The local node increases the sequence number each time before modifying the shared memory, and the remote node checks whether the sequence number changes at the beginning and end of the read. If the sequence number does not change, the shared memory is not modified during the read. If the serial number changes, the shared memory is modified. If inconsistent data is read, the remote node reads the shared memory again.

More sophisticated algorithms or lockless data structures, such as copy-on-write and read-copy-update, are needed if the usage of RDMA is relaxed, in a scenario where both remote and local nodes can Read and Write to shared memory. Copy-on-write and read-copy-update technologies are heavily used in the kernel for efficient memory management. There are many technical difficulties in this area of work.

The follow-up work

Next, after completing the SAFE encapsulation of RDMA, we plan to use Rust to implement asynchronous calls to the RDMA interface functions. Because RDMA is all IO operations, it is ideal for asynchronous implementations.

For asynchronous processing of RDMA interface functions, the main work is the message processing of the COMPLETION queue of RDMA. RDMA uses multiple work queues, including the receive queue (RQ), send queue (SQ) and completion queue (CQ), which are generally implemented by THE hardware of RDMA. The function of sending queue and receiving queue is well understood, as the literal meaning, respectively, is to store the message to be sent and to be received, the message is to point to a region in memory, when the memory region contains the data to be sent, when the memory region is used to store the received data. After sending and receiving, RDMA places completion messages in the completion queue to indicate whether the corresponding sent or received message was successful. User-mode RDMA programs can periodically and irregularly query completion messages in the completion queue, or interrupt the CPU after receiving the interrupt by the kernel to inform the application process.

In essence, asynchronous I/OS make use of Linux’s epoll mechanism. The kernel notifies the user program that an I/O is ready. The same is true for asynchronous processing of RDMA operations. RDMA creates device files to enable user-mode RDMA programs to interact with the RDMA module in the kernel. After installing the RDMA device and driver, RDMA will create one or more character device files, /dev/infiniband/uverbsN, N starts at 0, and there are several uverbsN device files for every RDMA device. If there is only one then /dev/infiniband/uverbs0. To implement asynchronous message processing for the RDMA completion queue, the user-mode RDMA program uses the Epoll mechanism provided by Linux to asynchronously query the uverbsN device file of RDMA and notify the user-mode RDMA program to process the message when there is a new message in the completion queue.

As for RDMA encapsulation, we haven’t finished the work yet. We plan to implement both SAFE encapsulation of RDMA and shared memory management of RDMA so that IT is easy to program RDMA using Rust, and we welcome those who are interested to join us.

Contents: Rust Chinese Collection (Rust_Magazine) march issue