Datenlord | Rust implementation RDMA asynchronous programming: (a) based on epoll implementation RDMA asynchronous operations

Author: Wang Pu/Post editor: Zhang Handong

RDMA is a set of high performance network protocol stack, mostly used in high performance computing, high performance storage domain. RDMA’s library is implemented in C, but there is no Rust binding that makes it difficult for Rust developers to use. So we are encapsulating a layer of RDMA Rust Binding that conforms to the Rust style and is easy for Rust developers to use. In particular, asynchronous programming has been a popular programming method in recent years. Using Rust asynchronous programming to implement IO operations can avoid process context switching in operating systems and improve performance. Moreover, Rust’s asynchronous programming framework is gradually maturing and improving. This series of articles explores how to implement RDMA operations asynchronously. This article first discusses how to implement RDMA asynchronous operations based on Linux’s epoll mechanism, and then discusses how to implement RDMA asynchronous operations using Rust asynchronous programming.

Introduction to RDMA operations

The programming model of RDMA is message-based to realize network transmission, and queues are used to manage messages to be sent and received. The operations related to network transmission in RDMA are basically queue-related operations. For example, the message to be sent is put into the send queue. After the message is sent, a send completion message is put into the finish queue for the user program to check the message sending status. For example, when a message is received in the receive queue, a receive completion message should also be placed in the completion queue, so that the user program can query for new messages to be processed.

It can be seen from the above description that RDMA queues are divided into several types: Send Queue(SQ), Receive Queue(RQ), and Completion Queue(CQ). SQ and RQ are called Work Queue (WQ) and Queue Pair (QP). In addition, RDMA provides two interfaces, IBv_post_send and IBv_post_RECv, which are called by user programs to send and receive messages, respectively:

  • User program callibv_post_sendSend Queue Element (SQE); Send Queue Element (SQE);
  • User program callibv_post_recvInsert the Receive Request (RR) into the RQ as a new Element in the Receive Queue Element (RQE).

SQE and RQE are also referred to as the Work Queue Element (WQE).

When a message has been sent to SQ or a new message has been received by RQ, RDMA notifies the user of the application by placing a new Completion Queue Element (CQE) in the CQ. There are two synchronized ways for a user program to query CQ:

  • User program callibv_cq_pollPolling CQ to be notified when a new CQE is available, but this polling method consumes CPU resources.
  • When a user program creates a CQ, it specifies a completion event channelibv_comp_channelAnd then callibv_get_cq_eventThe interface waits for the completion event channel to notify it of a new CQE, or if no new CQE is availableibv_get_cq_eventThis method saves CPU resources more than polling, but blocking degrades program performance.

One thing to note about RDMA’s CQE is that for RDMA’s Send and Receive operations, the sender can Receive CQE after sending and the receiver can Receive CQE after receiving. For RDMA unilateral Read and Write operations, such as node A reading data from node B, or node A writing data to node B, only the end of the Read and Write operation, node A can receive CQE after the operation, and node B does not perceive the Read or Write operation initiated by node A at all. Node B also does not receive CQE.

Linux epollIntroduction to Asynchrony

Linux’s epoll mechanism is an asynchronous programming mechanism provided by Linux. Epoll is designed for scenarios with a large number of I/O requests. It checks which I/O operations are ready, so that the user program does not have to block the UNready I/O operations and only processes the ready I/O operations. Epoll is more powerful than the previous two asynchronous mechanisms of Select and poll. Epoll is especially suitable for scenarios with a large number of I/O operations, such as RDMA scenarios, where each RDMA node has many queues at the same time and is used to transfer a large amount of data. Then epoll can be used to query each CQ. Get the sending and receiving of RDMA messages in a timely manner, while avoiding the disadvantages of synchronous CQ queries, which can either consume a lot of CPU resources for user programs or block them.

Linux’s epoll mechanism provides three APIS:

  • epoll_createUsed to createepollInstance, returnepollHandle to the instance;
  • epoll_ctlUsed toepollExample Add, modify, or delete I/O operation events to be checked.
  • epoll_waitUsed to check each passepoll_ctlRegistered toepollInstance IO operation to see if each IO operation is ready/expected event occurs.

The specific use of the three epoll interfaces will be explained with code examples later. This section describes the I/O event check rules of epoll. As shown in the figure below, epoll has two check rules: Edge Trigger (ET), and Level Trigger Level Trigger (LT). Edge triggering and level triggering originate in the area of signal processing. Edge trigger refers to trigger events when the signal changes, such as from 0 to 1, or from 1 to 0; Level triggering refers to triggering events as long as the state of the signal is in a specific state, such as high levels triggering events all the time, while low levels do not trigger events.

Corresponding to epoll, level triggering means that the user program is notified as long as an IO operation is in a particular state. For example, when the socket has data to read, the user program calls epoll_WAIT to check whether the socket has received data. If the user program does not read the last data received by the socket, each call to epoll_wait will notify the user program that the socket has data to read. That is, when the socket is in the readable state, it always notifies the user program. The edge trigger of ePoll means that ePoll is only notified once after a specific event of an IO operation has occurred. For example, if the socket receives data and epoll_WAIT detects that the socket has data to read, the next call to epoll_WAIT will not notify the socket that the socket has data to read, regardless of whether the socket has read the data or not. Unless the socket receives new data again; That is, the socket notifies the user only when it receives new data, regardless of whether the socket currently has data to read.

RDMA completes the synchronous and asynchronous methods of queue CQ reading CQE

This section uses an RDMA CQ read operation as an example to show how to implement asynchronous operations based on epoll. First, RDMA uses polling and blocking to read CQ, and then introduces the asynchronous CQ reading method based on epoll. The code below is an example and does not compile.

RDMA polling mode reads CQE

RDMA polling to read CQ is very simple, just keep calling ibv_poll_cq to read CQE in CQ. This method is the fastest way to get the new CQE, direct user program polling CQ, and does not require the kernel to participate, but the disadvantages are obvious, user program polling consumes a lot of CPU resources.

Loop {// Try to read a CQE poll_result = ibv_poll_cq(cq, 1, &mut CQE); if poll_result ! = 0 {// handle CQE}}Copy the code

RDMA completes the event channel reading of CQE

The RDMA completes the event channel by reading the CQE as follows:

  • The user program callsibv_create_comp_channelCreate the completion event channel;
  • And then callibv_create_cqAssociate the completion event channel when creating the CQ;
  • And then by callingibv_req_notify_cqTo tell CQ to notify the user program from the completion event channel when a new CQE is generated;
  • And then by callingibv_get_cq_eventQuery the completion event channel, block when there is no new CQE, return when there is a new CQE;
  • Next the user program fromibv_get_cq_eventAfter return, call againibv_poll_cqCall to read the new CQE from CQibv_poll_cqJust once, no polling required.

The following is an example of code for RDMA to read CQE with the completion event channel:

Let completion_event_channel = ibv_create_comp_channel(...) ; Let cq = ibv_create_cq(Completion_event_channel,...) ; Ibv_req_notify_cq (CQ,...) loop {// Set the CQ from the completion event channel to notify the generation of the next new CQE ; Ibv_get_cq_event (completion_event_channel, &mut CQ,...) ; Poll_result = ibv_poll_cq(cq, 1, &mut CQE); if poll_result ! Ibv_ack_cq_events (cq, 1); }Copy the code

CQE is read by RDMA completion of the event channel, essentially RDMA notifies the user program CQ of a new CQE through the kernel. The event queue uses a device file, /dev/Infiniband/Uverbs0 (if there are multiple RDMA nics, each NIC has a device file with the serial number increasing from 0), through which the kernel notifics the user of any event occurring in the program. The user program calls ibv_create_comp_channel to create the completion event channel, which opens the device file. The user program calls ibv_get_cq_event to query the completion event channel, which reads the open device file. However, this device file is only used for event notification, notifying the user that there is a new CQE to read, but the CQE cannot be read from the device file. The user also needs to call ibv_poll_cq to read CQE from CQ.

Using the completion event channel to read CQE saves CPU resources than polling method, but calling ibv_get_Cq_event to read the completion event channel will block, which affects user program performance.

CQE is read asynchronously based on epoll

In essence, the user program opens the /dev/Infiniband/Uverbs0 device file through the event queue and reads the event notification generated by the kernel about the new CQE. As you can see from the definition of the completion event channel ibv_comp_channel, it contains a Linux file descriptor that points to the open device file:

pub struct ibv_comp_channel {
    ...
    pub fd: RawFd,
    ...
}
Copy the code

Therefore, the epoll mechanism can be used to check whether new events are generated in the device file, avoiding blocking when user programs call ibV_GEt_Cq_event to read the completion event channel (that is, reading the device file).

First, use FCNTL to modify the IO mode of the device file descriptor in the event channel to non-blocking:

Let completion_event_channel = ibv_create_comp_channel(...) ; Let cq = ibv_create_cq(Completion_event_channel,...) ; Let flags = libc:: FCNTL ((*completion_event_channel).fd, libc::F_GETFL); / / for device file descriptor add non-blocking IO libc: : an FCNTL ((* completion_event_channel). Fd, libc: : F_SETFL, flags | libc: : O_NONBLOCK);Copy the code

Next, create the epoll instance and register the event to be checked with the epoll instance:

use nix::sys::epoll; Let epoll_fd = epoll::epoll_create()? ; // Complete the device file descriptor in the event channel let channel_dev_fd = (*completion_event_channel).fd; // Create an epoll event instance and associate it with the device file descriptor. // When the device file has new data to read, Notifies the user program in edge-triggered way let mut epoll_ev = epoll: : EpollEvent: : the new (epoll: : EpollFlags: : EPOLLIN | epoll: : EpollFlags: : EPOLLET, channel_dev_fd ); // Take the created epoll event instance, Before registration to create instances of epoll epoll: : epoll_ctl (epoll_fd, epoll: : EpollOp: : EpollCtlAdd, channel_dev_fd, & mut epoll_ev,)Copy the code

There are two caveats to the code above:

  • EPOLLINThis refers to checking whether the device file has new data/events to read;
  • EPOLLETEpoll uses edge-triggered notifications.

Ibv_poll_cq: ibv_poll_cq: ibv_poll_Cq: ibv_poll_Cq: ibv_poll_Cq:

let timeout_ms = 10; // create a list of events for epoll_wait check let mut event_list = [epoll_ev]; Ibv_req_notify_cq (CQ,...) loop {// Set the CQ from the completion event channel to notify the generation of the next new CQE ; Let NFDS = epoll::epoll_wait(epoll_fd, &mut event_list, timeout_ms)? ; Ibv_get_cq_event (completion_event_channel, &mut CQ,...) {if NFDS > 0 {ibv_get_cq_event(completion_event_channel, &mut CQ,...) ; Poll_result = ibv_poll_cq(CQ, 1, &mut CQE); if poll_result ! = 0 {// process CQE... // confirm a CQE ibv_ACK_cq_events (cq, 1); } else { break; }}}}Copy the code

One caveat to this code is that since epoll is triggered by an edge, ibv_poll_cq is called to read the CQ queue empty every time a new CQE is generated. Consider the following scenario, where multiple new CQE are generated at the same time, but the epoll edge trigger notifies the user program only once. If the user program does not read null CQ after receiving the notification, epoll will not generate any new notification until a new CQE is generated.

In conclusion, this article uses the epoll mechanism to achieve RDMA asynchronous CQE reading examples, shows how to achieve RDMA asynchronous operation. RDMA has similar operations that can be implemented asynchronously based on the epoll mechanism.

For those interested in Rust and RDMA, check out our open source project github.com/datenlord/a…