Summary: If you want to retrofit an existing application to use RDMA technology, is there a technology that can benefit from RDMA without retrofitting?

Dragon Lizard Community High performance Network SIG

The introduction

Shared Memory Communication over RDMA (SMC-R) is a kernel network protocol based on RDMA technology and compatible with socket interfaces. It was proposed by IBM and contributed to the Linux kernel in 2017. Smc-r can help TCP network applications transparently use RDMA to obtain high bandwidth and low latency network communication services. Alibaba Cloud Linux 3 and Dragon Lizard community open source operating system Anolis 8, together with Dragon Elastic RDMA (eRDMA), bring SMC-R to the Cloud scene for the first time, helping Cloud applications to obtain better network performance: Alibaba Cloud released the fourth generation Of Shenlong, SMC-R to improve network performance by 20%.

Due to the wide use of RDMA technology in the field of data center, Dragolink Community High Performance Network SIG believes that SMC-R will become one of the important directions of the next generation of data center kernel protocol stack. To that end, we made a lot of optimizations and actively contributed those optimizations back to the upstream Linux community. ** The Dragolana Community High Performance Network SIG is currently the largest SMC-R code contributing community after IBM. ** Since there is very little information about SMC-R in Chinese, we hope to let more domestic readers know and contact SMC-R through a series of articles. We also welcome readers who are interested to join the Dragon Lizard Community High Performance Network SIG to communicate with each other (see the qr code at the end of the article). This article, the first in a series, will take the reader through the smC-R from a macro perspective.

First, start with RDMA

The name Shared Memory Communication over RDMA contains a feature of smC-R network protocol, which is based on RDMA. Therefore, before introducing SMC-R, we first take a look at the absolute workforce in the field of high performance network: Remote Direct Memory Access (RDMA) technology.

1.1 Why RDMA?

With the rapid development of data center, distributed system and high performance computing, the performance of network equipment has improved significantly. The bandwidth of mainstream physical network has reached 25-100 Gb/s, and the network delay has also entered the era of ten microseconds. However, as the performance of the latest network devices improves, a problem is emerging: the mismatch between network performance and CPU power. ** In traditional networks, the CPU, which is responsible for packet encapsulation, parsing and data handling between user mode and kernel mode, appears to be unable to cope with the rapid growth of network bandwidth and faces increasing pressure. The following uses a TCP/IP network data sending and receiving process as an example. The CPU of the sending node copies the data from the user-mode memory to the kernel-mode memory and encapsulates the data packets in the kernel-mode protocol stack. Then the DMA controller carries the encapsulated packets to the NIC and sends them to the peer end. The NIC at the receiving end obtains the data packet and carries it to the kernel-mode memory through the DMA controller, which is parsed by the kernel protocol stack. After the frame header or packet header is stripped layer by layer, the CPU copies the payload to the user-mode memory to complete a data transmission.

(Figure/Traditional TCP/IP network transport model)

In this process, the CPU is responsible for:

1) Data copy between user mode and kernel mode.

2) Encapsulation and parsing of network packets.

These tasks are “repetitive and low-level” and occupy a large amount of CPU resources (for example, multiple CPU cores are needed to run a 100 Gb/s network card to full bandwidth), which prevents the CPU from using its computing power in more beneficial ways in data-intensive scenarios.

Therefore, solving the mismatch between network performance and CPU computing power becomes the key to the development of high-performance networks. Considering the gradual failure of Moore’s law and the slow development of CPU performance in a short period of time, the idea of offloading network data processing from CPU to hardware device has become the mainstream solution. This makes RDMA, which used to be dedicated to specific high-performance areas, more and more used in general-purpose scenarios.

1.2 Advantages of RDMA

Remote Direct Memory Access (RDMA) is a kind of Remote Direct Memory Access technology, which has become an important component of high performance network after 20 years of development. So how does RDMA perform a data transfer?

(Figure/User-mode RDMA network transmission model)

In the RDMA network (user-mode mode), the RNIC with RDMA capability obtains data directly from the user mode memory at the sending end, encapsulates the data in the nic and transmits the data to the receiving end, and then the RNIC at the receiving end parses and strips the data. Payload is directly placed into the user-mode memory for data transmission. In this process, the CPU does not participate in data transmission except for the necessary control surface functions. The data is written directly to the memory of the remote node via the RNIC. Thus, compared to traditional networks, RDMA frees the CPU from network transfers, making network transfers as convenient and quick as remote memory direct access.

(Figure/Comparison between traditional network and RDMA network protocol stack)

Compared with traditional network protocols, RDMA network protocols have the following three characteristics:

1. Bypass software stack

RDMA network relies on RNIC to complete packet encapsulation and parsing inside the network card, bypassing the software protocol stack related to network transmission. For user-mode applications, the data path of the RDMA network bypasses the entire kernel; For kernel applications, part of the protocol stack in the kernel is bypassed. By bypassing the software stack and offloading data processing to hardware devices, RDMA can effectively reduce network latency.

2. The CPU unloading

In AN RDMA network, the CPU is only responsible for the control plane. On the data path, the payload is copied by the DMA module of the RNIC between the application buffer and the nic buffer (on the condition that the application buffer is registered in advance and the nic access is authorized), so the CPU is not required to transport the data, thus reducing the CPU usage during network transmission.

3. Direct memory access

In THE RDMA network, once the RNIC has access to the remote memory, it can directly write or read data from the remote memory without the participation of the remote node, which is very suitable for bulk data transmission.

Two, back to SMC-R

Through the above introduction, I believe the reader has a preliminary understanding of the main features and performance advantages of RDMA. However, while RDMA technology can provide promising network performance improvements, it is difficult to use RDMA transparently to improve network performance of existing TCP applications because the use of RDMA networks relies on a new set of semantic interfaces, These interfaces include THE IBverbs interface and RDMACM interface.

Interface between SOME IBVerbs and RDMACM [1]

Compared to traditional POSIX sockets, Verbs have a larger number of interfaces and are closer to hardware semantics. For the existing TCP network application based on POSIX socket interface, if you want to enjoy the performance bonus brought by RDMA, you have to carry out a lot of transformation of the application, and the cost is huge.

Therefore, we hope to use THE SOCKET interface while using the RDMA network, so that the existing socket application can enjoy the RDMA service transparently.

In response to this demand, the industry proposed the following two solutions:

** The first is a user-mode scheme based on libvma. ** Libvma uses LD_PRELOAD to import all socket calls into a custom implementation. The custom implementation uses the LD_PRELOAD interface to send and receive data. However, since libvma is implemented in user mode, on the one hand, libvma lacks unified kernel resource management, on the other hand, it is not compatible with socket interfaces.

** Second, it is based on smC-R kernel mode scheme. ** As a kernel-mode stack, SMC-R is much more compatible with TCP applications than the user-mode scheme. This 100% compatibility means very low cost of promotion and reuse. In addition, the kernel mode enables RDMA resources in SMC-R protocol stack to be shared by different processes in user mode, which improves resource utilization and reduces the overhead of frequent resource application and release. However, full socket compatibility means sacrificing extreme RDMA performance (since user-mode RDMA programs can bypass the kernel and zero copy, whereas SMC-R cannot achieve zero copy for socket compatibility) in exchange for compatibility and ease of use. And the transparency performance improvement compared to TCP stack. In the future, we also plan to extend the interface and apply zero-copy features to SMC-R to further improve its performance at a small cost of compatibility.

2.1 Transparent REPLACEMENT of TCP

SMC-R is an open sockets over RDMA protocol that provides transparent exploitation of RDMA (for TCP based applications) while preserving key functions and qualities of service from the TCP/IP ecosystem that enterprise level servers/network depend on!

From:

www.openfabrics.org/images/even…

Smc-r is a kernel protocol stack that is parallel to TCP/IP, compatible with socket interface upwards, and uses RDMA to complete shared memory communication at the bottom. Its design intention is to provide transparent RDMA service for TCP applications, while retaining the key functions in the TCP/IP ecosystem. Therefore, SMC-R defines a new network protocol family AF_SMC in the kernel, whose proto_OPS is exactly the same as TCP behavior.

/* must look like tcp */
static const struct proto_ops smc_sock_ops = {
  .family    = PF_SMC,
  .owner    = THIS_MODULE,
  .release  = smc_release,
  .bind    = smc_bind,
  .connect  = smc_connect,
  .socketpair  = sock_no_socketpair,
  .accept    = smc_accept,
  .getname  = smc_getname,
  .poll    = smc_poll,
  .ioctl    = smc_ioctl,
  .listen    = smc_listen,
  .shutdown  = smc_shutdown,
  .setsockopt  = smc_setsockopt,
  .getsockopt  = smc_getsockopt,
  .sendmsg  = smc_sendmsg,
  .recvmsg  = smc_recvmsg,
  .mmap    = sock_no_mmap,
  .sendpage  = smc_sendpage,
  .splice_read  = smc_splice_read,
};
Copy the code

Because SMC-R supports socket interfaces with TCP behavior, it is very simple to use SMC-R. In general, there are two methods:

(Figure /SMC-R usage)

** First, using THE SMC-R protocol family AF_SMC development. ** By creating a socket of type AF_SMC, the application’s traffic will enter the SMC-R protocol stack;

** Second, transparent replacement stack. ** Transparently replaces TCP type sockets created by the application with SMC type sockets. Transparent substitution can be implemented in two ways:

  • LD_PRELOAD is used for transparent stack replacement. Preload a dynamic library when running the TCP application. The custom socket() function is implemented in the dynamic library to convert the AF_INET socket created by TCP applications into AF_SMC sockets, and then call the standard socket creation process to introduce TCP application traffic into SMC-R protocol stack.

    int socket(int domain, int type, int protocol) { int rc;

    if (! dl_handle) initialize();

    /* check if socket is eligible for AF_SMC / if ((domain == AF_INET || domain == AF_INET6) && // see kernel code, include/linux/net.h, SOCK_TYPE_MASK (type & 0xf) == SOCK_STREAM && (protocol == IPPROTO_IP || protocol == IPPROTO_TCP)) { dbg_msg(stderr, “libsmc-preload: map sock to AF_SMC\n”); if (domain == AF_INET) protocol = SMCPROTO_SMC; else / AF_INET6 */ protocol = SMCPROTO_SMC6;

    domain = AF_SMC;
    Copy the code

    }

    rc = (*orig_socket)(domain, type, protocol);

    return rc; }

The smC_run instruction in smC-Tools of open source user mode tool set achieves the above functions [2].

  • Transparent replacement of protocol stack is realized through ULP + eBPF. Smc-r support for TCP ULP is a new feature contributed to the upstream Linux community by dragon Community High performance Network SIG. You can use setsockopt() to convert the newly created TCP socket to SMC socket. In the meantime, to avoid application reprogramming, Users can use eBPF to inject setsockopt() at appropriate hook points such as BPF_CGROUP_INET_SOCK_CREATE, BPF_CGROUP_INET4_BIND, BPF_CGROUP_INET6_BIND, etc. Implement transparent replacement. This mode is more suitable for container scenarios where protocols can be converted in batches based on user-defined rules.

    static int smc_ulp_init(struct sock *sk) { struct socket *tcp = sk->sk_socket; struct net *net = sock_net(sk); struct socket *smcsock; int protocol, ret;

    /* only TCP can be replaced */ if (tcp->type ! = SOCK_STREAM || sk->sk_protocol ! = IPPROTO_TCP || (sk->sk_family ! = AF_INET && sk->sk_family ! = AF_INET6)) return -ESOCKTNOSUPPORT; /* don't handle wq now */ if (tcp->state ! = SS_UNCONNECTED || ! tcp->file || tcp->wq.fasync_list) return -ENOTCONN; if (sk->sk_family == AF_INET) protocol = SMCPROTO_SMC; else protocol = SMCPROTO_SMC6; smcsock = sock_alloc(); if (! smcsock) return -ENFILE; <... >Copy the code

    }

    SEC(“cgroup/connect4”) int replace_to_smc(struct bpf_sock_addr *addr) { int pid = bpf_get_current_pid_tgid() >> 32; long ret;

    /* use-defined rules/filters, such as pid, tcp src/dst address, etc... */ if (pid ! = DESIRED_PID) return 0; <... > ret = bpf_setsockopt(addr, SOL_TCP, TCP_ULP, "smc", sizeof("smc")); if (ret) { bpf_printk("replace TCP with SMC error: %ld\n", ret); return 0; } return 0;Copy the code

    }

Given the above, TCP applications can transparently use RDMA services in two ways:

2.2 SMC – R architecture

(Figure /SMC-R architecture)

The SMC-R protocol stack is located below the Socket layer and above the RDMA kernel Verbs layer. Is a person with

“hybrid”

Features kernel network protocol stack. Here,

“hybrid”

This is mainly reflected in the mixture of RDMA streams and TCP streams in the SMC-R protocol stack:

Data traffic is transmitted over RDMA networks

Smc-r uses the RDMA network to transfer data from user applications, allowing applications to benefit from RDMA performance benefits transparently, as shown in yellow in the figure above. The data traffic of the sending application comes from the application buffer to the kernel memory through socket interface. Then, a kernel ringbuF (Remote Memory Buffer, RMB) of the remote node is written directly through the RDMA network. Finally, the remote node SMC-R protocol stack copies the data from RMB to the receiving application buffer.

(Figure /SMC-R shared memory communication)

Obviously, shared memory communication in smC-R name refers to communication based on remote node RMB. Compared with the traditional local shared memory communication, SMC-R expands the communication ends into two separate nodes, and realizes the communication based on “remote” shared memory using RDMA.

(Figure/Mainstream RDMA implementation)

Currently, there are three main implementations of RDMA networks: InfiniBand, RoCE, and iWARP. As a trade-off between high performance and high cost, RoCE is compatible with Ethernet protocol while using RDMA, which not only ensures good network performance, but also reduces the cost of network construction, so it is favored by enterprises. The Upstream community version of Linux SMC-R therefore uses RoCE V1 and V2 as its RDMA implementation.

IWARP implements RDMA over TCP, breaking through the rigid requirements of the other two for lossless networks. IWARP is much more scalable and is ideal for cloud scenarios. Ali Cloud Elastic RDMA (eRDMA) brings RDMA technology to the cloud based on iWARP. Smc-r in Alibaba Cloud Linux 3 and Anolis community open source operating system Anolis 8 also further supports eRDMA (iWARP), enabling users on the Cloud to use RDMA network transparently without feeling.

Rely on TCP streams to establish connections

In addition to the RDMA stream, SMC-R is equipped with a TCP connection for each SMC-R connection, and both have the same life cycle. The main responsibilities of TCP flow in SMC-R stack are as follows:

1) Dynamic discovery of peer SMC-R capability

Before the SMC-R connection is established, the communication end does not know whether the peer end also supports SMC-R. Therefore, the two ends establish a TCP connection first. During the three-way handshake, smC-R is supported by sending SYN packets with special TCP options, and TCP options in the SYN packets sent by the peer end are checked.

(Figure/TCP options representing SMC-R capabilities)

2) back

In the preceding process, if one of the communication ends cannot support SMC-R or cannot continue during the establishment of smC-R connection, the SMC-R protocol stack falls back to the TCP protocol stack. During the rollback, the SMC-R protocol stack replaces the socket corresponding to the file descriptor held by the application with the SOCKET of the TCP connection. The application’s traffic is carried over this TCP connection to ensure that the data transfer is not interrupted.

3) Help establish SMC-R connection

If both ends of the communication support SMC-R, SMC-R connection establishment messages are exchanged over the TCP connection (the connection establishment process is similar to an SSL handshake). In addition, you need to use this TCP connection to exchange RDMA resource information on both sides, helping to establish an RDMA link for data transfer.

Through the above introduction, it is believed that readers have a preliminary understanding of the overall architecture of SMC-R. Smc-r as a

“hybrid”

The solution takes full advantage of the versatility of TCP streams and the high performance of RDMA streams. A complete communication process in SMC-R will be analyzed in a later article, which will give readers a further understanding

“hybrid”

This characteristic.

This article, the first in a series of smC-R articles, is intended to serve as a primer. In retrospect, we mainly answered these questions:

1. Why RDMA?

Because RDMA can improve network performance (throughput/latency /CPU usage);

2. Why does RDMA provide performance gains?

By bypassing a large stack of software protocols, it frees the CPU from the network transmission process and makes data transmission as simple as writing directly to remote memory.

3. Why smC-R is needed?

Because the RDMA application is based on the Verbs interface, it is expensive to modify existing TCP socket applications to use RDMA.

4. What are smC-R’s advantages?

Smc-r is fully compatible with socket interfaces and simulates TCP socket interface behavior. Enable TCP user-mode applications to use RDMA services transparently and enjoy the performance benefits of RDMA without any modifications.

5. Architecture features of SMC-R?

The SMC-R architecture has

“hybrid”

Features a fusion of RDMA and TCP streams. Smc-r protocol uses THE RDMA network to transfer application data, and uses TCP flow to confirm the peer SMC-R capability and help establish the RDMA link.

Reference note:

[1] : network.nvidia.com/pdf/prod\_s…

[2] : github.com/ibm-s390-li…

The original link

This article is the original content of Aliyun and shall not be reproduced without permission.