Welcome toTencent Cloud + community, get more Tencent mass technology practice dry goods oh ~

This post was posted by Willko in cloud + Community

1. The situation and trend of network IO

From the use of our users can feel that the network speed has been improving, and the development of network technology from 1GE/10GE/25GE/40GE/100GE evolution, which can be concluded that the single network IO capacity must keep up with the development of The Times.

1. Traditional telecommunications

The IP layer and the following devices, such as routers, switches, firewalls, and base stations, all adopt hardware solutions. Based on dedicated network processor (NP), based on FPGA, more based on ASIC. However, the disadvantages of hardware are very obvious. Bugs are not easy to be repaired or debugged and maintained. Moreover, network technology has been developing, such as the innovation of 2G/3G/4G/5G and other mobile technologies. The challenge faced by the traditional field is the urgent need for a software architecture of high-performance network IO development framework.

2. Cloud development

The emergence of private clouds Sharing hardware through network function virtualization (NFV) is a trend. NFV is defined as the implementation of various traditional or new network functions through standard servers and switches. A high-performance network IO development framework based on common systems and standard servers is urgently needed.

3. Soaring stand-alone performance

Network card from 1G to 100G development, CPU from single-core to multi-core to multi-CPU development, the single server capacity through horizontal expansion to reach a new high. However, software development is unable to keep up with the pace, single machine processing capacity and hardware can not match, how to develop with The Times and high throughput service, single machine millions of millions of concurrent capacity. Although some services do not have high requirements for QPS and are mainly CPU intensive, applications such as big data analysis and artificial intelligence need to transfer a large amount of data between distributed servers to complete their tasks. This should be our Internet background development should be the most concerned, also the most relevant.

Linux + x86 NETWORK IO bottlenecks

Several years ago, I wrote “How network Cards work and Tuning for High Concurrency,” describing the process of sending and receiving messages on Linux. As a rule of thumb, running applications on C1 (8-core) consumes 1% soft interrupt CPU Per 1W Packet processing, which means the upper limit for a single machine is 1 million PPS (Packet Per Second). From TGW’s (Netfilter version) performance of 1 million PPS, AliLVS is optimized to 1.5 million PPS, and the server they use is fairly well configured. If we want to run a full 10GE network adapter with 64 bytes per packet, we need 20 million PPS. Bandwidth, Packets Per Second, and Other Network Performance Metrics (Bandwidth, Packets Per Second, and Other Network Performance Metrics) That is, the processing time of each packet cannot exceed 50 nanoseconds. However, a Cache Miss, no matter TLB, data Cache or instruction Cache Miss, will be read back to memory in about 65 nanoseconds, and cross-node communication in NUMA system will be about 40 nanoseconds. So, even without the business logic, even just sending and receiving packages is so tough. We need to control Cache hit ratios, we need to understand the computer architecture, and cross-node communication cannot occur.

From this data, I hope to get a first-hand sense of the scale of the challenge, the ideal and the reality, that we need to balance. It’s all part of the problem

1. Traditional methods of sending and receiving messages must use hard interrupts for communication. Each hard interrupt consumes about 100 microseconds, not counting the Cache Miss caused by the termination context.

2. Data must be copied from the kernel mode to the user mode, resulting in a large amount of CPU consumption and global lock competition.

3. Sending and receiving packets has system call overhead.

4. The kernel works on multiple cores, so it can be globally consistent. Even if Lock Free is used, performance loss caused by bus Lock and memory barrier cannot be avoided.

5. The path from the NETWORK adapter to the service process is too long. Some unnecessary paths, such as the NetFilter framework, cause certain consumption and are prone to Cache Miss.

Iii. Basic principles of DPDK

From the previous analysis, we can know the IO implementation mode, the kernel bottleneck, and the uncontrollable factors of data flowing through the kernel. These are all implemented in the kernel, and the kernel is the cause of the bottleneck. To solve the problem, we need to bypass the kernel. So the mainstream solutions are bypass NETWORK adapter IO, bypass the kernel directly in the user mode to solve the bottleneck of the kernel.

The Linux community also provides Netmap as a bypass mechanism. The official figure is 14 million PPS for a 10G nic, but Netmap is not widely used. There are several reasons for this:

1.Net Map requires driver support, that is, the NIC vendor must approve the solution.

2.Net Map still relies on interrupt notification and does not completely solve the bottleneck.

3.Net Map is more like a few system calls, which can directly receive and send packets in user mode. The function is too primitive, and there is no dependent network development framework, and the community is not perfect.

Then, let’s take a look at DPDK, which has been developed for more than ten years. From The development led by Intel to the joining of huawei, Cisco, AWS and other big manufacturers, the core players are all in this circle, with a complete community and an ecological closed loop. In the early stage, it was mainly the application of layer 3 and below in the traditional telecom field, such as Huawei, China Telecom and China Mobile, and switch, router and gateway were the main application scenarios. However, with the needs of the upper business and the improvement of DPDK, higher applications are gradually emerging.

DPDK bypass principle:

On the left is the original way of data from nic -> Driver -> protocol stack -> Socket interface -> business

On the right is the DPDK approach, based on UIO (Userspace I/O) bypass data. Data from network adapter -> DPDK Polling mode -> DPDK base library -> Service

The advantages of user mode are easy to develop and maintain, and good flexibility. In addition, Crash does not affect kernel operation and has strong robustness.

CPU architectures supported by DPDK: x86, ARM, PowerPC (PPC)

List of network cards supported by DPDK: core.dpdk.org/supported/. We mainly use Intel 82599 (optical port) and Intel X540 (electrical port).

Iv. UIO, the cornerstone of DPDK

To get drivers running in user mode, Linux provides a UIO mechanism. With UIO, interrupts can be sensed through READ, and communication with network cards can be realized through MMAP.

UIO principle:

There are several steps to developing a user-mode driver:

1. Develop UIO modules that run in the kernel, because hard interrupts can only be handled in the kernel

2. The /dev/uiox access is interrupted

3. Share memory with peripherals through Mmap

DPDK core optimization: PMD

The UIO Driver of DPDK shields hardware interrupts and adopts active polling Mode in user Mode. This Mode is called Poll Mode Driver (PMD).

UIO bypassing the kernel, active polling removes hard interrupts, and DPDK can receive and send packets in user mode. Provides the benefits of Zero Copy and no system call. Synchronization reduces Cache misses caused by context switches.

Core running in PMD will be in user CPU100% state

When the network is idle, the CPU runs idle for a long time, which causes power consumption. So DPDK introduces the Interrupt DPDK pattern.

Interrupt DPDK:

The principle is very similar to NAPI, that is, when there is no package to process, it goes to sleep, instead of interrupt notification. It can also share the same CPU Core with other processes, but DPDK processes have higher scheduling priorities.

Six, DPDK high-performance code implementation

1. Use HugePage to reduce TLB misses

By default, Linux uses 4KB for a page. The smaller the page is, the larger the memory is. Therefore, the page table costs more and occupies more memory. The CPU has Translation Lookaside Buffer (TLB) which is very expensive, so it can only store hundreds to thousands of page entries. If the process is going to use 64GB of memory, then 64GB /4KB=16000000 (16million) pages, 16000000 * 4B=62MB per page in page entries. If you use HugePage with 2MB as a page, only 64G/2MB=2000, the quantity is not in the same level.

DPDK adopts HugePage and supports 2MB and 1GB page sizes under x86-64, geometrically reducing the size of page entries and thus reducing TLB-miss. It also provides Mempool, MBuf, Ring, Bitmap and other basic libraries. According to our practice, memory pool must be used for frequent memory allocation on Data Plane, and RTE_MALloc cannot be used directly. DPDK’s memory allocation implementation is very simple, not as good as PTMALloc.

2. SNA (Shared-Nothing Architecture)

Decentralized software architecture avoids global sharing as far as possible, brings global competition and loses the ability of horizontal expansion. NUMA does not remotely use memory across nodes.

3. SIMD (Single Instruction Multiple Data)

From the earliest MMX/SSE to the latest AVX2, SIMD’s capabilities have continued to grow. DPDK uses batch processing of multiple packages at the same time, and then vector programming to process all packages in a cycle. Memcpy, for example, uses SIMD for speed.

SIMD is common in the background of games, but if there are similar scenarios for batch processing in other services, you can also see if they can be satisfied.

4. Don’t use slow apis

There is a need to redefine the slow API, such as GetTimeofday, although in 64-bit vDSO there is no need to sink into kernel mode, just a pure memory access that can reach tens of millions per second. But don’t forget that with 10GE, we’re processing tens of millions of dollars per second. So even getTimeofDay is a slow API. DPDK provides Cycles interfaces, such as the RTE_GEt_TSC_CYCLES interface, based on HPET or TSC implementations.

In x86-64, using RDTSC instruction, directly read from the register, need to input two parameters, more common implementation:

static inline uint64_t
rte_rdtsc(void)
{
      uint32_t lo, hi;

      __asm__ __volatile__ (
                 "rdtsc" : "=a"(lo), "=d"(hi)
                 );

      return ((unsigned long long)lo) | (((unsigned long long)hi) << 32);
}
Copy the code

This is logical, but not extreme, and involves 2 bits to get the result. Let’s see how DPDK works:

static inline uint64_t
rte_rdtsc(void)
{
	union {
		uint64_t tsc_64;
		struct {
			uint32_t lo_32;
			uint32_t hi_32;
		};
	} tsc;

	asm volatile("rdtsc" :
		     "=a" (tsc.lo_32),
		     "=d" (tsc.hi_32));
	return tsc.tsc_64;
}
Copy the code

Ingenious use of C union shared memory, direct assignment, reduce unnecessary operations. But there are some issues to be faced and resolved with TSC

  1. CPU affinity solves the problem of inaccurate multi-core beats

  2. Memory barrier, to solve the problem of out-of-order execution is not accurate

  3. Disable frequency lowering and Intel Turbo Boost, and fix the CPU frequency to resolve the error caused by frequency changes

5. Compile and optimize

  1. Branch prediction

Modern CPU improves its parallel processing ability through Pipeline and Superscalar. In order to further develop its parallel ability, branch prediction will be made to improve the parallel ability of CPU. When encountering a branch, judge which branch may be entered, process the code of the branch in advance, read the code of the instruction in advance and read the register, etc. If the prediction fails, all the preprocessing will be discarded. Sometimes our development business has a pretty good idea of whether the branch is true or false, and we can use human intervention to generate more compact code that prompts the CPU to predict the success rate of the branch.

#pragma once

#if ! __GLIBC_PREREQ(2, 3)
# if ! define __builtin_expect
# define __builtin_expect(x, expected_value) (x)
# endif
#endif

#if ! defined(likely)
#define likely(x) (__builtin_expect(!! (x), 1))
#endif

#if ! defined(unlikely)
#define unlikely(x) (__builtin_expect(!! (x), 0))
#endif
Copy the code
  1. The CPU Cache prefetching

The cost of Cache Miss is very high. It takes 65 nanoseconds to read back to memory. The CPU Cache can be optimized to actively push the data to be accessed. A typical scenario is traversal of a linked list. The next node in the list is a random memory address, so the CPU cannot automatically preload. However, when we process this node, we can push the next node to the Cache by CPU instruction.

API documentation: doc.dpdk.org/api/rte__pr…

static inline void rte_prefetch0(const volatile void *p)
{
	asm volatile ("prefetcht0 %[p]" : : [p] "m" (*(const volatile char *)p));
}
Copy the code
#if ! defined(prefetch)
#define prefetch(x) __builtin_prefetch(x)
#endif
Copy the code

… , etc.

  1. Memory alignment

Memory alignment has two benefits:

L Avoid structure members across the Cache Line, requiring two reads before merging into the register, reducing performance. Structure members need to be sorted and forced to align from large to small. See Data Alignment: Straighten Up and fly Right.

#define __rte_packed __attribute__((__packed__))
Copy the code

L Multithreaded write generates False sharing, causing Cache Miss, and structs aligned by Cache Line

#ifndef CACHE_LINE_SIZE
#define CACHE_LINE_SIZE 64
#endif

#ifndef aligined
#define aligined(a) __attribute__((__aligned__(a)))
#endif
Copy the code
  1. Constant optimization

The compile phase of the constant-dependent operations is complete. C++11, for example, introduced constexp. For example, you can use GCC’s __builtin_constant_p to determine if a value is constant, and then compile that constant. Example: Network sequence host sequence conversion

#define rte_bswap32(x) ((uint32_t)(__builtin_constant_p(x) ? \
				   rte_constant_bswap32(x) :		\
				   rte_arch_bswap32(x)))
Copy the code

Rte_constant_bswap32 implementation

#define RTE_STATIC_BSWAP32(v) \
	((((uint32_t)(v) & UINT32_C(0x000000ff)) << 24) | \
	 (((uint32_t)(v) & UINT32_C(0x0000ff00)) <<  8) | \
	 (((uint32_t)(v) & UINT32_C(0x00ff0000)) >>  8) | \
	 (((uint32_t)(v) & UINT32_C(0xff000000)) >> 24))
Copy the code

5) Use CPU instructions

Modern cpus provide many instructions to perform common functions directly, such as size side conversion, x86 has direct support for bswap instructions.

static inline uint64_t rte_arch_bswap64(uint64_t _x)
{
	register uint64_t x = _x;
	asm volatile ("bswap %[x]"
		      : [x] "+r" (x)
		      );
	return x;
}
Copy the code

This implementation, also the GLIBC implementation, is constant optimized, CPU instruction optimized, and finally implemented in bare code. After all, top programmers have different desires for languages, compilers, and implementations, so you must understand the wheel before you build it.

Google’s open source CPU_Features can learn what features the current CPU supports to optimize specific cpus. High-performance programming is never-ending, with a deep understanding of hardware, kernel, compiler, and development language.

7. DPDK ecology

For our Internet background development, the ability provided by the DPDK framework itself is relatively bare. For example, to use DPDK, we must realize the basic functions of ARP and IP layer, which are difficult to get started. If you want higher level business use, also need user – mode transport protocol support. Direct use of DPDK is not recommended.

At present, The application layer development Project is FD. IO (The Fast Data Project), which has The VPP supported by Cisco open source, and relatively complete protocol support, such as ARP, VLAN, Multipath, IPv4/v6, MPLS, etc. User – mode transport protocol UDP/TCP has TLDK. From project positioning to community support is relatively reliable framework.

Tencent cloud open source F-Stack is also worth paying attention to, development is easier, directly provides POSIX interface.

Seastar is also very powerful and flexible. It can switch between kernel mode and DPDK at will. It also has its own transport protocol, Seastar Native TCP/IP Stack.

Our GBN Gateway project needs to support L3/IP layer access as Wan Gateway, single machine 20GE, based on DPDK development.

Question and answer

How do I check the network connection?

reading

DPDK does this by throwing the packet back into the kernel

DPDK RTE_ring is used to realize multi-process communication

How is the extreme Crash rate below 0.01% achieved?

Deep Learning technology (NLP) with a PhD from Nanyang Technological University in Singapore

This article has been authorized by the author to Tencent Cloud + community, more original text pleaseClick on the

Search concern public number “cloud plus community”, the first time to obtain technical dry goods, after concern reply 1024 send you a technical course gift package!

Massive technical practice experience, all in the cloud plus community!