High concurrency network programming

Read www.52im.net/thread-561-… Notes.

1. Concurrency constraints

1.1 Restrictions on file handles

Every TCP connection has a File descriptor. Once the File descriptor is used up, new connections return the error “Socket/File:Can’t open so many files”.

Process limit

Using ulimit -n to print 1024 indicates a maximum of 1024 files can be opened for a process, so you can use the default configuration for up to thousands of concurrent TCP connections. Temporary change: ulimit -n 1000000. However, this temporary change takes effect only when the current user logs in to the system and becomes invalid after the system restarts or the user logs out.

Reboot after the failure of change (in CentOS 6.5 under test, however I found no failure after restart), edit the/etc/security/limits file, the modified content as follows:
```
soft nofile 1000000
hard nofile 1000000
Copy the code
```
Permanent change: edit /etc/rc.local and add the following:
```
ulimit -SHn 1000000
Copy the code
```
Global limits

Run cat /proc/sys/fs/file-nr to output 9344 0 592026, respectively:
- Number of allocated file handles,
- Number of file handles allocated but not used,
- Maximum number of file handles.
In kernel 2.6, however, the value of the second item is always 0. This is not an error; it actually means that all allocated file descriptors have been used wastefully.

/etc/sysctl.conf with root permission:
```
fs.file-max = 1000000
net.ipv4.ip_conntrack_max = 1000000
net.ipv4.netfilter.ip_conntrack_max = 1000000
Copy the code
```

2. C10k problem

The original servers were based on the process/thread model, and each new TCP connection was allocated one process (or thread). Processes are the most expensive resource of an operating system, and one machine cannot create many processes. With C10K, 10,000 processes would have to be created, which would have been too much for the operating system to handle on a standalone basis (often resulting in inefficiencies and even total paralysis).

2.1 Nature of C10K problem

Many threads are created, data is copied frequently (cache I/O, kernel copies data to user process space, blocking), process/thread context switch consumption is high, resulting in operating system crash, this is the essence of C10K problem!

Therefore, the key to solve the C10K problem is to reduce the consumption of core computing resources such as CPU as much as possible, so as to drain the performance of a single server and break through the bottleneck described in the C10K problem.

2.2 Solution discussion of C10K problem

Thinking: Each process/thread processes multiple connections simultaneously (IO multiplexing)

The realization of this idea has the following process:

Method 1: A thread processes multiple connections one by one and waits for one socket to complete processing before processing the next socket
- Problem: The socket is blocked and blocks the entire thread when there is no data processing. (Non-blocking sockets are not involved for now)
Method 2: Select scheme. Select (); select (); select (); select (); select (); So we have the SELECT scheme.
- Problem: It is inefficient to check all file handle status one by one with handle upper limit + repeated initialization.
Method 3: Poll scheme. Poll addresses the first two problems of SELECT: it eliminates the file handle cap by passing events that require attention to the kernel through a PollFD array, and it avoids repeated initialization by using separate fields to label events that require attention and events that occur.
- Problem: It is inefficient to check all file handle status one by one.
Method 4: Epoll solution. Since it is not efficient to go through all file handle states one by one, it would be much more efficient if the call returned only file handles that had a state change (most likely data ready). So the epoll model becomes the ultimate solution to the C10K problem.
- Platform dependent (Linux)
Method 5: Libevent/Libuv solution. Encapsulate IO multiplexing for each platform.

3. The problem of C10M

As of now, X86 servers with 40gpbs, 32-cores and 256 gigabytes of RAM are quoted on Newegg for several thousand dollars. In fact, with this hardware configuration, it can handle more than 10 million concurrent connections, and if they can’t, it’s because you chose the wrong software, not the underlying hardware.

In the next 10 years, as the potential number of connections per server under IPv6 is in the millions, it is not impossible for a single server to handle millions (or even tens of millions) of concurrent connections, but we need to re-examine the technical implementation of network programming in the current mainstream OS.

3.1 Solution

Unix was not designed as a server operating system, but as a control system for telephone networks. Because it is the telephone network that actually carries the data, there is a clear boundary between the control layer and the data layer. The problem is that we shouldn’t be using Unix servers as part of the data layer at all right now.

Don’t let the OS kernel do all the heavy lifting: move packet processing, memory management, processor scheduling, and so on from the kernel to the application to do it efficiently, leaving an OS like Linux to deal only with the control layer, leaving the data layer entirely to the application.

To sum up, the key to solving the C10M problem is mainly from the following aspects:

** NETWORK adapter issues: ** Not efficient through the kernel ** Solution: ** Use your own drivers and manage them, keeping the adapter away from the operating system.

**CPU issues: ** Using traditional kernel methods to coordinate your application will not work. ** Solution: **Linux manages the first two cpus, your application manages the rest, and interrupts only occur on the cpus you allow.

** Memory issues: ** memory needs special attention to be efficient. ** Solution: ** Allocate most of the memory at system startup to the large memory pages you manage.

In the case of Linux, the idea is to hand over the control layer to Linux, where the application manages the data. There is no interaction between the application and the kernel, no thread scheduling, no system calls, no interrupts, nothing.

4. Theoretical exploration of high-performance network applications from C10K to C10M

4.1 CPU Affinity & Memory Localization

In both multi-process and multi-threaded models, all scheduling tasks are handed over to the operating system, which allocates hardware resources for us. The commonly used server operating systems are time-sharing operating systems, and the scheduling model pursues fairness as far as possible without special optimization for a certain type of task. If the current system only runs a specific task, the default scheduling strategy may cause performance loss to a certain extent. I run A task A, and the first scheduling cycle runs on core 0, and the second scheduling cycle may run on core 1. Such frequent scheduling may cause A lot of context switching, thus affecting certain performance.

Data locality is a similar problem. X86 servers use the NUMA architecture. In this architecture, each CPU has its own memory. If the data required by the current CPU needs to be obtained from the memory managed by another CPU, some latency must be increased. So we try as hard as possible to keep our tasks and data on the same CPU core and the same memory node at all times. Linux provides the sched_set_affinity function that allows us to bind our tasks to the specified CPU core in code. Some Linux distributions also provide numactl and Taskset tools in user mode, which also make it easy to get our programs running on a given node.

4.2 RSS, RPS, RFS, XPS

These technologies are in recent years in order to optimize the performance of Linux network and added features, RPS, RFS, XPS are Google contributed to the community, RSS requires hardware support, the current mainstream network card has been supported, commonly known as multi-queue network card, make full use of multiple CPU cores, Distribute the data processing pressure across multiple CPU cores.

RPS and RFS were added in linux2.6.35, and are generally used in pairs. Software is used to simulate similar functionality on non-rss supported network cards, and the same data flow is bound to a specified core to maximize the performance of network processing. The XPS feature was added in linux2.6.38 to optimize multi-queue network cards for sending data. When sending data packets, you can select the corresponding network card queue according to the CPU MAP.

4.3 IRQ optimization

There are two main points about IRQ optimization. The first point is about interrupt merging. In the early days, the network adapter would trigger an interrupt every time it received a packet, and the number of interrupts triggered would become terrible if the volume of packets was very large. Most of the computing resources are devoted to handling interrupts, resulting in performance degradation. Later, NAPI and Newernewer NAPI features are introduced to reduce the number of interrupts and improve the processing efficiency after an interrupt is triggered by newer NAPI. The second point is IRQ affinity, which is similar to CPU affinity mentioned earlier. It binds different nic queue interrupt processing to a specified CPU core, which applies to the NIC with RSS feature.

TSO: Ethernet MTU is usually 1500, minus TCP/IP header. TCP MaxSegment Size is 1460. Generally, the protocol stack segments the TCPPayload greater than 1460 to ensure that the IP packet generated at the end does not exceed the MTU size. For nics that support TSO/GSO, the protocol stack does not need to do this. Larger TCPPayloads can be sent to the NIC driver for packet encapsulation. In this way, the computing offload required on the CPU is transferred to the network card, further improving the overall performance. GSO is an upgraded version of TSO and is not limited to TCP. The working path of LRO and TSO is just opposite. When receiving small packets frequently, each small packet should be transmitted to the protocol stack, and multiple TCPPayload packets should be combined and then transmitted to the protocol stack, so as to improve the processing efficiency of the protocol stack. GRO is an updated version of LRO that addresses some of the problems with LRO. These features can be used only in certain scenarios. If you do not know your requirements, enabling these features may degrade performance.

4.4 Kernel optimization

The main Kernel network parameters are adjusted in the following two parts: net.ipv4.* parameter and net.core.* parameter.

It is mainly used to adjust some timeout control and cache, etc. Through search engines, we can easily find articles on the tuning of these parameters. However, it is recommended to read kernel documents in detail and do more tests to verify whether modifying these parameters can improve performance or have any disadvantages.