From the network card to the application, the packet passes through a series of components. What does the driver do? What does the kernel do? What can we do to optimize? The whole process involves a lot of fine tunable hardware and software parameters, and mutual influence, there is no “silver bullet” once and for all. In this article, Yang Peng, senior engineer of Youpai Cloud system development, will combine his practical experience to introduce how to make the optimal configuration of “scenario-based” on the basis of in-depth understanding of the underlying mechanism.

This article is based on Yang Peng’s keynote speech “Performance Optimization: Faster Data Reception” at Youpaiyun Open Talk Technology Salon in Beijing. The live video and PPT can be read and viewed.

Hello everyone, I am Yang Peng, a development engineer of Youpaiyun. I have been working in Youpaiyun for four years, during which I have been engaged in the development of CDN underlying system, responsible for scheduling, caching, load balancing and other core components of CDN. I am glad to share my experience and feelings in network data processing with you. Today’s topic is “How to receive data faster”, mainly introduces the methods and practices to speed up network data processing. Hopefully, this will help you better understand how to optimize the application as best as possible without feeling it at the system level. Let’s get to the point.

What is the first thing that comes to mind when trying to do any optimization? I think it’s a measure. Before making any changes or optimizations, make sure you know exactly what metrics reflect the current problem. Then after making corresponding adjustments or changes, the actual effect and effect can be verified through indicators.

For the topics to be shared, there is a basic principle that surrounds the core of the above metrics. In the final analysis, the optimization at the network level only needs to look at one point. If each layer of the network stack can be achieved and the packet loss rate of the corresponding layer can be monitored, then the core indicators can clearly know which layer is the problem. With clear indicators that can be monitored, it is easy to make adjustments and verify the actual results. Of course, the above two points are relatively a little empty, and the next part is more dry.

As shown in the figure above, when a packet is received, there are many data flows from the network card to the application layer. At this stage, instead of focusing on each process, focus on a few core critical paths:

  • First, the packet arrives at the nic.

  • Second, when the network adapter receives the packet, it needs to generate an interrupt to tell the CPU that the data has arrived.

  • Third, the kernel takes over from this point, taking data out of the network card and passing it on to the protocol stack of the following kernel.

So those are the three key paths. The sketch on the right in the image above refers to these three steps and intentionally distinguishes the two colors. The reason for this distinction is that the next two parts will be shared, one is the upper part of the driver, and the other is the lower part of the kernel. Of course, the kernel is more, the whole article only involves the kernel network subsystem, more specifically, the kernel and driver interaction part of the content.

Nic driver

The nic is the hardware part, and the driver is the software, which includes most of the NIC driver part. This section can be briefly divided into four points, initialization, startup, monitoring, and tuning drive its initialization process.

Nic driver – Initialization

The process of driver initialization is hardware dependent and should not be overly concerned. Ethool is a powerful tool that allows you to perform a variety of operations on a network card. You can not only read the configuration of a network card, but also change configuration parameters of a network card.

So how does it control the network card? During the initialization of each nic driver, a series of operations supported by the Ethool tool are unregistered through the interface. Ethool is a common set of interfaces. For example, ethool supports 100 functions, but each network adapter model supports only a subset. Therefore, the specific functions supported will be declared in this step.

The screenshot above shows the assignment of the structure at initialization. If you want to operate the corresponding callback function of this nic, which is the most important is to start and close, the useful ifconfig tool to operate the NIC should be very familiar with, when using ifconfig up/down a nic, It calls the same functions that it specified when it was initialized.

Nic driver – Start

The driver initialization process is followed by the open process, which consists of four steps: allocating RX/TX queue memory,

Enable NAPI, register interrupt handlers, enable interrupts. Registering interrupt handlers and enabling interrupts is a no-bradoubt, as is required by any hardware connected to the machine. When it receives later events, it needs to notify the system with interrupts, and then turn on interrupts.

The NAPI of step 2 will be explained later, focusing on memory allocation during startup. When the nic receives data, it must copy the data from the link layer to the machine’s memory, which is applied to the kernel and operating system through the interface when the NIC starts up. Once the memory is applied for and the address is determined, when the subsequent nic receives the data, it can directly transmit the data packets to the fixed address of the memory through the DMA mechanism, without even the participation of the CPU.

The allocation of memory to queues can be seen in the figure above. Long ago, network cards had a single-queue mechanism, but modern network cards are mostly multi-queue. The advantage is that the data reception of the machine’s nic can be load balanced across multiple cpus, thus providing multiple queues, which is a concept that will be explained later.

The second step in the startup process, NAPI, is an important extension to the modern network packet processing framework. The NAPI mechanism has played a big role in supporting 10GB, 20GB and 25GB high-speed network cards. Of course, NAPI is not complicated, and its core is two things: interrupt and round robin. Generally speaking, the network adapter must receive a packet when receiving data, generate an interrupt, and then dispose of the packet when interrupting the handler function. In the cycle of receiving packets, processing interrupts, next receiving packets, processing interrupts. The advantage of NAPI mechanism is that it only needs one interrupt, and after receiving it, it can take away all the data in the queue memory through the round-robin method, achieving a very efficient state.

Nic driver – Monitoring

Then there is the monitoring that can be done at the driver level, where some of the data comes from.


$ sudo ethtool -S eth0
NIC statistics:
     rx_packets: 597028087
     tx_packets: 5924278060
     rx_bytes: 112643393747
     tx_bytes: 990080156714
     rx_broadcast: 96
     tx_broadcast: 116
     rx_multicast:20294528
     .... 
Copy the code

First and foremost, ethool is a tool that can retrieve data from the network card, the number of packets received, the traffic processed, and other general information, but we need to pay more attention to exception information.


$ cat /sys/class/net/eth0/statistics/rx_dropped
2
Copy the code

Through the SYSFS interface, you can see the number of lost packets on the network card, which is a sign that the system is abnormal.

The three ways to get the information and the previous similar, but the format is a bit chaotic, only to understand.

Above is an online example to share. At that time, there was an anomaly in the business. After investigation, it was finally suspected to be the network card layer, which requires further analysis. The ifconfig tool can be used to intuitively view some statistics of the network adapter. In the figure, it can be seen that the errors data indicator of the network adapter is very high, indicating an obvious problem. But what’s even more interesting is that the final frame metric to the right of Errors has exactly the same value. The errors indicator is the index after the accumulation of many errors in the NETWORK adapter. Dropped and Overruns, which are adjacent to the index, are both zero. That is to say, most errors of the network adapter come from frame at that time.

Of course, this is only a transient state, the bottom part of the above is the monitoring data, you can clearly see the fluctuation of the change, it is indeed a machine abnormal. A frame error occurs when the network adapter fails to perform RCR verification after receiving data packets. When receiving a packet, the system verifies the packet. If the packet does not match the saved packet, it indicates that the packet is damaged and is discarded.

This reason is relatively easy to analyze, two points one line, the machine network adapter through the network cable connected to the upper switch. When there is a problem, it is either the network cable, the network adapter of the machine, or the port of the peer switch, that is, the port of the upstream switch. Of course, I analyzed according to the first priority, coordinated operation and maintenance to replace the network cable corresponding to the machine, and the following indicators also reflected the effect. The indicators dropped suddenly until they disappeared completely, and the errors no longer existed, and the business of the corresponding upper layer quickly recovered to normal.

Nic driver – Tuning

After monitoring, let’s look at the final tuning. In this level can adjust things not much, mainly for the network card multi-queue adjustment, more intuitive. It is possible to adjust the number and size of queues, the weights between queues, and even the fields in the hash.

$ sudo ethtool -l eth0
Channel parameters for eth0:
Pre-set maximums:
RX:   0
TX:   0
Other:    0
Combined: 8
Current hardware settings:
RX:   0
TX:   0
Other:    0
Combined: 4
Copy the code

The figure above shows adjustments for multiple queues. To illustrate this concept, for example, if you have a Web server bound to CPU2 and a machine with multiple cpus, the machine’s network card is multi-queued, and one of the queues will be processed by CPU2. There is a problem because the network card has multiple queues, so the traffic on port 80 will only be allocated to one queue. If the queue is not processed by CPU2, some shuffling of data will be involved. When the bottom layer receives the data and gives it to the application layer, it needs to move the data around. If CPU1 needs to move processing from CPU1 to CPU2, this will involve CPU cache invalidation, a costly operation for a high-speed CPU.

So what to do? We can use the tools mentioned earlier to specifically direct TCP traffic on port 80 to the corresponding NETWORK card queue processed by CPU2. The effect is that the packet arrives on the same CPU from the nic to the kernel to the application layer. The biggest benefit of this is the cache, the CPU cache is always hot, so overall, it will be very good latency, performance. Of course this example is not practical, but mainly to illustrate an effect that can be achieved.

Kernel network subsystem

After saying the whole nic driver part, the next part is to explain the kernel subsystem part, which will be divided into soft interrupt and network subsystem initialization two parts to share.

softirqs

NETDEV is an annual session of the Linux networking subsystem. One interesting point is that the number of sessions is represented by a special character. The figure is 0X15, which is a hexadecimal number. 0X15 is exactly 21 years, which is also quite geeky. Those of you interested in the network subsystem should take a look.

Anyway, there are many mechanisms for kernel delay tasks, and soft interrupts are just one of them. The diagram above shows the basic structure of Linux. The top layer is user mode, the middle layer is kernel, and the bottom layer is hardware. There are two ways in which user and kernel modes can interact: through system calls, or through exceptions that can fall into kernel mode. How does the underlying hardware interact with the kernel? The answer is interrupts. Hardware must interact with the kernel through interrupts, and any event needs to generate an interrupt signal to inform the CPU and the kernel.

Such a mechanism may not be a problem in general, but for network data, one datagram and one interrupt present two obvious problems.

Fault 1: Interrupt During processing, previous interrupt signals are masked. When an interrupt takes a long time to process, any interrupt signals received during processing are discarded. If a packet takes ten seconds to process, five more packets are received during that ten seconds, but because the interrupt signal is lost, even if the previous packet is finished, the subsequent packet is not processed. Corresponding to the TCP side, if the client to the server to send a packet, few seconds finished reprocessing, but during processing the client and the subsequent three packages, but behind the server does not know, thought that only received a package, then the client and waiting for the server back to package, so can lead to a stuck on both sides, It also shows that signal loss is a very serious problem.

Problem 2: When a packet triggers an interrupt processing, when a large number of packets arrive, a large number of interrupts will be generated. If you get to 100,000, 500,000, or even a million PPS, then the CPU has a lot of network outages to handle and nothing else to do.

The solution to these two problems is to make interrupt processing as short as possible. Specifically, you can’t just pull it out of the interrupt handler and hand it over to the soft interrupt mechanism. The practical result of this is that the hardware interrupt processing does less work, leaving the necessary work such as receiving data to the soft interrupt, which is why the soft interrupt exists.

static struct smp_hotplug_thread softirq_threads = { .store = &ksoftirqd, .thread_should_run = ksoftirqd_should_run, .thread_fn = run_ksoftirqd,.thread-comm = "ksoftirqd/%u",}; static _init int spawn_ksoftirqd(void) { regiter_cpu_notifier(&cpu_nfb); BUG_ON(smpboot_register_percpu_thread(&softirq_threads)); return 0; } early_initcall(spawn_ksoftirqd);Copy the code

The soft interrupt mechanism is implemented through threads in the kernel. A corresponding kernel thread is shown here. The server CPU will have a Kernel thread such as KsoftirQD, multi-CPU machine will be corresponding to multiple threads. The last member of the structure, ksoftirqd/, will have three kernel threads (/0/1/2) if there are three cpus.

Information about the soft interrupt mechanism can be found under softirQs. There are only a few soft interrupts, of which net-related concerns are Net-TX and NET-RX, two scenarios of network data receiving and receiving.

Kernel initialization

With soft interrupts in place, let’s look at the process of kernel initialization. There are two main steps:

  • For each CPU, create a data structure with a large number of members hanging on it, which is closely related to subsequent processing;

  • Register a soft interrupt handler corresponding to the net-TX and NET-rx soft interrupt handlers seen above.

Above is a hand-drawn packet processing flow:

  • Step 1 The nic receives the packet;

  • The second step is to copy the packet to memory via DMA.

  • The third step generates an interrupt to tell the CPU and start processing the interrupt. The key interrupt processing can be divided into two steps: one is to mask the interrupt signal, and the other is to wake up the NAPI mechanism.


static irqreturn_t igb_msix_ring(int irq, void *data)
{
  struct igb_q_vector *q_vector = data;
  
  /* Write the ITR value calculated from the previous interrupt. */
  igb_write_itr(q_vector);
  
  napi_schedule(&q_vector->napi);
  
  return IRO_HANDLED;
}
Copy the code

The above code is what the IGB nic driver interrupt handler does. If the initial variable declaration and subsequent return are omitted, the interrupt handler is only two lines of code, which is very short. The second thing to note is that in the hardware interrupt handler, you do nothing but activate the external NIPA soft interrupt handler. So the interrupt handler will return very quickly.

NIPI activation


/* Called with irq disabled */
static inline void ____napi_schedule(struct softnet_data *sd, struct napi_struct *napi)
{
  list_add_tail(&napi->poll_list, &sd->poll_list);
  _raise_softirq_irqoff(NET_RX_SOFTIRQ);
}
Copy the code

Activation of NIPI is also simple, consisting of two main steps. When the kernel network system is initialized, each CPU has a structure, which inserts queue information into the structure’s linked list. In other words, when receiving data, each NIC queue must inform the corresponding CPU of its queue information and bind the two information to ensure that a CERTAIN CPU processes a certain queue.

In addition, soft interrupts need to be triggered as well as hard interrupts. The following diagram puts many of the steps together, so we don’t need to repeat the previous steps. The figure focuses on how the soft interrupt is triggered. Much like hard interrupts, soft interrupts have a vector table for interrupts. Each interrupt number will correspond to a processing function, when you need to deal with an interrupt, just need to find in the corresponding interrupt vector table, with hard interrupt processing is exactly the same.

Data reception – Monitoring

Now that we’ve talked about how it works, let’s see where we can monitor it. There are many things under proc to see how interrupts are handled. The first column is the interrupt number. Each device has a separate interrupt number, which is written dead. For the network, you only need to pay attention to the interrupt number corresponding to the network adapter, such as 65, 66, 67, and 68. Of course, it doesn’t make sense to look at the actual number, but to see how it is distributed, whether interrupts are being processed by different cpus, and if all interrupts are being processed by one CPU, then something needs to be done to spread it out.

Data reception – tuning

Interrupt can be adjusted in two ways: one is interrupt merge, the other is interrupt affinity.

Adaptive interrupt merge

  • Rx-usecs: The delay for generating an interrupt signal after a data frame arrives, in microseconds

  • Rx-frames: indicates the maximum number of data frames that can be accumulated before an interrupt is triggered

  • Rx-usecs-irq: How long the current interrupt delay reaches the CPU if there is interrupt processing in progress

  • Rx-frames -irq: The maximum number of data frames to accumulate if there is interrupt processing in progress

The functions supported by the hardware network card are listed above. NAPI is essentially an interrupt merge mechanism. If a lot of packages arrive, NAPI can produce only one interrupt, so it doesn’t need hardware to do interrupt merge. The actual effect is the same as NAPI, reducing the total number of interrupts.

Interrupt affinity

$sudo bash -c 'echo 1 > /proc/irq/8/smp_affinity' $sudo bash -c 'echo 1 > /proc/irq/8/smp_affinity'Copy the code

This is closely related to the network card multi-queue. If the network card has multiple queues, you can manually specify which CPU will handle the processing, evenly distributing the data processing load to the available cpus on the machine. The configuration is also simple, simply writing the numbers into the file corresponding to /proc. This is an array of bits, which will be processed by the CPU when converted to binary. If you put a 1, maybe it’s CPU0; If you write a 4 and convert it to a binary of 100, it will be handed over to CPU2.

In addition there is a small problem need to note that many distributions may bring a called irqbalance daemon (called irqbalance. Making. IO/called irqbalance)…

Kernel – Data processing

Finally, the data processing part. When the data reaches the nic and enters the queue memory, the kernel is required to pull the data out of the queue memory. If the machine has a PPS of 100,000 or even millions, and the CPU only handles network data, the other basic business logic is not needed, so the processing of packets cannot be monopolized by the CPU, and the core point is how to do.

In view of the above problems, there are mainly two limitations: the overall limitation and the single-time limitation

while (! list_empty(&sd->poll_list)){ struct napi_struct *n; int work,weight; /* If softirq window is exhausted then punt. * Allow this to run for 2 jiffies since which will allow * an average Latency of 1.5 / HZ. * / if (unlikely (budget < = 0 | | time_after_eq (jiffies, time_limit))) goto softnet_break;Copy the code

The overall limit is well understood, one CPU for one queue. If the number of cpus is smaller than the number of queues, one CPU may need to process more than one queue.

weight = n->weight;

work = 0;
if (test_bit(NAPI_STATE_SCHED, &n->state)) {
        work = n->poll(n,weight);
        trace_napi_poll(n);
}

WARN_ON_ONCE(work > weight);

budget -= work;
Copy the code

A single limit limits the number of packets a queue can process in a round. When the limit is reached, stop and wait for the next round of processing.

softnet_break:
  sd->time_squeeze++;
  _raise_softirq_irqoff(NET_RX_SOFTIRQ);
  goto out;
Copy the code

Stop is the key node. Fortunately, there is a corresponding index record, such as time-squeeze interrupt count. After getting this information, we can judge whether there is a bottleneck in network processing of the machine and the frequency of forced interruption.

The figure above shows the data for monitoring CPU indicators. The format is simple. Each row corresponds to one CPU, and the values are separated by Spaces. So what does each column represent? Unfortunately, there is no documentation for this, so you can only check the kernel version you are using and then look at the corresponding code.

Seq_printf (seq, "%08x %08x % 08X % 08X %08x %08x %08x %08x % 08X % 08X % 08X % 08X % 08X % 08X % 08X % 08X % 08X % 08X % 08X % 08X % Sd ->time_squeeze, 0, 0, 0, 0, 0, /* was fastroute */ SD -> CPU_collision, SD ->received_rps, flow_limit_count);Copy the code

Here’s how each field in the file comes from. The actual situation may vary because the number of fields and the order of fields can change as the kernel version iterates, and the squeeze field is related to the number of times network data processing is interrupted:

  • Number of sD-> processed packets (multi-NIC bond mode may be more than the actual number of processed packets)

  • Sd ->dropped The number of discarded packets because the queue was full

  • Sd ->time_spueeze Soft interrupt processing Net_rx_action Number of forced interrupts

  • Sd -> CPU_COLLISION Gets device lock conflicts when sending data, such as multiple cpus sending data at the same time

  • Sd -> Received_rps Number of awakenings of the current CPU (via interprocessor interrupt)

  • Sd ->flow_limit_count Number of times that flow limit is triggered

The following figure shows a case of related problems encountered in the service, which are finally checked at the CPU level. Figure 1 shows the output of the TOP command, showing the usage per CPU, with CPU4 in the red box showing an anomaly, particularly the SI usage in the penultimate column reaching 89%. SI is short for softirq. It indicates the percentage of time CPU spends on soft interrupt processing. In the figure, CPU4 spends too much time on soft interrupt processing. Figure 2 shows the output of figure 1. CPU4 corresponds to row 5, where the third column is significantly higher than other cpus, indicating that it is frequently interrupted while processing network data.

In view of the above problem, it is inferred that CPU4 has a certain performance degradation, perhaps due to poor quality or other reasons. To verify that performance degrades, a simple Python script is written that keeps accumulating in an endless loop. At each run, bind the script to a CPU and observe the time taken by the different cpus. The comparison results also showed that CPU4 took several times longer than other cpus, confirming the previous prediction. After the coordination of operation and maintenance to replace the CPU, the intention indicators will return to normal.

conclusion

Above all operations are in packets from the network card into the kernel layer, not common agreement, just completed the first step to establish, followed by a series of steps, such as packet compression (GRO), network card queue software and RFS in load balancing (RPS) on the basis of considering the characteristics of the flow, is IP port, the characteristics of the quad Finally, the data is delivered to the IP layer and to the familiar TCP layer.

In general, today’s sharing is done around the drive, I want to emphasize that the core point of performance optimization lies in indicators, can not measure it is difficult to improve, there must be indicators, so that all optimization is meaningful.