• Author:DavidDi
  • Original link:www.ebpf.top/post/ebpf_n…
  • ** Copyright Notice: ** This work is licensed under a Creative Commons Attribution – Non-commercial – No Interpretation 4.0 international license. For non-commercial reprint, please indicate the source (author, link to the original). For commercial reprint, please contact the author for authorization.

1. Introduction

The containerization of Qutoutiao has been carried out for more than a year, and nearly 1,000 services have been containerized, and the scale of microservice cluster has reached more than 1,000 units. With the increasing number of containerized services and the size of clusters, in addition to the usual optimization of API Server parameters, Scheduler optimization and other routine optimization, we have recently encountered the problem of network jster caused by the underlying kubernetes load-balancing IPVS module. In this paper, the analysis, investigation and solution of the whole problem are summarized, hoping to provide a way to solve similar problems.

The k8S cluster and machine operating system versions involved are as follows:

  • K8s Ali Cloud ACK 14.8 version, the network model is THE CNI plug-in terway terway-EniIP mode;
  • The operating system is CentOS 7.7.1908. The kernel version is 3.10.0-1062.9.1.el7.x86_64.

2. Network jitter is abnormal

For service A newly deployed in the container cluster, it was found that the call delay of accessing downstream service B (in the same container cluster) was occasionally jitter at 999 line through service registration in the initial test. The QPS in the test was relatively small, which was obvious from the business monitoring. The maximum delay could reach 200 ms.

Figure 2-1 Service invocation delay

Access between services is accessed through the gRPC interface and node discovery is based on Service registration discovery on Consul. After the packet capture analysis and troubleshooting in the service A container, the following analysis and troubleshooting are performed:

  • Jitter still exists after abnormal nodes are registered in service B.

  • The jitter of the HTTP interface did not improve during the delay test.

  • Service A deployed test on VM (ECS), jitter situation did not improve;

After the above comparison test, we gradually narrowed the scope to the underlying network jitter on the host where service B is located.

After repeated ping packet tests, we found that the jitter rule of ping delay between host A and host B is quite consistent with that of service invocation delay. Because the analysis of ping packet is more simple and direct than that of gRPC, So we move on to the track of ping packet testing for the underlying network.

The following figure shows the stable reappearance of the host environment: The ping delay jitter of the container instance 172.23.14.144 on host B is detected when host A ping the container instance 172.23.14.144.

# # Pod IP address of host B IP route | grep 172.23.14.144 172.23.14.144 dev cali95f3fd83a87 scope linkCopy the code

! [ping_host_container] (imgs/ping_host_container.png)

Figure 2-2 Topology of hosts and containers involved in a ping test

The comparison results of ping on eth1 on host B and cali-xxx on container network are as follows:

Figure 2-3 Ping the network between the host and the container

Through multiple tests, we found that the ping to the network of Node host B had no jitter, while the container network Cali-xx had a large jitter, up to 133 ms.

During the ping test, packets were captured on host A and host B using tcpdump. It was found that the latency between eth1 on host B and cali95F3FD83a87 was 133 ms.

Figure 2-4 Ping packet delay on host B

Up to now, the problem has been gradually clarified. There is a delay of more than 100 ms in forwarding the ping packets received on host B. Then, what causes the delay in forwarding the ping packets on host B?

3. Problem analysis

Before looking at the ping packet forwarding delay, let’s first briefly review how network packets work in the kernel and the data flow path.

3.1 Processing flow in the network packet kernel

In the kernel, the network device driver receives and processes packets through interrupts. When data arrives on the network adapter device, a hardware Interrupt is triggered to notify the CPU to process the data. This type of Interrupt processing program is commonly called an ISR (Interrupt Service Routines). The ISR program should not handle too much logic, otherwise the device’s interrupt processing cannot respond in time. Linux therefore divides interrupt handlers into upper and lower halves. The top half does the simplest work, processing quickly and then releasing the CPU. That leaves most of the work in the bottom half, where the logic is handled by kernel threads at the right time.

After Linux 2.4 kernel version adopted the bottom half of the implementation of soft interrupts, by the Ksoftirqd kernel thread fully handled, under normal conditions, each CPU core has its own soft interrupt number queue and ksoftirqd kernel thread. The realization of soft interrupt is only identified by setting a corresponding binary value in memory. The processing timing of soft interrupt is mainly divided into the following two types:

  • Hardware interruptirq_exitExit;
  • Be awakenedksoftirqdKernel threads handle soft interrupts;

Common types of soft interrupts are as follows:

Enum {HI_SOFTIRQ=0, TIMER_SOFTIRQ, NET_TX_SOFTIRQ, NET_TX_SOFTIRQ, NET_RX_SOFTIRQ, NET_TX_SOFTIRQ, NET_RX_SOFTIRQ, NET_RX_SOFTIRQ, NET_RX_SOFTIRQ, NET_RX_SOFTIRQ... };Copy the code

Code 3-1 Linux soft interrupt type

The priority is from top to bottom, with HI_SOFTIRQ having the highest priority. NET_TX_SOFTIRQ corresponds to the sending of network packets, and NET_RX_SOFTIRQ corresponds to the receiving of network packets. Together, they complete the sending and receiving of network packets.

NET_RX_SOFTIRQ’s function is net_rx_action(). Net_rx_action () calls the poll function of the network adapter device Settings. If the upper-layer protocol is ICMP, ip_rcv is called. If the upper-layer protocol is ICMP, ICmp_rcv is called for subsequent processing.

Figure 3-1 Diagram for receiving data packets from the network adapter

//net/core/dev.c static int __init net_dev_init(void){ ...... for_each_possible_cpu(i) { struct softnet_data *sd = &per_cpu(softnet_data, i); memset(sd, 0, sizeof(*sd)); skb_queue_head_init(&sd->input_pkt_queue); skb_queue_head_init(&sd->process_queue); sd->completion_queue = NULL; INIT_LIST_HEAD(&sd->poll_list); // List of poll functions in soft interrupt processing //...... }... open_softirq(NET_TX_SOFTIRQ, net_tx_action); Open_softirq (NET_RX_SOFTIRQ, net_rx_action); // Register the soft interrupt received by network packets} subsys_initCall (net_dev_init);Copy the code

Code 3-2 soft interrupt data and network soft interrupt registration

The delay of sending and receiving network data is related to the processing of system soft interrupts in most scenarios. Here we will focus on the soft interrupts in the case of ping packet jitter. Here we use bCC-based TraceicMPSoftirq.py to assist in locating the kernel condition of ping packet processing.

The TRACeicMPSoftirq. py script depends on the BCC library. You need to INSTALL the BCC project first.

The read and write modes of the traceicmpsoftirq.py script are different between the Linux 3.10 kernel and the Linux 4.x kernel.

Using traceicmpsoftirq.py on host B, we found that the kernel thread that runs when jitter delays occur is ksoftirqD /0.

# # # host host A ping - 150 - c I 0.01 172.23.14.144 | grep -e "[0-9] {2} [\. 0-9] + ms" # host B #. / traceicmpsoftirq py tgid pid comm icmp_seq ... 0 0 swapper/0 128 6 6 ksoftirqd/0 129 6 6 ksoftirqd/0 130 ...Copy the code

Code 3-3 traceicMPSoftirq. py ping details about IP jitter of host B

[Ksoftirqd /0] This gives us two important messages:

  • Ping the IP address of the container in host B from host A, and each processing packet will be fixed on CPU#0;
  • CPU#0 is running the soft interrupt processing kernel thread when the delay occursksoftirqd/0, that is, the packet processing called in the process of processing soft interrupts. Another processing time of soft interrupts is described aboveirq_exitHard interrupt exit;

If the IP address of the container in ping host B falls on core CPU#0, then according to our test process, the IP address of the host in ping host B does not jitter, then the CPU must not be #0, so as to meet the test scenario. We continue to test using host B host IP address:

A, # # host ping - 150 - c I 0.01 172.23.14.144 | grep -e "[0-9] {2} [\. 0-9] + ms" # host B #. / traceicmpsoftirq py tgid pid comm icmp_seq ... 0 0 swapper/19 55 0 0 swapper/19 56 0 0 swapper/19 57 ...Copy the code

Code 3-4 TraceicMPsoftirq. py ping Host B Host IP address details

Through the actual test verification, when ping host B host IP address, all falls on CPU#19. At this point, we can determine that CPU#0 and CPU#19 have different soft interrupt processing loads, but this raises another question: why do our ping packets always land on the same CPU core? The RPS technology is enabled on the host by default. RPS is the Receive Packet Steering kernel patch submitted by Google engineer Tom Herbert. It entered the Linux kernel in 2.6.35 and implemented the functions provided by the multi-queue network adapter through software simulation. Spreading the load on data reception on multi-CPU systems and separating soft interrupts into individual cpus without hardware support greatly improves network performance. In simple terms, the softinterrupt handler net_rx_action() uses the received packet header (such as source IP address and port information) as a key to Hash to the corresponding CPU core according to the RPS configuration. See the get_rps_CPU function for details on the algorithm.

To check the RPS configuration in Linux, run the following command:

# cat /sys/class/net/*/queues/rx-*/rps_cpus
Copy the code

Based on the comprehensive analysis of the above situations, we have identified the problem with CPU#0’s handling of soft interrupts in kernel threads.

3.2 Handling CPU Soft Interrupts

At this point, we will focus on checking the performance metrics of CPU#0 in kernel mode to see if any functions are running that cause delays in processing soft interrupts.

First, we use the perf command to analyze the kernel mode usage of CPU#0.

# perf top -C 0 -U
Copy the code

Figure 3-2 perf top CPU#0 kernel performance data

By using the perf top command, we noticed that the estimation_timer function has been used very high in the kernel state of CPU#0. Similarly, we analyzed the flame graph on CPU#0, and the result was basically consistent with the perf top result.

Figure 3-3 estimation_timer flame diagram on kernel CPU#0

To figure out the kernel footprint of estimation_Timer, We continue to use the funcgraph tool from the open source project Perf-Tools (by Brendan Gregg) to analyze the invocation diagram and occupancy latency of the function Estimation_timer in the kernel.

# -m 1-a-d 6 estimation_timer # -m 1-a-d 6 estimation_timer #Copy the code

Figure 3-4 Calling the estimation_timer function in the kernel function

At the same time, we noticed that the duration of estimation_timer function in CPU#0 kernel traversal is 119 ms, which takes too long in the case of kernel processing soft interrupts, which will definitely affect the processing of other soft interrupts.

To further confirm the processing of soft interrupts on CPU#0, we observed the change of the number of soft interrupts on CPU#0 and the overall time distribution based on the softirqs.py script in the BCC project (slightly modified locally), and found that the number of soft interrupts on CPU#0 did not increase too fast. However, the histogram of Timer is also the anomaly data. Through the analysis of timer data within 10s, we found that there were 5 records with execution time distribution in the interval of [65-130] ms. This result is completely consistent with the estimation_timer delay on CPU#0 captured by the funcgraph tool.

Using histogram 10 # - d said 10 s do a polymerization, 1 shows a C - for our own modify functions, used to filter CPU# 0 # / usr/share/BCC/tools/softirqs - d 10 1-0 CCopy the code

Figure 3-5 histogram of timer execution time for CPU#0 soft interrupt

From the above analysis, we know that the ESTIMation_Timer is from the IPVS module (see Figure 3-4). The load balancer of kubernets kube-Proxy component is based on the IPVS module, so the problem basically occurs in the Kube-proxy process.

We only kept the tested container instance on host B. After stopping the Kubelet service, we manually stopped the Kube-proxy container process. After retesting, the problem of ping delay japping disappeared as expected.

The root cause of this problem can be determined to be that the estimation_timer function in the IPVS kernel module used in Kube-proxy takes a long time to execute, which results in the delay of processing network soft interrupt, thus causing the jster of ping packet. So what is the function of estimation_timer[ipvs]? What causes this function to take so long?

3.3 ipvS estimation_timer Timer

The mystery will finally be revealed!

The estimation_timer()[ipvs] function is initialized with __ip_vs_init in ip_vs_core.c at the time of creation of each Network Namespace.

/* * Initialize IP Virtual Server netns mem. */ static int __net_init __ip_vs_init(struct net *net) { struct netns_ipvs *ipvs; / /... If (ip_vs_estimator_net_init(ipvs) < 0) // Initialize goto estimator_fail; }Copy the code

Code 3-5 ipvS initialization function

The ip_vs_estimator_net_init function is defined in file ip_vs_est.c as follows:

int __net_init ip_vs_estimator_net_init(struct netns_ipvs *ipvs) { INIT_LIST_HEAD(&ipvs->est_list); spin_lock_init(&ipvs->est_lock); timer_setup(&ipvs->est_timer, estimation_timer, 0); Estimation_timer mod_timer(&ipvs->est_timer, jiffies + 2 * HZ); // Start the first timer, return 0 after 2 seconds; }Copy the code

Code 3-6 ipvs Estimator initializes the function

Estimation_timer is also defined in the ip_vs_est.c file.

static void estimation_timer(struct timer_list *t) { // ... spin_lock(&ipvs->est_lock); list_for_each_entry(e, &ipvs->est_list, list) { s = container_of(e, struct ip_vs_stats, est); spin_lock(&s->lock); ip_vs_read_cpu_stats(&s->kstats, s->cpustats); /* scaled by 2^10, but divided 2 seconds */ rate = (s->kstats.conns - e->last_conns) << 9; e->last_conns = s->kstats.conns; e->cps += ((s64)rate - (s64)e->cps) >> 2; / /... } spin_unlock(&ipvs->est_lock); mod_timer(&ipvs->est_timer, jiffies + 2*HZ); // Start a new round of statistics after 2 seconds}Copy the code

Code 3-7 ipvs estimation_timer function

According to the function implementation of ESTIMation_Timer, it first calls spin_lock to lock, and then traverses all ipvS rules under the current Network Namespace. Due to some historical reasons of our cluster, there are many services in the production cluster, so it takes a long time to iterate.

The statistics of this function are finally shown in the ipvSADm –stat result (Conns InPkts OutPkts InBytes OutBytes) :

# ipvSADm -ln -- Stats IP Virtual Server version 1.2.1 (size=4096) Prot LocalAddress:Port Conns InPkts OutPkts InBytes OutBytes # Related statistics -> RemoteAddress:Port TCP 10.85.0.10:9153 0 0 0 0 -> 172.22.34.187:9153 0 0 0 0 0Copy the code

The number of IPVS rules in our cluster is around 30,000.

# ipvsadm -Ln --stats|wc -l
Copy the code

Since there are estimation_timer traversal in each Network Namespace, why are there so many rules on only CPU#0?

This is because only the Host Network Namespace of a Host has all ipvS rules, which can be verified by ipvSADm -ln (executed in the Host Network Namespace). CPU#0 is used to process the ipvs rules in the Host Network Namespace when the ipvs module is loaded. Of course, the loading of this kernel is completely random.

4. Problem solving

4.1 Solution

By now, the problem has been thoroughly located. Due to the historical reasons of early deployment of our services, adjusting the number of services in a short period of time will lead to a large number of migration work. In the middle, there is a large number of rules generated by cloud vendor SLB, and there is no way to completely eradicate it.

  1. Dynamically sensed the function of ipvs estimation_Timer in the host Network Namespace, and set the CPU mapping to be disabled in RPS.

    In this way, the RPS configuration needs to be adjusted, and the number of cores of the IPVS host Network Namespace is not fixed, so the configuration needs to be identified and adjusted, and the changes of the IPVS host Network Namespace during the restart need to be processed.

  2. Since we do not need ipvS as a statistical function, we can avoid this problem by modifying the IPVS driver;

    Modifying the IPVS driver requires reloading the kernel module, which may cause a temporary interruption in the host service.

  3. The IPVS module adjusts the kernel traversal statistics to a separate kernel thread for statistics.

Ipvs rule traversal in the kernel Timer is a scenario that is not appropriate when ipvS is migrated to K8S, and the community should need to isolate traversal in the Timer, but this solution requires community efforts and is far from achievable.

Through the comparison of the above three schemes, it is not easy to solve the current problem of jitter. In order to ensure the stability of the production environment and the difficulty of implementation, we finally focus on the Kpatch scheme of Linux Kernel hot repair. The LivePatch function implemented by KPATH provides real-time enhancements to the running kernel without the need to restart the system.

4.2 kpatch livepatch

Kpatch is a tool for Linux kernel livepatch, produced by Redhat. The first hot patching tool was Ksplice. However, after Ksplice was acquired by Oracle, some distribution manufacturers had to develop their own hot patch tools, Kpatch for Redhat and KGraft for Suse. At the same time, under the promotion of the two manufacturers, Kernel 4.0 began to integrate the LivePatch technology. Kpatch is developed by Redhat, but it also supports Ubuntu, Debian, Oracle Linux and other distributions.

Here we briefly synchronize the implementation steps, and more documentation is available from the KPATH project.

$git clone https://github.com/dynup/kpatch.git $source test/integration/lib. Sh # middle dependence will use yum to install related packages, setup time depending on the network situation, $sudo kpatch_dependencies $CD kpatch # install /usr/local Note that kpatch-build is in /usr/local/bin/and kpatch is in /usr/local/sbin/$sudo make installCopy the code

4.2.1 Obtaining kPATH compilation and Installation

4.2.2 Generate kernel source patch

In the process of using Kpatch, the source code of the kernel needs to be used. The source code pulling method can refer to the source code of the kernel I need here.

$rpm2cpio kernel - 3.10.0-1062.9.1. El7. SRC. RPM | cpio - Linux - 3.10.0 div $xz - d - 1062.9.1. El7. Tar. Xz $tar XVF. - Tar $cp-ra linux-3.10.0-1062.9.1.el7/ linux-3.10.0-1062.9.1.el7-patchCopy the code

Here we set the implementation of the ESTIMation_Timer function to null

static void estimation_timer(unsigned long arg)
{
    printk("hotfix estimation_timer patched\n");
    return;
}
Copy the code

And generates the corresponding patch file

# diff -u Linux - 3.10.0-1062.9.1. El7 / net/netfilter/ipvs ip_vs_est. C Linux - 3.10.0-1062.9.1. El7 - patch/net/netfilter/ipvs ip_vs_est. C > ip_vs_timer_v1. The patchCopy the code

4.2.3 Producing kernel Patches and Livepatch

Then generate the relevant Patch KO file and apply it to the kernel:

# /usr/local/bin/kpatch-build ip_vs_timer_v1.patch --skip-gcc-check --skip-cleanup -r The livepatch-ip_vs_timer_v1.ko file is generated in the current directory after compilation /usr/local/sbin/kpatch load livepatch-ip_vs_timer_v1.koCopy the code

Check the kernel log

$ dmesg -T
[Thu Dec  3 19:50:50 2020] livepatch: enabling patch 'livepatch_ip_vs_timer_v1'
[Thu Dec  3 19:50:50 2020] livepatch: 'livepatch_ip_vs_timer_v1': starting patching transition
[Thu Dec  3 19:50:50 2020] hotfix estimation_timer patched
Copy the code

So far, our Livepatch has successfully revised the invocation of estimation_Timer, and everything seems to be successful. You then use the funcgraph tool to see that the estimation_timer function is no longer present in the call relationship.

If you only set the function as an empty implementation, it is equivalent to closing the call of ESTIMation_timer. Even if you unload the Livepatch by command, the call of this function will not be restored. Therefore, it is recommended to set the call of the function 2s to an acceptable time range in the production environment. Say 5 minutes, so that after unload, the estimation_Timer call can resume after 5 minutes.

4.3 Precautions for Using KPatch

  • Kpatch is a KO kernel module generated based on the kernel version. You must ensure that the kernel version of the livePatch is the same as that of the compiler.

  • Use manual livepatch to fix the fault. If you want to ensure that the fault still takes effect after the machine is restarted, you need to enable the KPathC service through install.

    /usr/local/sbin/kpatch install livepatch-ip_vs_timer_v1.ko

    systemctl start kpatch

  • To perform livepatch on other machines, file kpatch, livepatch-ip_vs_timer_v1.ko, and kpatch.service (used to take effect after install and restart) are required.

5. To summarize

Network jitter problem, involving the application layer, the network protocol stack and the kernel operation mechanism in the aspects of the coordination, step by step a screening process to check, gradually narrowing the scope, in the whole process, with the right tools is essential, this problem in our screening process, the direction of the BPF technology screening for us has played a vital role. The emergence of BPF technology for our observation and tracking in the kernel, provides a more flexible data acquisition and data analysis ability, in a production environment we have it is widely used in the monitoring network at the bottom of the retransmission and jitter dimensions, such as increased enormously promoted our problems in emergent situations screening efficiency, hope more people can benefit from the BPF technology.