Kubernetes Service is used to realize mutual invocation and load balancing among businesses in the cluster. At present, the realization of the community mainly includes three modes: userspace, iptables and IPVS. IPVS mode has the best performance, but there is still room for optimization. In this mode, DNAT is implemented by IPVS kernel module and SNAT is implemented by NF_Conntrack /iptables. NF_ConnTrack is designed for general purpose, and its internal state and flow are complex, resulting in significant performance losses.
Tencent TKE team  developed a new IPVS-BPF mode, completely bypassing the processing logic of NF_Conntrack, and using EBPF to complete SNAT function. For the most commonly used POD accessing ClusterIP scenarios, the short connection performance improved by 40% and the P99 delay decreased by 31%; The NodePort scenario is improved, as detailed in the table below and in the Performance Measurement section.
First, the current situation of container network
The iptables mode
1. Poor scalability. As the number of Service data reaches thousands, the performance of both the control plane and the data plane degrades dramatically. The reason is that the interface design of the iptables control surface requires traversing and modifying all the rules for each rule added, and the control surface performance is O(n²). On the data surface, the rules are organized with a linked list, and the performance is O(n).
2. The LB scheduling algorithm only supports random forwarding
IPVS is specifically designed for LB. It uses a hash table to manage the service, and the search for the service is O(1) time complexity. However, the IPVS kernel module does not have SNAT functionality, so it borrows the SNAT functionality of iptables. After IPVS makes DNAT for the packet, the connection information is saved in NF_CONNTRACK, and iptables make SNAT relay accordingly. This mode is currently the best choice of Kubernetes network performance. However, due to the complexity of nf_conntrack, there is a significant performance loss.
II. IPVS-BPF scheme introduction
EBPF  is a software implemented virtual machine in the Linux kernel. The user compiles the EBPF program into EBPF instructions, and then loads the EBPF instructions into the specific mount point of the kernel through the BPF () system call. The execution of the EBPF instructions is triggered by a specific event. When mounting EBPF instructions, the kernel will be fully verified to avoid EBPF code affecting the security and stability of the kernel. In addition, the kernel will also carry out JIT compilation, translating EBPF instructions into local instructions, reducing performance overhead.
The kernel presets many EBPF mount points in the network processing path, such as XDP, QDISC, TCP-BPF, Socket, etc. EBPF programs can be loaded into these mount points and call specific APIs provided by the kernel to modify and control network packets. EBPF programs can store and exchange data through the MAP data structure.
IPVS-BPF optimization scheme based on EBPF
In view of the performance problems brought by NF_ConnTrack, Tencent TKE team designed and implemented IPVS-BPF. The core idea is to bypass NF_CONNTRACK and reduce the number of instructions required to process each packet, thus saving CPU and improving performance. Its main logic is as follows:
1. A switch is introduced into the IPVS kernel module to support switching between native IPVS logic and IPVS-BPF logic
2. In IPVS-BPF mode, move the IPVS hook points forward from Localin to PREROUTING so that requests to access the service bypass NF_CONNTRACK
3. Add and delete session information in EBPF map in the code of IPVS connection creation and connection deletion
4. Mount SNAT code of EBPF in Qdisc and execute SNAT according to session information in EBPF map
In addition, there is special processing for ICMP and fragmentation. The detailed background and details of fragmentation will be introduced in the following QCON online conference . Welcome to discuss with us.
Comparison of message processing flow before and after optimization
As you can see, the message processing flow has been greatly simplified.
Why not just go full EBPF
Many readers will ask, why use IPVS module to integrate with EBPF, instead of using EBPF directly to implement all the Service functions?
We also carefully studied this problem at the beginning of the design, and mainly had the following considerations:
• NF_Conntrack consumes more CPU instructions and latency than IPVS modules and is the number one performance killer for forwarding paths. IPVS itself is designed for high performance, not performance bottlenecks
•IPVS is nearly 20 years old and is widely used in production environments with guaranteed performance and maturity
•IPVS internally maintains session table aging through Timer, while EBPF does not support Timer and can only co-maintain session table through user-space code
•IPVS supports rich scheduling policies, and EBPF is used to rewrite these scheduling policies. In addition to the large amount of code, many scheduling policies require loop statements, which EBPF does not support
Our goal is to achieve a controllable amount of code, can be implemented optimization scheme. Based on the above considerations, we chose to reuse the IPVS module, bypass NF_Conntrack and complete the SNAT scheme with EBPF. The final data surface code amount is 500+ lines of BPF code and 1000+ lines of IPVS module changes (mostly new code to assist SNAT map management).
Three, performance measurement
This chapter uses the method of quantitative analysis to read the CPU performance counter with the PERF tool, and explains the macro performance data from the micro point of view. The press-test procedures used in this paper are WRK and IPERF.
The test environment
To repeat this test, two points need to be noted:
1. Different clusters and machines, even if the models are the same, may also cause background differences in performance data due to different topologies of their mother machines and racks. In order to reduce the errors caused by such differences, we compare IPVS mode with IPVS-BPF mode by using the same cluster, the same set of back-end POD, and the same LB node. First, the IPVS performance data was measured in IPVS mode, then the LB nodes were switched to IPVS-BPF mode, and then the performance data of IPVS-BPF mode was measured. (Note: The switch mode is achieved by switching the control surface from Kube-Proxy to Kube-Proxy-BPF in the background, which is not supported in the product function)
2. The goal of this test is to measure the impact of software module optimization on the performance of accessing Service at LB, so as not to make the bandwidth and CPU of client and RS target server become bottlenecks. Therefore, the LB node under pressure test adopts 1-core model and does not run back-end POD instance. The nodes running back-end services use 8-core models
In order to collect indicators such as CPI, the LB node (in red) here uses a Blackstone bare metal machine, but only one core is turned on through Hotplug and the rest is turned off.
Here the LB Node (the Node on the left) uses the SA2 1-core 1G model.
The measured results
Compared with IPVS mode, the performance of Nodeport and ClusterIP short connection improved by 64% and 40% respectively in IPVS-BPF mode.
Nodeport optimization is more effective because Nodeport requires SNAT, and our EBPF SNAT is more efficient than iptables SNAT, so the performance is improved even more.
As shown in the figure above, IPVS-BPF performance improved by 22% compared to IPVS mode in the IPERF bandwidth test.
In the figure above, WRK tests show a 47% reduction in NodePort short connection P99 latency.
In the figure above, WRK tests showed a 31% reduction in P99 latency for ClusterIP short connections.
Instruction count and CPI
The average number of CPU instructions per request from the PERF tool in the figure above decreased by 38% in IPVS-BPF mode. This is the main reason for the performance improvement.
IPVS-BPF has a slight increase in CPI, about 16%.
The Service type
Short connection CPS
Short connection p99 delay
Long connection throughput
ClusterIP + 40%-31%, see Nodeport + 64%-47% +22% below
As shown in the table above, compared with the native IPVS mode, in IPVS-BPF mode, Nodeport improved the performance of short connection by 64%, reduced the P99 latency by 47%, and improved the bandwidth of long connection by 22%. ClusterIP improved short connection throughput by 40% and P99 latency by 31%.
When ClusterIP long connection throughput was tested, Iperf itself consumed 99% of the CPU, making the optimization effect difficult to measure directly. In addition, we also found an increase in CPI under IPVS-BPF mode, which is worthy of further study.
IV. Other optimizations, feature limitations, and follow-up
In the process of developing IPVS-BPF scheme, some other problems were solved or optimized
•conn_reuse_mode = 1 and no route to host problem 
The problem is that when a client initiates a large number of new TCP connections, the new connections are forwarded to the terminating POD, resulting in continuous packet loss. This problem does not occur when IPVS CONN_REUSE_MODE =1. When conn_reuse_mode=1, there is another bug where the performance of the newly created connection degrades dramatically, so it is generally set to conn_reuse_mode=0. We have completely fixed the problem in the Tencentos kernel, the codes are EF8004F8 , 8EC35911 , 07A6E5FF63  and are in the process of submitting a fix to the kernel community.
• Occasionally 5S delay in DNS resolution 
Iptables SNAT allocates lport to the call to insert nf_conntrack, which uses an optimistic locking mechanism. Inserting NF_ConnTrack with the same lport and quintuple will result in packet loss if there is a race. In IPVS-BPF mode, the process of SNAT selecting lport and inserting hash table are in the same cycle with the maximum number of cycles of 5, thus reducing the probability of this problem.
• Externalip optimization  causes CLB health check failure problem
See: https://github.com/kubernetes/… 07864
• Pods access their own service, IPVS-BPF mode will forward requests to other PODS, not to PODS themselves
The follow-up work
• Utilizing the method proposed by Cilium, ClusterIP performance was further optimized by EBPF
• Investigate the reasons for the CPI increase in IPVS-BPF mode and explore the possibility of further performance improvements
How to enable IPVS-BPF mode in TKE
As shown in the following figure, when the cluster is created in the TKE console of Tencent Cloud , IPVS-BPF can be selected from the Kube-proxy mode option under the advanced Settings.
Currently this feature requires a whitelist request. Please submit your application through the application page.
VI. Related Patents
The related patent applications for this product are as follows:
The utility model relates to a method of message transmission and a related device
Load balancing method, device, device and storage medium
A method for detecting idle network service application using EBPF technology
An adaptive load-balancing scheduling algorithm based on real-time load awareness of host computer