Serverless container service discovery

In September 2020, UCloud launched Cube, a Serverless container product with virtual machine-level security isolation, lightweight system occupancy, second startup speed, highly automated elastic scaling, and simple and straightforward ease of use. With Virtual Kubelet technology, Cube can seamlessly connect with UCloud container hosting product UK8S, which greatly enriches the elastic capability of Kubernetes cluster. As shown in the figure below, Virtual Node as a Virtual Node in a Kubernetes cluster, each Cube instance is treated as a Pod on a VK Node.

However, Virtual Kubelet only implements elastic scaling of Cube instances in the cluster. To make Cube instances officially part of the K8s cluster family, applications running in Cube need to be able to take advantage of K8s Service discovery capabilities, i.e. access to Service addresses.

Why not kube-proxy?

As we all know, Kube-proxy implements load balancing of service traffic for K8s. Kube-proxy constantly senses the mapping between Service and Endpoints addresses in K8s and generates traffic forwarding rules for ServiceIP. It provides three forwarding implementation mechanisms: Userspace, iptables, and IPVS, where userspace is no longer used due to high performance costs.

However, we found it inappropriate to deploy Kube-Proxy directly inside the Cube virtual machine for the following reasons:

1. Kube-proxy adopts go language development, and the target file generated by compilation is huge. For example, in the K8s v1.19.5 Linux environment, the size of the kube-proxy ELF executable file on the strip is 37MB. For ordinary K8s environments, this volume is negligible; However, for the Serverless product, in order to ensure lightweight virtual machines, the VIRTUAL machine operating system and images need to be highly tailored and cost a lot of land, we want a proxy controller with a deployment size of no more than 10MB.

**2, kube-proxy performance problems. ** also due to development in GO, there is a higher performance cost than C/C++ and Rust, which do not have GC and have the ability to fine-control the underlying resources. Cube typically has fine-grained resource delivery quotas, such as 0.5C 500MiB, and we don’t want auxiliary components like Kube-Proxy to take over.

3. Ipvs issues. Before eBPF was widely known, IPVS was considered the most reasonable implementation of K8s Service forwarding surface. While Iptables has long suffered from scalability issues, IPVs has been able to maintain stable forwarding capabilities and low rule refresh intervals as services and Endpoints scale up.

But the truth is that IPVS is not perfect and even has serious problems.

For example, if NAT is also implemented, IPtables completes DNAT at PREROUTING or OUTPUT. Ipvs requires INPUT and OUTPUT, and the link is longer. Therefore, ipvS scored the lowest score in terms of bandwidth and short connection request latency in service IP compression scenarios with a small number of SVCS and eps. In addition, the rolling publishing service access failure caused by the conn_reuse_mode parameter being 1 has not been properly addressed so far (April 2021).

**4, iptables issues ** Poor scaling, slow updates, O(n) time rule lookups, and the same problems with Iptables-based NetworkPolicy. 1.6.2 The following iptables do not even support full_random port selection, which makes SNAT performance worse in high concurrent short connection scenarios.

What does eBPF bring to container networks?

EBPF has been seen as a revolutionary technology for Linux in recent years, allowing developers to dynamically load and run their own sandbox programs in the Linux kernel in real time without changing the kernel source code or loading kernel modules. Meanwhile, the user-mode program can exchange data with eBPF program in kernel in real time through BPF (2) system call and BPF Map structure, as shown in the figure below.

The written eBPF program runs in the kernel in event-triggered mode. These events can be the system call in and out, and the critical pathpoints (XDP, TC, QDISC, socket) for network packet sending and receiving. Kernel function inbound and outbound Kprobes /kretprobes and user-mode function inbound and outbound Uprobes /uretprobes, etc. EBPF programs loaded into the hook points of the network receiving and receiving paths are usually used to control and modify network messages for load balancing, security policies and monitoring observations.

The emergence of Cilium makes eBPF officially enter the field of vision of K8s, and is profoundly changing the network, security, load balancing, observability and other fields of K8s. Starting from 1.6, Cilium can replace Kube-Proxy 100%, which really realizes all forwarding functions of Kube-Proxy through eBPF. Let’s first examine the implementation of ClusterIP(East-West traffic).

The realization of ClusterIP


For both TCP and UDP, the client only needs to implement DNAT for ClusterIP. Frontend and the corresponding Backends address are recorded in the eBPF map. The table is the basis for DNAT. Where does this DNAT work?

For the tc egress, the DNAT operation can occur on the TC egress, and the TC ingress performs reverse operations on the return traffic. That is, the source IP address is changed from the actual PodIP to a ClusterIP address. After NAT, the IP address and the CHECKsum of the TCP header are recalculated.

If the Linux OS supports Cgroup2, use the SOckADDR hook point of CGroup2 to perform DNAT. Cgroup2 provides a BPF interception layer (BPF_PROG_TYPE_CGROUP_SOCK_ADDR) for socket system calls that need to reference L4 addresses such as CONNECT (2), Sendmsg (2), and recvmsg(2). These BPF programs can complete the modification of destination address before packet generation, as shown in the figure below.

For TCP and connected UDP traffic (i.e. for UDP fd called connect(2)), only one forward conversion is required, that is, the BPF program is used to change the destination address of the outgoing traffic to the Pod address. In this scenario, load balancing is most efficient because the overhead is one-time and the effects persist throughout the lifetime of the traffic flow.

For udp traffic with no connection, reverse translation is required. That is, a SNAT is created for inbound traffic from Pod and the source IP address is changed back to ClusterIP. Without this step, UDP applications based on RecVMSG will not receive ClusterIP messages because the socket’s peer address has been rewritten to the Pod address. The traffic diagram is as follows.

Summary, this is a user unaware address translation. The user thinks he is connecting to Service, but the actual TCP connection points directly to Pod. An instructive comparison is that when you use kube-proxy and do tcpdump in Pod, you can find that the destination address is still ClusterIP because the IPVS or iptables rules are on host. When you use Cilium and do tcpdump in Pod, you can already find that the destination address is Backend Pod. NAT can be accomplished without conntrack and has fewer forwarding paths and better performance than IPVS and Iptables. Compared to tC-BPF, it is lighter and does not need to recalculate checksum.

Cube’s Service discovery


Cube starts an agent program called CProxy for each Serverless container group that needs ClusterIP access to implement the core functions of Kube-Proxy. As the lightweight VIRTUAL machine image of Cube uses a higher version of Linux kernel, CProxy adopts the above cGroup2 socket hook method for ClusterIP forwarding. Cproxy is developed using Rust, and the compiled object file is less than 10MiB. The running cost also has a significant advantage over Kube-Proxy. The deployment structure is as follows.

Here are some comparisons. We used WRK to conduct a 2000 concurrent HTTP short connection test for ClusterIP, and compared the SVC number with 10 and SVC number with 5000 respectively to observe the request time (unit: ms).

The conclusion is that CProxy has the best performance no matter the number of SVCS is small or large. Ipvs performs much better than Iptables when the number of SVCS is large but not as well when the number of SVCS is small. SVC number = 10

SVC number = 10

SVC number = 5000

In the future, we will continue to improve the LoadBalancer forwarding based on eBPF and NetworkPolicy based on eBPF.

UCloud container products embrace eBPF


EBPF is changing the cloud ecosystem. In the future, UCloud container cloud product UK8S and Serverless container product Cube will be closely combined with the latest progress in the industry to explore eBPF applications in network, load balancing, monitoring and other fields, providing users with better observation, positioning and tuning capabilities.


If you are interested in Cube products, please scan code to join the Cube Testing communication group!