The author

Wei Houmin, Background development engineer of Tencent Cloud, focuses on container, Kubernetes, Cilium and other open source communities, and is responsible for Tencent Cloud TKE hybrid cloud container network and other related work.

Zhao Qiyuan, senior engineer of Tencent Cloud, is mainly responsible for the design and research and development of Tencent Cloud container network.

preface

Hybrid clouds are not a new concept, but with the development of container technology, the combination of hybrid clouds and containers is getting more and more attention. Cloud native technologies such as containers can shield the differences in the underlying computing resource infrastructure of heterogeneous clusters in hybrid clouds, and realize the unification of multi-cloud scenarios, IDC scenarios, and even edge scenarios. The hybrid cloud is no longer a simple combination of public cloud and private cloud, but a distributed cloud with ubiquitous computing loads. It can make full use of its advantages in scenarios such as resource expansion, multi-active DISASTER recovery (Dr), and multi-cluster mixed deployment.

Tencent cloud TKE host service launched public cluster add third-party IDC computing node services, the service can let customers reuse IDC computing resources, from the local construction, operation and maintenance of Kubernetes cluster cost, maximize the utilization of computing resources.

In the implementation of this scheme, it is important to get through the network between IDC and public cloud. A Kubernetes cluster may contain many computing nodes in different network environments, such as IDC and VPC. In order to shield the differences between different network environments at the bottom, TKE proposed a hybrid cloud container network scheme. The unified network plane can be seen at the container level, so that Pod does not need to know whether it is a computing node running in IDC or a public cloud computing node.

The TKE Hybrid cloud container network supports both Overlay network based on VxLAN tunnel mode and Underlay network based on direct routing. Overlay networks can be used when customers do not want to change their IDC infrastructure. When customers have high performance requirements for hybrid cloud container networks, Underlay networks based on direct routing can be used. This paper will detail the challenges and solutions faced by container networks in hybrid cloud, and introduce the implementation of Overlay network in TKE hybrid cloud container. There will be a separate article on the implementation of the TKE hybrid cloud container Underlay network, so stay tuned.

Challenges faced by hybrid cloud container networks

In a hybrid cloud scenario, the components of a Kubernetes cluster may be distributed on different network planes:

  • Master Network: Network plane on which control plane components such as ApiServer run
  • VPC Network: a Network plane that contains TKE public cluster computing nodes
  • IDC Network: the Network plane on which the CUSTOMER’s IDC compute nodes reside

In the complex scenario of hybrid cloud, how to get through the links of different network planes poses challenges to container network design:

VPC networks and IDC networks communicate with each other

In a hybrid cloud scenario, a Kubernetes cluster may contain both public cloud computing nodes on a VPC Network and computing nodes on an IDC Network. It is the basis for the communication of the upper layer container network to get through the node networks in different network environments.

The IDC Network proactively accesses the Master Network

In the Kubernetes environment, the most common scenario is that Kubelet of compute nodes will connect to ApiServer of Master Network for obtaining cluster-related status and reporting node information. This requires the IDC Network to proactively access the Master Network.

Master Network Proactively accesses an IDC Network

In order to debug in Kubernetes environment, we often use kubectl logs and Kubectl exec and other commands to obtain the application Pod logs and directly log in to the application Pod running environment. Using Kubectl exec as an example, the following figure shows how such commands work: When kubectl exec is executed, a request is first sent to ApiServer, which forwards the request to kubelet on the Pod node, and then forward it to the Exec interface of Runtime.

For the above mechanism to run successfully, a Network path is required between the ApiServer on the Master Network and kubelet on the compute node, allowing ApiServer to proactively access Kubelet. In addition to kubectl exec and kubectl log commands, Kube-scheduler’s Extender mechanism and ApiServer’s Admission Webhook mechanism both rely on the Network connection between Master Network and compute nodes.

How to shield the difference of the underlying network and unify the container network

In a hybrid cloud scenario, a Kubernetes cluster may contain public cloud nodes in a VPC Network, IDC nodes in an IDC Network, or even public cloud nodes of other cloud vendors, or even edge nodes in an environmental scenario. Customers sometimes do not want to change the basic network Settings of their IDC environment, but want to have a unified container network.

TKE Hybrid cloud network solution

In order to solve the challenges faced by container networks in hybrid cloud scenarios, Tencent Cloud container team designed TKE hybrid cloud container network scheme with Cilium as the cluster network base. Cilium redesigns cloud native networks based on eBPF technology, bypassing Iptables, and provides a complete solution for networking, observability, and security. Cilium can support tunnel-based Overlay networks and Underlay networks based on direct routing, and has superior performance in large-scale Service scaling. Tencent Cloud Container team has been optimizing Kubernetes network based on eBPF technology for a long time. This time, the combination of hybrid cloud network and Cilium is a further exploration of eBPF technology.

The main features of TKE hybrid cloud container network are as follows:

  • The whole link container network can be unblocked and the underlying network differences can be shielded
  • Overlay network based on VxLAN tunnel mode and Underlay network based on direct routing are supported
  • Service and NetworkPolicy are implemented based on eBPF technology to optimize Kubernetes network performance
  • Supports user-defined container IPAM, which can realize the dynamic allocation of multi-segment PodCIDR and PodCIDR on demand
  • Supports network link observability

TKE Hybrid cloud container network usage method

On the TKE cluster basic information page, select the hybrid cloud container network mode after you select Support for importing third-party nodes. Here we can choose Cilium VxLAN to use hybrid cloud Overlay container network, or Cilium BGP to mix cloud Underlay container network:

TKE Hybrid cloud network interworking solution

VPC networks and IDC networks communicate with each other

To connect VPC networks and IDC networks, we recommend the cloud networking service provided by Tencent Cloud. The cloud networking service enables communication between VPCS on the cloud and BETWEEN VPCS and IDC networks. The cloud networking service provides multi-point networking, route self-learning, link optimization, and fast fault convergence capabilities.

The IDC Network proactively accesses the Master Network

In order to get through THE Network link of IDC Network and proactively access Master Network, we based on Tencent Cloud PrivateLink, Realize Kubelet in IDC Network to actively access ApiServer in Master Network.

Master Network Proactively accesses an IDC Network

In order to get through the Network link of Master Network to actively access IDC Network, we choose apiserver-network-Proxy project based on community to achieve this. The specific principle is as follows:

  • Created on the Master NetworkKonnectivity ServerAs a proxy server
  • Created on the IDC NetworkKonnectivity AgentTo create a long-term connection with the proxy server on the Master Network through PrivateLink
  • When the ApiServer on the Master Network proactively accesses the Kubelet of the IDC Network, it reuses the long connection between the Agent and Server

At this point, the Network connection requirement for Master Network to proactively access IDC Network is also solved. Furthermore, based on the same solution, computing nodes in a cloud VPC and edge nodes in an edge scenario can be managed on the same control plane to realize a real distributed cloud.

TKE Hybrid Cloud Overlay Container Network solution

After the connection between Master Network and IDC Network, we can build Overlay Network through tunnel mode on this basis. VxLAN is a tunnel encapsulation protocol widely used in data center networks. It encapsulates data packets using MAC in UDP and decapsulates data packets at the peer end. The tunnel encapsulation protocol of Cilium supports VxLAN and Geneve. By default, VxLAN is used. Based on the high scalability of VxLAN, you only need to open the network between nodes to realize a unified container network.

Pods access each other across nodes

When a packet is sent from the Pod port, it goes through the Veth pair to port LXC00AA. The eBPF program mounted on network port LXC00AA detects that the destination address of the data packet is the remote endpoint and forwards the data packet to Cilium_VXLAN for encapsulation. After packets are sealed, the outer IP address is the IP address of the node, and the inner IP address is Pod IP. The packets are forwarded to the peer node through the physical network port on the node. After arriving at the peer node, the packet is also decapsulated through cilium_VXLAN and forwarded to the peer Pod.

The node accesses the remote Pod

The Pod that accesses the remote node from the local node is forwarded to the cilium_host network port through the node route. The eBPF program mounted on cilium_host forwards the packets to cilium_VXLAN for tunnel encapsulation, and then forwards the packets to the peer end. After the packet is sealed, the outer IP address is the node IP address, the inner source IP address is CiliumHostIP, and the destination IP address is the Pod IP address. The back link is the same as the front link.

Pod accesses non-ClusterCIDR networks

When a Pod on a compute node visits a non-container ClusterCIDR network address, the eBPF program finds that the target address is not ClusterCIDR of the container network after the packet arrives from Pod network port to LXC00AA network port, and does not perform Overlay packet sealing. Instead, it is routed to the protocol stack node. After setting the Masquerade parameter of Cilium, masquerade will be executed for the packets with the destination address when they reach the physical network port of the node, replacing the source address of the packets with the node IP address, so that the packets can be returned to the node and finally to Pod.

Summary and Prospect

In the TKE hybrid cloud scenario, this paper introduces the complex scene and performance challenges faced by the hybrid cloud container network when the public cluster adds the third-party IDC node service, and proposes a container network solution based on Overlay of Cilium. It can be seen that this solution is not only suitable for adding IDC nodes, but also for heterogeneous clusters (multi-cloud and edge scenarios) under hybrid cloud, which can solve the management and experience problems caused by different cluster network plug-ins under hybrid cloud. Therefore, the combination of hybrid cloud and container is no longer just hybrid cloud, but can realize the unification of multi-cloud scenarios, IDC scenarios and edge scenarios, which is a real and ubiquitous distributed cloud.

The hybrid cloud container network with Overlay can shield the differences of different network environments at the bottom through tunnel mode, so that the unified network layer can be seen at the container level. For other customers, they have high requirements on the performance of hybrid cloud container network and do not want to introduce performance degradation due to Overlay packet unpacking process. In this case, the customer wants to get through the container network through direct routes directly on the Underlay network. Next, TKE hybrid cloud container network will introduce the Underlay network scheme implementation based on BGP direct routing. Please look forward to it.

The resources

  1. Mind the Gap: Here Comes Hybrid Cloud
  2. Kubernetes scheduler extender
  3. Kubernetes addmision controllers
  4. CNI Benchmark: Understanding Cilium Network Performance
  5. Tencent Cloud bypasses Conntrack and uses eBPF to enhance IPVS to optimize K8s network performance
  6. Tencent Cloud networking service CCN
  7. Kubernetes apiserver-network-proxy
  8. RFC 7348: Virtual eXtensible Local Area Network (VXLAN)

Container service TKE: Kubernetes container platform is stable, secure, efficient and flexible to expand on Tencent cloud without self-construction.