Tencent Cloud TKE- Unified Hybrid Cloud Container Network Based on Cilium (Part 2)

preface

In Tencent Cloud TKE – Cilium-based Unified Hybrid Cloud Container Network (Part I), we introduce the cross-plane network interworking scheme of TKE hybrid cloud and the TKE hybrid cloud Overlay network scheme. When the public cloud TKE cluster adds the third-party IDC node service, in order to meet the needs of customers in different usage scenarios (especially the low tolerance requirements for network performance loss), TKE hybrid cloud network scheme also proposes the Underlay network scheme based on BGP direct routing. This network model is implemented by Gobgp, which is based on Cilium to connect Node-pod and pod-pod network, which can guarantee high network performance and support large-scale cluster expansion.

Before the launch of TKE public cloud, this network scheme has been implemented on a large scale in the privatization environment of TCNS, the proprietary cloud agile PaaS platform of Tencent Cloud, and has been integrated and open source in TKEStack. This paper will introduce in detail the design and implementation of Underlay container network scheme based on BGP direct routing for TKE hybrid cloud.

background

The diversity of customer needs, especially the tolerance for network performance depletion, makes the Underlay network solution imperative. Why choose BGP protocol? In contrast to internal gateway protocols such as OSPF and RIP, BGP focuses on controlling routing propagation and selecting the best path. The biggest advantage of BGP lies in its strong scalability, which can meet the requirements of horizontal expansion of large-scale clusters. On the other hand, BGP is simple and stable, and there have been successful cases in the industry based on BGP in-place production environments.

Depending on the size of the cluster, the BGP routing pattern has different solutions. When the cluster size is small, the Full Mesh interconnection mode can be used, which requires all BGP speakers within the same AS to be fully connected, and all external routing information must be redistributed to other routers within the same AS. As the size of the cluster increases, the Full Mesh pattern becomes dramatically less efficient, and the Route Reflection pattern is a mature alternative. The RR scheme allows a BGP Speaker (i.e. Route Reflector) to broadcast routing information learned to other BGP peers, greatly reducing the number of BGP Peer connections.

Compared with the existing schemes, Tencent Hybrid Cloud adopts the Underlay scheme based on GobGP to realize Cilium. This scheme realizes its own BGP Agent based on the good programming interface provided by GobGP and has good scalability. Its characteristics are as follows:

Support for scaling of large clusters
Support BGP neighbor discovery
Support network visualization
Support for VIP and Podcidr routing announcements
Support ECMP and other advanced routing protocols
Implement Cilium Native Routing
Support L3 layer network communication

Tencent’s hybrid cloud Underlay container network solution

Without changing the internal network topology of IDC room, BGP connection was established between the access layer switch and the core layer switch, which was realized by the existing routing strategy in the computer room. Podcidr is assigned to the physical location of Node, and each Node announces the Podcidr to the access layer switch via BGP protocol to realize the ability of the whole network communication.

Each access layer switch and its managed Node layer connect together to form an AS. Each node runs a BGP service, which is used to announce the routing information of this node.
Each router between the core layer switch and the access layer switch occupies a separate AS, physical direct connection, and runs the BGP protocol. The core layer switch can sense the routing information of the whole network, and the access layer switch can sense the routing information of the Node directly connected with itself.
There is only one default route per Node, which leads directly to the access layer switch. The next hop of Node communication under the same access layer switch points to the opposite end.

Neighbor discovery

In the cluster network implemented by BGP, the nodes are often added or deleted. If the peer is configured statically, frequent operation of the switch is needed to add or delete the peer. The maintenance workload is heavy, which is not conducive to the horizontal expansion of the cluster. To avoid manually manipulating the switch, we support dynamic discovery of BGP neighbors based on configuring an access layer switch and a route reflector implemented at the software level.

Dynamic neighbor discovery is realized by access layer switches

Access layer switches act as boundary routers and enable Dynamic Neighbors. For H3C, Cisco and Huawei routers, please refer to the official documentation to enable Dynamic Neighbors configuration. The BGP service on Node initiatively establishes an IBGP connection with the access layer switch and announces the local route, which the access layer switch announces to the entire data room.

Dynamic neighbor discovery is realized through RR

The physical switch or the Node Node acts as the Reflection RR, the Reflection Router establishes the IBGP connection with the Access Layer Switch, and the BGP service on the Node Node establishes the connection with the Reflection Router. The BGP service on Node announces the local route to the RR, which is reflected to the Access Layer Switch, which in turn announces it to the entire data room.

The next-hop

Each Node runs a BGP service that announces PodCIDR on its own Node to the Access Layer Switch, which is aware of PodCIDR on all directly connected nodes. Nodes under the access layer switch learn to route each other and send to the local area, and the traffic is forwarded through the access layer switch layer 2. The next hop of the communication between nodes across the access layer switch points to the access layer switch, and the next hop of the communication between nodes under the same access layer switch points to the node point at the opposite end. The following figure shows the routing learning of nodes under the same access layer switch and across the access layer switch, which can intuitively determine the next-hop address according to the routing table.

Communication link under the same access layer switch: Node 10.2.0.2 and node 10.2.0.3 are under the same access layer switch and have two-layer connectivity. After packet packaging, it can be sent directly to the opposite end without passing through three-layer forwarding.
Communication links between switches at different access layers: Node 10.2.0.2 and node 10.3.0.3 are under different access layer switches, and messages can only reach the opposite end after being routed by access layer switches and core switches.

BMP monitoring

The BMP Server is developed based on the BGP Monitoring Protocol to monitor the running status of the BGP session in real time, including the establishment and closure of peer relationships, routing information, etc. Use the collected BMP messages to locate the fault directly.

Graceful restart

BGP is a routing protocol based on TCP. When TCP connection is abnormally disconnected, the switch that starts Graceful Restart function will not delete RIB and FIB, but still forward the messages according to the original repost entries, and start the RIB routing aging timer. BGP Peer needs to be Graceful Restart at both ends to be effective, which can effectively prevent BGP link oscillation and improve the availability of the underlying network.

Custom IPAM

In the common configuration of Kubernetes, nodes are allocated PodCIDR and routes are configured using the allocation-node-cidrs and configured-cloud-routes parameters of kube-controll-manager. However, the community’s solution restricts nodes to a single segment of PodCIDR and cannot be dynamically expanded. This one-node-one-podcidr strategy is too simple, resulting in low IP resource utilization, with some node sizes too small to be used up, and some node sizes too large to be used up.

In the hybrid cloud scenario, we found that customers put forward higher requirements for IPAM:

You want the node’s PodCIDR to be multisegment
You want the node’s PodCIDR to support dynamic expansion and collection on demand

To solve this problem, we used our owntke-ipamdComponent to realize the custom IPAM function, as shown in the following figure:

Instead of the kube-controller-manager component assigning the node PodCIDR, thetke-ipamdComponents assign Podcidr to nodes uniformly
The Cilium Agent reads through the CiliumNode objecttke-ipamdPodcidr is assigned to assign IP to POD in response to CNI request
tke-ipamdList-watch mechanism is used to monitor the usage of node IP resources. When the usage of node IP resources is too high, dynamically expand the node’s Podcidr

The performance test

To get a better idea of the performance of the TKE hybrid cloud Underlay container network, we tested its performance using the netperf tool, and we can see that Underlay has virtually no performance cost in network throughput and bandwidth.

Summary and Prospect

After introducing the interconnection scheme of TKE’s hybrid cloud container network based on Cilium and Overlay network scheme under the hybrid cloud scenario, this paper focuses on the Underlay network scheme based on BGP routing. TKE hybrid cloud Underlay container network solution takes advantage of the scalability of BGP, which can meet the needs of large-scale cluster horizontal scaling, while providing customers with higher data surface forwarding performance with minimal loss compared to node network. Before the launch of TKE public cloud, this network scheme has been implemented on a large scale in the privatization environment of TCNS, the proprietary cloud agile PaaS platform of Tencent Cloud, and has been integrated and open source in TKEStack.

The combination of hybrid cloud and container is attracting the attention of more and more enterprise customers. It can improve the utilization rate of enterprise’s existing computing resources and bring significant benefits to customers in the scenarios of resource expansion, more active disaster preparedness and business distributed deployment. Tencent cloud container team breaks through the differences between the public cloud and IDC environment, provides customers with a unified management view, and realizes the unification of cloud scene, IDC scene and edge scene. In addition to the unification of single cluster capability, Tencent cloud container team also has a unified scheme in the aspects of cluster registration, multi-cluster management, cross-cloud and cross-cluster visits, etc. Welcome to pay attention and experience.

The resources

Using BIRD to run BGP

Container Service (Tencent Kubernetes Engine, TKE) is a one-stop cloud native PaaS service platform based on Kubernetes provided by Tencent Cloud. We provide users with enterprise-level services that integrate container cluster scheduling, HELM application choreography, Docker image management, ISTIO service governance, automated DevOps and a full set of monitoring operation and maintenance systems.

[Tencent cloud native] cloud said new, cloud research new technology, cloud tour new live, cloud appreciation information, scan the code to pay attention to the public number of the same name, timely access to more dry goods!!