Computing, storage and network are the three basic services in the cloud era, and Kubernetes, as a new generation of infrastructure, is no exception. Among the three, the network is the most difficult to master and the most problematic service.

A network model

If you look at the Kubernetes network model, in fact, its network model is very simple, only the following simple:

(1) IP-per-POD, each Pod has an independent IP address, and all containers in Pod share a network namespace

(2) All PODS in the cluster are in a directly connected flat network, which can be directly accessed through IP

(3) The Service cluster IP address can be accessed only within the cluster. External requests must be accessed through NodePort, LoadBalance, or Ingress

In addition, Pod networks are configured using CNI plug-ins and a series of network extensions such as Calico, Flannel, CoreDNS, Nginx Ingress Controller, Ambassador API Gateway, There are also Linkerd, Istio and other service grids. All of these services combine to form a powerful container network, but they also add complexity to the network.

Service discovery

One of the most complex aspects of Kubernetes networking is its service discovery and load balancing mechanism. To achieve service discovery and load balancing, several components need to work together:

1. The user creates a Service2 through the API. Kube-controller-manager binds Pod with Label and creates Endpoints object 3 of the same name. Kube-proxy on each Node creates iptables rules for services and Endpoints to implement load balancing and DNAT

The core of Kubernetes network is how kube-Proxy realizes service discovery and load balancing. The work flow of the default Iptables mode is shown in the figure below. Mastering this process is the key to understand the working principle of Kubernetes network and daily network troubleshooting.

Network misarrangement

The network is one of the most problematic parts of the Kubernetes cluster, and troubleshooting is often difficult due to the number of modules involved. To master network troubleshooting is to master the core and the most important part of Kubernetes. On the whole, when it comes to Kubernetes’ network, there are actually no more than one of the following three situations

1. Pods access the external network of the container. 2

Troubleshooting network problems is based on these situations to locate specific network anomalies and then find solutions. For example, some common network problems are

1. The CNI network plug-in is incorrectly configured. 2. The Pod network route is lost. 3. The Service port conflicts or the NetworkPolicy configuration is incorrect. 4. Host or cloud platform security groups, firewalls, or security policies prevent container networks.

See the Kubernetes Guide on network Troubleshooting for detailed troubleshooting methods.

For actual network problems, Google Cloud shares a case of DNS packet loss problem, which is typical in Kubernetes Cluster Day 2 operation and maintenance. This case basically contains some of the most common network troubleshooting steps after a network failure. See detailed case cloud.google.com/blog/topics…

The future of container networks

As you can see from the examples of network errors shared by Google, even as one of the top three public cloud platforms in the world, Google TSE has to implement a number of tools and commands to gradually locate the root cause of network problems. And in many cases, customers need to reproduce the problem themselves to get first-hand debugging data.

On the one hand, this shows the complexity of network problems, on the other hand, it also shows that there is still a lot of room for improvement in network troubleshooting, debugging, monitoring and other aspects. If these problems are solved well, networking solutions are likely to spring up as popular new technologies.

In this respect, Cilium is a typical example. Cilium not only requires a newer kernel, but also infuses the kernel with a series of network mechanisms including observation, security, filtering, and so on through eBPF mechanisms. But that hasn’t stopped Cilium’s popularity. Many other network solutions are also using Cilium’s principles to speed up network performance and achieve more transparent network observation mechanisms through eBPF.

If you’re stuck with network performance and observability, consider Cilium and eBPF, which will give you some surprises.


Welcome to pay attention to chat cloud native public number, learn more cloud native knowledge.