The introduction

Docker network stack is “trapped” in the local namspace, resulting in cross-machine docker network communication problem, so as a container scheduling platform K8S how to solve this problem?

Pod network model

K8s does not implement POD network communication solutions by itself. Instead, it incorporates various solutions on the market, such as Flannel, Calico and VPC-CNI, in the form of plug-ins. However, each solution must meet the following requirements:

  • Pods can communicate with each other on the network without setting NAT
  • Each Pod will see its OWN IP address, just like any other Pod
  • Nodes running Kubernetes cluster can support physical machines or virtual machines, or any environment that supports running Kubernetes. These nodes can also communicate with other Pod nodes without setting NAT.

Why is there such a requirement? This brings us to the limitations of traditional Docker port exposure

Docker ports are exposed by mapping host ports to docker ports using iptables. For example, map host port 80 to nginx Docker port 80, so that the caller can access nginx Docker through host IP +80 port; But what if there are multiple Nginx Dockers on the host?

It is easy to map multiple nginx Docker ports to different hosts. I believe that most people will adopt this method without thinking; However, on closer inspection, if the K8S adopts this approach, will there be some problems?

  • What rules should a host follow to map its port to a multitude of Dockers?
  • Does the caller need to distinguish the target port when calling multiple nginx? Will it cause difficulty and confusion for the caller?
  • In micro-service registration and discovery scenarios, after NAT, the IP addresses of different services on the same host will be the same IP address registered with the registry, that is, the IP address of the host. In this case, service registration and discovery will inevitably fail

Obviously, the NAT port mapping scheme cannot be simply applied to K8S; Therefore, it is not difficult to understand the POD network model officially proposed by K8S; It can be seen from the network model that WHAT K8S wants to achieve is a flat network, that is, containers can communicate with each other only by their own IP without using the IP address of the host for NAT. The IP that containers see when communicating with each other should also be the real IP of each other

Flannel

Flannel is an architecture designed by CoreOs for K8s using Overlay Network. The so-called Overlay is actually a nesting of TCP, encapsulating another LAYER of TCP/UDP into TCP. The following figure shows the VXLAN Overlay package format

A flannel can be classified into backend systems based on packaging protocols

  • udp

Udp mode decapsulation is performed in user mode, which costs performance

  • host-gw

It applies only to layer 2 networks and cannot span layer 3

  • vxlan

Vxlan decapsulation is implemented in kernel mode, which has high performance and can cross domain and layer 3. This solution is widely used in the industry

Flannel VxLAN components include ectD and Flanneld

ETCD

If you are careful, you may find that the network segment assigned to docker is 172.17.0.0/16. If you start a host computer, you will find that docker is assigned to the same network segment. At this point, a central node is needed to ensure the global uniqueness of Docker IP. Etcd plays just such a role, storing all the network segments allocated by the host to the container. Of course, Flanneld is only the porter of Flannel meta-information. It is responsible for dividing network segments, registering and reporting subnet VTEP information

Flanneld

It can be simply considered that flanned runs on the host node as the agent of ectCD and has the following functions:

  • Obtain network configuration information from etCD
  • Subnets are partitioned and registered in etCD
  • To record subnet information to the host/run/flannel/subnet.envIn the
  • Of course, the most important function is to encapsulate and unpack VXLAN packets
Root @ IP - 172-25-34-198: ~ # cat/run/flannel/subnet configures envFLANNEL_NETWORK = 10.244.0.0/16 FLANNEL_SUBNET = 192.168.2.1/24 FLANNEL_MTU=8951 FLANNEL_IPMASQ=trueCopy the code
Ubuntu @ IP - 172-25-33-13: ~ $cat/run/flannel/subnet configures envFLANNEL_NETWORK = 10.244.0.0/16 FLANNEL_SUBNET = 192.168.1.1/24 FLANNEL_MTU=8951 FLANNEL_IPMASQ=trueCopy the code

As shown above, the two nodes are assigned different container subnets. This information is reported to ectCD and injected into the Docker startup parameters, thus ensuring the uniqueness of container IP in the K8S cluster

Vxlan encapsulation process

IP is now unique, so how do dockers communicate across nodes?

Easy to understand, analyze the data forwarding process between docker 192.168.1.4 – > 192.168.2.2 across nodes based on VXLAN mode

  1. The data starts with the docker at 192.168.1.4. According to the routing table, the packet destined for 192.168.2.2 matches the default route, and the packet is sent through the container’s eth0 network interface
root@ip-172-25-33-13:/etc/cni/net.d# docker exec bd69 ip addr1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:inet 127.0.0.1/8 scope host LO valid_lft forever preferred_lft forever3: eth0@if8: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 8951 qdisc noqueue state UP link/ether 96:53:14:4a:4e:bd brd Ff :ff:ff:ff:ff:ff :ff: FF inet 192.168.1.4/24 scope global eth0 valid_lft forever preferred_lft forever root@ip-172-25-33-13:/etc/cni/net.d# docker exec bd69 IP route default via 192.168.1.1 dev eth01 0.244.0.0/16 via 192.168.1.1 dev eth0 192.168.1.0/24 dev eth0 proto kernel scope link SRC 192.168.1.4Copy the code
  1. Where is the data sent from Docker? In theory, docker and host are located in two different network namespaces, so the network is isolated. However, vethpair makes things better. Vethpair is a “network cable” connecting two namespaces. By concatenating docker eth0 with the host’s CNI0 interface, packets are forwarded to the host’s CNI0 again

  2. In the routing table of the host machine, there is a route with the address 192.168.2.0/24 and the target device is Flannel.1, so the displaced target packet is forwarded to Flannel.1 again

root@ip-172-25-33-13:/etc/cni/net.d# IP route default via 172.25.32.1 dev eth0 172.17.0.0/16 dev docker0 proto kernel Scope link SRC 172.17.0.1 linkdown 172.25.32.0/20 dev eth0 proto kernel scope link SRC 172.25.33.13 192.168.0.0/24 via 1 onlink 192.168.1.0/24 dev cni0 proto kernel scope link SRC 192.168.1.1 192.168.2.0/24 via Onlink 192.168.2.0 dev flannel. 1Copy the code

Flannel.1 is a virtual VTEP(Vxlan tunnel endpoint) device that unpacks Vxlan packets. Once packets are sent to Flannel

When data packets arrive at Flannel.1, you need to encapsulate the packets using the VXLAN protocol. In this case, the DST IP address is 192.168.2.2, and Flannel

Unlike traditional layer 2 addressing, Flannel. 1 does not send an ARP request to obtain a MAC address at 192.168.2.2. Instead, the Linux kernel sends a user-space Flanned program requesting an “L3 Miss” event. After receiving the kernel request event, the Flanned program looks for the MAC address of the Flannel.1 device on the subnet that can match the target address from etCD, that is, the MAC address of the Flannel.1 device in the host where the pod is sent. The complete vxLAN inner data packet format is as follows

  1. According to the encapsulation process of TCP protocol stack from top to bottom, to form a complete data frame that can be transmitted, the target IP address and the target MAC address are also needed, so how to find the two?

For the target IP address, Flannel.1 searches the FORWARDING database (FDB) for MAC address 26:1C: B0:22:17:31 based on the MAC address of the peer VTEP, as shown in the following figure. The CORRESPONDING MAC address is 172.25.34.198

root@ip-172-25-33-13:~# bridge FDB show dev flannel.1 22:5b:16:5a:1b: FC DST 172.25.42.118 self permanent 26:1C: B0:22:17:31 DST 172.25.34.198 self permanentCopy the code

The MAC address corresponding to 172.25.34.198 can be found by using ARP. Then, a complete data frame is displayed as follows:

Finally, a data frame with target IP 172.25.34.198 and target port 8472 is sent from the host’s eth0 port

  1. Data frames are successfully forwarded to port 8472 of the peer 172.25.34.198 host through layer 2 network. This port is the listening port of Flanneld. Flanneld decapsulates vxLAN packets and finds that the destination IP address of the packets is 192.168.2.2. Flanneld matches detailed system routes and forwards the packets to PORT CNI0
root@ip-172-25-34-198:~# IP route default via 172.25.32.1 dev eth0172.17.0.0/16 dev docker0 proto kernel scope link SRC 172.17.0.1 linkdown 172.25.32.0/20 dev eth0 proto kernel scope link SRC 172.25.34.198 192.168.0.0/24 via 192.168.0.0 dev 1 onlink 192.168.1.0/24 via 192.168.1.0 dev Flannel.1 onlink 192.168.2.0/24 dev cni0 proto kernel scope link The SRC 192.168.2.1Copy the code
  1. As mentioned above, port CNI0 is directly connected to docker namespace, so data is forwarded to Docker192.168.2.1 through vethpair

After countless trials and hardships, packets finally achieve cross-host docKE communication

conclusion

In the Flannel network model, data packets are like a nested package: a small package is hidden in a large package, and the flow of data is just like the distribution of the package in the logistics network. Firstly, the small package is distributed according to the address list, and then it is packaged in a large package at the hub before boarding the expressway. Then, logistics distribution is carried out according to the address list of the large express package. After reaching the target distribution point, small express packages are separated and distributed again according to the address list of small express package. Finally, cross-node data transmission is realized

Color {green} {cloud monkey life} cloud monkey life for more knowledge