Docker native network solution

Container network mainly solves two core problems: one is IP address allocation of containers, the other is communication between containers. This paper focuses on the second problem and focuses on cross-host communication of containers.

The simplest way to realize container communication between hosts is to directly use the host network. In this case, since the IP of the container is the IP of the host, the network protocol stack of the host and the Underlay network are reused. The original host can communicate, so the container can naturally communicate, but the most direct problem is port conflict.

Therefore, the container is usually configured with its own IP address that is different from that of the host. Because the IP addresses are configured by the containers themselves, the underlying network devices on the Underlay plane, such as switches and routers, are completely unaware of the existence of these IP addresses. As a result, the IP addresses of the containers cannot be directly exported to achieve cross-host communication. To solve the above problems and realize cross-host communication between containers, there are mainly two ideas as follows:

  • Idea 1: Modify the underlying network device configuration, add container network IP address management, and modify router gateway. This method is combined with SDN.

  • Idea 2: Do not modify the underlying network device configuration at all, and reuse the original Underlay plane network to solve the cross-host communication between containers. There are two main methods:

    Overlay Tunnel transmission. The packet from the container is encapsulated into the layer 3 or layer 4 packet header of the original host network, and then transmitted to the destination host over the original network. The destination host then unpacks the packet and forwards it to the container. Overlay Tunnel Such as VXLAN and IPIP, and overlay container network such as Flannel and Weave.

    Example Modify a host route. The container network is added to the host routing table, the host is regarded as the gateway of the container, and the container is forwarded to the specified host based on routing rules to realize layer 3 communication between containers. Flannel host-GW, Calico, etc.

libnetwork&CNM

Libnetwork is a Docker Container Network library. The core content is the Container Network Model (CNM) defined by libNetwork. This Model abstracts the Container Network and consists of the following three components:

  • The Sandbox is the container’s network stack, containing the container’s interface, routing table, and DNS Settings. Linux Network Namespace is a standard implementation of Sandbox. Sandboxes can contain endpoints from different networks.
  • The Endpoint connects the Sandbox to the Network. A typical implementation of an Endpoint is Veth pair, which we’ll illustrate later. An Endpoint can belong to only one network and only one Sandbox.
  • A Network contains a set of endpoints. The endpoints of the same Network can communicate directly. The implementation of Network can be Linux Bridge, VLAN, etc.

Single host network solution

none

You can see that the Docker container has only one lo loopback interface. After starting the container with –net= None, you can still manually configure the network for the container.

docker run --net=none -ti ubuntu:latest ip addr show
[root@YZ-25-65-49 ~]# docker run --net=none -ti ubuntu:12.04  bash      
root@1accd8ab4f47:/# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
Copy the code

host

In host mode, the container can manipulate the network configuration of the host, which is dangerous and should be avoided unless absolutely necessary.

docker run -ti --net=host ubuntu:latest bash
root@YZ-25-65-49:/# ip a1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host LO valid_lft forever preferred_lft forever Inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000 link/ether f0:00:ac:19:41:31 brd Ff :ff:ff:ff:ff:ff :ff inet 172.25.65.49/23 BRD 172.25.65.255 scope global eth0 valid_lft forever preferred_lft forever inet6  fe80::f200:acff:fe19:4131/64 scope link valid_lft forever preferred_lft forever 3: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN link/ether 02:42:f5:a8:2a:78 brd ff:ff:ff:ff:ff:ff Inet 172.17.0.1/16 BRD 172.17.255.255 scope Global Docker0 VALID_lft forever preferred_lft forever Inet6 fe80::42:f5ff:fea8:2a78/64 scope link valid_lft forever preferred_lft foreverCopy the code

bridge 

[root@YZ-25-65-50 ~]# docker run -d --name busybox busybox sleep 360000
e6e41c28b89bdf0278648103bd59c036f2779ef6a70e97f5609764d65f26e28d
[root@YZ-25-65-50 ~]# ip link1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT qlen 1 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT qlen 1000 link/ether f0:00:ac:19:41:32  brd ff:ff:ff:ff:ff:ff 3: docker0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT link/ether 02:42:a7:64:c0:1a brd ff:ff:ff:ff:ff:ff 5: veth18c80ff@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker0 state UP mode DEFAULT link/ether da:09:fa:86:08:f7 brd ff:ff:ff:ff:ff:ff link-netnsid 0 [root@YZ-25-65-50 ~]# brctl showBridge Name Bridge ID STP Enabled interfaces docker0 8000.0242 A764c01A no veth18c80ff [root@YZ-25-65-50 ~]# docker inspect e6 |grep SandboxKey
            "SandboxKey": "/var/run/docker/netns/9812fa7c88bf",
[root@YZ-25-65-50 ~]# nsenter --net=/var/run/docker/netns/9812fa7c88bf
[root@YZ-25-65-50 ~]# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
4: eth0@if5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP 
    link/ether 02:42:ac:11:00:02 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 172.17.0.2/16 brd 172.17.255.255 scope global eth0
       valid_lft forever preferred_lft forever
Copy the code

Cross-host networking scheme

overlay

To support container communication across hosts, Docker provides overlay drivers that allow users to create overlay networks based on VxLAN. The VxLAN encapsulates Layer-2 data to UDP for transmission. The VxLAN provides the same Layer-2 Ethernet services as the VLAN, but has stronger scalability and flexibility.

Docerk Overlay Network Requires a key-value database to store Network status information, including Network, Endpoint, and IP. Consul, Etcd, and ZooKeeper are all key-vlaue software supported by Docker

Docker creates a separate network namespace for each overlay network, which contains a Linux Bridge BR0. The endpoint is still implemented by Veth pair. One end is connected to the container (eth0). The other end is connected to br0 of the namespace. Br0 connects to all endpoints and a VXLAN device for establishing VXLAN tunnels with other hosts. This tunnel is used to communicate data between containers.

macvlan

The MACVLAN is a Linxu Kernel module that allows multiple MAC addresses (that is, multiple interfaces) to be configured on the same physical nic. Each interface can be configured with its own IP address. Macvlan is essentially a network card virtualization technology, so it’s no surprise that Docker uses macVLAN to implement container networks. The biggest advantage of MacVLAN is its excellent performance. Compared to other implementations, MacVLAN does not require the creation of a Linux Bridge, but connects directly to the physical network through the Ethernet interface.

A MACVLAN is a network virtualization technology commonly used in modern networks. It divides a layer 2 network into 4094 logical networks. These logical networks are isolated at Layer 2. Logical networks (vlans) are distinguished by VLAN ids, which range from 1 to 4094. Linux nics can also support VLAN (apt-get install VLAN). The same interface can send and receive packets from multiple vlans, but the premise is to create a sub-interface of the VLAN. For example, to support VLAN10 and VLAN20, create sub-interfaces ENp0s9.10 and ENP0s9.20. On the switch, if a port can send and receive data from only one VLAN, the port works in Access mode. If the port supports multiple vlans, the port works in Trunk mode

Third-party Network Solutions

flannel

Flannel is a container networking solution developed by CoreOS. Flannel assigns a subnet to each host. The container assigns IP addresses to the subnet. These IP addresses can be routed between hosts and containers can communicate across hosts without NAT or port mapping. Each subnet is divided from a larger pool of IP addresses. Flannel runs an agent called Flanneld on each host whose job is to distribute subnets from the pool. To share information between hosts, Flannel uses ETCD (a key-value distributed database similar to Consul) to store network configurations, allocated subnets, and host IP addresses. How packets are forwarded between hosts is implemented by Backend. Flannel provides multiple backend systems. The most common backend systems are VXLAN and host-GW

Flannel.1 is a VXLAN device. The Linux kernel can automatically identify the flannel and encapsulate the packet. During this process, the kernel needs to know which node the packet is sent to. Kernel Check the forwarding database (FDB) on the node to obtain the address of the node where the upstream VTEP device (WHOSE MAC address is D6:51:2E: 80:55:69) resides. If this information is not present in the FDB, the kernel will issue an “L2 MISS” event to the Flanned program in user space. After receiving the event, Flanneld queries the ETCD, obtains the Public IP address of the node corresponding to the VTEP, and registers the information in the FDB. In this way, the Kernel can smoothly query the information and pack it:

https://tonybai.com/2017/01/17/understanding-flannel-network-for-kubernetes/ https://cizixs.com/2017/09/28/linux-vxlan/ [k8s@TX-220-54-4 ~]$ ip link show flannel.1 9: flannel.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN mode DEFAULT link/ether 4a:c7:ba:58:1d:87 brd Ff :ff:ff:ff:ff:ff [k8s@TX-220-54-4 ~]$IP a 7: bond0.120@bond0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP qlen 1000 link/ether 24:6e:96:7d:5d:90 brd Ff :ff:ff:ff:ff:ff :ff: FF inet 10.220.54.4/23 BRD 10.220.55.255 scope global bond0.120 valid_lft forever preferred_lft forever 9: flannel.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN link/ether 4a:c7:ba:58:1d:87 brd Ff :ff:ff:ff:ff:ff :ff:ff:ff inet 10.244.0.0/32 scope global flannel.1 valid_lft forever preferred_lft forever 10: cni0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP qlen 1000 link/ether d6:e8:51:68:d2:fd brd Ff :ff:ff:ff:ff:ff :ff:ff inet 10.244.0.1/24 scope global Cni0 valid_lft forever preferred_lft forever 14213: vethd176601d@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP link/ether da:03:29:43:c8:31 brd ff:ff:ff:ff:ff:ff link-netnsid 2 [k8s@TX-220-54-4 ~]$ kubectl get po -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES Simple - Tensorflow - SERVING - cbbCDCC74-WSgJC 1/1 Running 214 83D 10.244.0.210 tx-220-54-4.h.chinabank.com.cn <none> <none> [k8s@TX-220-54-4 ~]$ kubectlexec -it simple-tensorflow-serving-cbbcdcc74-wsgjc bash
root@simple-tensorflow-serving-cbbcdcc74-wsgjc:/simple_tensorflow_serving# ip a1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:inet 127.0.0.1/8 scope host LO valid_lft forever preferred_lft forever 3: eth0@if14213: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default link/ether 96:b9:cb:2e:a2:9c brd Ff :ff:ff:ff:ff:ff :ff:ff:ff link- netnSID 0 inet 10.244.0.210/24 scope global eth0 valid_lft forever preferred_lft forever [k8s@TX-220-54-7 ~]$IP r default via 10.220.55.254 dev bond0.120 10.220.54.0/23 dev bond0.120 proto kernel scope link SRC 10.220.54.7 10.244.0.0/24 via 10.244.0.0 dev flannel.1 onlink [k8s@TX-220-54-7 ~]$bridge FDB show dev Flannel. 1 | grep 54.4 4 a: c7: ba: do d: 87 DST 10.220.54.4 self permanentCopy the code

calico

Calico is a pure three-layer virtual network scheme. Calico assigns an IP to each container, and each host is a router, connecting containers of different hosts. Unlike VxLAN, Calico does not encapsulate packets, does not require NAT or port mapping, and has good scalability and performance. Calico has another advantage over other container network solutions: Network Policy. Users can dynamically define ACL rules to control packets sent to and from containers to meet service requirements.

Fuckcloudnative. IO/posts/poke -…

[root @ YZ - 25 - istio 58-1-1.1.6]# kubectl exec -it ceph-pod1 sh
/ # ip a1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever 2: tunl0@NONE: <NOARP> mtu 1480 qdisc noop qlen 1000 Link /ipip 0.0.0.0 BRD 0.0.0.0 4: eth0@if303: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1440 qdisc noqueue link/ether b2:ef:a6:90:45:a1 brd ff:ff:ff:ff:ff:ff inet 10.222.18.238/32 scope global eth0 valid_lft forever preferred_lft forever /# ip routeDefault via 169.254.1.1 dev eth0 169.254.1.1 dev eth0 scope link /# ip neigh172.25.58.2 dev eth0 lladdr ee:ee:ee:ee:ee: EE used 0/0/0 probe 0 STALE 169.254.1.1 dev eth0 lladdr ee:ee:ee:ee:ee used 0/0/0 probes 4 STALE [root@YZ-25-58-2 supdev]# ip a
303: calie6c32025bf7@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1440 qdisc noqueue state UP 
    link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netnsid 55
[root@YZ-25-58-2 supdev]# ip route
10.222.18.238 dev calie6c32025bf7 scope link 
[root@YZ-25-58-2 supdev]# cat /proc/sys/net/ipv4/conf/calie6c32025bf7/proxy_arp
1
[root@YZ-25-58-2 supdev]# tcpdump -i calie6c32025bf7 -e -nn
tcpdump: verbose output suppressed, use -v or -vv forfull protocol decode listening on calie6c32025bf7, link-type EN10MB (Ethernet), Capture size 262144 bytes 16:12:29.866426 B2 :ef: A6:90:45: A1 > ff:ff:ff:ff:ff:ff :ff:ff:ff:ff:ff, etherType ARP (0x0806), Length 42: Request who-has 169.254.1.1 tell 10.222.18.238, length 28 16:12:29.866454 EE :ee:ee:ee:ee:ee > B2 :ef:a6:90:45:a1, Ethertype ARP (0x0806), Length 42: Reply 169.254.1.1 IS-at EE: EE :ee:ee:ee:ee, Length 28Copy the code

reference

www.lijiaocn.com/%E9%A1%B9%E…

Mritd. Me / 2019/06/18 /…