Since private cloud deployment scenarios in enterprises will be more common, so before running Kubernetes + Docker cluster in private cloud, we need to build a network environment in line with Kubernetes requirements. Now in the open source world, there are many open source components that can help us break through the network between Docker containers and containers, and realize the network model required by Kubernetes. Of course, each scheme has its own suitable scene, we should choose according to their actual needs.

 

Kubernetes + Flannel

 

Kubernetes’ network model assumes that all pods are in a flat network space that can be directly connected. This is a ready-made network model in GCE (Google Compute Engine). Kubernetes assumes that the network already exists. When you build a Kubernetes cluster in a private cloud, you cannot assume that the network already exists. We need to implement this network assumption by ourselves, first open the mutual access between Docker containers on different nodes, and then run Kubernetes.

 

Flannel is a network planning service designed by CoreOS team for Kubernetes. To put it simply, its function is to make Docker containers created by different node hosts in the cluster have unique virtual IP addresses in the whole cluster. It also creates an Overlay Network between these IP addresses, through which packets are delivered intact to the target container.

 

Here’s a schematic of its network:

Flannel first created a bridge called Flannel0. One end of the bridge was connected to docker0 and the other to a service process called Flanneld.

 

Flanneld process is not simple, it first connected to ETCD, using ETCD to manage the allocation of IP address segment resources, at the same time monitor the actual address of each Pod in ETCD, and establish a Pod node routing table in memory. Then connect docker0 to the physical network, use the ROUTING table of Pod nodes in memory, package the packets sent by Docker0 to it, and use the connection of physical network to deliver the packets to the target Flanneld, so as to complete the direct address communication between Pod and Pod.

 

There are many options for underlying communication protocols between Flannel, such as UDP, VXlan, and AWS VPC. As long as you can reach the Flannel on the other side. The source Flannel is wrapped, the target Flannel is unpacked, and finally docker0 sees the original data. It is very transparent and does not feel the existence of the middle Flannel.

 

Flannel installation and configuration is covered extensively on the Web and will not be covered here. Flannel uses ETCD as its database, so it needs to be pre-installed.

 

Here are a few scenarios:

1. Network communication within the same Pod. Containers within the same Pod share the same network namespace and the same Linux stack. So for various operations on the network, they can access each other’s ports directly using the localhost address as if they were on the same machine. In fact, this is exactly the same environment as a traditional set of ordinary programs, which can be ported without special modifications for the network. The result is simplicity, security and efficiency, as well as reducing the difficulty of porting existing programs from physical machines or virtual machines to run under containers.

 

2. Pod1 to Pod2 network, in two cases. Pod1 and Pod2 are on different hosts. Pod1 and Pod2 are on the same host.

  • First, Pod1 and Pod2 are not on the same host. The Pod address is in the same network segment as docker0, but docker0 network segment and host network card are two completely different IP network segments, and communication between different nodes can only be carried out through the host physical network card. Associate the IP address of the Pod with the IP address of the Node where the Pod is located. This association allows the Pod to access each other.

 

  • Pod1 is on the same host as Pod2. If Pod1 and Pod2 are on the same host, the Docker0 bridge forwards requests directly to Pod2 without passing through Flannel.

 

3.Pod to Service network. When a Service is created, a domain name pointing to the Service is created. The rule of the domain name is {Service name}.{namespace}.svc. In the past, Service IP was forwarded by iptables and kube-proxy. Now, iptables maintains and forwards Service IP based on performance considerations. Iptables is maintained by Kubelet. Service supports only UDP and TCP. Therefore, ICMP, such as ping, cannot be used. Therefore, Service IP cannot be pinged.

 

4. Pod to the Internet. Pod sends a request to the extranet, searches the routing table, and forwards the packet to the host network adapter. After the host network adapter completes the route selection, Iptables executes Masquerade, changes the source IP address to the IP address of the host network adapter, and then sends the request to the extranet server.

 

5. External access to Pod or Service Because Pod and Service are virtual concepts within the Kubernetes cluster, external client systems cannot access them through the Pod IP address or the virtual IP address and virtual port number of the Service. To make these services accessible to external clients, the Pod or Service port number can be mapped to the host so that the client application can access the container application through the physical machine.

 

Conclusion: Flannel supports Kubernetes network, but it introduces multiple network components. During network communication, it needs to turn to Flannel0 network interface, and then to user-mode Flanneld program. After reaching the peer end, it also needs to go through the reverse process of this process, so it also introduces some network delay loss. In addition, the default underlying communication protocol of Flannel is UDP. UDP is an unreliable protocol. Although TCP on both ends implements reliable transmission, you need to repeatedly debug UDP to avoid transmission quality problems in scenarios with large traffic and high concurrency. In particular, network – dependent applications need to be assessed.

 

Network customization based on Docker Libnetwork

There are two main approaches to implement container network communication across hosts: Layer 2 VLAN network and Overlay network.

 

  • The solution to inter-host communication in layer 2 VLAN network is to transform the original network architecture into a large layer 2 network that can communicate with each other, and implement point-to-point communication between containers through direct routing of specific network devices.
  • Overlay Network Refers to a new data format that encapsulates Layer 2 packets on TOP of IP packets using a specified communication protocol without changing the existing network infrastructure.

 

Libnetwork is the Docker team to separate Docker network functions from Docker core code into a separate library. Libnetwork provides networking functionality for Docker in the form of plug-ins. Users can implement their own drivers to provide different network functions according to their own requirements.

 

The network model that Libnetwork implements is basically this: you can create one or more networks (a network is a bridge or a VLAN), and a container can join one or more networks. Containers on the same network can communicate, but containers on different networks are isolated. This is what separating the network from Docker really means, that we can create the network before creating the container (that is, creating the container is separate from creating the network), and then decide which network to add the container to.

 

Libnetwork implements 5 networking modes:

  1. Bridge: The default Container network driver of Docker. Container is connected to the Docker0 bridge through a pair of Veth pairs. Docker dynamically allocates IP addresses and configates routes and firewalls for the Container.
  2. Host: The container and host share the same Network Namespace.
  3. Null: The network in the container is empty. You need to manually configure network interfaces and routes for the container.
  4. Remote: Remote Driver enables Libnetwork to connect to third-party network solutions through the HTTP Resful API. The SDN scheme similar to SocketPlane can replace the native network implementation of Docker as long as it realizes the agreed HTTP URL processing function and the underlying network interface configuration method.
  5. Overlay: Docker native cross-host multi-subnet network solution.

The network function of Docker itself is relatively simple and cannot meet many complex application scenarios. Therefore, there are many open source projects to improve Docker’s network functions, such as Pipework, Weave, SocketPlane, etc.

 

Example: Network configuration tool Pipework

 

Pipework is an easy-to-use Docker container network configuration tool. Implemented by more than 200 lines of shell script. Use IP, BRCTL, OVs-vsctl and other commands to configure self-defined network Bridges, network adapters, routes, etc., for Docker containers. It has the following functions:

  • Supports custom Linux Bridge and Veth pair for container communication.
  • Support for connecting containers to local networks using MacVLAN devices.
  • DHCP is used to obtain the IP address of the container.
  • Supports Open vSwitch.
  • VLAN division is supported.

 

Pipework simplifies operation commands for container connections in complex scenarios and provides a powerful tool for configuring complex network topologies. Docker’s network model is pretty good for a basic application. However, with the rise of cloud computing and microservices, we can’t stay stuck with basic applications forever. We need better performance and more flexible web capabilities. Pipework is a great network configuration tool, but Pipework is not a solution. We can take advantage of the power provided by Pipework and add additional features according to our needs to help us build our own solutions.

 

The advantage of OVS is that as open source virtual switch software, OVS is relatively mature and stable, and supports various network tunnel protocols. It has passed the test of OpenStack and other projects. There are a lot of them on the web, so I won’t repeat them.

 

Kubernetes integrates Calico

Calico is a pure 3-tier data center network solution that seamlessly integrates with IaaS cloud architectures like OpenStack to provide controlled IP communication between VMS, containers, and bare metal machines.

 

By squeezing the principles of scalable IP networking across the Internet down to the data center level, Calico implemented an efficient vRouter for data forwarding at each compute node using the Linux Kernel. Each vRouter uses BGP to transmit workload routing information to the entire Calico network. Small-scale deployment can be directly connected, and large-scale deployment can be completed through the specified BGP Route reflector. In this way, all workload data traffic is interconnected through IP routing.

 

Calico node networking can directly utilize the Network structure of the data center (whether L2 or L3) without additional NAT, tunnel or Overlay Network.

Calico also provides rich and flexible network policies based on Iptables that guarantee Workload multi-tenant isolation, security groups, and other accessibility restrictions through ACLs on individual nodes.

 

Calico has two deployment scenarios, with clusters typically equipped with SSL certificates and non-certificates.

 

  • The first etCD without HTTPS connection is deployed in HTTP mode, that is, directly connect to etCD without certificates
  • The second HTTPS connection etCD cluster solution, loading the ETCD HTTPS certificate mode, is a bit troublesome

 

Conclusion: At present, the first fastest Kubernetes network is Calico, the second kind is a little slower Flannel, according to their own network environment conditions. Calico, as a virtual network tool for enterprise-level data centers, uses BGP, routing tables and iptables to implement a three-layer network without unpacking packets, and has the characteristics of simple debugging. There are still minor flaws, such as stable’s inability to support private networks, but hopefully it will be more powerful in future releases.

 

Iv. Application container IP fixed (refer to online materials)

 

Docker 1.9 started to support Contiv Netplugin. The convenience of Contiv is that users can access Contiv directly based on the instance IP.

 

Docker 1.10 supports starting containers with specified IP addresses, and as some database applications require instance IP fixation, it is necessary to study the design of container IP fixation solutions.

 

In the default Kubernetes + Contiv Network environment, the IP Network connection of container Pod is completed by Contiv Network Plugin. The Contiv Master only realizes simple IP address allocation and recycling. There is no guarantee that Pod IP will not change. Therefore, a new IPAM (IP address management plug-in) on the Pod layer can be introduced to ensure that the Pod IP is always the same when the same application is deployed multiple times.

 

This functionality can be integrated directly into Kubernetes as IPAM at the Pod level. Pod is the minimum scheduling unit of Kubernetes. The original Kubernetes Pod Registry is mainly responsible for processing all requests related to Pod and Pod subresource: Add, delete, change, query, bind, attach, log, etc.) it is not supported to assign IP to Pod when creating Pod. The IP of Pod Infra Container is dynamically allocated by Contiv.

 

The Pod Registry has been rewritten and two new resource objects have been introduced in Kubernetes:

 

  1. Pod IP Allocator: The Pod IP Allocator is an ETCD-based IP address Allocator that allocates and reclaims Pod IP addresses. Pod IP Allocator records the allocation of IP addresses using a bitmap and persists the bitmap to etCD.
  2. Pod IP Recycler: Pod IP Recycler is an ETCD based IP address Recycler that is at the heart of PodConsistent IP. Pod IP Recycler records the IP addresses used by every application based on namespace (RC name) and can use recycled IP addresses in advance for next deployment. Pod IP Recycler can only recycle IP generated by RC. IP generated by Pod Recycler through other controllers or directly can not be recorded, so IP generated by Pod Recycler does not remain constant. In addition, Pod IP Recycle checks the TTL of each RECLAIMED IP object. The retention time is set to one day.

Kubelet also needs to be modified, including creating containers according to the IP specified in Pod Spec (docker run adds IP specified) and releasing IP when Pod is deleted.

 

There are two main scenarios for Pod creation in PaaS:

  • The first deployment and capacity expansion of applications are randomly allocated from the IP pool.
  • Application redeployment: During redeployment, released IP addresses are stored in the IP Recycle list according to the RC full name. IP addresses are preferentially obtained from the Recycle list to fix IP addresses.

 

Additional REST APIS have been added to Kubernetes to prevent potential problems with IP fixations: including queries for assigned IP addresses and manual assignment/release of IP addresses.

 

The container IP fixing scheme has been tested and evaluated, and the operation is basically no problem, but the stability needs to be improved. Sometimes old PODS cannot be stopped within the expected time, so IP cannot be released and reused (the initial reason is that the container cannot be stopped within the specified time due to occasional lag of Docker), which can be repaired manually. However, in the long run, IP fixing solutions need to be stabilized and optimized according to specific needs.

 

Summary: At present, there are many network solutions supporting Kubernetes, such as Flannel, Calico, Huawei’s Canal, Weave Net, etc. Because they all implement the CNI specification, users get the same network model regardless of which solution they choose, with each Pod having its own IP and communicating directly. The difference lies in the underlying implementation of different solutions. Some implement Overlay based on VXLAN, while others implement Underlay. There are differences in performance, and whether Network Policy is supported or not.

 

Author: Sun Jie, first published on wechat Docker(ID :dockerone), original [link], didi Cloud blog was authorized to forward.