Everybody is good! Today, I am very glad to have the opportunity to share with you the practice and our thinking of Cloud Cloud native network. With the popularity of containers, Kubernetes, micro-services and other concepts and technologies, the cloud native curtain is opened. Correspondingly, whether the platform is distributed or the business is micro-serviced, a powerful network is needed to support the communication between virtual nodes and micro-services.

First of all, let’s take a look at the chart below CNCF. CNCF defines the network architecture. It can be seen that Runtime can be divided into resource management, cloud native network and cloud native storage.

1 What is the cloud native network

The difference between a cloud native network and a traditional network can be seen in the contrast between a container network and a physical network. One intuition is that the container network changes at a very high rate. For the physical network, when the machine is put on the shelf, the switch and router are connected, and all the physical machines are connected to the switch, the overall network topology will not change, and the subsequent addition, deletion, modification and check will not be very frequent.

For container networks, the number of nodes, PODs, workloads, and nodes changes frequently throughout the cluster, and the life cycle of each container is short, resulting in a much higher overall network change than that of traditional networks. At the same time, the change of network strategy is also very frequent, which has a high requirement for network automation. Imagine that a build container that requires manual network configuration can’t take advantage of the native value of Izumo.

In addition, the container network also has high requirements for the self-healing ability of the network. In the case of such frequent changes, it is difficult for a single hardware to solve the problem, and there will be many software, including the reliability of software and application reliability, which will interfere with the network. If the online and offline of network nodes, the start and stop of any container, and the fault adjustment of nodes all require human participation and recovery, which has a great impact on the overall network and has a particularly high demand on the self-healing ability of the network.

However, the container network is not exactly equivalent to the cloud native network. We can imagine the original idea of cloud native. It was meant to be cross-platform. Cloud native applications should not be dependent on the underlying infrastructure; The network is best across a variety of environments, public clouds, private clouds, physical machines, can be migrated on different clouds at any time. If your network is tied to some underlying implementation, such as a public or private cloud, it will be much less cross-platform. In my opinion, a cloud-native network is platform-independent and can be easily migrated across the cloud.

The basic requirements of container network are as follows: first, each pod needs a separate IP; second, the network between all pods is directly accessible at three layers without passing through any NAT devices. That is to say, a direct access from this pod to another pod should see the same target IP. Thirdly, the network also needs to provide a series of network applications, such as Service, DNS, NetworkPolicy, Ingress and so on. These network applications, together with the container network before them, and the cross-platform nature of the network, constitute a complete cloud native network.

2. Open source implementation of cloud native network

As for the open source implementation, CNCF has long defined a standard — CNI (Container Network Interface), which is actually a pluggable container network protocol. Flannel, Clillium and Calico are all based on the CNI protocol. It specifies an interaction protocol between the upper scheduling system and the underlying container network. The interaction protocol defines two interfaces: ADD and DEL.

We classify network plug-in implementations from two perspectives: control plane and data plane. Control plane refers to how networks discover each other. Container network is equivalent to adding another layer of network on top of the network of physical machines. So how do networks learn from each other? How does a container on one node discover a container on another node? What’s the mechanism?

This implementation mechanism can be roughly divided into four categories: one is that all control information is stored in distributed KV, such as Flannel and Cillium, and all control information is stored in ETCD and then sent to each node. Kube-OVN is the open source repository of all control information to OVS DB, which is also a distributed storage of RAFT protocol.

The second is the self-discovery capability directly through routers or switches at the existing network level without storage at the physical level. For example, Calico, Kube-Router issues routing information to external routers and switches through a routing protocol, an Agent with a protocol. External hardware information can be routed through the entire control platform information forward.

The third is to use the underlying physical equipment. The control plane of the physical machine is the switch. It is equivalent to that after each machine connects the network to the switch, the switch learns how to forward the control plane. Try this if the container can connect directly to the underlying physical device, typically MacVLAN or IPVLAN.

One of the more unusual but interesting implementations is Gossip Protocol Weave, also known as Gossip Protocol or Epidemic Protocol. It’s when one person tells another person, and the other person keeps telling other people.

From the point of view of data plane, there are mainly two kinds: one encapsulation mode, such as Flannel VxLAN, Kube-OVN Geneve, Calicoipip, Cillium VxLAN, and the Underlay mode, which is compared with encapsulation mode: Flannel Hostway, Kube-Ovnvlan, Calico BGP.

3. Challenges faced by cloud native networks

When the cloud native network is actually landed, it also faces some challenges, which are mainly divided into four aspects.

First, from the functional level, the ability of cloud native network functions is relatively small, and many customers will encounter new functions to do when they land.

For example, fixed IP, when the user request, can only either network side, or application side change. The ability to multi-tenancy. Some users have large clusters with many project teams and want tenant management capabilities. Tenant refers not only to the isolation of computing resources, but also to the isolation of network resources. For example, the address space is relatively independent and the network control is independent. Some banks or other financial institutions hope that the traffic through the switch is encrypted. Some customers are not all businesses on K8S, and these applications need to interact with each other.

Second, monitoring and troubleshooting, which is more troublesome than the functional level, will find that most of the time after the container network operation and maintenance of the network, check a variety of network problems. One of the most typical may be the Internet. Also, if the development environment is involved, the microservices can be grouped together, which can lead to particularly complex network structures and topologies.

Another problem that networks are prone to is compatibility with existing technology stacks. Some customers also have traditional network monitoring methods. Traditional network monitoring has been very mature, but it is missing in the container network, which will lead to the lack of overall monitoring and the lack of overall troubleshooting experience. Once there are problems in the network, it will be difficult to solve the problem.

Third, security. The traditional way may adopt the network strategy based on black and white list, or based on IP partition, but the container is on K8S, if using NetworkPolicy, many ways are inconsistent with the tradition, which will bring a lot of compatibility problems. Traditional traffic audit traffic playback collects all the traffic that passes through the cluster. Without a good underlying infrastructure, many of the security mechanisms are in the air. These are the problems encountered on the ground.

Fourth, performance. When we talk about performance, most people focus on data plane performance. But from our point of view, the big problem at this stage is how large the user’s network can support the cluster, and when the performance of the control plane will appear obvious attenuation, which is to be considered in the selection. In many cases, the performance of the data plane is not that outstanding when applied to the real world. It needs to be evaluated in terms of the desired performance of the application, and in many cases the most extreme performance is not the most desired performance.