CNI(Container Network Interface) is the Network API Interface of containers. In Kubernetes, CNI is used to extend Network functions. Today, we implement our own CNI Network plug-in from scratch.

See all the code in this article:

Github.com/qingwave/my…

The CNI profile

Kubernetes provides many extension points, through the CNI network plug-in can support different network facilities, greatly provides the flexibility of the system, has become the standard in the container network area.

The interaction logic between Kubernetes and CNI is as follows:

Kubelet listens to Pod scheduling to the current node, calls CRI(Containerd, CRI-O, etc.) through RPC, CRI creates Sandbox container, initializes Cgroup and Namespace, and then calls CNI plug-in to allocate IP, and finally completes container creation and startup.

Different from CRI and CSI communicating through RPC, CNI is invoked through binary interface and delivers specific network configuration through environment variables and standard input. The following figure shows the workflow of Flannel CNI plug-in. IP allocation and network configuration of Pod are realized through chain invocation of CNI plug-in:

To realize the CNI

Implementing a complete K8s CNI plug-in needs to meet the following requirements:

  1. Pod IP assignment, namely IPAM function
  2. The node can communicate with all THE PODS on the node to implement health check
  3. All PODS in the cluster can communicate, including the same node and different nodes
  4. Support for other functions, such as hostPort, kube-proxy compatible IPtables rules, etc

We mainly achieve the first three requirements, through Linux Bridge, Veth Pair and route to achieve K8s network scheme.

The network architecture is as follows:

Consists of two components:

  1. Mycni: CNI plug-in, realize IPAM, assign IP to Pod configuration route, realize communication with different Pod on the node through network bridge
  2. Mycnid: daemon on the Node that listens to K8s Node and obtains CIDR write routes of each Node

Mycni

CNI has officially provided toolkit, we only need to implement cmdAdd, cmdCheck, cmdDel interface to achieve a CNI plug-in.

func PluginMainWithError(cmdAdd, cmdCheck, cmdDel func(_ *CmdArgs) error.
        versionInfo version.PluginInfo, about string) *types.Error {
	return (&dispatcher{
		Getenv: os.Getenv,
		Stdin:  os.Stdin,
		Stdout: os.Stdout,
		Stderr: os.Stderr,
	}).pluginMain(cmdAdd, cmdCheck, cmdDel, versionInfo, about)
}
Copy the code

We focus on the process of network creation, IP assignment and bridge configuration:

func cmdAdd(args *skel.CmdArgs) error {
    // Load the configuration
	conf, err := config.LoadCNIConfig(args.StdinData)

    // Store the local IP address assignment list
	s, err := store.NewStore(conf.DataDir, conf.Name)
	defer s.Close()

    // The ipAM service allocates IP addresses
	ipam, err := ipam.NewIPAM(conf, s)
	gateway := ipam.Gateway()
	ip, err := ipam.AllocateIP(args.ContainerID, args.IfName)

	// Create a bridge, virtual device, and bind to the bridge
	br, err := bridge.CreateBridge(conf.Bridge, mtu, ipam.IPNet(gateway))
	bridge.SetupVeth(netns, br, mtu, args.IfName, ipam.IPNet(ip), gateway)

    // Returns network configuration information
	result := &current.Result{
		CNIVersion: current.ImplementedSpecVersion,
		IPs: []*current.IPConfig{
			{
				Address: net.IPNet{IP: ip, Mask: ipam.Mask()},
				Gateway: gateway,
			},
		},
	}

	return types.PrintResult(result, conf.CNIVersion)
}
Copy the code

IPAM

The IPAM service needs to ensure that the Pod is assigned a unique IP, and K8s will assign PodCIDR to each node. It only needs to ensure that all Pod IP addresses on the node do not conflict.

The allocated IP is stored in a local file. When a new Pod is created, it only needs to check the allocated IP and take an unused IP through CIDR. While the usual practice is to store IP information in a database (ETCD), this article uses file storage only for the brief.

First of all, it is necessary to ensure that IP address allocation does not conflict in concurrent requests. This can be achieved through file locks. Storage is as follows:

type data struct {
	IPs  map[string]containerNetINfo `json:"ips"` // Storage IP address information
	Last string                      `json:"last"`// The last assigned IP address
}

type Store struct {
	*filemutex.FileMutex / / file lock
	dir      string
	data     *data
	dataFile string
}
Copy the code

Assign IP code:

func (im *IPAM) AllocateIP(id, ifName string) (net.IP, error) {
	im.store.Lock() // Lock to prevent collisions
	defer im.store.Unlock()

	im.store.LoadData(); // Load IP data in the storage

    // Check whether the IP address is allocated based on the container ID
	ip, _ := im.store.GetIPByID(id)
	if len(ip) > 0 {
		return ip, nil
	}

    // From the last assigned IP address, check successively, if the IP address is not used, add to the file
	start := make(net.IP, len(last))
	for {
		next, err := im.NextIP(start)
		iferr == IPOverflowError && ! last.Equal(im.gateway) { start = im.gatewaycontinue
		} else iferr ! =nil {
			return nil, err
		}

        / / assign IP
		if! im.store.Contain(next) { err := im.store.Add(next, id, ifName)return next, err
		}

		start = next
		if start.Equal(last) {
			break
		}

		fmt.Printf("ip: %s", next)
	}

	return nil, fmt.Errorf("no available ip")}Copy the code

The IP address assignment process is as follows:

  1. Add file lock, read the assigned IP
  2. Check whether the IP address is unassigned starting from the last assigned IP address
  3. If the IP is not in use, store it in a file and return it

Intra-node communication

Intra-node communication is achieved through the network bridge. Create a virtual device pair, bind them to the Namespace where Pod resides and the network bridge respectively, bind the IP assigned by IPAM, and set the default route. Thus realize the communication between Node->Pod and Pod on the same Node.

First, if the bridge does not exist, create it from the NetLink library

func CreateBridge(bridge string, mtu int, gateway *net.IPNet) (netlink.Link, error) {
	ifl, _ := netlink.LinkByName(bridge); l ! =nil {
		return l, nil
	}

	br := &netlink.Bridge{
		LinkAttrs: netlink.LinkAttrs{
			Name:   bridge,
			MTU:    mtu,
			TxQLen: - 1,}}iferr := netlink.LinkAdd(br); err ! =nil&& err ! = syscall.EEXIST {return nil, err
	}

	dev, err := netlink.LinkByName(bridge)
	iferr ! =nil {
		return nil, err
	}

    // Add the address, namely the Pod default gateway address
	iferr := netlink.AddrAdd(dev, &netlink.Addr{IPNet: gateway}); err ! =nil {
		return nil, err
	}
    // Start the bridge
	iferr := netlink.LinkSetUp(dev); err ! =nil {
		return nil, err
	}

	return dev, nil
}
Copy the code

Create a virtual nic for the container

func SetupVeth(netns ns.NetNS, br netlink.Link, mtu int, ifName string, podIP *net.IPNet, gateway net.IP) error {
	hostIface := &current.Interface{}
	err := netns.Do(func(hostNS ns.NetNS) error {
		// Create a virtual nic in the container network space
		hostVeth, containerVeth, err := ip.SetupVeth(ifName, mtu, "", hostNS)
		iferr ! =nil {
			return err
		}
		hostIface.Name = hostVeth.Name

		// set ip for container veth
		conLink, err := netlink.LinkByName(containerVeth.Name)
		iferr ! =nil {
			return err
		}
        // Bind Pod IP
		iferr := netlink.AddrAdd(conLink, &netlink.Addr{IPNet: podIP}); err ! =nil {
			return err
		}

		// Start the nic
		iferr := netlink.LinkSetUp(conLink); err ! =nil {
			return err
		}

		// Add the default path, gateway is the address of the bridge
		iferr := ip.AddDefaultRoute(gateway, conLink); err ! =nil {
			return err
		}

		return nil
	})

	// need to lookup hostVeth again as its index has changed during ns move
	hostVeth, err := netlink.LinkByName(hostIface.Name)
	iferr ! =nil {
		return fmt.Errorf("failed to lookup %q: %v", hostIface.Name, err)
	}

	// Bind the other end of the virtual nic to the bridge
	iferr := netlink.LinkSetMaster(hostVeth, br); err ! =nil {
		return fmt.Errorf("failed to connect %q to bridge %v: %v", hostVeth.Attrs().Name, br.Attrs().Name, err)
	}

	return nil
}
Copy the code

At this point, the function of CNI plug-in is completed.

Mycnid

Mycnid is the daemon process on the node, which realizes Pod communication of different nodes. Its main functions include:

  1. Listen on K8s Nodes to obtain the PodCIDR write configuration file of this node (default location/run/mycni/subnet.json)
  2. Add routes to other nodes (ip route add podCIDR via nodeip)
  3. Some initial configuration, writing the default Iptables rules, initializes the bridge

Internode communication

Listen to Node resources via Controller and call Reconcile routes when nodes Add/Delete are present

func (r *Reconciler) Reconcile(ctx context.Context, req reconcile.Request) (reconcile.Result, error) {
	result := reconcile.Result{}
	nodes := &corev1.NodeList{}
    // Get all nodes
	iferr := r.client.List(ctx, nodes); err ! =nil {
		return result, err
	}
    // Obtain the CIDR of the node to generate the route
	cidrs := make(map[string]netlink.Route)
	for _, node := range nodes.Items {
		if node.Name == r.config.nodeName { // Skip the current node
			continue
		}
        // Generate the desired route
		_, cidr, err := net.ParseCIDR(node.Spec.PodCIDR)
		nodeip, err := getNodeInternalIP(&node)
		route := netlink.Route{
			Dst:        cidr,
			Gw:         nodeip,
			ILinkIndex: r.hostLink.Attrs().Index,
		}
		cidrs[cidr.String()] = route
        // Compare with the local route. If no local route exists, create one
		if currentRoute, ok := r.routes[cidr.String()]; ok {
			if isRouteEqual(route, currentRoute) {
				continue
			}
			iferr := r.ReplaceRoute(currentRoute); err ! =nil {
				return result, err
			}
		} else {
			iferr := r.addRoute(route); err ! =nil {
				return result, err
			}
		}
	}
    // Delete redundant routes
	for cidr, route := range r.routes {
		if_, ok := cidrs[cidr]; ! ok {iferr := r.delRoute(route); err ! =nil {
				return result, err
			}
		}
	}

	return result, nil
}

// Create a route
func (r *Reconciler) addRoute(route netlink.Route) (err error) {
	defer func(a) {
		if err == nil {
			r.routes[route.Dst.String()] = route
		}
	}()

	log.Info(fmt.Sprintf("add route: %s", route.String()))
	err = netlink.RouteAdd(&route)
	iferr ! =nil {
		log.Error(err, "failed to add route"."route", route.String())
	}
	return
}
Copy the code

The main steps include:

  1. Obtain network information of all nodes in the cluster (PodCIDR, IP, etc.) when the current Node changes
  2. Compare the route with the route of the node, add the route if it is missing, and delete the route if there are redundant routes on the host

Why not just modify the route for the changed Node?

By obtaining all node network information and host machine routing comparison, in line with Kubernetes programming specifications, declarative programming rather than procedural, this way will not lose events. Even if a route is deleted manually, it will be recovered in the next synchronization.

Other configuration

If Docker is used, Docker will prohibit traffic forwarding for non-Docker Bridges. You need to configure iptables

iptables -A FORWARD -i ${bridge} -j ACCEPT
Copy the code

Most clusters allow Pod access to external networks. SNAT needs to be configured:

sudo iptables -t nat -A POSTROUTING -s $cidr -j MASQUERADE
Allow host network card forwarding
iptables -A FORWARD -i $hostNetWork -j ACCEPT
Copy the code

To configure iptables, use github.com/coreos/go-iptables/iptables:

func addIptables(bridgeName, hostDeviceName, nodeCIDR string) error {
	ipt, err := iptables.NewWithProtocol(iptables.ProtocolIPv4)
	iferr ! =nil {
		return err
	}

	if err := ipt.AppendUnique("filter"."FORWARD"."-i", bridgeName, "-j"."ACCEPT"); err ! =nil {
		return err
	}

	if err := ipt.AppendUnique("filter"."FORWARD"."-i", hostDeviceName, "-j"."ACCEPT"); err ! =nil {
		return err
	}

	if err := ipt.AppendUnique("nat"."POSTROUTING"."-s", nodeCIDR, "-j"."MASQUERADE"); err ! =nil {
		return err
	}

	return nil
}
Copy the code

demo

Run kind simulation to create a multi-node K8S cluster. Kind create cluster –config deploy/kind.yaml

# cat deploy/kind.yaml
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
networking:
  Disallow the default CNI
  disableDefaultCNI: true
nodes:
# Contains one master node and three workers
- role: control-plane
- role: worker
- role: worker
- role: worker
Copy the code

Once created, you can see that all nodes are NotReady because we have no CNI installed

Mycni kubectl get no NAME STATUS ROLES AGE VERSION kind-Control-plane NotReady Control-plane,master 37s v1.23.4 Tready <none> tready v1.23.4 kind-worker2 NotReady <none> tready v1.23.4 kind-worker3 NotReady <none> tready v1.23.4 kind-worker3 NotReady <none> tready v1.23.4Copy the code

Next, compile our MyCNI image and load it into the cluster via kind Load Docker-image.

The deployment of THE CNI plug-in refers to the deployment of Flannel. It mainly deployed a Pod on each node through Daemonset, initialized the container, copied the myCNI check binary file to /opt/ CNI /bin and copied the configuration file to /etc/cni/net.d. Then start the myCNID container.

kubectl apply -f delpoy/mycni.yaml
Copy the code

After the deployment is complete, all nodes are in the Ready state.

Finally, to test the Pod’s network configuration, deploy a multi-copy Alpine Deployment

kubectl create deployment cni-test --image=alpine --replicas=6 -- top kubectl get po -owide NAME READY STATUS RESTARTS AGE IP NODE NODE READINESS GATES CNI-test-5DF744744C-5wTHb 1/1 Running 0 12s 10.244.2.3 Kind-worker2 < None > NOMINATED NODE READINESS GATES CNI-test-5DF744744C-5wTHb 1/1 Running 0 12s 10.244.2.3 Kind-worker2 < None > <none> cni-test-5df744744c-7cdll 1/1 Running 0 12s 10.244.1.2 kind-worker <none> <none> CNi-test-5df744744c-jssjk 1/1 Running 0 12s 10.244.3.2 kind-worker3 <none> <none> CNi-test-5df744744C-jw6XV 1/1 Running 0 12s 10.244.1.3 kind-worker <none> <none> CNi-test-5df744744c-klbr4 1/1 Running 0 12s 10.244.3.3 kind-worker3 <none> <none> Cni-test-5df744744c-w7q9t 1/1 Running 0 12s 10.244.2.2 kind-worker2 <none> <none>Copy the code

All pods can start normally. First test the communication between Node and Pod. The result is as follows

root@kind-worker:/# ping 10.244.1.2PING 10.244.1.2 (10.244.1.2) 56(84) bytes of data.64 bytes from 10.244.1.2: Icmp_seq =1 TTL =64 time=0.101 ms 64 bytes from 10.244.1.2: Icmp_seq =2 TTL =64 time= 0.049ms -- 10.244.1.2 Ping statistics --Copy the code

Communicate with Pod on node:

~ kubectl execCni-test-5df744744c-7cdll -- ping 10.244.1.3 -c 4 ping 10.244.1.3 (10.244.1.3): 56 data bytes 64 bytes from 10.244.1.3: Seq =0 TTL =64 time=0.118 ms 64 bytes from 10.244.1.3: seq=1 TTL =64 time=0.077 ms 64 bytes from 10.244.1.3: Seq =2 TTL =64 time=0.082 ms 64 bytes from 10.244.1.3: Seq =3 TTL =64 time= 0.075ms -- 30.730.7ms -- 30.730.7ms 0% packet loss round-trip min/ AVG/Max = 0.077/0.090/0.118msCopy the code

Pod communication between different nodes

 ~ kubectl execCni-test-5df744744c-7cdll -- ping 10.244.2.2 -c 4 ping 10.244.2.2 (10.244.2.2): 56 data bytes 64 bytes from 10.244.2.2: Seq =0 TTL =62 time=0.298 ms 64 bytes from 10.244.2.2: seq=1 TTL =62 time=0.234 ms 64 bytes from 10.244.2.2: seq=1 TTL =62 time=0.234 ms 64 bytes from 10.244.2.2: Seq =2 TTL =62 time=0.180 ms 64 bytes from 10.244.2.2: Seq =3 TTL =62 time= 0.234ms -- 10.244.2.2 ping statistics -- 3 packets transmitted, 4 packets received 0% packet loss round-trip min/ AVg/Max = 0.180/0.236/0.298msCopy the code

Pod accesses the Internet

~ kubectl execCni-test-5df744744c-7cdll -- ping www.baidu.com -c 4 ping www.baidu.com (103.235.46.39): 56 data bytes 64 bytes from 103.235.46.39: seq=0 TTL =47 time= 312.115ms 64 bytes from 103.235.46.39: Seq =1 TTL =47 time=311.126 ms 64 bytes from 103.235.46.39: seq=2 TTL =47 time=311.653 ms 64 bytes from 103.235.46.39: seq=2 TTL =47 time=311.653 ms 64 bytes from 103.235.46.39: Seq =3 TTL =47 time= 311.250s ms -- www.baidu.com ping statistics -- 3 packets transmitted, 4 packets received, 0% packet loss round-trip min/ AVG/Max = 311.126/311.536/312.115msCopy the code

Through the test and verification, myCNI we realized meets the requirements of K8S CNI network plug-in, can realize the communication of all Pod in the cluster, as well as the communication between Node and Pod, Pod can also access the external network normally.

conclusion

This paper first introduces the architecture of CNI, through the manual implementation of a CNI network plug-in, you can have a deeper understanding of the working principle of CNI and Linux related network knowledge.

Welcome to correct, all code see:

Github.com/qingwave/my…

reference

  • www.cni.dev/docs/
  • Ronaknathani.com/blog/2020/0…

Explore more in qingwave.github.io