preface

In K8S network model (CNI), AWS VPC CNI adopts large layer 2 Underlay network scheme. Pod is assigned VPC IP, which belongs to the same layer network with node hosts. Compared with flannel and other overlay schemes, IT is simpler and more direct.

1. IP addresses in and out of the cluster can communicate with each other

In the process of k8S landing, many enterprises often have strong requirements to achieve direct communication between internal and external networks of K8S cluster, so as to achieve smooth migration

2. Network loss of packet sealing and packet unpacking in overlay scheme is reduced, which can improve performance

The principle of

Main implementation logic:

ENI(Elastic Netowrk Interface)

  • Each ENI is bound to a Primary IP and Secondry IP

  • Local IP Address Manager (IPAMD) runs on each worker node and adds all secondary IP addresses of all enIs to the Local IP Address pool

  • When CNI receives the request to create POD event, it will request IPAMD to get IP and set pod network stack through GRPC. On the other hand, when a request to remove a POD is received, ipAMD is notified to release the IP and remove the POD network stack

CNI

Comply with the interface specification of k8S CNI network model, the main implementation of cmdAdd cmdDel interface, respectively deal with pod network creation and destruction events

  • cmdAdd

Code path: CMD/CLASH – ENI – CNi -plugin/ CNi. Go

func cmdAdd(args *skel.CmdArgs) error { return add(args, typeswrapper.New(), grpcwrapper.New(), rpcwrapper.New(), driver.New()) } func add(args *skel.CmdArgs, cniTypes typeswrapper.CNITYPES, grpcClient grpcwrapper.GRPC, rpcClient rpcwrapper.RPC, driverClient driver.NetworkAPIs) error { conf, log, err := LoadNetConf(args.StdinData) ... Var k8sArgs k8sArgs if err := cniTypes.LoadArgs(args.Args, &k8sArgs); err ! = nil { log.Errorf("Failed to load k8s config from arg: %v", err) return errors.Wrap(err, "add cmd: failed to load k8s config from arg") } ... // Initiate a request to ipAMD Server Conn via GRPC, err := grpcclient. Dial(ipamdAddress, grpc.withinsecure ())... C: = rpcClient NewCNIBackendClient (conn) / / call the AddNetwork ipamd interface to get IP address r, err: = c.a. ddNetwork (context. The Background (), &pb.AddNetworkRequest{ ClientVersion: version, K8S_POD_NAME: string(k8sArgs.K8S_POD_NAME), K8S_POD_NAMESPACE: string(k8sArgs.K8S_POD_NAMESPACE), K8S_POD_INFRA_CONTAINER_ID: string(k8sArgs.K8S_POD_INFRA_CONTAINER_ID), Netns: args.Netns, ContainerID: args.ContainerID, NetworkName: conf.Name, IfName: args.IfName, }) ... addr := &net.IPNet{ IP: net.ParseIP(r.IPv4Addr), Mask: net.IPv4Mask(255, 255, 255, 255), } ... Err = DriverClient. SetupNS(hostVethName, args.IfName, args.Netns, addr, int(r.DeviceNumber), r.VPCcidrs, r.UseExternalSNAT, mtu, log) } ... ips := []*current.IPConfig{ { Version: "4", Address: *addr, }, } result := &current.Result{ IPs: ips, } return cniTypes.PrintResult(result, conf.CNIVersion) }Copy the code

Summary: CNI requests IPAMD service to obtain IP through GRPC, and calls driver module to set pod network environment after obtaining IP

  • cmdDel

Release pod IP and clean up the POD network environment

func cmdAdd(args *skel.CmdArgs) error { return add(args, typeswrapper.New(), grpcwrapper.New(), rpcwrapper.New(), driver.New()) } func add(args *skel.CmdArgs, cniTypes typeswrapper.CNITYPES, grpcClient grpcwrapper.GRPC, rpcClient rpcwrapper.RPC, driverClient driver.NetworkAPIs) error { conf, log, err := LoadNetConf(args.StdinData) ... Var k8sArgs k8sArgs if err := cniTypes.LoadArgs(args.Args, &k8sArgs); err ! = nil { log.Errorf("Failed to load k8s config from arg: %v", err) return errors.Wrap(err, "add cmd: failed to load k8s config from arg") } ... // Initiate a request to ipAMD Server Conn via GRPC, err := grpcclient. Dial(ipamdAddress, grpc.withinsecure ())... C: = rpcClient NewCNIBackendClient (conn) / / call the AddNetwork ipamd interface to get IP address r, err: = c.a. ddNetwork (context. The Background (), &pb.AddNetworkRequest{ ClientVersion: version, K8S_POD_NAME: string(k8sArgs.K8S_POD_NAME), K8S_POD_NAMESPACE: string(k8sArgs.K8S_POD_NAMESPACE), K8S_POD_INFRA_CONTAINER_ID: string(k8sArgs.K8S_POD_INFRA_CONTAINER_ID), Netns: args.Netns, ContainerID: args.ContainerID, NetworkName: conf.Name, IfName: args.IfName, }) ... addr := &net.IPNet{ IP: net.ParseIP(r.IPv4Addr), Mask: net.IPv4Mask(255, 255, 255, 255), } ... Err = DriverClient. SetupNS(hostVethName, args.IfName, args.Netns, addr, int(r.DeviceNumber), r.VPCcidrs, r.UseExternalSNAT, mtu, log) } ... ips := []*current.IPConfig{ { Version: "4", Address: *addr, }, } result := &current.Result{ IPs: ips, } return cniTypes.PrintResult(result, conf.CNIVersion) }Copy the code

driver


This module mainly provides the tools to create and destroy pod network stack. The main functions of dirver module are SetupNS and TeardownNS

Code path: CMD/Television-eni – CNi -plugin/driver.go

Code logic:

  • SetupNS

This function is used to configure the POD network stack, including preparing the POD network environment and configuring policy-based routing

In the AWS-CNI network model, each ENI on a node generates a routing table to forward from-POD traffic. In PBR mode, to-POD traffic preferentially goes through the main routing table, while FROm-POD traffic goes through the routing table corresponding to ENI. Therefore, you need to configure PBR when configuring A POD network

func (os *linuxNetwork) SetupNS(hostVethName string, contVethName string, netnsPath string, addr *net.IPNet, deviceNumber int, vpcCIDRs []string, useExternalSNAT bool, mtu int, log logger.Logger) error { log.Debugf("SetupNS: hostVethName=%s, contVethName=%s, netnsPath=%s, deviceNumber=%d, mtu=%d", hostVethName, contVethName, netnsPath, deviceNumber, mtu) return setupNS(hostVethName, contVethName, netnsPath, addr, deviceNumber, vpcCIDRs, useExternalSNAT, os.netLink, os.ns, mtu, log, os.procSys) } func setupNS(hostVethName string, contVethName string, netnsPath string, addr *net.IPNet, deviceNumber int, vpcCIDRs []string, useExternalSNAT bool, netLink netlinkwrapper.NetLink, ns nswrapper.NS, mtu int, log logger.Logger, ProcSys procsyswrapper. procSys) error {// Call setupVeth to set pod network environment hostVeth, err := setupVeth(hostVethName, contVethName, netnsPath, addr, netLink, ns, mtu, procSys, log) ... addrHostAddr := &net.IPNet{ IP: addr.IP, Mask: Net.cidrmask (32, 32)} IP route add $IP dev veth-1 route := netlink. route {LinkIndex: hostVeth.Attrs().Index, Scope: netlink.SCOPE_LINK, Dst: AddrHostAddr} // The netLink interface encapsulates Linux commands such as "IP link"," IP route", and "IP rule". If err := netlink. RouteReplace(&route); err ! = nil { return errors.Wrapf(err, "setupNS: Unable to add or replace route entry for %s", route.dst.ip.string ())} // Use the "IP rule" command to add to-POD PBR 512: From all to 10.0.97.30 lookup main err = addContainerRule(netLink, true, addr, mainRouteTable)... // If ENI is not primary ENI, add a policy-based route for traffic coming out of POD // 1536: From 10.0.97.30 lookup eni-1 if deviceNumber > 0 {tableNumber := deviceNumber + 1 err = addContainerRule(netLink, false, addr, tableNumber) ... } return nil }Copy the code

The final result:

# ip rule list
0:	from all lookup local 
512:	from all to 10.0.97.30 lookup main <---------- to Pod's traffic
1025:	not from all to 10.0.0.0/16 lookup main 
1536:	from 10.0.97.30 lookup eni-1 <-------------- from Pod's traffic
Copy the code
  • createVethPairContext

The createVethPairContext structure contains the parameters required to create vethPair. The run method is the implementation of the setupVeth function, including creating vethpair, enabling VethpIR, configuring the POD gateway, routing, and so on

func newCreateVethPairContext(contVethName string, hostVethName string, addr *net.IPNet, mtu int) *createVethPairContext { return &createVethPairContext{ contVethName: contVethName, hostVethName: hostVethName, addr: addr, netLink: netlinkwrapper.NewNetLink(), ip: ipwrapper.NewIP(), mtu: mtu, } } func (createVethContext *createVethPairContext) run(hostNS ns.NetNS) error { veth := &netlink.Veth{ LinkAttrs: netlink.LinkAttrs{ Name: createVethContext.contVethName, Flags: net.FlagUp, MTU: createVethContext.mtu, }, PeerName: CreateVethContext hostVethName,} / / execution IP link add for pod create vethpair if err: = createVethContext.net link. LinkAdd (veth); err ! = nil { return err } hostVeth, err := createVethContext.netLink.LinkByName(createVethContext.hostVethName) ... / / perform IP link set $link up enabled vethpair host if err = createVethContext.net link. LinkSetUp (hostVeth); err ! = nil { return errors.Wrapf(err, "setup NS network: failed to set link %q up", createVethContext.hostVethName) } contVeth, err := createVethContext.netLink.LinkByName(createVethContext.contVethName) if err ! = nil { return errors.Wrapf(err, "setup NS network: failed to find link %q", CreateVethContext. ContVethName)} / / enable pod side vethpair if err = createVethContext.net Link. LinkSetUp (contVeth); err ! = nil { return errors.Wrapf(err, "setup NS network: failed to set link %q up", CreateVethContext. ContVethName)} / / add the default gateway 169.254.1.1 route add default gw addr if err = createVethContext.netLink.RouteReplace(&netlink.Route{ LinkIndex: contVeth.Attrs().Index, Scope: netlink.SCOPE_LINK, Dst: gwNet}); err ! = nil { return errors.Wrap(err, "setup NS network: Failed to add default gateway")} // Add default route effect default via 169.254.1.1 dev eth0 if err = createVethContext.ip.AddDefaultRoute(gwNet.IP, contVeth); err ! = nil { return errors.Wrap(err, "setup NS network: Failed to add default route")} // add IP address "IP addr add $IP dev eth0" if err = createVethContext.netLink.AddrAdd(contVeth, &netlink.Addr{IPNet: createVethContext.addr}); err ! = nil { return errors.Wrapf(err, "setup NS network: Failed to add IP addr to % q ", createVethContext. ContVethName)} / / as the default gateway to add static arp entry neigh: = & netlink. Neigh {LinkIndex: contVeth.Attrs().Index, State: netlink.NUD_PERMANENT, IP: gwNet.IP, HardwareAddr: hostVeth.Attrs().HardwareAddr, } if err = createVethContext.netLink.NeighAdd(neigh); err ! = nil { return errors.Wrap(err, "setup NS network: Failed to add static ARP")} // Move one end of the vethpair to the host createVethContext.netLink.LinkSetNsFd(hostVeth, int(hostNS.Fd())); err ! = nil { return errors.Wrap(err, "setup NS network: failed to move veth to host netns") } return nil }Copy the code
  • TeardownNS

Clear the POD network environment

func (os *linuxNetwork) TeardownNS(addr *net.IPNet, deviceNumber int, log logger.Logger) error { log.Debugf("TeardownNS: addr %s, deviceNumber %d", addr.String(), deviceNumber) return tearDownNS(addr, deviceNumber, os.netLink, log) } func tearDownNS(addr *net.IPNet, deviceNumber int, netLink netlinkwrapper.NetLink, log logger.Logger) error { ... // To delete the TO-POD PBR, run the "IP rule del" toContainerRule := netlink.newrule () tocontainerrule-dst = addr command toContainerRule.Priority = toContainerRulePriority err := netLink.RuleDel(toContainerRule) ... If deviceNumber > 0 {err := deleteRuleListBySrc(*addr)... } addrHostAddr := &net.IPNet{ IP: addr.IP, Mask: net.CIDRMask(32, 32)} ... return nil }Copy the code

IPAMD

The local IP address pool management process runs on each worker node with daemonset and maintains all available IP addresses on the node. So, where does the data in the IP address pool come from?

In AWS EC2, there is a concept of EC2metadata, which holds metadata information about the instance, including all enIs bound to EC2, and all IP addresses on the ENI, and provides interfaces to obtain:

The curl http://169.254.169.254/latest/meta-data/network/interfaces/macs/

The curl http://169.254.169.254/latest/meta-data/network/interfaces/macs/0a:da:9d:51:47:28/local-ipv4s

Ipamd stores ENI/IP information in the dataStore during initialization, which is implemented in nodeInit

nodeInit

func (c *IPAMContext) nodeInit() error { ... / / request ec2 meta-data interface, to obtain all the ENI data metadataResult, err: = c.a. wsClient. DescribeAllENIs ()... enis := c.filterUnmanagedENIs(metadataResult.ENIMetadata) .... // Add information retry := 0 for {retry++ if err = c.setupeni (ENI.ENIID, ENI, isTrunkENI, isEFAENI); err == nil { log.Infof("ENI %s set up.", eni.ENIID) break } ... return nil }Copy the code
  • setupENI

The main tasks of setupENI are to complete dataStore data initialization, including:

  • Add ENI to the datastore
  • Enable vethpair associated with ENI
  • Add all secondary IP addresses of ENI to the datastore
func (c *IPAMContext) setupENI(eni string, eniMetadata awsutils.ENIMetadata, isTrunkENI, isEFAENI bool) error {
	primaryENI := c.awsClient.GetPrimaryENI()
    
	err := c.dataStore.AddENI(eni, eniMetadata.DeviceNumber, eni == primaryENI, isTrunkENI, isEFAENI)
	...
	c.primaryIP[eni] = eniMetadata.PrimaryIPv4Address()

	if eni != primaryENI {
		err = c.networkClient.SetupENINetwork(c.primaryIP[eni], eniMetadata.MAC, eniMetadata.DeviceNumber, eniMetadata.SubnetIPv4CIDR)
        ...
	}
    ...
	c.addENIsecondaryIPsToDataStore(eniMetadata.IPv4Addresses, eni)
	c.addENIprefixesToDataStore(eniMetadata.IPv4Prefixes, eni)

	return nil
}
Copy the code

dataStore

DataStore is a local DB constructed by a structure. It maintains information about the ENI node and all IP addresses bound to the ENI node. Each IP address has ipamkey as the primary key. When IP addresses are assigned, network name, CNI_CONTAINERID, CNI_IFNAME is used as the primary key. Otherwise, IP is not assigned and ipamkey is set to null

Code path/PKG/ipamd/datastore data_store. Go

type DataStore struct { total int assigned int allocatedPrefix int eniPool ENIPool lock sync.Mutex log logger.Logger CheckpointMigrationPhase int backingStore Checkpointer cri cri.APIs isPDEnabled bool } type ENI struct { ID string createTime time.Time IsPrimary bool IsTrunk bool IsEFA bool DeviceNumber int AvailableIPv4Cidrs map[string]*CidrInfo } type AddressInfo struct { IPAMKey IPAMKey Address string UnassignedTime time.Time } type CidrInfo struct { Cidr Net.ipnet // 192.168.1.1/24 IPv4Addresses map[string]*AddressInfo IsPrefix bool} type ENIPool map[string]*ENI //['eniid]eniCopy the code

Datastore contains two main methods AssignPodIPv4Address and UnAssignPodIPv4Address CNI essentially call these two methods directly to obtain and release an IP address, respectively

  • AssignPodIPv4Address
// Assign IP to pod func (ds *DataStore) AssignPodIPv4Address(ipamKey ipamKey) (ipv4Address String, deviceNumber int, Err Error) {// add a mutex to the dataStore operation ds.lock.lock () defer ds.lock.unlock ()... EniPool {for _, eni := range ds.eniPool {for _, availableCidr := range eni.AvailableIPv4Cidrs { var addr *AddressInfo var strPrivateIPv4 string var err error if (ds.isPDEnabled && availableCidr.IsPrefix) || (! ds.isPDEnabled && ! availableCidr.IsPrefix) { strPrivateIPv4, err = ds.getFreeIPv4AddrfromCidr(availableCidr) if err ! = nil { ds.log.Debugf("Unable to get IP address from CIDR: %v", err) //Check in next CIDR continue } ... addr = availableCidr.IPv4Addresses[strPrivateIPv4] ... AvailableCidr. IPv4Addresses [strPrivateIPv4] = addr / / for assigned IP, set its ipamkey ds. AssignPodIPv4AddressUnsafe (ipamkey, eni. addr) ... return addr.Address, eni.DeviceNumber, nil } } ... }Copy the code
  • UnAssignPodIPv4Address
Func (ds *DataStore) UnassignPodIPv4Address(ipamKey ipamKey) (e *ENI, IP string, deviceNumber int, err error) { ... / / by primary key ipamKey find pod of corresponding IP address in enipool eni, availableCidr, addr: = ds. Enipool. FindAddressForSandbox (ipamKey)... / / call unassignPodIPv4AddressUnsafe set IP to undistributed state, the IP address of the corresponding primary key ipamkey. Set to null ds unassignPodIPv4AddressUnsafe (addr)... UnassignedTime = time.now ()... return eni, addr.Address, eni.DeviceNumber, nil }Copy the code