Overview

We currently produce K8S and Calico deployed in private rooms using Ansible binaries, without using official Calico/Node containers, and without felix because we only deployed confd/ Bird process services without using Network policy. The Border Gateway Protocol (BGP) and Peered with TOR (Top of Rack) routers are used to deploy networks. BGP peer pairs are established between each worker node and its Top switch. The top switch establishes BGP peer pairing with the upper-layer core switch to ensure that pod IP can be directly accessed on the Intranet.

BGP: a protocol that distributes dynamic routes between networks and uses TCP to transmit data. For example, switch A is connected to 12 worker nodes. A BGP Client, such as Bird or GoBGP program, can be installed on each worker node. In this way, each worker node will distribute its route to switch A, which will perform route aggregation. And continue forwarding up to the core switch. Routing on switch A is Node level, not Pod level.

During the maintenance of K8S cloud platform, it is sometimes found that all pod IP on a worker node is inaccessible outside the cluster. After investigation, it is found that the worker node has two Intranet network cards eth0 and eth1. The eth0 IP address establishes BGP connection with the switch and obtains its AS number. But the IP address of the eth1 network adapter used in the bird startup configuration file bird.cfg. The IP address ipv4Address of the Node data in Calico is incompatible with the switch address peerIP of the BGPPeer data. Calico data can be obtained by using the following command:


calicoctl get node ${nodeName} -o yaml
calicoctl get bgppeer ${peerName} -o yaml

Copy the code

The root cause of ansible deployment is that the network API is called to obtain as number and PEER IP data of the SWITCH’s BGP peer using eth0. Yaml uses the ansible task calicoctl apply -f bgp_peer-. Yaml writes node-specific BGP Peer data, and writes calico BGP Peer data using the eth0 switch address. However, when ansible tasks run to configure the bird. CFG configuration file, eth1 interface is used for environment variable IP, and eth1 nic address is used for writing calico Node data, and then the Confd process reads Node data to generate bird. CFG file. The eth1 address will be used. Eth0 should be used here.

When the cause of the problem is found, it is happily resolved.

How does Calico write Node data? The code used to be in the Calico startup code startup.go. The official Calico/Node container will start multiple processes such as Bird/Confd/Felix, and use Runsvdir (supervisor like) to manage multiple processes. When the container starts, it will also run the initialization script, configured here l11-L13:


# Run the startup initialisation script.
# These ensure the node is correctly configured to run.
calico-node -startup || exit 1

Copy the code

So, take a look at what the initialization script does.

Initialize script source code parsing

When running the calico-node-startup command, l111-l113, the starup startup.go script, is actually executed:


  func main(a) {
    // ...
    if *runStartup {
        logrus.SetFormatter(&logutils.Formatter{Component: "startup"})
        startup.Run()
    }
    // ...
  }
  
Copy the code

The startup.go script does three things:

  • Detecting IP address and Network to use for BGP.
  • Configuring the node resource with IP/AS information provided in the environment, or autodetected.
  • Creating default IP Pools for quick-start use.(NO_DEFAULT_POOLS can be used to disable IP Pools.) If there is an IP Pool in the cluster, you can skip the creation, so you can not close it. We produce K8S ansible deployment here is selected to shut down, not shut down will not affect)

Therefore, there is only one thing to do at initialization: write a Node data to Calico for subsequent Confd configuration bird.cfg configuration. L97-l223:


func Run(a) {
  // ...
  // Read the name of the current host from environment variables such as NODENAME, HOSTNAME, or CALICO_NODENAME_FILE
  nodeName := determineNodeName()
  
  / / create CalicoClient:
  // If DATASTORE_TYPE uses kubernetes, just pass the KUBECONFIG variable value. If k8s pod is deployed, no need to pass it
  / / KubernetesClient as truth, can refer to the calicoctl configuration document: https://docs.projectcalico.org/getting-started/clis/calicoctl/configure/kdd
  // If DATASTORE_TYPE uses etcdv3, the etCD environment variable must be configured as follows: https://docs.projectcalico.org/getting-started/clis/calicoctl/configure/etcd
  ~/.zshrc = ~/.zshrc You can refer to https://docs.projectcalico.org/getting-started/clis/calicoctl/configure/kdd#example-using-environment-variables:
  // export CALICO_DATASTORE_TYPE=kubernetes
  // export CALICO_KUBECONFIG=~/.kube/config
  cfg, cli := calicoclient.CreateClient()
  // ...
  if os.Getenv("WAIT_FOR_DATASTORE") = ="true" {
    Get("foo") to see if it works
    waitForConnection(ctx, cli)
  }
  // ...

  // Query nodeName Node data from calico, if not, construct a new Node object
  // The Node object is later updated with the host's IP address
  node := getNode(ctx, cli, nodeName)

  var clientset *kubernetes.Clientset
  var kubeadmConfig, rancherState *v1.ConfigMap

  // If running under kubernetes with secrets to call k8s API
  if config, err := rest.InClusterConfig(); err == nil {
    // If the k8S cluster is deployed by kubeadm or Rancher, read the kubeadm-config or full-cluster-state ConfigMap value
    // Used to configure the ClusterType variable and create IPPool
    // We currently do not use these two methods to produce K8S
    
    // ...
  }

  // Logic is key here, where the Node object's spec.bgp.ipv4Address address is configured and ipv4 address policy can be obtained in various ways
  // The IP environment variable can be directly assigned a specific address, such as 10.203.10.20, or "autodetect" can be assigned to the IP environment variable
  // The automatic detection policy is configured according to the "IP_AUTODETECTION_METHOD" environment variable, such as can-reach or interface=eth.*,
  / / specific automatic detection strategy can refer to: https://docs.projectcalico.org/archive/v3.17/networking/ip-autodetection
  // Our production K8S is using ansible variables to get the ipv4 address of eth{$interface} to the IP environment variable. If the machine has dual Intranet nics, either eth0 or eth1 is selected
  // The nic must be the same as that used to create the BGP peer. In addition, the default gateway address of the machine depends on whether the default gateway address is eth0 or eth1
  // How to obtain an IP address
  configureAndCheckIPAddressSubnets(ctx, cli, node)

  // We use bird, where CALICO_NETWORKING_BACKEND configurebird
  if os.Getenv("CALICO_NETWORKING_BACKEND") != "none" {
    // Select AS from the environment variable, you can give the default value 65188, does not affect
    configureASNumber(node)
    ifclientset ! =nil {
      // If you choose the official calico/ Node cluster deployment, patch the NetworkUnavailable Condition of the current node in k8S, meaning that the network is currently unavailable
      / / you can refer to https://kubernetes.io/docs/concepts/architecture/nodes/#condition
      // Currently we produce K8s without calico/ Node cluster deployment, so we will not take this step of logic, and we produce K8S version is too low, and there is no NetworkUnavailable Condition in node Conditions
      err := setNodeNetworkUnavailableFalse(*clientset, nodeName)
      // ...}}Node.spec. OrchRefs is k8s with the value read from the CALICO_K8S_NODE_REF environment variable
  configureNodeRef(node)
  // Create directories such as /var/run/calico, /var/lib/calico, and /var/log/calico
  ensureFilesystemAsExpected()
  
  // The Calico Node object is ready to be created or updated
  // This is the core logic of the startup script, which is used to query the configuration data of the Node object. The main function is to create or update the Node object during initialization
  if_, err := CreateOrUpdate(ctx, cli, node); err ! =nil {
    // ...
  }

  // Configure the cluster IP Pool, i.e., the pod cidr network segment for the entire cluster. If the /18 network segment is used and each K8S worker Node uses the /27 subnet segment, then the cluster can deploy up to 2^(27-18)=512
  // Each machine can be assigned 2^(32-27)=32- the first two addresses =30 pods.
  configureIPPools(ctx, cli, kubeadmConfig)

  // If DatastoreType is not kubernetes, we will write a global FelixConfiguration object named default for each Node
  // Default configuration FelixConfiguration object.
  /etc//etc//etc//etc//etc//etc//etc//etc//etc//etc//etc//etc//etc//etc//etc//etc//etc//etc//etc//etc//etc//etc//etc//etc//etc//etc//etc//etc//etc/ In addition, we don't use Felix, so we don't need to pay too much attention to felix data.
  iferr := ensureDefaultConfig(ctx, cfg, cli, node, getOSType(), kubeadmConfig, rancherState); err ! =nil {
    log.WithError(err).Errorf("Unable to set global default configuration")
    terminate()
  }

  // write nodeName to the file specified by the CALICO_NODENAME_FILE environment variable
  writeNodeConfig(nodeName)
  // ...
}
// Query nodeName Node data from calico, if not, construct a new Node object
func getNode(ctx context.Context, client client.Interface, nodeName string) *api.Node {
    node, err := client.Nodes().Get(ctx, nodeName, options.GetOptions{})
    // ...
    iferr ! =nil {
      // ...
        node = api.NewNode()
        node.Name = nodeName
    }
    return node
}
// Create or update Node objects
func CreateOrUpdate(ctx context.Context, client client.Interface, node *api.Node) (*api.Node, error) {
    ifnode.ResourceVersion ! ="" {
        return client.Nodes().Update(ctx, node, options.SetOptions{})
    }
    return client.Nodes().Create(ctx, node, options.SetOptions{})
}

Copy the code

From the above code analysis, there are two key logic need to look carefully: one is to get the IP address of the current machine; One is to configure pod CIDR for the cluster.

Pod CIDR logic l858-L1050


// configureIPPools ensures that default IP pools are created (unless explicitly requested otherwise).
func configureIPPools(ctx context.Context, client client.Interface, kubeadmConfig *v1.ConfigMap) {
  // Read in environment variables for use here and later.
  ipv4Pool := os.Getenv("CALICO_IPV4POOL_CIDR")
  ipv6Pool := os.Getenv("CALICO_IPV6POOL_CIDR")

  if strings.ToLower(os.Getenv("NO_DEFAULT_POOLS")) = ="true" {
    // ...
    return
  }
  // ...
  // Read block size from the CALICO_IPV4POOL_BLOCK_SIZE environment variable, that is, the subnet mask to be assigned to your network segment, for example /26 by default
  // If the default IP pool 192.168.0.0/16 is selected and each Node subnet is assigned to the /26 network segment, the cluster can deploy 2^(26-16)=1024 machines
  ipv4BlockSizeEnvVar := os.Getenv("CALICO_IPV4POOL_BLOCK_SIZE")
  ifipv4BlockSizeEnvVar ! ="" {
    ipv4BlockSize = parseBlockSizeEnvironment(ipv4BlockSizeEnvVar)
  } else {
    // DEFAULT_IPV4_POOL_BLOCK_SIZE is the default subnet segment 26
    ipv4BlockSize = DEFAULT_IPV4_POOL_BLOCK_SIZE
  }
  // ...
  // Get a list of all IP Pools
  poolList, err := client.IPPools().List(ctx, options.ListOptions{})
  // ...
  // Check for IPv4 and IPv6 pools.
  ipv4Present := false
  ipv6Present := false
  for _, p := range poolList.Items {
    ip, _, err := cnet.ParseCIDR(p.Spec.CIDR)
    iferr ! =nil {
      log.Warnf("Error parsing CIDR '%s'. Skipping the IPPool.", p.Spec.CIDR)
    }
    version := ip.Version()
    ipv4Present = ipv4Present || (version == 4)
    ipv6Present = ipv6Present || (version == 6)
    CreateIPPool () will not be used to create an IP pool if the cluster has one
    if ipv4Present && ipv6Present {
      break}}if ipv4Pool == "" {
    // If no pod network segment is configured, give the default network segment "192.168.0.0/16"
    ipv4Pool = DEFAULT_IPV4_POOL_CIDR
        // ...
  }
  // ...
  // IP pools already exist in the cluster
  if! ipv4Present { log.Debug("Create default IPv4 IP pool")
    outgoingNATEnabled := evaluateENVBool("CALICO_IPV4POOL_NAT_OUTGOING".true)

    createIPPool(ctx, client, ipv4Cidr, DEFAULT_IPV4_POOL_NAME, ipv4IpipModeEnvVar, ipv4VXLANModeEnvVar, outgoingNATEnabled, ipv4BlockSize, ipv4NodeSelector)
  }
  / /... Omit ipv6 logic
}

// Create an IP pool
func createIPPool(ctx context.Context, client client.Interface, cidr *cnet.IPNet, poolName, ipipModeName, vxlanModeName string, isNATOutgoingEnabled bool, blockSize int, nodeSelector string) {
  / /...
  pool := &api.IPPool{
    ObjectMeta: metav1.ObjectMeta{
      Name: poolName,
    },
    Spec: api.IPPoolSpec{
      CIDR:         cidr.String(),
      NATOutgoing:  isNATOutgoingEnabled,
      IPIPMode:     ipipMode, // Since we use BGP in production, ipipMode is never
      VXLANMode:    vxlanMode,
      BlockSize:    blockSize,
      NodeSelector: nodeSelector,
    },
  }
  // Create an IP pool
  if_, err := client.IPPools().Create(ctx, pool, options.SetOptions{}); err ! =nil {
    // ...}}Copy the code

Then look at the logic l498-L585:


// Configure an IPv4Address for the Node object
func configureIPsAndSubnets(node *api.Node) (bool, error) {
  // ...
  oldIpv4 := node.Spec.BGP.IPv4Address

  // Obtain the IP address from the IP environment variable, we produce k8S ansible to read the IP address directly, but for dual Intranet nics, sometimes here read the IP address,
  // The IP address used in bgp_peer. Yaml will be different from that used in bgp_peer. Yaml uses eth0 by default.
  // So the IP must be the eth0 address.
  ipv4Env := os.Getenv("IP")
  if ipv4Env == "autodetect" || (ipv4Env == "" && node.Spec.BGP.IPv4Address == "") {
    adm := os.Getenv("IP_AUTODETECTION_METHOD")
    // Select the nic address according to the automatic detection policy. We can see the code * * [L701 - L746] (https://github.com/projectcalico/node/blob/release-v3.17/pkg/startup/startup.go#L701-L746) * *
    / / * * and configuration document [IP - autodetection] (https://docs.projectcalico.org/archive/v3.17/networking/ip-autodetection) * *,
    // If calico/node is deployed in k8S, it seems that using can-reach= XXX can save a lot of time
    cidr := autoDetectCIDR(adm, 4)
    ifcidr ! =nil {
      // We autodetected an IPv4 address so update the value in the node.
      node.Spec.BGP.IPv4Address = cidr.String()
    } else if node.Spec.BGP.IPv4Address == "" {
      return false, fmt.Errorf("Failed to autodetect an IPv4 address")}else {
      // ...}}else if ipv4Env == "none"&& node.Spec.BGP.IPv4Address ! ="" {
    log.Infof("Autodetection for IPv4 disabled, keeping existing value: %s", node.Spec.BGP.IPv4Address)
    validateIP(node.Spec.BGP.IPv4Address)
  } else ifipv4Env ! ="none" {
    // We produce k8s ansible using this logic, and take the IP address of eth0 directly. Subnet is set to /32 by default
    // Please refer to the official documentation: https://docs.projectcalico.org/archive/v3.17/networking/ip-autodetection#manually-configure-ip-address-and-subnet-for-a- node
    ifipv4Env ! ="" {
      node.Spec.BGP.IPv4Address = parseIPEnvironment("IP", ipv4Env, 4)
    }
    validateIP(node.Spec.BGP.IPv4Address)
  }
  // ...
  // Detect if we've seen the IP address change, and flag that we need to check for conflicting Nodes
  ifnode.Spec.BGP.IPv4Address ! = oldIpv4 { log.Info("Node IPv4 changed, will check for conflicts")
    return true.nil
  }

  return false.nil
}

Copy the code

The above is calico startup script execution logic, relatively simple, but after learning its code logic, the problem investigation will be more handy, otherwise can only fool random guess, although happened to solve the problem but do not know why, later encountered similar problems still do not know how to solve, waste of time.

conclusion

This article mainly studied the calico startup script execution logic, mainly to write the deployment of host Node data to Calico, the error prone place is that when the machine dual network card may appear Node and BGPPeer data inconsistency, Bird can not distribute routes. As a result, the POD address of the machine cannot be routed to outside or inside the cluster.

At present, we produce the Ansible binary deployment for Calico, which is not convenient to check through logs. It is still recommended that Calico/Node container be deployed in K8S, and relevant data logic can be obtained when network API is called to pair with switch BGP peer, which can be stored in initContainers. Then calicoctl apply -f bgp_peer. Yaml writes to Calico. Of course, do not rule out the middle will step on a lot of pits, as well as time and energy problems.

In short, Calico is an excellent IMPLEMENTATION of K8S CNI. It uses the mature BGP protocol to distribute routes, and the packets are routed at layer 3 without SNAT/DNAT operations. It is also very easy to understand its principle and process. Kubelet calls calico to create network objects and network cards, and uses calico-ipam to assign subnets to current Nodes and IP addresses to pods.