Phenomenon of the problem

Create an independent cluster whose version is V1.18.4 (details < v1.18.4-TKE.5) on the TKE console. The node information of the cluster is as follows:

There are 3 master nodes and 1 worker node, and the worker and master are in different available zones.

node role The label information
ss-stg-ma-01 master label[failure-domain.beta.kubernetes.io/region=sh,failure-domain.beta.kubernetes.io/zone=200002]
ss-stg-ma-02 master label[failure-domain.beta.kubernetes.io/region=sh,failure-domain.beta.kubernetes.io/zone=200002]
ss-stg-ma-03 master label[failure-domain.beta.kubernetes.io/region=sh,failure-domain.beta.kubernetes.io/zone=200002]
ss-stg-test-01 worker label[failure-domain.beta.kubernetes.io/region=sh,failure-domain.beta.kubernetes.io/zone=200004]

After the cluster is created, a Daemonset object is created, and a POD of daemonset is stuck pending. The symptoms are as follows:

$ kubectl get pod -o wide NAME READY STATUS RESTARTS AGE NODE debug-4m8lc 1/1 Running 1 89m ss-stg-ma-01 debug-dn47c 0/1  Pending 0 89m <none> debug-lkmfs 1/1 Running 1 89m ss-stg-ma-02 debug-qwdbc 1/1 Running 1 89m ss-stg-test-01Copy the code

(Note: The latest version of TKE is v1.18.4-TKE.8, and the latest version is used by default when creating a cluster.)

Problem conclusion

When the k8S scheduler schedules a POD, it synchronizes a snapshot from the scheduler’s internal cache containing the node information that pod can schedule. The reason for the above problem (a POD instance of Daemonset was stuck in the pending state) was that part of node information was lost during the synchronization process, which resulted in that part of POD instances of Daemonset could not be scheduled to the designated node and were stuck in the pending state.

The following is a detailed investigation process.

Log screen

Node information (from customer online cluster) shown in the screenshot: K8S master node: SS-STG-MA-01, SS-STG-MA-02, ss-STG-MA-03 K8S worker node: SS-STg-test-01

1. Get the scheduler log. Here we first dynamically increase the scheduler log level by, for example, directly up toV(10), try to get some related logs. After the log level is increased, some key information is captured as follows:

To clarify, when scheduling a POD, it is possible to enter the scheduler’s preempt phase, and the above log is from the preempt phase. There are 4 nodes in the cluster (3 master nodes and 1 worker node), but only 3 nodes are shown in the log, and one master node is missing. Therefore, it is suspected that node Info is missing from the scheduler’s internal cache.

K8s V1.18 supports printing the internal cache information of the scheduler. The printed internal cache information of the scheduler is as follows:

It can be seen that the node info in the internal cache of the scheduler is complete (3 master nodes and 1 worker node). By analyzing the logs, a preliminary conclusion can be drawn: The node info in the scheduler’s internal cache is complete, but some node information will be missing in the cache when scheduling POD.

Problems returning

Before going any further, let’s reacquaint ourselves with the scheduler pod scheduling process (shown in part) and the nodeTree data structure.

Pod scheduling process (part of display)

Combined with the above figure, the scheduling process of a POD is a Scheduler Cycle. At the start of the Cycle, the first step is the Update snapshot. Snapshot is a cycle cache that stores node info for POD scheduling. Update snapshot is a synchronization process from nodeTree (node information stored in the scheduler cache) to snapshot. Nodetree.next () is the nodetree.next () function.

// next returns the name of the next node. NodeTree iterates over zones and in each zone iterates // over nodes in a round robin fashion. func (nt *nodeTree) next() string { if len(nt.zones) == 0 { return "" } numExhaustedZones := 0 for { if nt.zoneIndex >= len(nt.zones) { nt.zoneIndex = 0 } zone := nt.zones[nt.zoneIndex] nt.zoneIndex++ // We do not check  the exhausted zones before calling next() on the zone. This ensures // that if more nodes are added to a zone after it is exhausted, we iterate over the new nodes. nodeName, exhausted := nt.tree[zone].next() if exhausted { numExhaustedZones++ if numExhaustedZones >= len(nt.zones) { // all zones are exhausted. we should reset. nt.resetExhausted() } } else { return nodeName } } }Copy the code

Combined with the results of the above troubleshooting process, we can further narrow the scope of the problem: the synchronization process to the nodeTree (internal cache of the scheduler) lost a node information.

### nodeTree data structure

In the nodeTree data structure, there are two cursors, zoneIndex and lastIndex (zone level), which control the synchronization from nodeTree (scheduler internal cache) to snapshot.nodeInfoList. And, importantly, the value of the cursor after the last synchronization is recorded as the initial value for the next synchronization process.

### Reproduce the problem and locate the root cause

When the K8S cluster is created, the master node is added first, and then the worker node is added.

The first synchronization: 3 master nodes are created and pod scheduling occurs (e.g., the CNI plug-in, deployed in the cluster as daemonset), which triggers a synchronization to the nodeTree (scheduler internal cache). After synchronization, the two cursors on the nodeTree result in the following:

nodeTree.zoneIndex = 1,
nodeTree.nodeArray[sh:200002].lastIndex = 3,
Copy the code

Second round of synchronization: After the worker node joins the cluster, a daemonset is created, which triggers the second round of synchronization (synchronization from nodeTree (scheduler internal cache). The synchronization process is as follows:

ZoneIndex =1, nodeArray[sh:200004]. We get ss-stg-test-01.

ZoneIndex = 1 >= zoneIndex (zones); zoneIndex=0, nodeArray[sh:200002].lastIndex=3, return.

3, zoneIndex=1, nodeArray[sh:200004].

4, zoneIndex=0, nodeArray[sh:200002]. We get ss-stg-ma-01

ZoneIndex =1, nodeArray[sh:200004]. We get ss-stg-test-01

ZoneIndex = 1 >= zones; zoneIndex=0, nodeArray[sh:200002].lastIndex=1, we get ss-stg-ma-02.

After synchronization, the snapshot.nodeInfoList of the scheduler results in the following:

[
    ss-stg-test-01,
    ss-stg-ma-01,
    ss-stg-test-01,
    ss-stg-ma-02,
]
Copy the code

Where did SS-STG-MA-03 go? It was lost during the second round of syncing.

The solution

The root cause of the problem is that the zoneIndex and lastIndex (zone level) values in the nodeTree data structure are retained, so the solution is to forcibly reset the cursor (return to 0) every time SYNC is performed. Related issue: github.com/kubernetes/… Related PR (K8S V1.18) : github.com/kubernetes/… TKE fix: V1.18.4-TKE.5