Kube-controller-manger

The node status is periodically checked. When the node status is NotReady and the podEvictionTimeout time is exceeded, all the pods on the node are expelled to other nodes. The specific expulsion speed is also affected by the parameters of expulsion speed and cluster size.

  • Pot-eviction -timeout: Eviction mechanism will be designed to expel pods from faulty nodes after node downtime. The default is 5min
  • The node-eviction rate is designed to be implemented by token bucket flow control algorithm, the default is 0.1, 0.1 nodes per second, note that it is not the Pod rate, but the node rate. This is equivalent to clearing a node every 10 seconds
  • Secondary-node-eviction rate: Secondary expulsion rate, which is designed to be reduced when too many nodes are down in the cluster. The default value is 0.01
  • Unhealthy zone-threshold: indicates the unhealthy zone threshold. This threshold affects when the secondary drive rate is enabled. The default value is 0.55
  • Large-cluster-size-threshold: large cluster rule. If the number of nodes in a zone exceeds the threshold, the zone is considered to be a large cluster. When the number of node outages in a large cluster exceeds 55%, the drive rate will be reduced to 0.0.1; if it is a small cluster, the drive rate will be directly reduced to 0

Kubelet

Kubelet periodically checks the memory and disk resources of this node. When the available resources are below the threshold, the POD is expelled according to the priority

  1. Nodefs: saves kubelet volume and daemon logs.
  2. Imagefs: Used to save images and writable layers while the container is running.

Soft drive

  • Eviction soft: Describes a set of eviction thresholds (for example, memory.available<1.5Gi) that will trigger eviction action against a Pod if the duration of this condition is exceeded by the grace period.
  • Eviction -soft-grace-period: includes a set of eviction grace periods (e.g. Memory. available=1m30s) that define how long an eviction will last beyond the soft threshold has been reached.
  • The maximum grace period, in seconds, until a pod is expelled after the soft threshold has been reached,

The hard drive

Eviction -hard: Describes a set of eviction thresholds (e.g. Memory.available <1Gi) that will trigger eviction of pods once reached

--eviction-hard=memory.available<500Mi,nodefs.available<1Gi,imagefs.available<100Gi
Copy the code

Container inspection interval

housekeeping-interval

Kubelet continuously reports node status

node-status-update-frequency

Node state fluctuation

If the status of a node fluctuates around the soft threshold, but does not exceed its grace period, the status of the node will continue to fluctuate between whether or not, and ultimately affect the decision process of scheduling reduction.

To prevent this, the following flags can be used to inform Kubelet that it must wait before exiting the pressure state.

Eviction pressure-transition-period defines the time to wait before jumping out of a stress state.

Kubelet verifies that the node has not reached the ejection threshold during the period before setting the pressure state to False.

--eviction-minimum-reclaim="memory.available=0Mi,nodefs.available=500Mi,imagefs.available=2Gi"
Copy the code

To mitigate this, Kubelet can define minimum-Reclaim for each resource. Once Kubelet found the resource pressure, he tried to reclaim the resources at least minimum-Reclaim to bring the resource consumption back to the desired range.

Expelling User Policies

Kubelet will judge Pod evictions based on the following criteria:

  • According to the quality of service
  • Consumption of exhausted resources scheduling requests based on Pod

Next, the Pod is expelled in the following order:

  • BestEffort: The Pod that consumes the most scarce resources fails first.
  • Burstable: The Pod with the most requests for scarce resources is expelled first. If no Pod exceeds their request, the policy targets the Pod with the largest consumption of scarce resources.
  • Guaranteed: Those pods with the most requirements for scarce resources are the first to be ejected. If no PODS exceed their requirements, the policy will target the pods with the largest consumption of scarce resources.

Reserve resources for the system and kubelet service

- system - reserved = = 1.5 memory Gi, ephemeral storage = 1 Gi - system - reserved - cgroup = / system. Slice --enforce-node-allocatable=pods,kube-reserved,system-reserve --kube-reserved=cpu=1000m,memory=8Gi,ephemeral-storage=1Gi --kube-reserved-cgroup=/kubelet.serviceCopy the code