The author | high phase (chan Lin wong)

** cluster upgrade is one of the most important parts of the Kubernetes cluster lifecycle. In order to better understand the connotation and extension of cluster upgrade, we will first elaborate on the necessity and difficulties of cluster upgrade; This will be followed by a step-by-step explanation of the pre-check that must be done before upgrading the cluster. Next, two common upgrade methods are described. Finally, three steps of cluster upgrade are explained to help readers from theory to practice.

The necessity & difficulty of upgrading

In Kubernetes, thanks to an active open source community, Kubernetes iterates faster and currently releases a new version every quarter. The new version of Kubernetes has more advanced new features, more comprehensive security hardening and bug fixes. The community just completed the official release of version 1.19 some time ago.

With such a fast-moving open source project, it’s even more important to keep up with the community, and cluster upgrade capabilities are the best way to help us keep up. We can illustrate the need for a cluster upgrade in the following two aspects:

  • For users of the Kubernetes cluster: The updated version of Kubernetes means updated features, more comprehensive security patches, and lots of Bugfixes. We can enjoy the benefits of the active Kubernetes open source community through cluster upgrades.

  • For Kubernetes cluster operators: through the cluster upgrade function can pull the managed cluster version, reduce the fragmentation of the cluster version, so as to reduce the management cost and maintenance cost brought by the fragmentation of Kubernetes version.

Having talked about the necessity of cluster upgrade, let’s take a look at the difficulties of cluster upgrade in detail.

Currently, most Kubernetes users are very conservative about cluster upgrades, fearing that unexpected things will happen during the upgrade process. Some users also refer to cluster upgrades as “changing engines for a plane in flight.” So, what are the main reasons for users’ conservative attitudes toward upgrades? I think there are the following:

  • After a long run, the Kubernetes cluster has accumulated complex runtime states;

  • Kubernetes cluster operation and maintenance will be based on the different business of the cluster, different configuration of the cluster, resulting in each cluster has its own differentiated configuration, may cause the “thousand clusters thousand faces”;

  • For the Kubernetes cluster running on the cloud, it uses a lot of cloud computing underlying resources. Numerous underlying cloud resources will bring numerous uncertainties.

The existence of “thousands of clusters with thousands of faces” leads to the cluster upgrade needs to complete the cluster upgrade work in different situations by a set of logic, which is also the difficulty of cluster upgrade.

Upgrade preview

As we said earlier, upgrading a Kubernetes cluster in service is like “changing the engine of an airplane in flight.” Because the cluster upgrade is facing many difficulties, also makes many Kubernetes cluster maintainers on the cluster upgrade this thing more nervous.

We can eliminate the uncertainty of cluster upgrade through detailed upgrade pre-check. For the difficulties in cluster upgrade listed above, we can also carry out a detailed upgrade pre-check to solve the difficulties one by one. Upgrade precheck can be divided into three aspects:

  • Core component health check
  • Checking Node Configuration
  • Checking Cloud Resources

1. Check the health status of core components

When it comes to core component health check, we have to analyze the importance of cluster health for cluster upgrade. An unhealthy cluster is likely to have a variety of abnormal problems during the upgrade, and even if the upgrade is completed, the problems will gradually become apparent in subsequent use.

One might say, my cluster looks healthy, but then I upgrade and something goes wrong. In general, this happens because the problem exists before the cluster is upgraded, but only after the cluster has been upgraded.

Now that we understand the need for core component health checks, let’s take a look at which components need to be checked:

  • Network components: You need to ensure that the network component version is compatible with the target Kubernetes version you want to upgrade to;
  • Apiservice: Ensure that all apiservice in the cluster are available.
  • Node: Ensure that all nodes are healthy.

2. Check the node configuration

As the underlying meta-computing resource bearing Kubernetes, nodes not only run important system processes such as Kubelet and Docker, but also act as interactive interfaces between clusters and underlying hardware.

Ensuring the health of nodes and correct configurations is an important part of ensuring the health of the entire cluster. The following describes the required checks.

  • Operating system configuration: Check whether basic system components (such as yum, Systemd, and NTP) and kernel parameters are properly configured.
  • Kubelet: You need to make sure that kubelet processes are healthy and correctly configured;
  • Docker: Ensure that the Docker process is healthy and correctly configured.

3. Check cloud resources

The Kubernetes cluster running on the cloud depends on many cloud resources. Once the cloud resources on which the cluster depends are unhealthy or incorrectly configured, the normal operation of the whole cluster will be affected. We pre-check the status and configuration of the following cloud resources:

  • SLB used by Apiserver: need to determine instance health and port configuration (forwarding configuration, access control configuration, etc.);
  • VPCS and VSwitches used by clusters: Determine the health status of instances.
  • ECS instances within the cluster: Need to determine their health and network configuration.

There are two common ways to upgrade

In the field of software upgrade, there are two mainstream software upgrade methods, namely, in-place upgrade and replacement upgrade. These two upgrade methods are also suitable for Kubernetes cluster, they adopt different software upgrade ideas, but also have their own advantages and disadvantages. Let’s take a look at the two cluster upgrade methods one by one.

1. Upgrade in situ

In situ upgrade is a kind of refined, relatively small to this cluster change momentum of an upgrade. When the worker node of the container is upgraded, this upgrade mode will replace Kubernetes components (mainly Kubelet and its related components) in place on ECS to complete the upgrade of the whole cluster. Alibaba cloud container service Kubernetes provides customers with cluster upgrades based on this approach.

Take upgrading Kubernetes version from 1.14 to 1.16 as an example. We will upgrade Kubelet and its configuration from 1.14 on ECS A to 1.16. After completing the component upgrade on ECS A, the node will be successfully upgraded to 1.16. We then do the same for ECS B, upgrading it to 1.16 to complete the cluster upgrade.

During this process, the nodes remain running and the ECS configuration is not modified. As shown in the figure:

1) advantages

  • In place upgrade by replacing kubelet components in place to upgrade the node version, so as to ensure that the Pod on the node will not be rebuilt because of cluster upgrade, to ensure the continuity of business;

  • This upgrade mode does not modify or replace the underlying ECS, ensuring the normal running of services that depend on the scheduling of specific nodes, and making ECS customers more friendly.

2) disadvantages

  • In the in-place upgrade mode, a series of upgrade operations need to be performed on the node to complete the upgrade of the entire node. As a result, the upgrade process is not atomic enough and may fail in a step in the middle, resulting in a node upgrade failure.

  • Another disadvantage of an in-place upgrade is that it requires a certain amount of resources to be reserved. Only when the resources are sufficient can the upgrade program complete the upgrade of nodes on the ECS.

2. Replace the upgrade

Replacement upgrade is also called rotation upgrade. Compared with in-place upgrade, replacement upgrade is a rougher and atomized upgrade. The upgrade completes the upgrade of the Kubernetes cluster by removing nodes of the old version one by one and replacing them with new nodes of the new version.

Similarly, take upgrading Kubernetes version from 1.14 to 1.16 as an example. In the alternative rotation mode, the 1.14 nodes in the cluster are drained and removed from the cluster, and the 1.16 nodes are added directly. Delete ECS A on 1.14 nodes from the cluster, add ECS C on 1.16 nodes to the cluster, delete ECS B from the cluster, and add ECS D to the cluster.

This completes the rotation of all nodes, and the cluster is upgraded to 1.16. As shown in the figure:

1) advantages

The replacement upgrade completes the cluster upgrade by replacing nodes of the old version with nodes of the new version. This replacement process is more atomic than an in-place upgrade, and there are less complex intermediate states, so there is less need for pre-checking before upgrading. In turn, there will be fewer of the kinky problems that can arise during the upgrade process.

2) disadvantages

  • The replacement upgrade replaced and reset all nodes in the cluster, and all nodes would undergo a process of drainage, which led to a large number of MIGRATION and reconstruction of PODS in the cluster, which were unfriendly to services that had low pod reconstruction tolerance, applications that had only one copy, and related applications that were stateful set. Unavailability or failure may occur as a result;

  • All nodes undergo a reset, and data stored on the local disk of the node is lost.

  • This upgrade method may bring host IP changes and other issues, and is not friendly to users who pay by year and by month.

Cluster Upgrade Trilogy

A Kubernetes cluster is mainly composed of a master responsible for cluster management, workers performing workload and many functional system components. Upgrading a Kubernetes cluster means upgrading these three parts of the cluster separately.

Therefore, the cluster upgrade trilogy is as follows:

  • Upgrade master
  • Upgrade workers in batches
  • Upgrade of system components (mainly CoreDNS, Kube-Proxy and other core components)

Let’s go through the cluster upgrade trilogy in detail.

1. Scroll to upgrade master

As the brain of the cluster, the master is responsible for user interaction, task scheduling, and various functional task processing. The master cluster can be deployed in a variety of modes, including static pod deployment, local process deployment, and Kubernetes on Kubernetes deployment.

In summary, no matter which deployment mode the master is deployed, the main purpose of upgrading the master is to upgrade the three major components, including:

  • Upgrade kube – apiserver
  • Upgrade kube controller — the manager
  • Upgrade kube – the scheduler

To ensure that Kubernetes apiserver availability is not interrupted, there must be at least two Kube-Apiservers in master. Then, rolling upgrades can be implemented to ensure that Apiserver downtime will not occur.

2. Upgrade workers in batches

After completing the master upgrade, we can start to upgrade the worker. Worker upgrade is mainly to upgrade Kubelet and its dependent components (such as CNI, etc.) on nodes. To ensure that large kubelet restarts of workers in the cluster do not occur at the same time, we need to upgrade worker nodes in batches.

Note that we must upgrade the master first and then the worker. This is because a higher version of Kubelet is likely to have incompatibilities when connecting to a lower version of Master, causing the node not to be ready. For lower versions of Kubelet to connect to higher versions of Apiserver, the open source community guarantees backward compatibility between the two versions of Kubelet: 1.14 kubelet can connect to 1.16 Apiserver, There will be no compatibility issues.

3. Upgrade core system components

To ensure the compatibility of all components in the cluster, we need to synchronize the upgrade of core system components in the cluster when upgrading the cluster, including:

  • Dns component: According to the community version compatibility matrix, the CoreDNS version is upgraded to the version corresponding to the cluster version;

Community version compatibility matrix: github.com/coredns/dep…

  • Network forwarding component: The version of Kube-Proxy follows the evolution of Kubernetes version, so we need to upgrade the version of Kube-Proxy to the same version with Kubernetes version.

Course recommended

Last year, CNCF and AliYun jointly released “Cloud Native Technology Open Course”, which has become a “required course” for Kubernetes developers.

Today, Ali Cloud again gathered a number of technical experts with rich experience in cloud native practice, officially launched the cloud native technology Practice Open Class. The course content is advanced from simple to profound, focusing on “landing practice”. It also creates real and operable experimental scenes for learners, which is convenient to verify learning results and lays a solid foundation for subsequent practical application. Click on the links to view the course: developer.aliyun.com/learning/ro…

“Alibaba Cloud originator focuses on micro-service, Serverless, container, Service Mesh and other technical fields, focuses on the trend of cloud native popular technology, large-scale implementation of cloud native practice, and becomes the public account that most understands cloud native developers.”