In a rapidly changing technology world, it is critical to provide users with continuous, rapid innovation. Kubernetes is a great engine for driving innovation in the cloud, locally, and at the edge. As a result, Kubernetes and its entire ecosystem itself are rapidly iterating, and keeping Kubernetes up to date to ensure security and the use of new features is critical to any deployment.

What is a zero downtime upgrade cluster

Rancher 2.4 was released last week. In Rancher 2.4, we officially introduced the zero downtime cluster upgrade feature. In layman’s terms, this feature allows you to change engines in mid-flight without any interference. Developers can continue to deploy applications to clusters, and users can continue to use services without interference. At the same time, when used in conjunction with Rancher’s OOB (Out of Band) Kubernetes update, the cluster operator can safely issue maintenance and security updates within hours of the released version.

In previous versions of Rancher, RKE first upgraded the ETCD node and took care not to interrupt quorum. Rancher then immediately upgrades all nodes on the control plane and all worker nodes immediately. This leads to temporary failures of API and workload availability. In addition, once the control plane is updated, Rancher treats the cluster status as “active,” making the operator potentially unaware that the working node is still being upgraded.

In Rancher 2.4, we optimized the entire upgrade process to ensure proper delivery of the CI/CD pipeline and continued workload servicing traffic. Throughout the process, Rancher looks at the cluster in its updated state, which allows the operator to quickly see something happening in the cluster.

Rancher still starts with ectD nodes, one node at a time, and takes care not to disrupt quorum. As an additional precaution, operator takes snapshots of etCD and Kubernetes configurations before upgrading. And if you need to roll back, the entire cluster can be restored to its pre-upgrade state.

As you know, deploying an application to a cluster requires the Kubernetes API to be available. In Rancher 2.4, Kubernetes control plane nodes will also be upgraded one at a time. The first server will be offline, upgraded, and put back to the cluster. Next, the control plane node starts to upgrade only when the previous node reports that its status is healthy. This behavior ensures that the API always responds to requests during the upgrade process.

Two major changes to the Rancher 2.4 node upgrade

Most of the activity on the cluster occurs on worker nodes. In Rancher 2.4, there are two major changes to the way nodes are upgraded. The first is that the number of worker nodes can be set for a single upgrade. For traditional methods or smaller clusters, the operator can select only one node at a time to upgrade. For operators in larger clusters, the Settings can be adjusted to upgrade to larger batch sizes. This option strikes a balance between risk and time and provides maximum flexibility. The second change is that the operator can choose to consume the workload before the worker node is upgraded. Ejecting nodes first minimizes the impact of Pod restarts on Kubernetes minor version upgrades.

Additional services such as CoreDNS, NGINX Ingress, and CNI drivers update synchronously with worker nodes. Rancher 2.4 exposes the upgrade strategy for each additional deployment type, which enables additional upgrades to use the native Kubernetes availability structure.