The author | si-yu wang (the wish) source | alibaba cloud native public number

preface

OpenKruise is an open source Cloud Native application automation management suite of Ali Cloud. It is also a Sandbox project currently hosted under Cloud Native Computing Foundation (CNCF). It comes from alibaba’s containerized, cloud-native technology precipitation over the years. It is a standard extension component based on Kubernetes for large-scale application in Alibaba’s internal production environment. It is also a technical concept and best practice closely related to upstream community standards and adapted to the large-scale scene of the Internet.

OpenKruise released the latest v0.8.0 version (ChangeLog) in 2021.3.4, one of the major changes is the addition of image preheating. This article is shared and compiled by “Implementing Large-scale Cluster Image preheating through OpenKruise & Deployment and Release Acceleration Practice” to introduce the source of requirements, implementation methods and application scenarios of the image preheating capability we provide.

Background: Why is mirror preheating capability necessary

“Mirror” is also a big innovation that Docker brings to the container field. Before Docker, although Linux has provided Cgroup isolation, and although Alibaba has gradually started containerization based on LXC since 2011, they lack the encapsulation of the operating environment such as mirror. However, despite the benefits of mirroring, there is no denying that in real life, we also face various problems caused by mirror pulling. One of the most common problems is the time consuming of mirror pulling.

We’ve heard a lot in the past about expectations and awareness of containerization, such as “extreme elasticity”, “second scaling”, “efficient release”, etc. But combining this with a standard Pod creation process in Kubernetes, There is still a certain gap with the expectation of users (assuming that Pod contains sidecar and APP containers) :

Normally, operations such as scheduling, allocating/mounting remote disks, and allocating networks take less time in a small-scale cluster. However, some optimization needs to be made in a large-scale cluster, but these operations are still controllable. However, it is particularly difficult to pull a mirror in a large-scale elastic cluster. Even if P2P and other technologies are used for optimization, it may take a long time to pull a large service image, which is inconsistent with users’ expectations for capacity expansion and release speed.

If we can pull the image of sidecar container and the basic image of service container on the node in advance, the Pod creation process can be greatly shortened, and the time of pulling the image can even be optimized by more than 70% :

And Kubernetes itself is not to provide any mirror operation ability, around the ecology of Kubernetes, there is no mature scale image preheating products. This is the reason why we provide image preheating in OpenKruise, and this set of image preheating capability has been largely implemented in alibaba’s internal cloud native environment. Our basic usage will also be briefly introduced in the following practice.

How does OpenKruise achieve image preheating

The principle of OpenKruise to achieve image preheating should first be seen from its operating architecture:

Starting from V0.8.0, after Kruise was installed, there were two components in the Kruise-System namespace: Kruise-Manager and Kruise-Daemon. The former is a centralized component deployed by Deployment. A Kruise-Manager container (process) contains multiple controllers and Webhooks. The latter is deployed to nodes in the cluster by DaemonSet, which interacts with CRI to bypass Kubelet to complete some scaling capabilities (such as pulling images, restarting containers, etc.).

Therefore, Kruise will create a custom resource with the same name for each Node: The NodeImage of each node specifies which images need to be preheated on the node, so kruise-daemons on this node can simply pull images according to the NodeImage:

As shown in the figure above, in NodeImage we can specify the image name to be pulled, tag, pull policy, such as timeout for a single pull, number of failed retries, deadline of the task, TTL time, and so on.

With NodeImage, we have the most basic image preheating capability, but it is not sufficient for large-scale scene preheating. In a cluster of 5K nodes, it is not friendly for users to update the NodeImage resource one by one to warm up. Therefore, Kruise also provides a more abstract custom resource, ImagePullJob:

As shown in the figure above, in ImagePullJob, the user can specify the range of nodes on which an image should be preheated in batches, as well as the pull strategy and life cycle of the job. After an ImagePullJob is created, it will be received and processed by imagepulljob-Controller in Kruise-Manager, decomposed and written into nodeimages of all matching nodes, so as to complete the preheating of scale.

The overall process is as follows:

And with the image preheating capability, how do we use it, or what scenarios need to use it? Next we introduce the mirror preheating in Alibaba in several common ways to use.

What are the common ways to use mirror preheating

1. Base mirroring – Preheating the cluster dimensions

The most common warm-up scenario is to continuously warm up some base images throughout the cluster dimension:

apiVersion: apps.kruise.io/v1alpha1
kind: ImagePullJob
metadata:
  name: base-image-job
spec:
  image: xxx/base-image:latest
  parallelism: 10
  completionPolicy:
    type: Never
  pullPolicy:
    backoffLimit: 3
    timeoutSeconds: 300
Copy the code

As described above, ImagePullJob has several characteristics:

  1. If no selector rule is configured, the entire cluster dimension is preheated by default
    1. Unified preheating on all storage nodes
    2. The newly added (imported) nodes will also be preheated immediately
  2. Use the completionPolicy policy of Never to run long
    1. The Never policy indicates that the job continues to warm up and will not end (unless deleted).
    2. In the Never policy, ImagePullJob triggers retries on all matched nodes every 24 hours. In other words, the image exists once a day

According to our experience, there are about 10 to 30 Imagepulljobs preheating the base image in a cluster, depending on the cluster and the service scenario.

2. Sidecar image – Preheat the cluster dimensions

We can also preheat some sidecar images, especially the base Sidecars that come with almost every business Pod:

apiVersion: apps.kruise.io/v1alpha1
kind: ImagePullJob
metadata:
  name: sidecar-image-job
spec:
  image: xxx/sidecar-image:latest
  parallelism: 20
  completionPolicy:
    type: Always
    activeDeadlineSeconds: 1800
    ttlSecondsAfterFinished: 300
  pullPolicy:
    backoffLimit: 3
    timeoutSeconds: 300
Copy the code

As described above, ImagePullJob has several characteristics:

  1. With no selector configured, the entire cluster dimension is preheated by default, similar to base mirroring
  2. The Always policy is used to preheat the system at one time
    1. Preheat all nodes once
    2. The job preheating timeout period is 30 minutes
    3. The job is automatically deleted 5 minutes after it is completed

Of course, sidecar preheating can also be configured as the Never policy, depending on the scenario. According to our experience, especially when sidecar is doing version iteration and image upgrade, a large-scale image preheating in advance can greatly improve the speed of subsequent Pod expansion and release.

3. Special service mirroring – Preheating the resource pool dimension

For some multi-rented Kubernetes clusters, there may be multiple different service resource pools, where specific service images may need to be preheated according to the resource pool dimension:

apiVersion: apps.kruise.io/v1alpha1
kind: ImagePullJob
metadata:
  name: serverless-job
spec:
  image: xxx/serverless-image:latest
  parallelism: 10
  completionPolicy:
    type: Never
  pullPolicy:
    backoffLimit: 3
    timeoutSeconds: 300
  selector:
    matchLabels:
      resource-pool: serverless
Copy the code

As described above, ImagePullJob has several characteristics:

  1. Use the Never policy to warm up for a long time
  2. Specify the selector preheat range, which is matched with the resource-pool=serverless label

Of course, this section only uses resource pools as an example. You can define which nodes to preheat an image on according to your own scenarios.

Version preview: a combination of in-place upgrades and preheating

Finally, what enhancements will be implemented in the next version of OpenKruise (V0.9.0) based on the current image warm-up?

For those of you who are familiar with OpenKruise before, one of the features we provide is “in-place upgrade”, which breaks the mode of Pod deletion and reconstruction when Kubernetes native workload is released, and supports only updating the image of one container on the original Pod. For those of you who are interested in the principle of in-place upgrading, please read this article: “Revealed: How to Implement in-place Upgrading for Kubernetes?” .

Since the in-place upgrade avoids Pod removal and reconstruction, it already provides the following benefits:

  • The time of scheduling is saved, and Pod locations and resources are not changed
  • The Pod also uses the original IP address, which saves time in network allocation
  • It saves the time of allocating and mounting remote disks. Pod also uses the original PV (which is already mounted on Node).
  • It saves most of the time to pull the image, because the old image of the application already exists on the node. When pulling the image of the new version, only a few layers need to be downloaded
  • When a container in Pod is upgraded in place, other containers run normally, and the network and storage are not affected

After “saving most of the time of pulling images”, you only need to download part of the layer on the upper layer of the new image. Is it possible to optimize the mirror pull time completely? The answer is yes.

As shown above, The next version of OpenKruise’s CloneSet will support automatic image preheating during the release process. Kruise will pre-heat the image of the new Pod on the node where the subsequent Pod is located while the user is upgrading the first Pod in gray scale. In this way, new images will be ready on the node when subsequent batches of PODS are upgraded in place, saving the time spent pulling images during the actual release process.

Of course, this “publish + warm up” mode only works for OpenKruise’s in-place upgrade scenario. For the original workload such as Deployment, since Pod is newly created when it is released, we cannot predict the node to which it will be scheduled in advance, so it is naturally impossible to preheat the image in advance.

If you are interested in the OpenKruise project and have any topics you would like to talk about, please visit the OpenKruise website, GitHub, and the Tidbit search group number: 23330762 to join the exchange group!