Author: Wine Wishes (Wang Siyu)

OpenKruise, a cloud native application automation management suite and CNCF Sandbox project, has recently released v1.1.

OpenKruise [1] is a suite of enhanced capabilities for Kubernetes, focusing on the deployment, upgrade, operation and maintenance, stability protection and other fields of cloud native applications. All functions are extended in standard ways such as CRD and can be applied to any Kubernetes cluster above 1.16. Kruise deployment can be completed with a single helm command without further configuration. * * * * * *

Version of the resolution

In v1.1, OpenKruise extends and enhances many existing features and optimizes performance in large clusters. The following is a brief introduction to some functions of V1.1.

It’s worth noting that OpenKruise V1.1 has upgraded the Kubernetes code-dependent version to V1.22, which means users can use new fields like Up to V1.22 in pod Template templates for workloads like CloneSet, However, the version of Kubernetes cluster compatible with OpenKruise installed by users remains >= V1.16.

In-place upgrade supports container order priority

V1.0 released at the end of last year, OpenKruise introduced the function of container startup sequence control [2], which supports defining different weight relationships for multiple containers in a Pod, and controlling the startup sequence of different containers according to the weight when Pod is created.

In V1.0, this functionality was only available during the creation phase of each Pod. Once created, if you upgrade multiple containers in a Pod in place, they will all be upgraded simultaneously.

In recent days, the community has done some talking to companies like LinkedIn to get more input into user usage scenarios. In some scenarios, multiple containers in Pod are associated. For example, when a business container is upgraded, other containers in Pod also need to be configured to be associated with the new version. Or multiple containers avoid parallel upgrade to ensure that the sidecar container of the log collection class does not lose logs in the service container.

Therefore, in v1.1, OpenKruise supports in-place upgrades in order of container priority. In practice, users do not need to configure any additional parameters. As long as the Pod is created with the container startup priority, the high-priority container will be started before the low-priority container in the Pod creation stage. In a single in-place upgrade, if multiple containers are upgraded at the same time, high-priority containers will be upgraded first, and low-priority containers will be upgraded after the upgrade is completed.

The in-place upgrade includes image upgrade and env from metadata upgrade. For details, see [3].

  • For pods that do not have a container startup order, there is no order guarantee when upgrading multiple containers in place.

  • For pods with a container startup sequence:

  • If multiple containers are upgraded in place in different startup sequences, the upgrade sequence is controlled based on the startup sequence.

  • If the startup sequence of containers for a local in-place upgrade is the same, the in-place upgrade is not guaranteed.

For example, a CloneSet containing two containers in different startup sequences looks like this:

apiVersion: apps.kruise.io/v1alpha1
kind: CloneSet
metadata:
  ...
spec:
  replicas: 1
  template:
    metadata:
      annotations:
        app-config: "... config v1 ..."
    spec:
      containers:
      - name: sidecar
        env:
        - name: KRUISE_CONTAINER_PRIORITY
          value: "10"
        - name: APP_CONFIG
          valueFrom:
            fieldRef:
              fieldPath: metadata.annotations['app-config']
      - name: main
        image: main-image:v1
  updateStrategy:
    type: InPlaceIfPossible
Copy the code

When we update CloneSet and modify the image of the app-Config annotation and main container, it means that both sidecar and main container need to be updated. Kruise will first upgrade the Pod in place to rebuild the sidecar container to implement the new env from annotation.

Next, we can see the apps.kruise.io/inplace-update-state annotation and its value in the upgraded Pod:

{"revision": "{CLONESET_NAME}-{HASH}", // Update name "updateTimestamp": NextContainerImages: {"main": "Main-image :v2"}, // "nextContainerRefMetadata": {... }, // Containers env from labels/ Annotations "preCheckBeforeNext": {"containersRequiredReady": ["sidecar"]}, // pre-check item, Only after meeting the requirements can the subsequent batches of containers" containerBatchesRecord" be upgraded in situ :[{"timestamp":" 2022-03-22T09.06:55z ","containers":["sidecar"]} // The first batch container that has been updated (this only indicates that the container's spec has been updated, such as the image or labels/ Annotations in pod.spec.containers, but does not indicate that the actual container on node has been upgraded)]}Copy the code

When the Sidecar container is successfully upgraded, Kruise will then upgrade the main container. You’ll eventually see the following apps.kruise. IO /inplace-update-state annotation in Pod:

{
  "revision": "{CLONESET_NAME}-{HASH}",
  "updateTimestamp": "2022-03-22T09:06:55Z",
  "lastContainerStatuses":{"main":{"imageID":"THE IMAGE ID OF OLD MAIN CONTAINER"}},
  "containerBatchesRecord":[
    {"timestamp":"2022-03-22T09:06:55Z","containers":["sidecar"]},
    {"timestamp":"2022-03-22T09:07:20Z","containers":["main"]}
  ]
}
Copy the code

Typically, users just need to pay attention to containerBatchesRecord to ensure that containers are upgraded in batches. If the Pod in the process of in situ upgrade stuck, you can check the nextContainerImages/nextContainerRefMetadata fields, And preCheckBeforeNext whether the container from the previous upgrade is successfully upgraded and ready.

StatefulSetAutoDeletePVC function

Starting with Kubernetes V1.23, the native StatefulSet added the StatefulSetAutoDeletePVC function, which selects to preserve or automatically delete the PVC objects created by StatefulSet, based on a given policy. Reference document [4].

Therefore, v1.1 version of the Advanced StatefulSet from upstream synchronization the function, allows the user to through the spec. PersistentVolumeClaimRetentionPolicy fields to specify the automatic cleaning strategy. This requires you to enable StatefulSetAutoDeletePVC feature-gate when installing or upgrading Kruise.

apiVersion: apps.kruise.io/v1beta1
kind: StatefulSet
spec:
  ...
  persistentVolumeClaimRetentionPolicy:  # optional
    whenDeleted: Retain | Delete
    whenScaled: Retain | Delete
Copy the code

Where, two policy fields include:

  • WhenDeleted: The preserve/delete policy for PVC when Advanced StatefulSet is deleted.
  • WhenScaled: the hold/delete strategy for scaling down Pod associated with PVC when Advanced StatefulSet shrinks.

Each policy can be configured with the following two values:

  • Retain (default) : It behaves as StatefulSet did in the past, reservating its associated PVC when Pod is deleted.
  • Delete: When a Pod is deleted, the PVC object it is associated with is automatically deleted.

In addition, there are a few points to note:

  1. The StatefulSetAutoDeletePVC function only cleans up PVCS defined and created by volumeClaimTemplate, not PVCS created by users themselves or associated with StatefulSet Pods.
  2. The above cleanup only occurs if Advanced StatefulSet is deleted or proactively scaled down. For example, Pod expulsion reconstruction caused by node failure will still reuse the existing PVC.

Advanced DaemonSet reconstructs and supports lifecycle hooks

The earlier version of Advanced DaemonSet implementation is quite different from the upstream controller. For example, for not-ready and unschedulable nodes, additional configuration fields are required to choose whether to process, which increases the cost and burden for our users.

In V1.1, we did a small refactoring of Advanced DaemonSet to re-align it with the upstream controller. As a result, all default behaviors of Advanced DaemonSet will be basically the same as those of native DaemonSet, and users can use Advanced StatefulSet as well, It is easy to change a native DaemonSet to Advanced DaemonSet by modifying the apiVersion.

We also added lifecycle hooks for Advanced DaemonSet, starting with support for preDelete hooks that allow users to execute some custom logic before the Daemon Pod is removed.

apiVersion: apps.kruise.io/v1alpha1
kind: DaemonSet
spec:
  ...
  # define with label
  lifecycle:
    preDelete:
      labelsHandler:
        example.io/block-deleting: "true"
Copy the code

When DaemonSet removes a Pod (including scaling and rebuilding upgrades) :

  • If lifecycle hook is not defined or Pod does not meet preDelete criteria, it will be deleted.
  • Otherwise, the Pod is updated to the PreparingDelete state and waits for the user-defined Controller to remove the label/ Finalizer associated with the Pod before performing Pod deletion.

Disable DeepCopy performance optimization

By default, when we write Operator/ controller using controller-Runtime, IO/controller-Runtime/PKG /client to get/list typed objects, which are retrieved and returned from the memory Informer. Most people know this.

What many people don’t know is that behind these get/list operations, controller-Runtime does a deep copy of all objects retrieved from the Informer and returns them.

The goal of this design is to prevent developers from mistakenly tampering with objects in Informer directly. After deep copy, any changes made by the developer to the objects returned by get/list will not affect the objects in Informer, which will only be synchronized from kube-Apiserver’s ListWatch request.

However, in some large clusters, each controller in OpenKruise is running at the same time, and there are several workers performing Reconcile in each controller, which may result in a large number of deep copy operations. For example, there are a large number of CloneSet applications in clusters, and the number of pods managed by some CloneSet is very large. Then, in Reconcile, each worker will list all the Pod objects under one CloneSet. In addition, the parallel operation of multiple workers may cause sudden increase in CPU and Memory pressure of Kruise-Manager, and even OOM risk in the case of insufficient Memory quota.

In controller-Runtime upstream, I have submitted merging DisableDeepCopy functionality [5] last year, included in Controller-Runtime V0.10 and above. It allows developers to specify certain resource types and instead of performing deep copy when doing GET /list queries, return Pointers to objects in the Informer.

For example, when initializing Manager in main.go, add parameters to cache to configure resource types such as Pod not to do deep copy.

mgr, err := ctrl.NewManager(cfg, ctrl.Options{
        ...
        NewCache: cache.BuilderWithOptions(cache.Options{
            UnsafeDisableDeepCopyByObject: map[client.Object]bool{
                &v1.Pod{}: true,
            },
        }),
    })
Copy the code

However, in Kruise V1.1, instead of using this function directly, we reencapsulate the Delegating Client [6]. The developer can use DisableDeepCopy ListOption to DisableDeepCopy for a single list operation in any place where a list query is performed.

if err := r.List(context.TODO(), &podList, client.InNamespace("default"), utilclient.DisableDeepCopy); err ! = nil { return nil, nil, err }Copy the code

This has the advantage of being more flexible in its use, avoiding the possibility that many community contributors might incorrectly modify objects in the Informer if they didn’t notice that deep copy was turned off for the entire resource type.

Other changes

You can view more changes and their authors and commit records on the Github Release [7] page.

Community participation

You are welcome to join the OpenKruise community via Github/Slack/ Dingding/wechat. Do you have something you want to communicate with our community? Share your voice at our community biweekly meetings [8], or join the discussion through the following channels:

  • Join community Slack Channel [9] (English)
  • Join community nail group: search group number 23330762 (Chinese)
  • Join the community wechat group (new) : Add user OpenKruise and let the robot pull you into the group (Chinese)

Related links * *

[1]OpenKruise

​​https://openkruise.io/​​

[2] Container start sequence control

​​https://openkruise.io/zh/docs/user-manuals/containerlaunchpriority/​​

[3] Introduction to in-place upgrade

​​https://openkruise.io/zh/docs/core-concepts/inplace-update​​

[4]

​​https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#persistentvolumeclaim-retention​​

[5] DisableDeepCopy function

​​https://github.com/kubernetes-sigs/controller-runtime/pull/1274 ​​

[6]Delegating Client

​​https://github.com/openkruise/kruise/blob/master/pkg/util/client/delegating_client.go​​

[7]Github release

​​https://github.com/openkruise/kruise/releases​​

[8] Community biweekly meeting

​​https://shimo.im/docs/gXqmeQOYBehZ4vqo​​

[9]Slack channel

​​https://kubernetes.slack.com/?redir=%2Farchives%2Fopenkruise​​

Click here to view the OpenKruise project official homepage and documentation!