Brief introduction:OpenKruise is an open source Cloud Native application automation management suite of Ali Cloud, and also a Sandbox project currently hosted under Cloud Native Computing Foundation (CNCF). It comes from Alibaba’s containerized and cloud-native technology precipitation over the years. It is a standard extension component based on Kubernetes for large-scale application in Alibaba’s internal production environment. It is also a technical concept and best practice that conforms to the upstream community standards and ADAPTS to the Internet scale scene.

The author | si-yu wang (the wish) Photo Creidt @ si-yu wang (the wish)


OpenKruise is an open source Cloud Native application automation management suite of Ali Cloud, and also a Sandbox project currently hosted under Cloud Native Computing Foundation (CNCF). It comes from Alibaba’s containerized and cloud-native technology precipitation over the years. It is a standard extension component based on Kubernetes for large-scale application in Alibaba’s internal production environment. It is also a technical concept and best practice that conforms to the upstream community standards and ADAPTS to the Internet scale scene. OpenKruise released the latest version of ChangeLog on May 20, 2021, which includes POD container restart, resource cascading deletion protection and other important features. Here is an overview of the new version.

POD container restart/rebuild

“Restart” is a very simple demand, even if the demand of daily operation and maintenance, is also a common “recovery means” in the technical field. However, in native Kubernetes, there is no ability to operate on container granularity, and POD as the minimum operation unit has only two operation modes: create and delete. Some students may ask, in the era of cloud native, why do users still pay attention to the operation and maintenance of container restart? In an ideal Serverless model, the business only needs to care about the service itself, right? This comes from the difference between the cloud native architecture and the traditional infrastructure of the past. In the traditional era of physical machine and virtual machine, multiple application instances are often deployed and run on one machine, and the life cycle of machine and application is different. In this case, the reboot of an application instance might be just a Systemctl or Supervisor instruction, rather than the entire machine restarting. However, in container and cloud native mode, the application lifecycle is tied to the POD container; That is, under normal circumstances, a container only runs an application process, and a POD only provides services for an application instance. Due to the above limitations, there is currently no API under native Kubernetes to provide container (application) restart capability for the upper layer business. The Kruise V0.9.0 release provides a single POD dimension container restart capability, compatible with standard Kubernetes clusters in 1.16 and up. After you install or upgrade Kruise, you only need to create a ContainerRecreateRequest (CRR) object to specify the restart, the simplest YAML looks like this:

kind: ContainerRecreateRequest
  namespace: pod-namespace
  name: xxx
  podName: pod-name
  - name: app
  - name: sidecar

Where, the namespace needs to be in the same namespace as the Pod to be operated on, and the name is optional. Where podName is the POD name in the SPEC, the containers list can specify one or more container names in the POD to perform the restart. In addition to the required fields mentioned above, CRR also provides a variety of alternative restart policies:

  # ...
    failurePolicy: Fail
    orderedRecreate: false
    terminationGracePeriodSeconds: 30
    unreadyGracePeriodSeconds: 3
    minStartedSeconds: 10
  activeDeadlineSeconds: 300
  ttlSecondsAfterFinished: 1800
  • FailurePolicy: Fail or Ignore, default Fail; The CRR terminates as soon as any container stops or the rebuild fails.
  • orderedRecreate: Default false; True means that if the list has more than one container, it will wait until the previous container has been rebuilt before starting to rebuild the next container.
  • TerminationGracePeriodSeconds: wait for container graceful exit time, do not fill in the default use Pod defined in time.
  • UnreadyGracePeriodSeconds: before rebuilding the Pod is set to not ready, and wait for this period of time and then start reconstruction.

    • Note: This feature relies on the KruisRereadInessGate feature gate to be turned on, which will inject a readInessGate at each POD creation time. Otherwise, the default will only give Kruise workload created inject readinessGate Pod, that is to say, only the Pod can use unreadyGracePeriodSeconds CRR rebuilt.
  • MinstartedSeconds: The container is considered to have been rebuilt after it has been running for at least that long.
  • ActiveDeadlineseconds: If the CRR execution exceeds this time, it is marked as terminated directly (incomplete containers are marked as failed).
  • TTL SecondSafterFinished: After the CRR ends, it is automatically deleted after this period of time.

How to implement it: After the CRR is created by the user, it will be received and executed by the kruise-daemon on the node where the Pod is located after the initial processing at the central end of the kruise-manager. The process is as follows:

  1. If the Pod container defines preStop, kruise-daemon will preStop the CRI runtime exec into the container first.
  2. If there is no preStop or execution is complete, kruise-daemon calls the CRI interface to stop the container.
  3. When Kubelet senses that the container exits, it creates a new container with an increasing number and starts (and postStart).
  4. Kruise-daemon senses that the new container has been started successfully and reports that the CRR restart is complete.

In fact, the above container “serial number” corresponds to the RESTARTCOUNT reported by Kubelet in Pod status. Therefore, you will see the restartCount of POD increase after the container is restarted. Also, because the container has been rebuilt, files that were temporarily written to the old container rootfs will be lost, but the data in the volume mount volume will still exist.

Cascading delete protection

Kubernetes’ end-state oriented automation is a “double-edged sword” that brings declarative deployment capabilities to applications, but also potentially amplifies some misoperations into end-state. For example, its “cascading deletion” mechanism means that all subclass resources will be associated with deletion once the parent resources are deleted under non-orphan deletion:

  1. Delete a CRD, and all corresponding CRs are cleared.
  2. Delete a namespace where all resources, including Pod, are deleted.
  3. To delete a workload (Deployment/StatefulSet /…). , then all subordinate PODs are deleted.

We’ve heard a lot of complaints from K8s users and developers in the community about such “cascading deletes”. For any enterprise, its production environment such a scale of accidental deletion is unbearable pain, Alibaba is no exception. Therefore, in Kruise v0.9.0 version, we will export the anti-cascading deleting capability that Ali has done internally to the community, hoping to bring stability to more users. If you need to use this feature in the current version, the installation or upgrade Kruise ResourcesDeletionProtection this feature – need to explicitly turned on when the gate. IO /delete-protection. There are two types of values for a resource object that needs to be protected:

  • Always: indicates that this object is not allowed to be deleted unless the label above is removed.
  • Cascading: This object is prohibited from being deleted if subordinate resources are still available.

The types of resources and Cascading cascading relationships currently supported are as follows:

New features for Cloneset

1. Delete priority

Controller. Kubernetes. IO/pod – deletion – cost from kubernetes version 1.21 after joining the annotation, ReplicaSet in shrinkage, let they refer to the cost value to sort. Cloneset has also supported this feature since Kruise v0.9.0. Users can configure this annotation on a pod, and its value is of type int, indicating the “deletion cost” of this pod compared to other PODS in the same CloneSet. The lower the cost, the higher the deletion priority of the pod. The POD default deletion cost for not setting this annotation is 0. Note that this deletion order is not mandatory, because the real POD deletion order is similar to the following:

  1. Unscheduled < Scheduled
  2. PodPending < PodUnknown < PodRunning
  3. Not ready < ready
  4. Small POD-deletion cost < large POD-deletion cost
  5. The Ready time is shorter < longer
  6. Container restarts more times < less
  7. Short creation time < Long

2. Mirror preheating with in-situ upgrade

When an application is upgraded in situ with CloneSet, only the container image is upgraded and no POD reconstruction occurs. This ensures that there will be no change in the node before and after the POD upgrade. Thus, in the in-situ upgrade process, if CloneSet first mirrored the new version on all POD nodes in advance, the POD in-situ upgrade speed will be greatly improved in the subsequent release batches. If you need to use this feature in the current version, the installation or upgrade Kruise PreDownloadImageForInPlaceUpdate this feature – need to explicitly turned on when the gate. Once turned on, when the user updates the image in the CloneSet Template and the publishing policy supports in-situ upgrades, CloneSet will automatically create an ImagePullJob object for the new image (a batch mirror warming feature provided by OpenKruise). To preheat the new image on the node where POD is located. By default, Cloneset configures ImagePullJob with a degree of concurrency of 1, which is to mirror a node by node. If you need to adjust, you can set the degree of concurrency on the CloneSet annotation when its image is warmed up:

kind: CloneSet
  annotations: "5"

3. First expand and then shrink the POD replacement mode

In past releases, CloneSet’s MAXUNAVAILABLE, MAXSURGE policy only worked for the app publishing process. Starting with Kruise V0.9.0, these two policies will also apply to POD specified deletions. That is, when the user uses PodStoDelete or True to specify that one or more PODS are expected to be removed. CloneSet will only remove them if the number of currently unavailable PODS (relative to the total number of replicas) is less than maxUnavailable. At the same time, if the user has configured the MaxSurge policy, it is possible that CloneSet will create a new Pod first, wait for the new Pod to be ready, and then delete the specified old Pod. The exact replacement method used depends on the MAXUNAVAILABLE and the actual number of PODs that are not available at the time. Such as:

  • For a ClonesetmaxUnavailable=2, maxSurge=1And there is apod-aBe unavailable if you are on anotherpod-bSpecify deletion, and CloneSet will immediately delete it and create a new POD.
  • For a ClonesetmaxUnavailable=1, maxSurge=1And there is apod-aBe unavailable if you are on anotherpod-bSpecify deletion, and CloneSet will create a new Pod, wait for it to be ready, and then delete itpod-b.
  • For a ClonesetmaxUnavailable=1, maxSurge=1And there is apod-aIt’s not available. If you’re interested in thispod-aSpecify deletion, and CloneSet will immediately delete it and create a new POD.
  • .

4. Efficient rollback based on partition final state

In the native workload, Deployment’s own publishing does not support grayscale publishing. StatefulSet has partition semantics to allow the user to control the amount of grayscale updating. Kruise Workload, such as Cloneset and Advanced StatefulSet, also provide Partitions to support grayscale batch. In the case of CloneSet, the semantics of Partition is the number or percentage of old version PODs retained. For example, a Cloneset with 100 replicas can be released in 5 batches if the partition value is periodically changed to 80-> 60-> 40-> 20-> 0. In the past, however, whether it was Deployment, StatefulSet or Cloneset, if you wanted to roll back the template information during the release, you would have to change the template information back to the old version. The latter two are in the process of grayscale, when the partition is turned down, the old version will be upgraded to the new version, but when the partition is turned up again, it will not be processed. Starting with version 0.9.0, the partition of Cloneset supports “final state rollback” functionality. If at the time of installation or upgrade Kruise opened CloneSetPartitionRollback this feature – gate, is when a user will up the partition, CloneSet will correspond to the number of new version Pod to roll back to the old version. This brings obvious benefits: in the process of grayscale publishing, only need to adjust the partition value back and forth, can flexibly control the proportion of the number of new and old versions. Note, however, that the “old and new version” on which CloneSet is based corresponds to UpdateVision and CurrentVision in its Status:

  • UpdateVision: Corresponds to the template version currently defined by CloneSet.
  • CurrentVision: The template version of this CloneSet that was previously fully published successfully.

5. Short hash

By default, Cloneset sets the controller-review-hash value in the Pod label to the full name of a ControllerRevision, such as:

apiVersion: v1
kind: Pod
    controller-revision-hash: demo-cloneset-956df7994

It is stitched together by the CloneSet name and the ControllerRevision Hash value. Typically, hash values are 8 to 10 characters long, while the label value in Kubernetes cannot exceed 63 characters. As a result, Cloneset names are generally not more than 52 characters long, and if they are larger than that, PODs will not be created successfully. A new feature-gate for ClonesetShorthash was introduced in version 0.9.0. If it is turned on, Cloneset will only set the controller-review-hash value in the Pod to a hash value, such as 956DF7994, so there will be no restriction on the length of Cloneset names. (Even with this feature enabled, CloneSet will still recognize and manage the past stock of Revision Labels as fully formatted PODs.)


Sidecar hot upgrade feature

The Sidecarset is the workload that Kruise provides to independently manage the Sidecar container. Users can inject and upgrade a specified Sidecar container in a range of PODs through SidecarSet. By default, a stand-alone in-situ upgrade to Sidecar stops the old version of the container and then creates a new version of the container. This approach is more suitable for Sidecar containers that do not affect POD service availability, such as a log collection agent, but for many agents or run-time Sidecar containers, such as Iseo Envoy, this upgrade approach is problematic. Envoy acts as a proxy container in POD. It acts as a proxy for all traffic. If you restart the update directly, the POD service’s availability will be affected. If you need to update the Envoy sidecar separately, you will need a sophisticated Grace termination and coordination mechanism. Therefore, we provide a new solution for this Sidecar container upgrade, namely Hot Upgrade.

apiVersion: kind: SidecarSet spec: # ... Containers: -name: nginx-sidecar Image: nginx:1.18 Lifecycle: postStart: exec: command: - /bin/bash - -c - /usr/local/bin/nginx-agent migrate upgradeStrategy: upgradeType: HotUpgrade hotUpgradeEmptyImage: The empty: 1.0.0
  • UpgradeType: The UpgradeGrade means that the SideCarcontainer type is HotUpgrade, and will perform the hot upgrade scheme of the UpgradeEmptyImage: When hot-upgrading a Sidecar container, the business must provide an Empty container for container switching during the hot-upgrading process. The Empty container has the same configuration as the Sidecar container (except for the mirror address) such as Command, Lifecycle, Probe, etc., but it does not do any work.
  • Lifecycle. Poststart: State migration. This process is used to transfer the state during the hot upgrade process. This script needs to be implemented by the business itself according to its own characteristics. For example, Nginx hot upgrade needs Listen FD sharing and Reload.

For details on Sidecar injection and hot upgrade process, please refer to the documentation on the official website.

The last

For more information on these capabilities, visit the documentation on the website. Those who are interested in OpenKruise are welcome to participate in our community building. Users who are already using the OpenKruise project please register in the issue.

Nail search group 23330762 join the Nail communication group!

Copyright Notice:The content of this article is contributed by Aliyun real-name registered users, and the copyright belongs to the original author. Aliyun developer community does not own the copyright and does not bear the corresponding legal liability. For specific rules, please refer to User Service Agreement of Alibaba Cloud Developer Community and Guidance on Intellectual Property Protection of Alibaba Cloud Developer Community. If you find any suspected plagiarism in the community, fill in the infringement complaint form to report, once verified, the community will immediately delete the suspected infringing content.