The author | si-yu wang (the wish) Photo Creidt @ si-yu wang (the wish)

background

OpenKruise is an open source Cloud Native application automation management suite of Ali Cloud. It is also a Sandbox project currently hosted under Cloud Native Computing Foundation (CNCF). It comes from alibaba’s containerized, cloud-native technology precipitation over the years. It is a standard extension component based on Kubernetes for large-scale application in Alibaba’s internal production environment. It is also a technical concept and best practice closely related to upstream community standards and adapted to the large-scale scene of the Internet. OpenKruise released the latest v0.9.0 version (ChangeLog) on May 20, 2021, which added Pod container restart, resource cascading deletion protection and other important functions. This article gives an overall overview of the new version.

Pod container restart/rebuild

“Restart” is a very simple demand, even if the appeal of daily operation and maintenance, is a relatively common “recovery means” in the technical field. However, in the original Kubernetes, there is no provision of any container granularity operation ability, Pod as the smallest operation unit also only create, delete two operation modes. Some students may ask, in the cloud native era, why do users care about container restart operation? In the ideal Serverless model, the business only needs to care about the service itself, right? This comes from the difference between the cloud native architecture and the traditional infrastructure of the past. In the traditional era of physical machines and virtual machines, multiple application instances are deployed and run on one machine, and the life cycles of the machine and application are different. In this case, the restart of the application instance might be just a command like SystemCTL or Supervisor, without the need to restart the entire machine. However, in container-cloud native mode, the application lifecycle is tied to the Pod container; Normally, a container runs only one application process, and a Pod serves only one application instance. Due to the above limitations, there is currently no API under native Kubernetes to provide container (application) restart capability for upper-layer business. Kruise V0.9.0 provides a single-pod dimension container restart capability, compatible with standard Kubernetes clusters 1.16 and above. After installing or upgrading Kruise, you only need to create ContainerRecreateRequest (CRR) object to specify the restart. The simplest YAML is as follows:

apiVersion: apps.kruise.io/v1alpha1
kind: ContainerRecreateRequest
metadata:
  namespace: pod-namespace
  name: xxx
spec:
  podName: pod-name
  containers:
  - name: app
  - name: sidecar
Copy the code

The namespace must be in the same namespace as the Pod to be operated. The name is optional. The Spec podName is the Pod name, and the Containers list can specify one or more container names in the Pod to perform the restart. In addition to the above mandatory fields, CRR provides a variety of optional restart policies:

spec:
  # ...
  strategy:
    failurePolicy: Fail
    orderedRecreate: false
    terminationGracePeriodSeconds: 30
    unreadyGracePeriodSeconds: 3
    minStartedSeconds: 10
  activeDeadlineSeconds: 300
  ttlSecondsAfterFinished: 1800
Copy the code

FailurePolicy: Fail or Ignore, default Fail. Indicates that the CRR ends as soon as a container stops or rebuild fails.
Ordereddefinitions: Default false; True: If the list has multiple containers, wait until the previous container is rebuilt before starting the next one.
TerminationGracePeriodSeconds: wait for container graceful exit time, do not fill in the default use Pod defined in time.
UnreadyGracePeriodSeconds: before rebuilding the Pod is set to not ready, and wait for this period of time and then start reconstruction.
- Note: This feature relies on the KruisePodReadinessGate feature to be opened, which injects a readinessGate at each Pod creation. Otherwise, the default will only give Kruise workload created inject readinessGate Pod, that is to say, only the Pod can use unreadyGracePeriodSeconds CRR rebuilt.
MinStartedSeconds: The new container is considered to have been rebuilt successfully only after it has been running for at least this time.
ActiveDeadlineSeconds: If the CRR execution exceeds this time, it is directly marked as finished (unfinished containers are marked as failed).
TtlSecondsAfterFinished: After the CRR ends, it is automatically deleted.

Implementation principle: After a user creates a CRR, kruise-Manager will receive it from Kruise-Daemon on the Pod node and start to execute it. The execution process is as follows:

If the Pod container defines preStop, Kruise-Daemon will first exec the CRI runtime into the container to execute preStop.
If there is no preStop or execution is complete, Kruise-daemon calls the CRI interface to stop the container.
When Kubelet senses that the container has quit, it creates a new container with an increasing “serial number” and starts (along with postStart).
Kruise-daemon detects that a new container is successfully started and reports that the CRR is restarted.

The container “serial number” above actually corresponds to the restartCount reported by Kubelet in Pod Status. Therefore, you will see the Pod restartCount increase after the container restarts. In addition, files that were temporarily written to the old container rootfs will be lost because the container has been rebuilt, but the data in the volume mount will still exist.

Cascading deletion protection

Kubernetes’ end-state oriented automation is a “double-edged sword” that brings declarative deployment capabilities to applications while potentially amplifying misoperations by end-state. For example, its “cascading delete” mechanism, that is, in normal condition (non-orphan delete), once the parent class resource is deleted, all subclass resources are deleted in association:

When a CRD is deleted, all corresponding CRS are cleaned up.
Delete a namespace in which all resources including Pod are deleted at once.
To delete a workload (Deployment/StatefulSet /…). , all subordinate pods are deleted.

We’ve heard complaints from K8s users and developers in the community about glitches like these “cascading deletions.” For any enterprise, the production environment of such a scale error deletion is unbearable pain, Alibaba is no exception. Therefore, in Kruise V0.9.0, we export the anti-cascading deletion capability of Ali to the community, hoping to bring stability guarantee for more users. If you need to use this feature in the current version, the installation or upgrade Kruise ResourcesDeletionProtection this feature – need to explicitly turned on when the gate. You can tag a resource object to be deleted with policy-kruise. IO /delete-protection and value can be either of the following:

Always: indicates that the object cannot be deleted unless the label is removed.
Cascading: This object is prohibited from being removed if there are subordinate resources available.

The types of resources and cascading relationships currently supported are as follows:

New features in CloneSet

1. Delete the priority

Controller. Kubernetes. IO/pod – deletio… Annotation is added after Kubernetes 1.21, ReplicaSet will use this cost value to sort the scaling down. CloneSet has also supported this feature since Kruise V0.9.0. The user can configure this annotation on POD, and its value is of int type, indicating the “deletion cost” of this POD compared with other pods under the same CloneSet. The pod with the smaller cost has a higher deletion priority. The pod that doesn’t set this annotation defaults to deletion cost is 0. Note that this deletion order is not mandatory, as real pod deletions are similar to the following order:

Unscheduled < scheduled
PodPending < PodUnknown < PodRunning
Not ready < ready
The smaller POD-deletion cost is less than the larger POD-deletion cost
The time in Ready is shorter than longer
The container restarts more times than less times
Short creation time < long creation time

2. Configure image preheating for the in-place upgrade

When using CloneSet for in-place upgrades, only the container image is updated, and no Pod reconstruction occurs. This ensures that the node where the Pod is located will not change before and after the upgrade. Therefore, during the in-place upgrade process, if CloneSet takes the image of the new version on all the Pod nodes in advance, the in-place upgrade speed of Pod will be greatly improved in the subsequent release batches. If you need to use this feature in the current version, the installation or upgrade Kruise PreDownloadImageForInPlaceUpdate this feature – need to explicitly turned on when the gate. When turned on, when the user updates the image in the CloneSet template and the publishing policy supports in-place upgrading, CloneSet automatically creates an ImagePullJob object for the new image (batch image preheat provided by OpenKruise). To preheat the new image on the Pod node. By default, CloneSet sets the concurrency for ImagePullJob to 1, that is, pull mirroring from node to node. If you need to adjust, you can set the concurrency of its image warming up on the CloneSet annotation:

apiVersion: apps.kruise.io/v1alpha1
kind: CloneSet
metadata:
  annotations:
    apps.kruise.io/image-predownload-parallelism: "5"
Copy the code

3. Pod replacement mode of expanding and then shrinking

In previous versions, CloneSet’s maxUnavailable, maxSurge policies only applied to the application publishing process. Starting from Kruise V0.9.0, these two policies will also take effect for Pod designation deletion. That is, when the user uses podsToDelete or apps.kruise.io/ specification-delete: The true method specifies that one or more pods are expected to be deleted. CloneSet will delete only when the number of currently unavailable pods (relative to the total number of Replicas) is less than maxUnavailable. At the same time, if the user is configured with the maxSurge policy, CloneSet may create a new Pod, wait for the new Pod to be ready, and then delete the specified old Pod. The exact substitution method will depend on the maxUnavailable and actual number of unusable pods at the time. Such as:

For a CloneSetmaxUnavailable=2, maxSurge=1And there is apod-aIn an unavailable state if you have anotherpod-bSpecify delete, and CloneSet immediately deletes it and creates a new Pod.
For a CloneSetmaxUnavailable=1, maxSurge=1And there is apod-aIn an unavailable state if you have anotherpod-bSpecify delete, and CloneSet creates a new Pod, waits for it to be ready, and then deletes itpod-b.
For a CloneSetmaxUnavailable=1, maxSurge=1And there is apod-aIn an unavailable state, if you’re on thispod-aSpecify delete, and CloneSet immediately deletes it and creates a new Pod.
.

4. Efficient rollback based on partition final state

In the native workload, Deployment release itself does not support grayscale release. StatefulSet has partition semantics to allow users to control the number of grayscale upgrades. Kruise workload, such as CloneSet and Advanced StatefulSet, also provide partitions to support grayscale batching. For CloneSet, the semantics of Partition are the number or percentage of pods retained from the previous version. For example, a CloneSet with 100 copies will be released in 5 batches if the partition value is changed to 80 -> 60 -> 40 -> 20 -> 0 during the upgrade of the mirror. But in the past, whether it was Deployment, StatefulSet, or CloneSet, if you wanted to roll back during a release, you had to change the template information (image) back to the old version. In the gray-scale process of the latter two versions, scaling down the partition will trigger the upgrade of the old version to the new version, but scaling up the partition again will not process the upgrade. Starting with v0.9.0, CloneSet partitions supported “end-state rollback” functionality. If at the time of installation or upgrade Kruise opened CloneSetPartitionRollback this feature – gate, is when a user will up the partition, CloneSet will correspond to the number of new version Pod to roll back to the old version. The advantage of this is obvious: in the grayscale publishing process, only need to adjust the partition value before and after, can flexibly control the number of old and new versions of the scale. Note, however, that the “old and new versions” CloneSet relies on correspond to updateRevision and currentRevision in its Status:

UpdateRevision: corresponds to the current version of the template defined by CloneSet.
CurrentRevision: The previous successful full release of the Template version of this CloneSet.

5. Short hash

By default, CloneSet sets controller-Revision-hash values in Pod labels to the full name of ControllerRevision, such as:

apiVersion: v1
kind: Pod
metadata:
  labels:
    controller-revision-hash: demo-cloneset-956df7994
Copy the code

It is a concatenation of the CloneSet name and the ControllerRevision hash value. Generally, the hash value contains 8 to 10 characters, while the label value in Kubernetes cannot exceed 63 characters. As a result, CloneSet names are generally limited to 52 characters, and if they exceed that, pods cannot be successfully created. New feature-gate for CloneSetShortHash was introduced in v0.9.0. If it is turned on, CloneSet only sets the controller-Revision-hash value in the Pod to the hash value, such as 956DF7994, so there is no limit to the length of the CloneSet name. (Even with this feature enabled, CloneSet still recognizes and manages past revision labels as full-format PODS.)

SidecarSet

Sidecar Hot upgrade function

SidecarSet is Kruise’s workload for independent management of Sidecar containers. Users can use SidecarSet to inject and upgrade specified Sidecar containers in a range of pods. By default, the standalone in-place upgrade of Sidecar stops the old version of the container and then creates the new version of the container. This approach is more suitable for Sidecar containers that do not affect the availability of Pod services, such as log collection Agents, but for many proxy or run-time Sidecar containers, such as Istio Envoy, this upgrade approach is problematic. Envoy acts as a proxy container in the Pod, representing all traffic, and the availability of the Pod service will be affected if the upgrade is restarted directly. If you need to upgrade the Envoy Sidecar individually, you need sophisticated grace termination and coordination mechanisms. Therefore, we provide a new solution for this sidecar container upgrade, namely hot upgrade.

apiVersion: apps.kruise.io/v1alpha1 kind: SidecarSet spec: # ... Containers: - name: nginx-sidecar image: nginx:1.18 lifecycle: postStart: exec: command: - /bin/bash - -c - /usr/local/bin/nginx-agent migrate upgradeStrategy: upgradeType: HotUpgrade hotUpgradeEmptyImage: The empty: 1.0.0Copy the code

UpgradeType: HotUpgrade Indicates that the type of the sidecar container is Hot upgrade, and the hot upgrade scheme will be executed. HotUpgradeEmptyImage: When hot upgrading a Sidecar container, the business must provide an empty container for container switching during the hot upgrade. Empty has the same configuration as sidecar except for mirror addresses, such as Command, lifecycle, probe, etc., but it does nothing.
Lifecycle. PostStart: state migration. This process will complete the state migration during the hot upgrade.

For details about sidecar injection and hot upgrade processes, see the documents on the official website.

The last

To learn more about these capabilities, visit the documentation on our website. Students who are interested in OpenKruise are welcome to participate in our community construction. Users who have already used OpenKruise project please register in issue.

Nail search group number 23330762 join nail exchange group!

OpenKruise V0.9.0 release: added Pod restart, delete protection and other heavyweight functions