The author | Gu Zhiguang Alibaba senior development engineer

This article is collated from lecture 30 of “CNCF X Alibaba Cloud Native Technology Open Class”, click directly to the course page. Pay attention to “Alibaba cloud original” public number, reply keywords ** “entry” **, you can download K8s series of articles PPT from zero entry.

Source of RuntimeClass requirements

The evolution of the container runtime

Let’s take a look at the evolution of the container runtime, which can be divided into three phases:

  • Phase 1: June 2014

Kubernetes officially open source, Docker was the only and default container runtime at the time;

  • Phase 2: Kubernetes V1.3

RKT merges into the Kubernetes trunk and becomes the second container runtime.

  • The third stage: Kubernetes V. 15

At the same time, more and more container runtimes want to plug into Kubernetes. If RKT or Docker built-in support, it will bring serious challenges to Kubernetes code maintenance and quality assurance.

The community was aware of this, so CRI, which stands for Container Runtime Interface, was introduced in version 1.5. The advantage of this is that the runtime and Kubernetes are decoupled, and the community no longer has to adapt to various runtimes, nor do they have to worry about version-maintenance issues caused by inconsistent runtimes and Kubernetes iterations. Typically, the Cri-plugin in ContainerD enables containers such as CRI, katA-containers, and gVisor to run only with ContainerD.

As more and more container runtimes appear, different container runtimes also have different requirement scenarios, hence the need for multiple container runtimes. However, how to run the multi-container runtime also needs to solve the following problems:

  • What container runtimes are available in the cluster?
  • How do I choose the right container runtime for Pod?
  • How do I schedule a Pod to a node with a specified container runtime?
  • When the container runs, there are some extra costs incurred by running the container. How do I calculate the “extra costs”?

Workflow of RuntimeClass

To address the aforementioned issues, the community launched RuntimeClass. It was actually introduced in Kubernetes V1.12, but originally in the form of CRD. After V1.14, it was introduced as RuntimeClas, a built-in cluster resource object. V1.16 extends Scheduling and Overhead capabilities on the basis of V1.14.

The following uses V1.16 as an example to illustrate the workflow of RuntimeClass. As you can see in the figure above, its workflow flowchart is on the left and a YAML file is on the right.

The YAML file contains two parts: the upper part is responsible for creating a RuntimeClass object named Runv, and the lower part is responsible for creating a Pod that references the runv RuntimeClass through spec.runtimeclassName.

At the heart of the RuntimeClass object is the handler, which represents a program that receives requests to create a container and corresponds to a container runtime. For example, the Pod in the example will eventually be created by the runv container runtime; Scheduling determines which nodes pods will eventually be scheduled to.

To illustrate the workflow of RuntimeClass:

  1. K8s-master received a request to create a Pod.
  2. The squares represent three types of nodes. Each node has a Label to identify the container runtime supported by the current node. There may be one or more handlers in the node, and each handler corresponds to a container runtime. For example, the second box represents a node with a handler that supports both runc and RUNV container runtimes. The third box indicates that the node has a handler that supports the runHCS container runtime;
  3. According to scheduling. NodeSelector, the Pod will eventually be scheduled to the middle grid node and the runv handler will eventually create the Pod.

RuntimeClass

Structure definition of RuntimeClass

Again, take RuntimeClass in Kubernetes V1.16 as an example. Let’s start with the structure definition of RuntimeClass.

A RuntimeClass object represents a container runtime, and its structure contains Handler, Overhead, and Scheduling fields.

  • We also mentioned Handler in the previous example, which represents a program that receives a request to create a container and corresponds to a container runtime.
  • Overhead is a new field introduced in V1.16 that represents the Overhead in Pod beyond the resources required to run a business;
  • A third field, Scheduling, was also introduced in V1.16, and the Scheduling configuration is automatically injected into Pod’s nodeSelector.

RuntimeClass resource definition example

The usage of referencing RuntimeClass in a Pod is very simple. You can import the RuntimeClass by configuring the RuntimeClass name in the runtimeClassName field.

Definition of the Scheduling architecture

As the name implies, Scheduling stands for Scheduling, but not for the Scheduling of the RuntimeClass object itself, but for the Scheduling of the pods that reference RuntimeClass.

Scheduling contains two fields, NodeSelector and Tolerations. These are very similar to the NodeSelector and Tolerations included in Pod itself.

NodeSelector represents the list of labels that should exist on the node that supports the RuntimeClass. After a Pod references the RuntimeClass, the RuntimeClass Admission will merge the label list with the LABEL list in the Pod. If there is a conflict between the two labels, it will be rejected by Admission. Conflict refers to the fact that they have the same key but different values. In this case, admission will reject them. Note also that RuntimeClass does not automatically set the label for Node. You must set the label before using RuntimeClass.

Tolerations represents a list of RuntimeClass tolerances. After a Pod references the RuntimeClass, Admission also merges the Toleration list with the Toleration list in Pod. And if these two Toleration have the same tolerance configuration, they’re combined into one.

Why was Pod Overhead introduced?

The image above shows a Docker Pod on the left and a Kata Pod on the right. We know that Docker Pod has a pause container in addition to the traditional Container, but we ignore pause when calculating its container overhead. For Kata pods, except for containers, the costs of kata-agent, pause, and guest-kernel are not counted. These costs, sometimes more than 100MB, we can not ignore these costs.

That’s why we introduced Pod Overhead. Its structure is defined as follows:

Its definition is very simple, with a single field, PodFixed. It is also a mapping, its key is a ResourceName, value is a Quantity. Each Quantity represents the usage of a resource. Therefore, PodFixed represents the usage of various resources, such as CPU and memory, which can be set with PodFixed.

Usage scenarios and limitations for Pod Overhead

Pod Overhead has three main usage scenarios:

  • Pod scheduling

Before Overhead, a Pod could be scheduled to a node as long as its resource availability was greater than or equal to Pod requests. After Overhead is introduced, nodes can only be scheduled if their resource availability is greater than or equal to Overhead plus requests.

  • ResourceQuota

It is a namespace level resource quota. Let’s say we have a namespace with 1 GB of memory usage, and we have a Pod for requests equal to 500. We can schedule up to two of these pods under the namespace. If we add 200MB Overhead to both pods, we can schedule at most one of these pods under the namespace.

  • Kubelet Pod cast

Overhead is added to the node’s used resources, which increases the proportion of used resources and ultimately affects the removal of Kubelet pods.

This is the usage scenario for Pod Overhead. In addition to this, Pod Overhead has some usage restrictions and cautions:

  • Pod Overhead is eventually permanently injected into the Pod and cannot be manually changed. Pod Overhead remains valid even after the RuntimeClass is removed or updated;
  • Pod Overhead can only be injected automatically by RuntimeClass Admission (at least for now) and cannot be added or changed manually. If you do, you will be rejected;
  • Pod Overhead does not affect HPA and VPA aggregation based on container-level metric data.

The multi-container runtime example

At present, Ali Cloud ACK security sandbox container has supported the multi-container runtime, we show the environment as an example to illustrate how the multi-container runtime is working.

There are two pods as shown in the figure above. On the left is a runc Pod with the corresponding RuntimeClass as runc, on the right is a Runv Pod with the referenced RuntimeClass as runv. The corresponding requests have been color-coded, with blue representing RUNC and red representing RUNV. In the bottom half of the diagram, the core part is containerd, where containerd can be configured to run multiple containers, and eventually the above requests will also arrive here to forward the requests.

Let’s take a look at runc’s request. It arrives at Kube-Apiserver, and kube-Apiserver forwards the request to Kubelet. Finally, Kubelet sends the request to Cri-plugin (which is a plug-in that implements CRI). The Ci-plugin queries the containerd configuration file for the runc Handler, and finds that the Shim API Runtime v1 requests Containerd-shim to create the container. This is the runC process.

The flow of RUNV is similar to that of RUNC. The request is sent to Kube-Apiserver, then to Kubelet, then to Cri-plugin, which finally returns the matching configuration file of Containerd. The Shim API Runtime v2 will be used to create Containerd-Shim-kata-v2, which will then create a Kata Pod.

Let’s take a look at containerD’s configuration.

Containerd under file:///etc/containerd/config.toml this position by default. The core configuration is in the plugins.cri.containerd directory. The runtimes configuration has the same prefix plugins. The cri. Containerd. Runtimes, runc, behind two RuntimeClass runv. Runc and runv correspond to the name of the Handler in the previous RuntimeClass object. Besides, there are a more special configure plugins. The cri. Containerd. Runtimes. Default_runtime, it mean, if there is no specified RuntimeClass a Pod, but was dispatched to the current node, The runc container runtime is then used by default.

The following example creates runc and runv RuntimeClass objects. We can see all currently available container runtimes by using kubectl get RuntimeClass.

The following image shows, from left to right, a Pod for RUNc and RUNv, respectively. The core is the container runtime that references RUNc and RUNV, respectively, in the runtimeClassName field.

After the Pod is finally created, we can use the kubectl command to view the running status of each Pod container and the container runtime used by the Pod. We can see that there are now two pods in the cluster: one is runc-pod and the other is runv-pod, referring to runc and Runv’s RuntimeClass respectively, and both are in Running state.

Iv. Summary of this paper

This is the end of the main content of this article, here is a simple summary for you:

  • RuntimeClass is a built-in cluster resource in Kubernetes. It is mainly used to solve the problem of mixing multiple containers at runtime.
  • Scheduling is configured in RuntimeClass to automatically schedule PODS to nodes that run the specified container runtime. However, users need to set the label for these nodes in advance.
  • Overhead is configured in RuntimeClass to count Overhead that is not required to run a business in Pod, making scheduling, ResourceQuota, and Kubelet Pod ejection more accurate.

Enrollment link: yqh.aliyun.com/live/CloudN…

“Alibaba Cloud originator focuses on micro-service, Serverless, container, Service Mesh and other technical fields, focuses on the trend of cloud native popular technology, large-scale implementation of cloud native practice, and becomes the public account that most understands cloud native developers.”