Microsoft senior engineer explains the K8S container runtime

| to | let | | | | technical surgery which | |

Lecturer: Ni Pengfei/Microsoft Senior Software Engineer

Proofreading: Summer

Editor: Little Junjun

Kubernetes has become the de facto standard in the field of container scheduling. Its excellent architecture not only ensures rich container scheduling functions, but also provides various levels of extended interfaces to meet user customization requirements. The Container Runtime is a key component of Kubernetes to manage and run containers. It also provides an easy-to-use extension Interface, namely the Container Runtime Interface (CRI). CRI has enabled the container runtime to flourish, as well as bringing more options to complex scenarios such as strong isolation and multi-tenancy.

Welcome to “read the original text” to watch the lecturer [live video], dialog box reply [information download] to obtain the PPT of this article.

This article describes the evolution of the Kubernetes container runtime, how the community can use the container runtime to meet a variety of complex multi-tenant scenarios, and the Kubernetes community’s vision for the future of the Container runtime. The main content includes four parts:

Briefly introduces the architecture of Kubernetes, especially kubelet architecture, and successively determines the position and role of container runtime in Kubernetes architecture.
This paper introduces the evolution of container runtime during the development of Kubernetes.
The design of container runtime interface and the method of implementing the new container runtime are introduced.
The Kubernetes community’s vision for container runtime

Introduction of Kubernetes

As we know, Kubernetes is Google’s open source container cluster management system. It has grown very rapidly and has become the most popular and active container choreography system. It provides the perfect cluster management ability, including multi-level security protection and access mechanism, multi-tenant application supporting ability, transparent service registration and service discovery mechanisms, built-in load balancer, fault detection and self-healing, rolling upgrade and expansion and online services, extensible automatic scheduling mechanism, multi-granularity resources quota management ability.

In terms of architecture, Kubernetes components can be divided into two parts: Master and Node. Master is the brain of the whole cluster, and all the orchestration, scheduling, API access and so on are responsible for by Master.

Specifically, Master consists of the following components:

Etcd saves the state of the entire cluster.
Kube-apiserver provides a unique entry point for resource operations and provides mechanisms for authentication, authorization, access control, API registration, and discovery. And components, both inside and outside the cluster, must access data through API Server;
Kube-controller-manager is responsible for maintaining the state of the cluster, including controllers for many resources. These controllers are the brains that make the Kubernetes declaration API work, such as fault detection, automatic scaling, rolling updates, etc.
Kube-scheduler is responsible for resource scheduling, scheduling pods to corresponding nodes according to a predetermined scheduling policy.

Node is responsible for running a specific container and providing storage, networking, and other necessary functions for the container:

Kubelet is responsible for maintaining the lifecycle of the container, as well as managing the Volume (CSI) and network (CNI);
The Container Runtime is responsible for image management and the actual running (CRI) of pods and containers. The default Container runtime is Docker.
Kube-proxy is responsible for providing Service Discovery and load balancing within the Cluster for Service.
The Network Plugin is responsible for configuring networking for the container.

In addition to these core components, Kubernetes has many rich functions, and these additional functions are deployed in the way of “Addon”. Kube-dns and Metrics-Server, for example, are deployed as containers in clusters and provide apis for other components to call.

Kubelet architecture

Kubelet is responsible for maintaining the life cycle of the container. It also works with Kube-Controller-Manager to manage the volume of the container, and with CNI to manage the network of the container. Below is the architecture of Kubelet.

Components of Kubelet include:

Kubelet Server provides apis for services such as Kube-Apiserver and metrics-server to invoke. For example, kubectl exec needs to interact with the container through Kubelet API /exec/{token};
Container Manager Manages Container resources, such as CGroups, QoS, Cpuset, and Device.
The Volume Manager manages storage volumes of containers, such as formatting storage disks, mounting them to nodes, and finally transferring the mount path to containers.
Eviction, when low priority containers are expelled when resources are low, Eviction is designed to ensure high priority containers are running.
CAdvisor is responsible for providing Metrics for the container;
Metrics and STATS provide container and node Metrics. For example, Metrics extracted by Metrics server via/STATS /summary are the basis for HPA automatic scaling.
The Generic Runtime Manager is the Runtime Manager of the container, responsible for CRI interactions and managing containers and images.
Under CRI, there are two implementations of the container runtime:

One is the built-in Dockershim, which realizes the support of Docker container engine and CNI network plug-in (including Kubenet).
The other is the external container runtime, which supports runc, Containerd, gVisor, and other external container runtimes.

Kubelet interacts with the external container runtime through the CRI interface. Components include:

CRI Server: CRI gRPC Server, listening on Unix sockets.
Streaming Server: Provides the Streaming API, including Exec, Attach, Port Forward;
Container and image management, such as pulling images, creating and starting containers, etc.
CNI network plug-in support for configuring networks for containers;
Container engine management, such as support for RUNc, Containerd, or support for multiple container engines.

The container runtime in Kubernetes can be divided into three parts according to different functions:

Part 1: Container runtime management in Kubelet, which manages containers and images through CRI;
The second part: container runtime interface, which is the communication interface between Kubelet and external container runtime;
The third part: specific container runtime implementation, including kubelet built-in Dockershim and external container runtime (such as Cri-O, Cri-Containerd, Frakti, etc.).

The evolution of the container runtime

Let’s take a look at the evolution of container operation through these three different sections.

The evolution of the container runtime can be divided into three phases:

In the first phase, before Kubernetes V1.5, Kubelet built in support of Docker and RKT, and configured container network for them through CNI network plug-in. Customizing the functionality of the runtime can be painful for users at this stage, requiring modifications to Kubelet’s code that may not be pushed to the upstream community. This requires maintaining a fork of its own, but maintenance and upgrades are cumbersome.

In the second phase, different user implementations of container runtimes have different strengths, and many users want Kubernetes to support more runtimes. As a result, the CRI interface was added starting with V1.5 to remove these barriers through the container runtime abstraction layer, allowing Kubelet to support running multiple container runtimes without modification.

The CRI interface includes a set of Protocol Buffers, gRPC apis, libraries for streaming interfaces, and a series of tools for debugging and validation. At this stage, the built-in Docker implementation is also gradually migrated to the CRI interface. However, RKT is not fully migrated at this point, because RKT migrates CRI in a separate repository for easy maintenance and management.

In the third phase, starting from V1.11, the RKT code built into Kubelet is removed and the implementation of CNI is migrated into Dockershim. In this way, all container runtimes except Docker are accessed through CRI. The external container runtime, commonly referred to as the CRI Shim, is responsible for configuring the network for the container in addition to implementing the CRI interface. CNI is recommended because it supports many network plug-ins within the community, but it is not necessary. Network plug-ins satisfy the basic Kubernetes network assumption that IP-per-Pod, all Pods and Nodes can access each other directly via IP.

Container Runtime Interface (CRI)

The Container Runtime Interface (CRI) is a GPRc-based interface that extends the container runtime. Users do not need to care about internal communication logic, but only need to implement defined interfaces (including RuntimeService and ImageService).

RuntimeService is responsible for managing the Pod and container lifecycle;
ImageService is responsible for managing the image lifecycle;

In addition to the gRPC API, CRI includes libraries for implementing Streaming Server (for Exec, Attach, PortForward, and so on) and CRI Tools.

The CRI interface-based container runtime is often referred to as CRI Shim, which is a gRPC Server that listens on a local Unix socket. Kubelet acts as the client of gRPC to call CRI interface. In addition, the external container runtime is responsible for managing the container’s network itself, so CNI is recommended to keep it consistent with Kubernetes’ network model.

The introduction of CRI has brought new prosperity to the container community. A number of containers, such as Cri-O, Frakti, and Cri-Containerd, run for different scenarios:

Ci-containerd — Containerd-based container runtime;
Ci-o — OCI-based container runtime;
Frakti — Virtualization based container runtime;

Based on these container runtime, you can also easily connect to new container engines. For example, you can easily access Kubernetes through new container engines such as Clear Container and gVisor with Cri-O or Cri-Containerd. Kubernetes is extended to the strong isolation and multi-tenant scenarios that traditional IaaS can achieve.

When using CRI, you need to configure kubelet’s — container-Runtime parameter to remote. Run the following command to set –container- Run-time endpoint to the listening Unix socket location (TCP port on Windows).

CRI interface

CRI interface includes two services, RuntimeService and ImageService, which can be implemented in one gRPC server or separated into two independent services. Many of the current community runtimes are implemented in a gRPC Server.

The ImageService that manages images provides five interfaces:

Query the mirror list.
Pull image to local;
Query the mirror status.
Delete a local mirror.
Example Query the space occupied by a mirror.

These are all easily mapped to the Docker API or CLI.

RuntimeService provides more interfaces, which can be divided into four groups based on functionality:

PodSandbox management interface: PodSandbox is an abstraction of Kubernete Pods to provide containers with an isolated environment (such as mounting under the same CGroup) and shared namespaces such as networks. PodSandbox usually corresponds to a Pause container or a virtual machine;
Container management interface: Creates, starts, stops, and deletes containers in the specified PodSandbox.
Streaming API interface: including Exec, Attach and PortForward interfaces for data interaction with containers, these three interfaces return the URL of the runtime Streaming Server instead of directly interacting with containers;
Status interface: includes querying API version and querying runtime status.

Streaming API

Streaming API is used for interaction between client and container, including Exec, PortForward and Attach interfaces. Kubelet’s built-in Docker supports these features through nsenter, Socat, etc., but they are not necessarily applicable to other runtimes, nor do they support platforms other than Linux. Thus, CRI also explicitly defines these apis and requires the container runtime to return a Streaming Server URL so that Kubelet can redirect Streaming requests sent by the API Server.

Because all of the container’s streaming requests go through Kubelet, which can bottleneck network traffic for nodes, CRI requires the container runtime to start a separate streaming server for the request, returning the address to Kubelet. Kubelet returns this information to the Kubernetes API Server, which directly opens the stream connection to the Server provided by the runtime and communicates with the client through it.

Such a complete Exec process can be divided into several phases as shown in the figure above:

Kubectl exec-i-t kubectl exec-i-t kubectl exec-i-t kubectl exec-i-t ;
Kube-apiserver sends streaming requests to kubelet /exec/;
Kubelet requests the URL of Exec from CRI Shim through CRI interface;
CRI Shim returns the Exec URL to kubelet;
Kubelet returns a redirected response to kube-Apiserver;
Kube-apiserver redirects Streaming requests to Exec URLS, and then the Streaming Server inside CRI Shim interacts with Kube-Apiserver to complete Exec requests and responses.

In V1.10 and earlier, the container runtime must return a URL that the API Server can access directly (usually using the same listening address as Kubelet); Starting with V1.11, Kubelet added the –redirect-container-streaming option (default: false) to support proxy streaming requests instead of forwarding them. A LOCALhost URL is returned at the line (TLS is no longer required).

Container runtime instance

Here are a few examples of common container runtimes, each with its own strengths and support for different container engines:

multi-tenant

In multi-tenant scenarios, strong isolation, especially at the virtualization level, is a fundamental requirement.

When Kubernetes was used in the past, it was unable to meet the needs of multi-tenant because it only supported Docker containers, which only provided the isolation of kernel namespace (namespace), although it also supported SELinux, AppArmor and other basic security controls. Therefore, some people in the community once proposed the method of node monopoly to achieve tenant isolation, that is, each container or tenant monopolizes a VIRTUAL machine, the waste of resources is very obvious.

With CRI, you can plug into virtualization-based Container engines like Kata Container and Clear Container. In this way, the virtualization implements strong isolation of containers, and containers of different tenants can also run on the same Node, greatly improving resource utilization.

Multi-tenancy requires not only strong isolation of the container itself, but also many other functions such as:

Network isolation. For example, CNI can be used to build a new network plug-in to connect Pod of different tenants to the isolated virtual network.
Resource management, such as building tenant APIS and tenant controllers based on CRD to manage tenants and their resources;
Authentication, authorization, quota management, and more can also be built on top of the Kubernetes API.

CRI Tools

CRI Tools is an auxiliary tool developed for the CRI interface by the Community Node group. It includes two Tools: Crictl and Critest.

Crictl is a container runtime command line interface that is a useful tool for system and application debugging. When using Docker to run, we may use Docker ps and Docker inspect and other commands to check the application process when debugging system information.

But other CRI based container runtimes may not have their own command-line tools; Even if they do, their interfaces are not necessarily consistent with the concepts in Kubernetes. There are many commands that do nothing for Kubernetes and even damage the system (e.g. Docker Rename). Therefore, we recommend crictl as the successor to the Docker CLI as a debugging tool for pods, containers, and images on the Kubernetes node.

Crictl provides a Docker CLI-like experience and supports all CRI compliant container runtimes. Also, Crictl provides a more kubernetes-friendly view of the container: it is designed for Kubernetes and has different commands to interact with pods and containers separately. For example, Crictl Pods will list Pod information, while Crictl PS will only list application container information.

Critest is a validation test tool for the container runtime to verify that the container runtime complies with Kubelet CRI requirements. In addition to validation tests, Critest also provides performance tests for CRI interfaces, such as Critest-Benchmark.

It is recommended that Critest be integrated into the Devops process developed at the container runtime, ensuring that every change does not break the basic functionality of CRI. Alternatively, you can submit critest test results and Kubernetes Node E2E results to TestGrid of Sig-Node for community and users.

future

The last part is about the future of container runtime. Here I will elaborate on the following two points:

Multicontainer runtime RuntimeClass;
None Server container service.

Multicontainer runtime

Multi-container runtimes are used for different purposes, such as running untrusted applications and multi-tenant applications using virtualized container engines, while Docker is used to run system components or containers that cannot be virtualized (such as containers requiring hostnetworks).

Typical use cases are:

Kata Containers/gVisor + runc
Windows Process isolation + Hyper-V isolation containers

Previously, multi-container runtimes were usually supported in the form of annotations, such as Cri-O, Frakti, and so on. But this is far from elegant, and it does not allow for scheduling containers based on their runtime. As a result, Kubernetes will start adding a new API object called RuntimeClass in V1.12 to support multiple container runtimes.

RuntimeClass represents a runtime that requires enabling the RuntimeClass feature and creating the RuntimeClass CRD before it can be used:

kubectl apply -f https://github.com/kubernetes/kubernetes/tree/master/cluster/addons/runtimeclass/runtimeclass_crd.yamlCopy the code

You can then define a RuntimeClass object:

apiVersion: node.k8s.io/v1alpha1  # RuntimeClass is defined in the node.k8s.io API group
kind: RuntimeClass
metadata:
  name: myclass  # The name the RuntimeClass will be referenced by
  # RuntimeClass is a non-namespaced resource
spec:
  runtimeHandler: myconfiguration  # The name of the corresponding CRI configurationCopy the code

You can then define which RuntimeClass to use in the Pod:

apiVersion: v1
kind: Pod
metadata:
  name: mypod
spec:
  runtimeClassName: myclass
  # ...Copy the code

In future releases, RuntimeClass will also support scheduling pods based on the actual container runtime running on Node.

None Server container service

Serverless is a very popular direction right now, and various cloud platforms also offer many kinds of Serverless computing services, such as Azure Container Instance, AWS Farget, etc. They have the advantage that the user does not need to manage the underlying infrastructure of the container, only the container, and the container is usually charged for the actual running time. In this way, users can not only avoid the tedious steps of managing the infrastructure, but also save money.

So what are the application scenarios for CRI here? If you are a manager of a cloud platform and want to build a serverless container service, using CRI with multiple container runtimes is a good idea.

The application scenarios are as follows:

Kubernetes can be used to provide scheduling and orchestration for the entire platform;
Tenant management function can be built based on Kubernetes API;
Strong isolation of multi-tenant container operation can be realized based on CRI.
Multi-tenant network isolation can be implemented based on CNI.

What about users of cloud platforms? These serverless container services typically provide simple functionality without choreography. However, with the Virtual Kubelet project, you can use Kubernetes to provide choreography for containers on these platforms.

Virtual Kubelet is a Virtual Kubernetes Node designed for the Serverless container platform. It simulates the functions of Kubelet and abstracts the Serverless container platform into a Virtual Node with unlimited resources. This allows you to manage containers on it through the Kubernetes API.

Virtual Kubelet currently supports many cloud platforms, including:

Azure Container Instance
AWS Farget
Service Fabric
Hyper.sh
IoT Edge

conclusion

That’s all for “Kubernetes Container Runtime Evolution.” Container runtime from the custom runtime function is not perfect to solve the configuration network problems, open CRI interface. In the future, container runtimes will be used for different purposes in multiple container runtimes. Developers use multi-container runtimes to build serverless container services for the future.

Ni Pengfei/Senior Software Engineer, Microsoft

Microsoft software engineer, mainly responsible for Kubernetes open source and Azure development and implementation. He is also the maintainer of the Kubernetes project. He has worked for Shanda, Tencent and HyperHQ, and has years of practical experience in cloud computing, SDN network and container scheduling. Maintain the Open source Kubernetes Guide that keeps up to date with the community.

Microsoft senior engineer explains the K8S container runtime

Related Posts

The react case

How are SQL queries executed?

Understanding HTTP idempotence – Todd Wei – Blog Park