Jiang Biao, senior engineer of Tencent Cloud, has been focusing on operating system related technology for more than 10 years, and is a senior enthusiast of Linux kernel. At present, I am responsible for the research and development of Tencent Cloud native OS and the performance optimization of OS/ virtualization.

Introduction:

Tencentos Server (also known as Tencent Linux for short) is a Linux operating system developed by Tencent for cloud scenarios. It provides specialized functional features and performance optimization to provide high-performance, more secure and reliable operating environment for applications in cloud Server instances. Tencent Linux is free to use. Applications developed on CentOS (and compatible distributions) can run directly on Tencent Linux. Users can also receive continuous update maintenance and technical support from Tencent Cloud.

Tencentos has experienced more than 10 years of iteration and evolution within Tencent, carrying and supporting all the businesses of Tencent, with over 300W commercial deployment nodes, and withstanding the extreme test of massive complex business models in extreme scenarios.

Common OS Architecture

Definition of a traditional OS (embezzling classic textbook content) :

An operating system is a program that controls the execution of an application program and is the interface between the application program and the computer (hardware).

The operating system has three goals:

  • Convenience: To make computers easier to use
  • Efficient: Allows computer resources to be used in a more efficient way
  • Extensions: Allows efficient development, testing, and introduction of new system functionality without affecting services

The typical architecture of the traditional general OS(Linux) is designed as above. The operating system contains a variety of functional modules and interfaces for the realization of the above three objectives. Generally speaking, they are divided into two parts:

  • Kernel: Provides the basic abstraction of the underlying hardware (computer). Different kernel modules provide different hardware management or related ancillary functions, and provide services to the upper applications through system calls.
  • Base libraries and related service components (user mode) : Provides the base environment for real business operations

OS in IaaS scenario

In the IaaS scenario, OS is mainly used to provide the running environment for the cloud host (virtual machine). In the IaaS scenario, the type and number of tasks running in the OS are controllable, and the scenario is much simpler than the general scenario. There are basically only a few types of tasks:

  • VM-related threads (typically QEMU + VCPU threads)
  • Manage Agents for various control surfaces
  • Some of the threads of control required by the OS itself (e.g. PER-CPU Workers)

In the IaaS scenario, in order to make the virtual machine performance infinitely close to (or even surpass) the physical machine, subtraction is usually considered to improve performance by infinitely reducing the overhead of virtualization and OS level. Some typical methods commonly used are as follows:

  • Core binding at the CPU level
  • Preallocation at the memory level
  • Various Bypass Kernel techniques at the IO level

For OS, the end result is:

OS is getting thinner and may eventually disappear

Another View of OS (Cloud Native Perspective)

When the cloud native wave hits, and you look at OS from a different perspective (cloud native perspective), you see a different view. Cloud native scenarios pose new challenges to OS, and also inject new impetus to the further development and evolution of OS-related technologies.

In the cloud native scenario, OS provides the underlying support and services for different types of applications (Apps, Containers, Functions, Sandboxes). At this point, the most obvious differences from the IaaS scenario are:

The boundaries of applications and systems have moved up dramatically, with OS under applications

For OS, the end result is:

The OS is getting thicker (breeding infinite possibilities), in stark contrast to the IaaS scene

Tencentos For cloud native

In the background of the industry swept by the wave of cloud native, along with the rapid transformation of Tencent’s own various business architectures, containerization, micro-service and Serverless of the business put forward new challenges and requirements to the underlying infrastructure (including the core OS), and Tencentos also rapidly transformed. According to Tencent’s own cloud native scene and demand, it has carried out deep reconstruction and redesign, fully embraced cloud native, and stepped forward to the goal of cloud native OS.

The overall architecture

Tencentos (currently) implements the following cloud native Features

  • Cloud Native Scheduler – Tencent Could Native Scheduler(TCNS)
  • Cloud native resource QoS-rue
  • Quality Monitor
  • Cloud native SLI
  • Cgroupfs

Cloud Native Scheduler – Tencent Could Native Scheduler(TCNS)

TCNS is an overall solution of kernel scheduler provided by Tencentos for cloud native scenarios, which can cover containers, security containers and common scenarios. It is effective for business scenarios where there is a need for CPU isolation in the multi-priority business mix, as well as extreme requirements for real-time performance and stability. In this article, the need for CPU isolation in the mixed scenario and the possible solutions are explained in detail. The discussion on the real-time guarantee of the kernel scheduler will be discussed in the following OS articles. Please pay attention to it in succession.

TCNS mainly includes three modules:

  • BT Scheduler
  • VMF Scheduler
  • ECFS

BT Scheduler

The BT Scheduler is a new scheduling class designed by Tencentos for CPU isolation in mixed-side scenarios, as shown in the figure below:

Its core design is: a new design of a new scheduling class, lower priority than the default CFS, only to run when no other higher priority tasks are running, dedicated to running offline tasks (online tasks use CFS class).

The core benefits of this design are: the online business can be absolutely preemptive to the offline business, and the nearly perfect CPU isolation effect can be achieved in the mixed part scenario.

The BT scheduling class itself is also a new fully functional scheduling class, within which a complete set of functions similar to CFS can be provided.

In addition, the BT scheduling class implements the following features:

Additional information about BT can be found by clicking the following link:

“Https://cloud.tencent.com/dev…”

Note: Although the content is somewhat old, the new version has been iterated for several rounds, for your reference. Related to BT’s latest introduction, will also be launched in the follow-up corresponding articles, please look forward to.

VMF Scheduler

VMF (VM First) scheduler is a kernel scheduler solution specially designed by Tencentos for security container scenarios (and virtual airport landscape) (a completely new kernel scheduler is re-implemented).

The main background of the rewritten scheduler is that the existing CFS scheduler is based on the principle of “complete fairness” and cannot guarantee the real-time dispatch of virtual machine (security container) threads.

The core design of VMF includes:

  • Unfair schedulers, which tilt CPU resources toward virtual machine processes, ensure that virtual machine (secure container) threads are scheduled first
  • Scheduling is based on the type of task without fine-grained priority. , by contrast, we believe that the priority of the CFS and cannot accurately describe the running characteristics of the different processes, the typical is the kernel thread, this kind of process characteristics, obviously, first of all, he is very important, followed by a single execution of his time is very short, but it is hard to define their priority, height is not suitable, just run through the priority and cannot accurately describe their behavior. In VMF, all processes are classified through portrait and modeling of the characteristics of different processes, and detailed scheduling strategies are designed for different types of processes, which can meet the extreme demand for real-time performance in cloud native scenarios.
  • One-way aggressive preemption means that a VM process can preempt other tasks as soon as possible in any situation, while vice versa. In this way, the real-time performance of the VM process can be guaranteed to the greatest extent without compromising the scheduler’s throughput performance.

In addition, we have designed many other features for other scenarios and requirements, which are too limited to expand on in detail. We plan to introduce them separately in a separate topic later.

Overall, with our own VMF scheduler, we can gain several key benefits:

  • Extreme scheduling delay indicator (good real time), the maximum delay in extreme cases is at the subtle level
  • The new scheduler is much lighter than CFS, with less than a third of the total code
  • In the case of partial interference, the real-time performance of virtual machine (security container) thread is guaranteed
  • The classification design based on VMF can provide different levels of CPU QoS guarantee for different types of processes
  • Through the fully self-developed scheduler, you can do a lot of cool, usually can not imagine the customization. If you have experience tuning CFS, you can see how much more painful it can be to do customization on CFS.

The explanation of VMF is also planned for another article, so please look forward to it. In addition, the virtualization special session of OS2ATC 2020, the theme of “TIANUS Hypervisor -” Zero Loss “Tencent Cloud Lightweight Virtualization Technology” is also explained in part

https://www.bilibili.com/vide…

< Note: 1:24:00 starts >

ECFS Scheduler

ECFS is optimized for common scenarios (Upstream route) based on the mainstream CFS scheduler in the community. Core optimization points:

  • Introduced a new task scheduling type to distinguish between online and offline tasks.
  • Optimize preemption logic to ensure online preemption against offline preemption. Avoid unnecessary offline preemption over online preemption
  • Absolute preemptive design
  • Hyperthreading interference isolation

The exact principles will not be explained yet, but look forward to a future OS article.

Cloud native resource QoS-rue

UE (Resource Utilization Enhancement), the Chinese brand “Ruyi”, is a product in Tencentos product matrix that is designed for server Resource QoS under cloud native scenarios, to improve Resource Utilization and reduce operating costs. Ruyi is unified dispatching cloud machine, memory, CPU, IO, network resources, such as compared to conventional server resource management scheme, more suitable for the cloud scenes flexibly, can significantly improve the cloud machine resources use efficiency, reduce the operating costs of customers, on the cloud for the public cloud, a hybrid cloud, private cloud customers resources, including value-added services. Ruyi’s core technology can make different priority businesses do not interfere with each other, and realize the efficient unification of resource utilization, resource isolation performance and resource service quality.

architecture

RUE includes the following main modules:

Cgroup Priority

The global unified POD priority concept is introduced in the kernel, and it runs through the processing stack of all the CPU, Memory, IO and NET resources to provide unified priority control.

CPU QoS

Based on the TCNS implementation mentioned in the previous section, absolute preemption and perfect isolation can be achieved at the CPU level.

Memory QoS

Through the priority awareness in the allocation and collection path, different levels of memory allocation QoS guarantee are provided for containers with different priorities (memory availability of low-quality containers is sacrificed to guarantee memory QoS of high-quality containers). Several original features have been implemented that, as a whole, provide maximum memory allocation latency for superior containers, which is one of the key capabilities that the Upstream Kernel lacks.

IO QoS

Allows users to divide containers into different IO priorities and allocate IO resources according to their priorities, so as to ensure that low-priority containers will not interfere with high-priority containers. At the same time, low-priority containers are allowed to use idle IO resources to improve resource utilization. IO resource QoS includes three aspects: bandwidth QoS, delay QoS and write back QoS. In addition, it also provides a minimum bandwidth guarantee to prevent potential priority inversion due to low optimal hunger.

Net QoS

Allows users to allocate the bandwidth of network cards on the server to different containers according to priority, and allows low-priority containers to use idle bandwidth resources without interfering with the network bandwidth of high-priority containers. In addition, it also provides a minimum bandwidth guarantee to prevent potential priority inversion due to low optimal hunger.

The overall structure of RUE is relatively complex, and a large number of changes and optimizations have been made to the Upstream Kernel. The related features are too extensive to be discussed in this article, but we will discuss them one by one in the following special article.

The overall effect

  • The concept of global unified POD priority is introduced to provide unified priority control
  • Suitable for multi-priority container (POD/task) mixed deployment, can greatly improve resource utilization

Quality Monitor

In the mixed part scenario, in order to improve the resource utilization of the whole machine, it tends to maximize the Overcommit. Under the premise of the underlying isolation technology (resource QoS) guarantee, the interference isolation can be guaranteed to a certain extent. But there are two main challenges:

  • How to evaluate QoS effects and perceived “interference”?
  • How to effectively troubleshoot “interference”?

On the other hand, upper-level scheduling (K8S) also needs to provide more meaningful indicators (service quality assessment and more detailed indicators) based on the bottom level (kernel) to carry out fine operation, improve the overall performance of mixed sector cluster and enhance the overall competitiveness of mixed sector technical solutions.

There are scattered statistics of different dimensions in the existing system, but:

  • Not “friendly” enough to be understood by upper-level schedulers (K8s), some more meaningful abstract data is needed as the basis for “basic” scheduling.
  • It is not “professional”, but also needs some targeted monitoring data for mixed scenes, and K8S can do more “detailed” operation based on these data.

On the other hand, the existing system lacks the normal operation of the measurement means, which can be the first time to catch the scene when there is “interference” (or high quality container jitters), effective analysis and positioning means. Shortcomings of existing means:

  • Most need to be deployed after the fact (Ftrace/Kprobe, etc.), but the business jitter may be difficult to reproduce, or instantaneous occasional, difficult to catch.
  • High overhead and difficult to deploy normally.

PSI, which appears with CGroup V2, is a very good attempt to reflect the Health status of the system to some extent, but it is still a bit thin for QoS effect evaluation in mixed-part scenarios.

Tencentos designed Quality Monitor, which is dedicated to evaluating the Quality of service (QoS) in all aspects of the container, and provides a normalized, low overhead, event-triggered monitoring mechanism, which can timely and effectively capture the abnormal context when the Quality of service declines (substandard).

Quality Monitor mainly consists of two modules:

Score

Service quality score, specifically defined as:

Per-Prio score = 100 – The percentage of time stalled by other priority (Cgroup) processes (resource preemption)

Per- CG score = 100 – The percentage of time stalled by other Cgroup processes (resource grab)

Note: The interference here includes both software and hardware interference

Monitor Buffer

Normal monitoring of interference and jitter memory areas, when key metrics (dependent on cloud native SLI) do not meet expectations (over the limit), the relevant context information is automatically recorded.

Overall effect:

  • Provide quality of service scores in both priority and POD dimensions to evaluate the container’s quality of service (QoS)
  • When there is a quality of service degradation (interference), the exception context can be captured through the Monitor Buffer

Cloud native SLI

define

SLI (Service Level Indicator) is an Indicator used to measure Service Level, such as Latency, throughput, error rate, etc.

The SLO is based on the target specified by the SLI;

From a cloud-native perspective, cloud-native SLI can be understood (in a narrow sense) as a container’s metrics that can be used to observe Service levels, namely key metrics from the container’s perspective, which are the basis for defining a container SLO.

On the other hand, basic statistics and monitoring in cgroups by the existing Upstream Kernel are primitive and crude, with only some basic statistics, such as Memory/ CPU Usage information, and a lack of usable container-perspective SLI data collection and abstraction.

Tencentos designed the cloud native SLI, through real-time collection and calculation in the kernel (low overhead), to provide sufficient, professional, different dimensions of SLI indicators, for the upper layer (K8S) to use, the user can be based on the corresponding SLO.

Cloud native SLI includes the following modules:

CPU SLI

Collect and calculate the SLI of CPU dimension, including scheduling delay, kernel-mode blocking delay, Load, context-switch frequency, etc.

Memory SLI

Collect and calculate the SLI of the Memory dimension, including Memory allocation delay, Memory allocation speed, direct collection delay, Memory Compaction delay, Memory collection delay, Memory allocation failure rate, etc.

IO SLI

Collect and calculate the SLI of IO dimensions, including IO delay, IO throughput, IO error rate, etc.

NET SLI

Collect and calculate the SLI of network dimensions, including network latency, network throughput, IO error rate, etc.

The overall effect

  • Provides fine-grained SLI metrics at the container level
  • K8S or other modules (such as Quality Monitor) can do fine operation based on relevant indicators

Cgroupfs

In the cloud native scene, based on Namespace, Cgroup and other underlying resource isolation technologies, the basic isolation of resources (container perspective) is done, but the overall isolation of containers is still very incomplete. Some resource statistics information in the /proc, /sys file system, There is no complete containerization (or Namespace), so some common commands in the physical/virtual machine (such as free/top) cannot accurately display container perspective information when running in the container (by default, display system-level global information, such as total system memory and free memory). This is also a persistent problem in cloud native (container) scenarios.

The direct reason is that the relevant information has not been containerized, and the essential reason is that the isolation of the container is still insufficient.

To address the problem of key information not being containerized in the /proc file system, the community recommended solution is:

lxcfs

LXCFS is a customized virtual file system for the above scenarios. The underlying implementation of a user-mode file system based on FUSE provides containerized statistics of the /proc file system and a little bit of individual information of the /sys file system. The implementation is relatively simple and straightforward.

LxCFS basically solves the problem of using common base commands (free/top/vmstat, etc.) in the container, but there are still some shortcomings:

  • Depends on additional components, LXCFS, is difficult to deeply merge with the container, and is not controllable.
  • User mode is implemented based on FUSE. The overhead is greater than the kernel, and the information accuracy is not as good as the kernel.
  • LXCFS stability is poor (according to user feedback), often problems: stuck (high cost), information can not get, etc.
  • The current implementation is completely based on some basic information visible in user mode (the current information is relatively limited). If deeper customization is needed (based on user requirements), the capability bottleneck will be encountered (limited by user mode implementation).

Tencentos provides a solution in kernel mode, named _Cgroupfs_

The core design is to design a new virtual file system (placed in the root file system) that contains the container-view /proc, /sys, and other fs that we need to implement, keeping the directory structure consistent with the global procfs and sysfs to ensure compatibility with user tools. When the relevant files are actually read, the corresponding container information view is dynamically generated from the context of the reader process of cGroupFS.

The directory structure is as follows:

The overall effect

  • Virtual machine file system (/proc, /sys), isolated global information, support for general commands (top, free, iotop, vmstat, etc.)
  • Design for CGroup V2, unified hierarchy
  • According to the needs of deep customization and expansion

TencentOS For Kubernetes

In the cloud primitionwave, Kubernetes as the industry de facto standard to bear the brunt. As the cloud enters the deepwater area, the business also pays more attention to the actual gain after the cloud, and the resource utilization and cost are also increasingly valued. In the old Kubernetes system, different Priority workload can be mixed into the same cluster through Service QoS Class and Priority to increase resource utilization and reduce resource operation cost. However, this “user mode” behavior is limited by the Linux kernel cgroups design, and the isolation granularity is inherently insufficient. The business suffers from the scramble for resources after the division is mixed, and sometimes the gain is outweighed by the loss.

In this context, the design of Tencentos for cloud native and Prioirty is a perfect solution to this problem. By matching Kubernetes Service QoS Class with Tencentos Priority one by one, we provide a native aware Priority in the kernel layer and a strong isolation mechanism in the bottom layer to ensure the Service quality after mixing to the greatest extent. And this priority mechanism runs through the whole cgroups subsystem.

Tencent cloud container team has opened the release version of TKE externally, and this Feature will be supported in the next version, so that users can keep an eye on the community dynamics.

More

In addition to focusing on cloud native, Tencentos Server itself is a general-purpose Server OS. In the process of focusing on the kernel for more than 10 years, Tencentos Server has developed/customized many large and small Features.

“Https://github.com/Tencent/Te…”

conclusion

Tencentos has been thinking and exploring its own cloud native road. The journey has begun, but it is far from over!