Tencent TencentOS ten years of cloud native iterative evolution

Introduction:

TencentOS Server (also known as Tencent Linux, Tlinux for short) is a Linux operating system developed by Tencent for cloud scenarios. It provides special features and performance optimization to provide high-performance, more secure and reliable operating environment for applications in cloud Server instances. Tencent Linux is free to use. Applications developed on CentOS (and compatible distributions) can run directly on Tencent Linux, and users can continue to receive updates, maintenance and technical support from Tencent Cloud.

TencentOS has experienced more than 10 years of iteration and evolution in Tencent. It supports all Tencent businesses and has more than 300W commercial deployment nodes. It has withstood the extreme test of massive complex business models in extreme scenarios.

Common OS Architecture

Definition of traditional OS (stealing from classic textbook) :

An operating system is a program that controls the execution of an application program. It is the interface between the application program and the computer (hardware).

The operating system has three goals:

Convenience: Makes computers easier to use
Efficient: Allows computer resources to be used in a more efficient manner
Extension: Allows efficient development, testing, and introduction of new system functionality without impacting services

The typical architecture design of the traditional general purpose OS(Linux) is as follows. The operating system contains various functional modules and interfaces to achieve the above three objectives. Generally speaking, the operating system is divided into two parts:

Kernel: Provides the basic abstraction of the underlying hardware (computer). Different kernel modules provide different hardware management or related auxiliary functions, and provide services to the upper application through system calls.
Base library and related service components (in user mode) : Provides a basic operating environment for real services

OS in IaaS scenario

In the IaaS scenario, the OPERATING system (OS) is used to provide a running environment for cloud hosts (VMS). In the IaaS scenario, the types and number of tasks running on the OS can be controlled, which is simpler than the common scenario. There are basically only the following types of tasks:

Vm-related threads (typically Qemu + Vcpu threads)
Management Agents of various control surfaces
Some control threads required by the OS itself (such as per-CPU Workers)

In IaaS scenarios, to ensure that vm performance is similar to or even better than that of physical machines, subtraction is usually used to improve vm performance by reducing virtualization and OS overhead. Typical methods are as follows:

Binding cores at the CPU layer
Preallocation at the memory level
Various Bypass Kernel technologies at the I/O layer

The end result for OS is:

The OS is getting thinner and thinner and may eventually disappear

Look at OS from a different perspective (cloud native perspective)

When the cloud native wave hits, look at the OS from a different perspective (cloud native perspective) and you see a different view. Cloud native scenarios pose new challenges to OS and inject new impetus into the further development and evolution of OS-related technologies.

In the cloud native scenario, OS provides underlying support and services for different types of applications (Apps, Containers, Functions, Sandboxes). At this point, the most obvious differences compared to IaaS scenarios are:

The boundary between the application and the system has moved up dramatically, with the OS under the application

The end result for OS is:

OS is getting thicker (spawning endless possibilities), in stark contrast to IaaS scenarios *

TencentOS For cloud native

In the background of cloud native wave sweeping the industry, with Tencent’s own various business architecture quickly turn around, business containerization, microservitization, Serverless, to the underlying infrastructure (including the core OS) put forward new challenges and requirements, TencentOS also rapidly transformation. In view of Tencent’s own cloud native scene and needs, it has carried out in-depth reconstruction and redesign, fully embraced cloud native, and strided step by step towards the goal of cloud native OS.

The overall architecture

TencentOS (currently) mainly implements (Kernel layer) the following cloud native Features

Tencent Could Native Scheduler(TCNS)
Cloud native resource qos-rue
Quality Monitor
Cloud native SLI
Cgroupfs

Tencent Could Native Scheduler(TCNS)

TCNS is the overall solution of kernel scheduler provided by TencentOS for cloud native scenarios, which can cover containers, security containers and general scenarios. It is effective for multi-priority mixed services that require CPU isolation and business scenarios that have extreme requirements on real-time/stability. The requirements and possible solutions for CPU isolation in hybrid scenarios are explained in detail in this article, “The Breakdown of Hybrid – CPU Isolation of Cloud Native Resource Isolation Technology (1)”. The technical discussion on real-time guarantee of kernel scheduler will be covered in the subsequent OS series articles. Please stay tuned.

TCNS mainly consists of three modules:

BT Scheduler
VMF Scheduler
ECFS

BT Scheduler

BT Scheduler is a new scheduling class designed by TencentOS for CPU isolation in mixed (container) scenarios, as shown in the following figure:

Its core design is: a new scheduling class with a lower priority than the default CFS. It can run only when no other higher priority tasks are running. It is dedicated to running offline tasks (online tasks use the CFS class).

The core benefits of this design are as follows: online services can absolutely preempt offline services, and CPU isolation can be nearly perfect in mixed scenarios.

BT scheduling class itself is a new scheduling class with complete functions, in BT scheduling class can provide a complete set of functions similar to CFS.

In addition, THE BT scheduling class also implements the following features:

Additional information about BT can be found in the following links:

Cloud.tencent.com/developer/a…

Note: Although the content is somewhat old, the new version has been iterated several times for your reference. The latest introduction of BT will be followed by corresponding articles, please look forward to it.

VMF Scheduler

VMF (VM First) scheduler is a kernel scheduler solution specially designed by TencentOS for secure container scenarios (and virtual airport scenarios) (re-implemented with a new kernel scheduler).

The main background of the scheduler rewrite is that the existing CFS scheduler is based on the principle of “perfect fairness” and cannot guarantee the real-time scheduling of virtual machine (security container) threads.

The core design of VMF includes:

An unfair scheduler ensures that virtual machine (security container) threads get priority scheduling by tilting CPU resources toward virtual machine processes
Schedule based on task type without fine-grained priority. , by contrast, we believe that the priority of the CFS and cannot accurately describe the running characteristics of the different processes, the typical is the kernel thread, this kind of process characteristics, obviously, first of all, he is very important, followed by a single execution of his time is very short, but it is hard to define their priority, height is not suitable, just run through the priority and cannot accurately describe their behavior. In VMF, we classify all processes by drawing and modeling the characteristics of different processes, and design fine scheduling strategies for different types of processes, which can meet the extreme demand for real-time performance in cloud native scenarios.
One-way aggressive preemption means that THE VM process can preempt other tasks as soon as possible in any case, but not vice versa. In this way, the real-time performance of the VM process can be guaranteed to the greatest extent without compromising the scheduler’s throughput performance.

In addition, we have designed a number of other features for other scenarios and requirements. There is not enough space to elaborate on them. We plan to introduce them separately in the future.

Overall, we can gain several key benefits from the VMF scheduler developed by ourselves:

Extreme scheduling delay index (good real-time performance), the maximum delay in extreme cases are at the subtle level
The new scheduler is much lighter than CFS, with less than a third of the total code
In the case of partial interference, ensure the real-time performance of virtual machine (security container) threads
Vmf-based classification design can provide different levels of CPU QoS for different types of processes
Through the completely self-developed scheduler, you can do a lot of very cool, usually dare not imagine the customization. If you’ve had any experience optimizing CFS, you know how uncomfortable it can be to have to customize on CFS.

VMF explanation is also planned to be discussed in another article, please also look forward to it. In addition, OS2ATC 2020 virtualization special session, the theme: Tianus Hypervisor – “Zero Loss” Tencent Cloud Lightweight Virtualization Technology, has some explanation

www.bilibili.com/video/BV1Ky…

Note: start at 1:24:00

ECFS Scheduler

ECFS is optimized for Upstream and based on the community mainstream CFS scheduler. The core optimization points are:

A new task scheduling type is introduced to distinguish online and offline tasks.
Optimize preemption logic to ensure online preemption from offline preemption. Avoid unnecessary preemption from offline to online
Absolute preemption design
Hyperthreading interference isolation

The specific principle is not expanded, please look forward to the next OS series of articles to expand the explanation.

Cloud native resource qos-rue

RUE (Resource Utilization Enhancement), Chinese brand “Ruyi”, is a product in TencentOS product matrix specially designed for server Resource QoS in cloud native scenarios to improve Resource Utilization and reduce operating costs. Ruyi is unified dispatching cloud machine, memory, CPU, IO, network resources, such as compared to conventional server resource management scheme, more suitable for the cloud scenes flexibly, can significantly improve the cloud machine resources use efficiency, reduce the operating costs of customers, on the cloud for the public cloud, a hybrid cloud, private cloud customers resources, including value-added services. The core technology of Ruyi can achieve the efficient unification of resource utilization rate, resource isolation performance and resource service quality without interference between different priority businesses.

architecture

RUE includes the following main modules:

Cgroup Priority

The concept of global unified Pod priority is introduced in the kernel, and it runs through the processing stack of ALL resources including CPU, Memory, IO and Net to provide unified priority control

CPU QoS

Based on the TCNS implementation mentioned in the previous section, absolute preemption and perfect isolation can be achieved at the CPU level.

Memory QoS

Through priority awareness on allocation and reclamation paths, different levels of memory allocation QoS guarantee are provided for containers with different priorities (memory availability of low-priority containers is sacrificed to guarantee memory QoS of high-priority containers). There are several original features that are implemented and, overall, maximise memory allocation latency for premium containers, which is one of the key capabilities lacking in the Upstream Kernel.

IO QoS

You can divide containers into DIFFERENT I/O priorities and allocate I/O resources based on priorities. This ensures that low-priority containers do not interfere with high-priority containers and allows low-priority containers to use idle I/O resources, improving resource utilization. I/O resource QoS includes bandwidth QoS, delay QoS, and write back QoS. In addition, minimum bandwidth assurance is provided to prevent possible priority inversion due to low priority hunger.

Net QoS

Allows users to allocate the bandwidth of network adapters on the server to different containers based on their priorities, allowing low-priority containers to use idle bandwidth resources without interfering with the network bandwidth of high-priority containers. In addition, minimum bandwidth assurance is provided to prevent possible priority inversion due to low priority hunger.

The overall structure of RUE is complex, and a large number of modifications and optimizations have been made to Upstream Kernel. Related features involve many and extensive contents, which cannot be covered in this paper one by one.

The overall effect

Introduce the concept of global unified Pod priority to provide unified priority control
It is suitable for mixed deployment of multi-priority containers (Pod/ task), which greatly improves resource utilization

Another advertisement: Qcon Beijing 2021-Global Software Developers Conference, there is a corresponding Topic “Tencent Kubernetes large scale away from online mixing and kernel isolation practice” share, welcome to watch (the video has not been replayed for now)

Qcon. Infoq. Cn / 2021 / beijin…

Quality Monitor

In a mixed-part scenario, Overcommit is used to maximize resource utilization. Interference isolation can be guaranteed to a certain extent on the premise of resource QoS protection. But two major challenges remain:

How to evaluate QoS effectiveness and perceived “interference”?
How to effectively troubleshoot “interference”?

On the other hand, the upper dispatching (K8s) also needs to provide more meaningful indicators (service quality assessment and more detailed indicators) based on the bottom (kernel) to carry out refined operation, improve the overall performance of the mixed cluster, and improve the overall competitiveness of the mixed technical solutions.

There are some scattered and dimensional statistics in the existing system, but:

Not “friendly”, for the upper scheduling (K8s), can not understand, need some more meaningful abstract data, as the “basic” scheduling basis.
It is not “professional”. For mixed scenes, some targeted monitoring data is needed, and K8s can do more “fine” operation based on these data.

On the other hand, the existing system lacks the commissioning means of normal operation, which can catch the scene in the first time in case of “interference” (or jitter of high quality container) and effectively analyze and locate the means. Inadequacies of existing means:

In most cases, the service jitter may be difficult to reoccur, or the service jitter may occur occasionally and cannot be captured.
It is expensive and difficult to deploy normally.

PSI along with Cgroup V2 is a very good attempt, reflecting the Health state of the system to a certain extent, but it is a little weak for the QoS effect evaluation of mixed part scenarios.

TencentOS designed the Quality Monitor, which is dedicated to evaluate the Quality of service (QoS) of containers in all aspects, and provides a normalized, low-overhead, event-triggered monitoring mechanism, which can timely and effectively capture abnormal context in case of service Quality deterioration (non-compliance).

Quality Monitor mainly consists of two modules:

Score

Service quality score, specifically defined as:

Per-prio score = 100 – Percentage of time Stall due to interference (resource preemption) from other priority (Cgroup) processes

Per-cg score = 100 – Percentage of time Stall due to interference (resource preemption) from other Cgroup processes

Note: Interference here includes software and hardware level interference

Monitor Buffer

Normal monitoring of interference and jitter memory areas, when key indicators (dependent on cloud native SLI) do not meet expectations (over limit), automatic recording of relevant context information.

Overall effect:

Provide priority and Pod quality of service scores to evaluate the container’s quality of service (QoS)
The exception context can be captured by the Monitor Buffer when a quality of service deterioration (interference) occurs

Cloud native SLI

define

Service Level Indicator (SLI) is a Service Level Indicator, such as Latency, throughput, and error rate.

The SLO is based on the goals specified by the SLI

From the perspective of cloud native, cloud native SLI can be (narrowly) understood as container metrics that can be used to measure the Service level, i.e. some key metrics from the container perspective, which are the basis for defining the container SLO.

On the other hand, the basic statistics and monitoring of the existing Upstream Kernel in Cgroup are primitive and crude, with only some basic statistics, such as Memory/Cpu Usage, lacking available SLI data collection and abstraction from a container perspective.

TencentOS designs the cloud native SLI. Through real-time collection and calculation (low overhead) in the kernel, it provides sufficient, professional and different dimensions of SLI indicators for the upper layer (K8s) to use, and users can customize the corresponding SLO based on this.

The cloud native SLI consists of the following modules:

CPU SLI

Collect and calculate SLI of CPU dimension, including scheduling delay, kernel-mode blocking delay, Load, context-switch frequency, etc.

Memory SLI

Collect and calculate Memory SLIS, including Memory allocation delay, Memory allocation speed, direct reclamation delay, Memory Compaction delay, Memory reclamation delay, and Memory allocation failure rate.

IO SLI

Collect and calculate THE I/O SLI, including I/O latency, I/O throughput, and I/O error rate

NET SLI

Collect and calculate SLI of network dimensions, including network latency, network throughput, and I/O error rate.

The overall effect

Provides fine-grained SLI metrics at the container level
K8s or other modules (such as Quality Monitor) can be refined based on relevant metrics

Cgroupfs

In the cloud native scenario, basic resource isolation (container perspective) is implemented based on underlying resource isolation technologies such as Namespace and Cgroup, but the overall isolation of containers is still very incomplete. It is not fully containerized (or namespaced), so some common commands (such as free/top) in physical machines/virtual machines cannot accurately display the container view information (by default, display system-level global information, such as total system memory and free memory). This is also a persistent problem in the cloud native (container) scenario.

The direct reason is that the relevant information has not been containerized, and the essential reason is that the isolation of the container is still insufficient.

The community recommended solutions to the /proc filesystem where critical information is not containerized are:

lxcfs

LXCFS is a virtual filesystem tailor-made for this scenario. FUSE implements a user-mode filesystem underneath, providing containerized statistics from the /proc filesystem, and a bit of individual information from the /sys filesystem, which is straightforward to implement.

LXCFS basically solves the problem of using common basic commands in containers (free/top/vmstat, etc.), but it still has the following disadvantages:

Need to rely on additional components LXCFS, difficult to deep integration with the container, uncontrollable.
User-mode is implemented based on FUSE. The overhead is larger than the kernel, and the information accuracy is not as good as the kernel.
LXCFS has poor stability (according to user feedback) and often has problems such as being stuck (expensive) and not getting information.
Poor customization ability, the current implementation is completely based on some basic information visible in the user mode (the current information is still relatively limited), if the need for deeper customization (based on user needs), there will be a capability bottleneck (limited by the user mode implementation).

TencentOS provides a solution in kernel mode named Cgroupfs

The core design is to design a new virtual file system (placed in the root file system) that contains the container-perspective fs /proc, /sys, and so on that we need to implement, keeping the directory structure consistent with global procfs and SYSFS to ensure compatibility with user tools. When actually reading the relevant files, the corresponding container information view is dynamically generated through the context of the cGroupFS reader process.

The directory structure is as follows:

The overall effect

Kernel-container-perspective virtual machine file system (/proc, /sys), isolated global information, support general commands (top, free, iotop, vmstat, etc.)
Design for Cgroup V2, unified hierarchy
Can do deep customization and expansion according to demand

TencentOS For Kubernetes

Under the wave of cloud origin, Kubernetes bears the brunt as an industry de facto standard. As cloud native enters the deep water area, businesses pay more attention to the actual gain after cloud, and resource utilization and cost are also increasingly concerned. In the original Kubernetes system, different workload priorities can be mixed into the same cluster through Service QoS Class and Priority to increase resource utilization and reduce resource operation costs. However, this “user-mode” behavior is limited by the design of Linux Kernel Cgroups, which inherently lacks granularity of isolation. Businesses can suffer, sometimes more than they gain, from a scramble for resources. In this context, TencentOS is designed for cloud native and Prioirty to solve this problem perfectly. By one-to-one mapping between Kubernetes Service QoS Class and TencentOS Priority, we can sense the Priority in the kernel layer and provide a strong isolation mechanism at the bottom layer to ensure the Service quality of mixed services to the maximum extent. And this priority mechanism runs through the entire CGroups subsystem.

Tencent Cloud Container team has opened source the TKE release externally, and this Feature will be supported in the next version, so users can continue to pay attention to the community dynamics.

In addition to paying attention to cloud native, TencentOS Server itself is a universal Server OS. In the process of adhering to the focus on kernel for more than 10 years, TencentOS Server has also developed/customized many large or small Features.

Github.com/Tencent/Ten…

conclusion

TencentOS has been thinking and exploring its own way to cloud native. The journey has begun, but it is far from over!

Container service TKE: Kubernetes container platform is stable, secure, efficient and flexible to expand on Tencent cloud without self-construction.

Tencent TencentOS ten years of cloud native iterative evolution

Introduction:

Common OS Architecture

OS in IaaS scenario

Look at OS from a different perspective (cloud native perspective)

TencentOS For cloud native

The overall architecture

Tencent Could Native Scheduler(TCNS)

BT Scheduler

VMF Scheduler

ECFS Scheduler

Cloud native resource qos-rue

Cgroup Priority

CPU QoS

Memory QoS

IO QoS

Net QoS

The overall effect

Quality Monitor

Score

Monitor Buffer

Cloud native SLI

define

CPU SLI

Memory SLI

IO SLI

NET SLI

The overall effect

Cgroupfs

lxcfs

The overall effect

TencentOS For Kubernetes

More

conclusion

Related Posts

LeetCode 20: Valid Parentheses

Client -RGW Monitoring buried point

What’s the difference between equals and equals?