Server resource utilization is low, and TCO(total cost of ownership of IT infrastructure) is rising year by year, which is undoubtedly a headache for companies with a large number of machine resources. Mixing technology comes into being in this situation. At present, mixing technology still belongs to a relatively small field in the industry, and only some companies with large resources are researching and developing mixing technology in order to obtain benefits.

For Baidu, through the application of mixing technology, the main mixing unit cluster of hundreds of thousands of units, improve CPU utilization to 40+%, saving billions of yuan in total. At present, Baidu container engine product CCE has supported offline mixing and has completed large-scale business implementation. This article will take you to have an in-depth understanding of Baidu’s offline mixing technology.

1. What is offline mixing

Mixed-service applications are divided into online and offline services.

How to divide online and offline services? Within Baidu, we believe that the characteristics of online business include but are not limited to: long running time, sensitive delay, high requirement for stability, unstable services will be immediately perceived and bring losses, with obvious peaks and troughs, high in the day and low in the night, such as advertising search business; The characteristics of offline services include but are not limited to non-delay sensitive, retried, short running time of dozens of minutes or so, and internal services such as big data computing and machine learning.

Online services take search as an example. During the day, the query volume is very large when users are working and studying. However, when most users are resting at night, the query volume is very small compared with that during the day. Offline business no strict time requirements, can be run at any time, users are concerned about the task can be run, for when the run is not much demand, if there is a conflict on single resource at the same time, this time we will suppress offline business, will even oust offline business, the user is non-inductive, computing platform to pull up the task, continue to calculate.

Therefore, linear services and offline services complement each other in terms of time and resource tolerance. On the one hand, online services have a higher priority. In the stand-alone and scheduling layer, online resources are guaranteed first, and offline resources may be suppressed or expelled. On the other hand, suppression and expulsion of offline tasks are insensitive to users, and they only need to ensure that the tasks are successfully completed, which has a high tolerance.

To put it simply, online services and offline tasks are mixed on the same physical resources to make full use of resources and ensure service stability by means of resource isolation and scheduling. This technology is called “mixing”.

2. Why is resource utilization low?

The average utilization rate of pure online business clusters is generally low. Before The application of mixed technology by Baidu, the CPU utilization rate of online business clusters is generally about 20%. This phenomenon is mainly caused by the following reasons:

2.1 Tidal phenomena and redundancy

When an online service applies for resources, it usually applies for resources based on the maximum value of the evaluated resources, and some resources are adjusted at the same time. As a result, the service does not accurately estimate the resources, and the resources applied for are far greater than the resources actually used. Multiple replicas may even be deployed for DISASTER recovery.

The low peak utilization will be very low, resulting in low overall utilization.

2.2 Pooling In Offline Mode

Generally speaking, there is a very big characteristic when planning the machine room, that is, the offline machine room and the online machine room are separated from each other. For example, if we do a computer room in ningxia, this time will certainly consider doing an offline computer room, because online room need to take into account the issue of Internet users requests the regional distribution, to obtain the best access optimization experience and access speed, certainly will need in some access to hot up to do their own online computer room planning, such as Beijing, Shanghai, guangzhou, etc.

However, an offline equipment room does not need to worry about these problems. What it cares about most is how to improve the scale of computing and storage resources and infrastructure. Therefore, the resource requirements and characteristics of online and offline services are different, resulting in uneven resource utilization. In normal mode, the online usage is low, the offline usage is high, and the normal resources are insufficient.

In view of the above scenarios, we use the offline mixing technology to unify the offline resource pool and deploy the offline job on the online resource pool node, so as to make full use of machine resources and achieve the purpose of improving the resource utilization rate.

3. Detailed explanation of baidu cloud native mixing department

With the booming development of Kubernetes (hereinafter referred to as “K8s”) ecosystem, many businesses of Baidu have built and operated many OF their OWN K8s clusters, but also encountered some problems, such as the low resource utilization rate mentioned above. According to our internal mixing experience, we reconstructed K8s in offline mixing, and achieved zero invasion K8s, which is portable and can support large-scale clustering.

Mixed-part system architecture diagram

3.1 How to Achieve Resource Overcommitment?

Native K8s is based on static view allocation

The figure above shows the CPU usage and allocation rate of a node. The allocation rate is 89%, and the usage is below 20% between 0 and 16, with peak usage starting at 17, between 30 and 40%. It can be seen that there are a lot of idle resources between request and Used. If you want to reuse these resources, you need to schedule more PODS, but from the perspective of K8s scheduler, there is no more CPU allocated to Pod.

If the Pod with request and limit left unfilled is deployed, it can be scheduled. However, K8s’s Pod for this BestEffort will not be scheduled according to the usage, and may be scheduled to a very busy node, which will not improve the utilization effect. It may also exacerbate service delays on nodes.

Dynamic Resource view

As K8s cannot allocate resources according to resource usage, we introduce dynamic resource view.

In mixed scheduling, the same physical resources are used online and offline. The online and offline resource views are independent of each other. The available resources for online services are still allocated statically to the system resources. The available resources for offline services are allocated statically to the system resources minus the resources used by online services.

As can be seen from the figure, there is a big difference between the online application usage and the online application usage, which is mainly due to the blindness of the R&D students when they deploy services and select the container resource specifications. The application quantity is higher than the actual resource usage or the application quantity exceeds the peak value. Mixing part offline can reuse this part of recyclable resources, by quickly filling offline jobs, this part of resources can be utilized.

∑High (online) static allocation of ∑High Request + ∑Medium Request <= Host Quota

Dynamic calculation Low optimal (offline) Available Low Quota = Host Quota – ∑High Used – ∑Medium Used

Note: The above is the ideal formula. In practical application, an upper limit should be set for offline usage, and the influence caused by the upper limit is excluded here. A description of single-node resource management is provided below.

Since K8s is static allocation, BestEffort does not occupy request in THE QOS model of K8s. Even if resources have been allocated on a node, resources of BestEffort type can still be scheduled. Therefore, we reused the BestEffort model for offline tasks, as shown in the architecture diagram above, which has the following advantages:

Resolve conflict issues in offline view, offline use of BestEffort model, on online not visible
Compatible community components, such as CAdvisor, can be used directly
There is no need to modify existing components, including Kubelet Containerd Runc, which is less invasive. You can install the mixing system directly and enjoy the improved resource efficiency brought by mixing.

3.2 priority

Because offline services deployed on the same node may cause interference, we prioritize online and offline services in terms of scheduling and single-node deployment.

The high priority is classified into high, medium, and low priorities. The high priority is online, and the low priority is offline. Each priority optimization has a number of small priorities.

Let’s first look at the QOS model of K8s

Guaranteed: When request == limit of all containers in a Pod
Burstable: A request for a Container! = when the limit
BestEffort: When no Request or Limit is set for any Container in a Pod

Compared with K8s model, Baidu mixed part scheduler makes the following extensions:

3.3 Resource Isolation

Because in offline division is to mix business online and offline tasks mixed department on the same physical resources, so as traffic soared in offline business, resource competition, how to ensure the online SLA is not affected, and in SLA guarantee online service, also want to ensure the overall quality of offline tasks, to ensure the success rate of off-line and time-consuming.

CPU

Cpuset choreography

For online services that need to be bound, the CPU topology can be sensed on a single machine, and the core can be directly bound without kubelet enabling the binding mechanism. The online services can be bound to the same NUMA node as far as possible to avoid cross-node communication delay.

NUMA scheduling

NUMA is a memory management technology. Under the NUMA architecture, cpus are divided into multiple nodes, each of which has its own CPU core and local memory. The cost of accessing local memory for a node core process is different from that of remote memory. All memory on a node is the same for all cpus on the node and different for all cpus on other nodes. Therefore, each CPU can access the entire system memory, but the memory access speed of the local node is the fastest (not through the interconnection module), and the memory access speed of the non-local node is slow (only through the interconnection module). That is, the memory access speed of the CPU is related to the distance between nodes.

For nodes with NUMA enabled, we sense the NUMA architecture, bind online services to the same NUMA to improve the performance of online services, and sense the load of NUMA nodes. When there is an obvious imbalance among nodes, we rescheduling.

Offline scheduler

Online business requires high real-time performance and low delay; To keep latency low, the CPU load is not too high, and when the CPU load is high, interline interference can cause latency to increase. Offline services generally have high CPU utilization but no reprocessing delay. Therefore, if it can meet the online delay guarantee, online and offline running on the same core will not cause interference to online, then it can greatly improve the utilization of resources.

According to the current common Linux kernel scheduler scheduling algorithm, online services cannot be strongly guaranteed and offline services cannot be distinguished. As a result, online services cannot preempt offline cpus. In load balancing, online services may be allocated to the same core because offline services cannot be distinguished. Performance deteriorates.

Offline scheduler is a CPU scheduling algorithm dedicated to offline tasks. It is separated from the scheduler and the online scheduler cannot see the offline tasks. The online scheduler schedules tasks prior to the offline scheduler. If online tasks exist, the online scheduler cannot schedule tasks offline. So for online tasks, similar CPU quality can be achieved before mixing.

memory

The Linux operating system often writes logs and generates backup files. When these files are large, the corresponding cache occupies a large amount of system memory. These caches are not frequently accessed, so the system periodically flushes them to disks. Linux uses the cache collection algorithm to collect data. This creates two problems:

1. The page cache of the container cannot be collected, because it depends on the mechanism of the container to manage the page cache. That is, there is no background collection, and every time the page cache reaches the limit when allocating, the collection starts at alloc. If the speed of allocation is greater than the speed of collection, OOM problems may occur

2. Cache reclaiming does not distinguish between online and offline services. As a result, the cache of online services may be reclaimed before that of offline cache. It can even cause IO to tamper.

To solve the above problems, we added a background reclamation mechanism. Background reclamation refers to asynchronous cache reclamation. Based on different online and offline QOS, different background reclamation water levels are set to reclaim the cache of offline services preferentially.

Each container periodically recycles its own page cache
Each container can set the switch and its own high and low water level

At the network level, we also developed container-level Cgroup subsystems such as outbound and inbound bandwidth limiting and traffic marking, which can limit traffic offline.

More kernel isolation is shown below:

Dynamic strategy based on eBPF

Existing kernel isolation policies are based on QOS. Cgroup configuration is carried out when containers are created, and unified resource management is carried out by the kernel. However, some high-sensitivity services cannot guarantee specific resources under the QOS of the highest priority, or resources in a certain latitude need to be guaranteed with high quality.

Also due to the isolation scheduling strategy is a global unified strategy, business if you want to modify some isolation ability according to their own characteristics, only by business feedback platform, platform to modify the underlying cycle is long, and the application of global isolation ability may cause misidentification to offline or other business, so to promote isolation to user mode conforms to the needs of the business.

For this scenario, eBPF is stable, secure, efficient, and has the characteristics of hot loading/unloading eBPF programs without restarting the Linux system. We developed a customized policy based on eBPF, which can be delivered in real time and take effect in real time. It is less invasive and does not need to modify existing services and platforms. The isolation policy can be customized for some services in user mode to achieve the goal of undifferentiated service mixing.

Single-node Resource Management

There is always a question of how much resources can be occupied offline when mixing. Model, the sensitivity of the online service, offline business takes up how many the effects on the online resources are also different, for this kind of circumstance, we for cluster latitude, pool (a batch machine) with the same properties latitude, node latitude for offline available resources limit restrictions, including minimum granularity of highest priority.

Take CPU as an example, as shown in the following figure:

We set the CPU threshold X for the device. When the CPU usage approaches or exceeds a certain amount, for example, X=50%, the OFFLINE CPU resources are compressed.

Here’s a simple formula:

Offline Quota = Host Quota * X – ∑NotOffline Used

Offline Free = Offline Quota – ∑Offline used

The same restrictions apply to Memory, IO, and networking. In this way, we can easily adjust the offline dosage according to different models and businesses, so as to avoid the performance affected when the online dosage increases.

3.4 High-performance scheduler

Online and offline services have different scheduling requirements. Online services are usually resident services that do not change frequently and have low requirements on the scheduler. However, due to the short running time (several minutes to several hours) and multiple tasks of offline tasks, the scheduling performance of K8s default scheduler is not enough to support the scheduling of offline tasks. So we developed a high-performance offline scheduler that can calculate up to 5,000 OPS.

As shown in the figure above, we scheduled 15W pods and the computational performance could reach 5K OPS. In order to prevent the scheduling speed from being too fast and causing pressure to ETCD and the whole cluster, we limited the binding speed to 1500 OPS.

3.5 Resource Portraits

Single-node resource isolation is used to isolate the tasks scheduled on the node. If the offline tasks have certain impact on the node, the node immediately responds to suppress or expel the offline tasks. This will affect the stability of offline tasks. In this scenario, if the online usage of nodes can be predicted in the future, offline tasks can be scheduled accordingly.

Compared with the resource model of real-time calculation, it is assumed that the running time of offline job is 1 hour. If the resource model uses the real-time resource view, if the online job will increase in half an hour, then after half an hour of offline job, resources will be suppressed online and the running quality will be affected.

We predict the amount of online resources in the next one hour window, and schedule appropriate offline tasks to ensure that resources are not suppressed during any offline operation, so as to achieve the purpose of improving operation quality.

How to provide more stable overshoot resources depends on how accurate our resource prediction is.

Resource portraits are required not only for offline scheduling but also for online scheduling. Resource portraits can effectively avoid hot spots.

For online scheduling, the main objective is to improve service availability. During scheduling, resource profiles are used to predict the resource usage in a future period, which can avoid hot issues (excessive resource usage in a certain latitude) and avoid hot issues during rescheduling.
For offline scheduling, the main goal of scheduling is to improve the throughput of clusters and reduce the queuing time and execution time of jobs. Therefore, resource portrait can improve the stability of offline jobs and avoid the long execution time caused by evocation rescheduling and resource suppression.

4. Future outlook

Baidu currently has hundreds of thousands of mixers, and its mixers cluster CPU utilization rate is 40% to 80% higher than online utilization rate, saving nearly 100,000 servers in total.

In the future, the main goal of Baidu mixing department is to continue to expand the size of the mixing department, save resource costs on a larger scale, and support more load types, not limited to offline mixing department, to achieve undifferentiated mixing department.

In terms of single machine isolation, it supports more services to be mixed, and does a better job in conflict detection and resource isolation.
In terms of scheduling, more planned scheduling and more refined resource portrait can predict the hotspot probability more accurately, optimize the scheduling ability and reduce the hotspot rate.

And invest more in the following directions:

Kernel programmable technology:

Through the innovation of eBPF observable technology, the close observation of the load performance of the mixing container is realized, and the high density mixing is further lossless

The eBPF hot load/unload feature enables users to issue isolation policies to quickly solve sensitive resource quality problems

Heterogeneous: Better supports the mixing of heterogeneous resources such as Gpus, improves the efficiency and elasticity of heterogeneous resources, and greatly reduces GPU costs.
Container-virtual machine convergence: Solve the bottleneck of shared kernel mixing in high-density models.
Multi-cloud mixing: Combine public cloud to make extreme flexibility. For example, combine elastic bidding instance, and automatically include or remove bidding instance based on user’s price sensitivity setting for offline business operation or mixing, so as to realize multi-cloud elastic mixing.

About Baidu container engine CCE

Baidu container engine product CCE provides Docker container life cycle management, large-scale container cluster operation and maintenance management, one-click release and operation of business applications and other functions, seamlessly connecting other baidu smart cloud products. Flexible and highly available cloud-based Kubernetes container operation platform helps system architecture microservitization, DevOps operation and maintenance, AI application deep learning container and other scenarios.

———- END ———-

Baidu said Geek

Baidu official technology public number online!

Technical dry goods, industry information, online salon, industry conference

Recruitment information · Internal push information · technical books · Baidu surrounding

Welcome to your attention

In-depth understanding of Baidu’s offline mixing technology