Data center daily CPU utilization rate of 45% operation - Alibaba's large-scale mixed technology evolution

Author Introduction Jiang Ling (Ling Xin) Alibaba – System software Department – technical expert to promote automation preparation product leader of e-commerce large-scale mixed department project leader

Today, we will share the mixing technology from the following four aspects, focusing on the first two chapters:

First, the introduction of Alibaba’s mixed technology exploration. Mixed technology is still a little-studied field in the industry, and only when the volume of resources and costs reaches a certain scale, this technology will show its considerable technical dividends. I will introduce alibaba’s exploration process of mixed technology.

Second, mixed department scheme and architecture. This sharing will focus on the architecture design and introduction of operation and maintenance.

Third, mixed core technology. Due to time constraints, this share only lists some technical points and directional things, not too many core technical details.

Fourth, future prospects.

I. Introduction of Alibaba mixed department exploration

The starting point of mixing technology comes from the consideration of how to balance the growing business and the increasing resource cost. We hope to support the greater business demand with the minimum resource cost. Whether the existing stock resources can be reused to meet the new business, this is the ideological source of the development of mixed technology.

1.1 Why do mixers?

The figure above shows the turnover curves of Alibaba over the years since the Double 11 shopping Carnival started in 2009. For business students, this growth curve is pretty, but for technical and operation personnel, it means great challenges and resource pressure.

For those who do e-commerce platform business, we should know that when we do promotional activities, technical pressure often comes from the first second of sales, which is a pulse of peak flow.

The zero-point peak traffic for Alibaba’s online business on Singles’ Day (often described as second transaction creation) roughly matches this graph. Since 2012, the peak pressure at zero has almost doubled from the previous year. It can be seen that the online side of business development so fast, mainly with our promotional activities inseparable.

In addition to the linear business, Alibaba also has a larger offline computing business. With the rise of AI technology, artificial intelligence and other technologies, computing businesses are also on the rise. Up to now, the big data storage of our company has reached KPB level, and the daily task quantity is one million level.

The business is growing and there are significant resources in the infrastructure layer to meet the needs of both linear and offline businesses. Since there are many inconsistent resource usage features between in-line and off-line services, it was originally designed to be supported by two independent data centers. Currently, both data centers have reached the scale of more than 10,000 servers.

However, we found that although the resource volume of data center is huge, the utilization rate of some resources is not optimistic, especially for online business data center, the daily utilization rate of resources is only about 10%.

Based on the above background, and considering the difference in resource usage and requirements of different services: on the one hand, different services have the characteristics of different peak periods (time-sharing reuse of resources);

On the other hand, the varying tolerance of resource responses (resources competing and preempting by priority) prompted us to explore the technical direction of mixed deployment of different businesses.

1.2 What is the Mixing Part (Co-location)

In short, mixed technology is: different types of business mixed deployment, with one resource to provide the resource equivalent of two different business technology.

Mixing technology first, resource integration, the original physical separation of services deployed on the unified physical resources;

Secondly, resources are shared. The same resource supports both A and B services. From the perspective of A and B services, each resource can be seen at the same time.

Finally, it is reasonable competition of resources. Since the original one resource is two for a lifetime, it is inevitable that there will be competition of resources. It is necessary to provide reasonable means of competition so that businesses with different resource demands can meet their own service requirements.

The biggest value of mixing division lies in the full reuse of resources by means of resource sharing to achieve something out of nothing. The core goal of mixing technology is to guarantee the high level of business in the competition of resources. Therefore, we hope to fully share resources and isolate competition by means of scheduling control and kernel isolation.

1.3 Online and Offline mixing

In linear business, the main scenarios described in mixed technology are transactional business, payment business and browsing request.

The feature of linear business is real time, which is very high, and can not be degraded. If the user in the process of purchasing baby, there is a long wait (such as second level), it is likely that the user will give up the purchase; If the user needs to retry, it will probably be difficult to retain the user.

In linear businesses, especially those like our e-commerce business, the trend of business volume is very obvious. With the user’s rest time, high in the day, low at night, buy buy buy during the day.

Another big feature of e-commerce platforms is that the daily traffic is very low compared with daishuang. The number of second creation on daishuang day may be ten times or even more than 100 times of the normal peak volume, which has a strong time scene.

From the lines of business, such as computing business, algorithm computation, statistic report, data processing, such as business, business types compared to online, can call time delay is not sensitive to the user submits the work and business, itself the processing time in seconds, minutes, hours, days, even so they can run after a period of time to complete. They also accept retries, and technically we should be more concerned about who retries them. User retry is not acceptable, but if the system helps retry, the user is not responsive.

In addition, the time scene of offline service is not as strong as that of online service. It can run at any time, and even shows the time characteristic of anti-online service. It has a low probability in the daytime and a high probability in the early morning.

The reason is also related to user behavior. For example, users submit a statistical type, wait for it to start running after 0 am, and receive the report before going to work the next morning.

From the analysis of the run-time characteristics of different businesses, we can find that in line business and off-line business, there are conditions of business pressure off-peak and resource off-peak;

On the other hand, online services obviously have higher priority and resource preemption ability, while offline services show certain tolerance when resources are insufficient. These factors become the feasibility factors of online and offline business mixing technology.

1.4 Alibaba is engaged in the process of mixed division exploration

Before the introduction of technology, briefly describe the process of Alibaba’s mixed technology exploration.

In 2014, hybrid technology was proposed;
In 2015, I did offline testing and prototype simulation;
In 2016, about 200 machines were delivered to the production environment. As the first batch of users in the company, they operated for one year. For internal users, after the online landing is effective,
In 2017, the production environment was mixed in a small scale, reaching thousands of physical machines, directly facing external users, and supporting the double eleven Promotion in 2017;
In 2018, we hope to be a year of scale expansion. We hope that the mixing department can bring objective technology dividends under the scale effect and create a mixing department cluster with a volume of 10,000 units.

1.5 Alibaba’s large-scale mixed department achievements

1. The number of mixed units reaches thousands, which has been verified by the core scenario of Double 11 transaction; Introducing offline computing tasks in the online cluster: The daily CPU usage increases from 10% to 40%.

2. Deploy online services (offline) in an offline cluster to support the creation of W transactions per second on Double 11.

3. Interference to online business services is less than 5% in mixed environment;

At present, there are two scenarios for mixing services: Online clusters provide resources for mixing services, and online resources provide extra offline computing power for offline services. The offline cluster provides resources for mixing, and uses offline resources to create online business transaction capabilities (mainly dealing with online traffic peaks such as big Promotion).

We have a simple internal agreement, online and offline, whoever provides the machine, who will be in front of the line, so there is the offline mixing department and off-line mixing department.

On November 11, 2017, our company officially announced that the number of second-level creation was 375,000 transactions per second, and the volume of online mixed department cluster was 10,000 transactions per second. We used offline resources to support the online peak and saved a certain amount of resource cost.

At the same time, after the offline mixed cluster went online, the daily resource utilization of the online native cluster increased from 10% to 40%, providing extra daily computing power for offline. As shown below:

This is data from a real surveillance system. (Right image) This represents the immixing scenario, the time point is around 7 to 11 o ‘clock, the online center utilization is 10%. (Left figure) The data representing the mixed part scenario is about 40% on average, and the jitter is relatively high, because the offline business itself has relatively high volatility.

With so many resources saved, does the service quality of the business, especially online, become worse?

The following is the RT curve of the online core service responsible for transaction processing, in which the green curve represents RT performance in the mixed part cluster, and the yellow curve represents RT performance in the non-mixed part cluster. It can be seen that the two curves basically coincide, and the average RT in the mixed part scenario is less than 5% less than that in the common cluster, which meets the service quality requirements:

Ii. Mixing scheme and structure

Due to the correlation between mixing technology and the company’s business system and operation and maintenance system, different technical backgrounds may be mentioned in this paper. Due to space constraints, it is only briefly quoted, and may not be introduced in detail.

The following describes the solution, including the overall architecture, service deployment policy, resource management and allocation mechanism, and service operation policy in the mixed deployment scenario.

2.1 Overall structure of mixing department

Abstract, mixing technology can be divided into three levels:

First, merge resources and integrate resource pools, which can be used by both A and B.

Second, we should do a good resource scheduling and allocation.

Alibaba Group has several resource scheduling platforms before the mixing technology, among which the online resource scheduling system is called Sigma and the offline resource scheduling system is called Fuxi.

The challenge of mixed division technology lies in allocating resources to different services, unifying multiple resource scheduling systems, and conducting decision arbitration.

Third, the runtime should isolate and preempt resource competition.

The architecture in the figure above shows a certain level of hierarchy:

At the bottom is the infrastructure layer. The data center of the whole group is unified. No matter how the upper layer is used, the hardware facilities and supporting facilities of machines, networks and so on are all the same.

The next layer is the resource layer. If we want to do mixing, we must get through the pool and control the resources together.

The next level up is the scheduling layer, which is divided into server and client. Online is Sigma, offline is Fuxi, we call each business’s own resource scheduling platform a layer scheduler. In the mixed architecture, a “layer 0” scheduler is introduced, which is mainly responsible for coordinating the resource management and control and resource allocation decisions of the two layer 1 schedulers. It also has its own Agent.

The top layer is the business-oriented resource scheduling and control layer. Some of the resources are directly delivered to the business through the first-layer scheduler, and some involve the second layer, such as Hippo.

However, there is a special mixing management and control layer in the mixing architecture, which is mainly responsible for the arrangement and execution of the mechanism of business operation under the mixing mode, as well as the configuration and control of physical resources, business monitoring and decision making.

This is the architecture of resource allocation so that machines and resources can be allocated to different businesses. However, after allocation, how are the industry priorities and SLAs guaranteed at runtime? Online and offline services run on a physical machine at the same time. What should I do if resource contention occurs between services? We have implemented runtime resource assurance through kernel isolation, and we have developed a number of kernel features to support different types of resource isolation, switching, and degradation. Kernel-related mechanisms are covered in Chapter 3.

2.2 Online Service Deployment Policies in mixed Service Scenarios

This section describes how hybrid technologies can be applied to online business scenarios to provide transaction creation capabilities for e-commerce platforms.

First of all, due to the novelty of mixing technology and many technical transformation points, we hope to carry out small-scale tests in a limited and controllable scope in order to avoid risks. Therefore, based on the e-commerce (online) unit deployment architecture of our company, we carried out the business deployment strategy, and constructed the mixed cluster into an independent transaction unit. On the one hand, we ensured the convergence of mixed technology in the local scope without affecting the global situation, and on the other hand, we could achieve the business closed loop and independent resource allocation control of the unit.

In the linear system of e-commerce, we closed the whole chain of services related to the buyer’s purchase behavior into a service set, which was defined as a transaction unit. The transaction unit can do this: all requests and orders related to the buyer’s transaction behavior are completed in a closed loop within the unit, which is the remote multi-activity unit deployment architecture.

Another constraint in the implementation of hybrid technology comes from the limitation of hardware resources. Offline and online services have different requirements for hardware resources, and the storage resources of each service may not be suitable for the other service. Therefore, we encounter the adaptation problem of the storage resources in the implementation, which is most strongly reflected in disk resources.

There are a large number of low-cost HDDS in the native resources of offline services, and HDDS are almost used up during offline operation. It’s basically not available for online business.

To avoid disk IOPS performance problems, compute storage separation is introduced. Computing and storage separation is another technology that has evolved within our group. It provides centralized computing and storage services where compute nodes are connected to storage centers through a network that eliminates the dependence of compute nodes on local disks.

Storage clusters can provide different storage capabilities. Online services have high requirements on storage performance but low throughput. Therefore, we use computing and storage separation technology to obtain IOPS remote storage services.

2.3 Resource Allocation in mixed-use Clusters

After finishing the overall structure, we look at the resource allocation of the mixed cluster from the perspective of resources, how to achieve out of thin air.

The first is the resources from the stand-alone perspective, mainly CPU, MEM, Disk and Net. The following will describe how to achieve the acquisition of additional resources.

Let’s take a look at THE CPU. The daily resource utilization of the pure online cluster is about 10%. It can be said that the online business cannot make full use of the CPU in the daily situation.

Offline tasks are more like sponges that soak up water, with huge business volumes that can use as much CPU power as they want. With the above background of business use of resources, it is possible to keep the CPU in two for life in hybrid technology.

In the kernel operation mechanism, CPU resources are allocated to different processes in time slice rotation training. We allocate one CPU core to online services and offline tasks at the same time, and ensure that online services have a high priority. When online services are idle, the CPU can be used offline, and when online services are needed, the offline tasks are preempted and suspended.

As mentioned above, Pouch container is a resource unit for online services. Pouch container is bound to a certain CPU core for use by an online service. Sigma will assume that the entire physical machine is online.

At the same time, the Fuxi scheduler considers this machine as offline, and it will allocate the CPU resources of the whole machine to offline tasks as allocable resources. In this way, the effect of Double CPU resources is achieved.

Assigning the same CPU to two business runs runs the risk of competition, which relies on core kernel technology for CPU isolation and scheduling, as discussed below.

CPU time slices can be shared by multiple processes, but MEM and Disk resources are tricky. They are expendable and cannot be used by other processes, or they will be overwritten by new processes. How to carry out memory level reuse becomes another research focus.

Figure (top right) illustrates the mechanism of memory oversold usage in mixed-segment technology, with brackets at the top representing online memory allocation (blue) and offline memory allocation (red), and brackets at the bottom representing online memory usage (blue) and offline memory usage (red).

As can be seen in the figure, the amount of memory allocated to online memory is used more when offline memory is used. Through this mechanism, memory oversold is realized.

Why is online memory allowed to be oversold? Because our company’s online business is based on Java language, the memory allocated to the container is used for Java heap memory overhead on the one hand, and the remaining memory is used as cache.

This results in a certain amount of free memory in the online container, and we allocate the free memory allocated by the online container to the offline use through fine monitoring of the memory usage and combining with certain protection mechanism. However, this part of memory is online and cannot be strongly protected offline. Therefore, low-grade services that can be degraded are scheduled to these resources offline.

In the Disk area, the Disk capacity is sufficient for both services. Therefore, no restriction is required. For disk I/O, a series of bandwidth limiting is implemented to restrict the maximum NUMBER of I/OS used by offline tasks to avoid completely occupying the I/OS of the online and system.

In addition, the stand-alone Net level, due to the current capacity is relatively surplus, the current is not the bottleneck point, do not make too much introduction.

2.4 Promote resource surrender mechanism: site up and down quickly

How to achieve the isolation of sharing and competition of resources at the above single level? Let’s take a look at the migration and maximum utilization of resources at the whole resource cluster level through the overall operation and maintenance control. In mixed technology, we pursue the ultimate use of resources, so that business scenarios that should not be used do not waste every resource.

So we put forward the concept of site under fast fast, in terms of its online business oriented, as stated earlier, each is a mix of cluster online trading unit, the trading behavior of the independent support a small number of users, so we will also be it a “site”, we do online site overall capacity scaling transformation, is the process of fast fast. As shown below:

There is a huge deviation in the line business during daily operation and special promotional activities. During Double 11, the daily flow may be more than 100 times, which lays a feasible foundation for the fast up and fast down scheme.

As shown in the above, the two big block diagram, the whole capacity is likened to the online site, every little squares represent an online service, the number of containers of each line represents a online service capacity reserve (total number of containers), we through to the entire site capacity planning, implement daily state and promote the capacity of the switch model, so as to make the subtle use of resources.

Our e-commerce business usually takes a business target, such as the number of transactions created in seconds, as the benchmark for site capacity evaluation. Generally speaking, in daily normal, the capacity of K pens /s retained by a single site is enough, but when big promotion approaches, we will switch the site to big promotion, usually W pens /s capacity level.

In the above mode, the unnecessary online capacity of the whole site is reduced to fully release resources. In this way, offline services can get more physical resources. This is the quick up quick down mechanism.

Site speed up the process (from low volume to high volume), execution efficiency in less than an hour. Site down process (from high volume to low volume), execution efficiency in less than half an hour.

In the daily state, the mixed site with the minimum capacity model to support daily online traffic, and when a large promotion or full link pressure test eve, the mixed site will quickly pull up to a relatively high capacity state, and continue to run for several hours, the site fast down.

Through this mechanism, we ensure that the vast majority of the time, very few resources are used online, while more than 90% of the resources are fully utilized offline. The figure below shows the resource allocation details for the up-down and up-down phases:

In the figure above, the three rectangular boxes on the left, middle and right respectively represent the resource distribution of the daily normal, pressure measured and large rapid mixed cluster.

Red represents offline and green represents online. Each rectangular box is divided into three layers: upper, middle and lower. The upper layer represents the operation and magnitude of services. The middle layer represents the distribution of resources (host), in which the small blue square represents the mixed resources; The lower layer represents the resource allocation ratio and operation mode at the cluster level.

In the daily normal (left rectangle), the vast majority of resources are taken offline, some through allocation and some through run-time scrambling (online or offline).

In the pressure measurement state (middle) and the big promotion state (right), offline resources will be compromised, basically reaching 50% of the allocation ratio of offline and online. When the online pressure is high, offline resources will not be oversold, but in the preparation period (big promotion state but not high pressure time), offline resources can still be occupied.

On the day of the Double 11 promotion, in order to ensure the stability of online business, we will downgrade offline business to a certain extent.

2.5 Daily Resource Surrender Mechanism: Time-sharing multiplexing

Mechanism under the above fast fast is online site capacity in promoting state and normal switching process, in addition, online business early in the day and also showed a strong regularity to traffic peaks and valleys phenomenon, in order to further improve resource utilization, we also proposed the daily case resources concession mechanism: time-sharing multiplexing.

The figure above shows the daily traffic cycle curve of online services. The traffic cycle is low in the early morning and high in the day. For each online service, we implement daily capacity scaling to minimize the resource usage of online services and transfer resources to offline use.

Iii. Mixing core technology

Mixing core technology is mainly divided into two aspects: one is kernel isolation technology, the other is resource scheduling technology, because the content involved are professional fields, considering the length of the current article, the following only listed a series of technical points, not to do the details.

3.1 Introduction to Kernel isolation Technology

We have made strong isolation feature development at the level of various resource types in the kernel, including CPU dimension, IO dimension, memory dimension and network dimension. On the whole, online and offline business groups are divided based on CGroup to distinguish the kernel priorities of the two types of business.

In the CPU dimension, we have implemented isolation features such as hyperthreading pairs, schedulers, level 3 caches, and so on. In the memory dimension, memory bandwidth isolation and OOM kill priority are implemented. The DISK implements I/O bandwidth limiting. Network dimension, flow control at single machine level, and QoS guarantee at whole chain level.

The detailed introduction of hybrid kernel isolation technology you can search for, the following is only about the memory oversold mechanism.

Memory dynamic oversold mechanism:

As shown in the solid line brackets in the figure above, red and blue represent the memory allocation of offline and online Cgroups respectively, and the sum value represents the memory that can be allocated by the whole machine (excluding the system overhead memory). There is also a solid purple line bracket below, which represents the oversold memory quota offline. Its size varies with the running time. This is determined by listening for the amount of unused free memory found online at runtime.

The dashed line bracket on the upper part of the figure indicates the actual memory usage of online and offline services. Online services usually do not use up the memory and use the remaining memory offline as the oversold quota. To prevent unexpected online memory requirements, a certain amount of memory is reserved in the mechanism as a buffer. Through the above mechanism, offline oversold memory usage is realized.

3.2 Resource Scheduling Technology

The second core technology of mixed part technology is resource scheduling technology. Resource scheduling in mixed part scene can be divided into original one-layer resource scheduling (sigM online resource scheduling technology and Fuxi offline resource scheduling technology) and mixed part 0-layer scheduling.

3.2.1 Online Resource Scheduling -sigma

Online resource scheduler is mainly based on the application resource image, which can reasonably schedule and allocate resources, including a series of packing problems, affinity/mutual exclusion rules, global optimal solutions, etc., and automatically scale application capacity from the global dimension, time-sharing multiplexing, and speed up and speed down the battle dimension.

The figure above is the architecture diagram of online one-level scheduling Sigma, which is compatible with Kubernetes API and based on Ali Pouch container technology for scheduling, and has been verified by ali Pouch’s large-scale flow and Double 11 promotion for many years.

3.2.2 Offline Resource Scheduling -Fuxi

The offline cluster scheduler mainly implements hierarchical task scheduling, dynamic memory oversold, lossless/lossy offline downgrading scheme, etc.

This is the operating mechanism diagram of offline resource scheduling Fuxi, which conducts scheduling based on Job and provides a data-driven multilevel pipeline-parallel computing framework for complex applications of massive data processing and large-scale computing.

It is compatible with MapReduce, Map-Reduce-Merge, Cascading, FlumeJava and other programming modes in terms of presentation capability. It is highly scalable, supports hundreds of thousands of parallel task scheduling, and optimizes network overhead based on data distribution.

3.2.3 Unified Resource Scheduling layer 0

In a mixed-service scenario, offline and online services schedule and allocate resources through their own one-layer resource scheduler. However, there is also a unified resource scheduling layer – Layer 0 below the one-layer resource scheduler, which can coordinate and arbitrate resources of both parties and allocate resources reasonably by monitoring and making decisions. The following is the overall architecture diagram of mixed-ministry resource scheduling.

Iv. Future prospects

The future development of mixing technology will evolve in three directions: scale, diversification and refinement.

Scale: In 2018, the mixing unit will reach 10,000 units, which will be a leap of magnitude. We hope to take the mixing unit as the basic capability of the group’s internal resource delivery to save resource costs on a larger scale.

Diversification. In the future, we hope to support more business types, more types of hardware resources, and more complex environment. We even hope to get through the resources on the cloud, aliyun and internal resources intercommunication.

In the future, it is hoped that the resource portrait of the business can be more detailed, the scheduling level can be more timely and accurate, the kernel isolation can be more refined, and the monitoring and operation and maintenance control can be more real-time and accurate.

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Data center daily CPU utilization rate of 45% operation — Alibaba’s large-scale mixed technology evolution