Kubernetes cluster management practice

background

As a leading lifestyle service platform in China, many businesses of Meituan-Dianping have significant and regular “peaks” and “valleys”. Especially in the event of holidays or promotional activities, the flow will increase explosively in a short time. This has high requirements on the resource flexibility and availability of the cluster center, and increases the complexity and cost of supporting the service traffic exponentially. What we need to do is to maximize the throughput capacity of the cluster with limited resources to ensure user experience.

This paper will introduce meituan-Dianping Kubernetes cluster management and use practice, including meituan-Dianping cluster management and scheduling system introduction, Kubernetes management and practice, Kubernetes optimization and transformation and resource management and optimization.

Meituan-dianping cluster management and scheduling system

Meituan-dianping has been on the road of cluster management and resource optimization for many years. In 2013, it began to build a resource delivery method based on traditional virtualization technology. In July 2015, HULK, a comprehensive cluster management and scheduling system, was established to promote the containerization of Meituan Dianping service. In 2016, the company completed self-research based on Docker container technology to achieve elastic expansion capacity, so as to improve delivery speed and cope with the demand for rapid expansion and reduction of capacity, achieve elastic expansion and reduction of capacity, improve resource utilization rate, improve business operation and maintenance efficiency, and reasonably and effectively reduce enterprise IT operation and maintenance costs. In 2018, kubernetes-based resource management and scheduling began to further improve the efficiency of resource use.

Initially, Meituan-Dianping realized elastic scaling capability through self-research based on Docker container technology, mainly in order to solve the deficiencies of management and deployment mechanism based on virtualization technology in responding to the demand of rapid service expansion and reduction. For example, the resource instance creation is slow, the unified operating environment cannot be unified, the instance deployment and delivery process is long, the resource recovery efficiency is low, and the elasticity is poor. After investigation and testing, combined with practical experience in the industry, we decided to develop our own cluster management and scheduling system based on Docker container technology to effectively cope with the demand of rapid capacity expansion and contraction and improve resource utilization efficiency. We call it “HULK” — HULK, and this phase can be seen as HULK1.0.

Later, after continuous exploration and trial in the production environment, we gradually realized that it is not enough to be satisfied with the elastic expansion capacity of the cluster, and cost and efficiency will definitely be a more intractable problem in the future. We learned from the development and operation experience of HULK 1.0 in the past two years, further optimized and improved the architecture and support system level, and enabled HULK with the help of the power of ecology and open source, that is, introduced the open source cluster management and scheduling system Kubernetes, hoping to further improve the efficiency and stability of cluster management and operation. It also reduces resource costs. Therefore, we changed from self-research platform to open source Kubernetes system, and built a more intelligent cluster management and scheduling system based on Kubernetes system — HULK2.0.

The architecture overview

At the architectural level, how HULK2.0 can be better layered and decoupled from the upper layer business and the underlying Kubernetes platform is a priority for us from the beginning of the design. We expect it to be business-friendly and maximize Kubernetes’ scheduling capabilities, so that business layers and users don’t need to pay attention to the details of resource relationships, so that what you get is what you get. At the same time, the publishing, configuration, billing, load and other logical layer and the underlying Kubernetes platform decoupled layer, and maintain compatibility with native Kubernetes API to access Kubernetes cluster. In this way, the complex, diverse and inconsistent management requirements of Meituan dianping infrastructure can be solved by means of unified, mainstream and industry-standard standards.

Architecture is introduced

From top to bottom, Meituan group management and scheduling platform provides services for the whole company, with various main business lines, unified OPS platform and Portal platform. HULK cannot customize interfaces and solutions for each platform. Therefore, it is necessary to abstract and converge various businesses and requirements, and finally shield the details of HULK system through HULK API, so as to decouple HULK from the upper business side. The HULK API is an abstraction of the business layer and resource requirements and is the only way for the outside world to access HULK.

With the upper level issues resolved, let’s look at decoupling from the lower level Kubernetes platform. After HULK receives the upper layer resource request, it first needs to carry out a series of initialization work, including parameter verification, resource allowance, IP and Hostname allocation, etc. Then it applies to Kubernetes platform to allocate machine resources, and finally delivers resources to users. The Kubernetes API further converges and transforms resource requirements, allowing us to leverage Kubernetes’ resource management advantages. The Kubernetes API aims to converge HULK’s resource management logic and align it with the industry mainstream. In addition, because it is fully compatible with the Kubernetes API, it allows us to leverage the power of community and ecology to build and explore together.

As you can see, the HULK API and Kubernetes API divide our entire system into three layers, which allows each layer to focus on its own modules.

Kubernetes management and practice

Why Kubernetes? Kubernetes is not the only cluster management platform on the market (others are Docker Swarm or Mesos). We chose Kubernetes because it is not a solution, but a platform and a capability. This ability can let us based on real Meituan comment on the actual situation to expand, at the same time can depend on and reuse for many years of technical accumulation, give us more freedom of choice, we can quickly deploy applications including, without having to face the traditional platform of risk, dynamically extend application and better resource allocation policy.

Kubernetes cluster as the whole HULK cluster resource management and platform, the requirements are stability and scalability, risk control and cluster throughput capacity.

Cluster Operation Status

Cluster size: 100,000 + instances online, multi-geographic deployment, and growing rapidly.
Service monitoring alarms: The cluster collects startup and status data of applications. Container-init automatically integrates service monitoring information, which is pluggable and configurable.
Resource health alarm: Collects important data such as Nodes, Pods, and Containers from the perspective of resources, and detects their status information in a timely manner, such as Node unavailable or Container restart.
Periodic inspection and account checking: Automatically checks the status of all host computers every day, including the number of remaining disks (data volumes), the number of D processes, and the status of host computers, and synchronizes and verifies the AppKey capacity expansion data with the actual Pod and container data to discover the inconsistency in time.
Cluster data visualization: visualization of the current cluster status, including host resource status, service number, Pod number, containerization rate, service status, expansion and shrinkage data, etc. It also provides interface service configuration, host offline and Pod migration operation entrance.
Capacity planning and prediction: Detect cluster resource status and prepare resources in advance. Based on rules and machine learning, traffic and peak times are detected to ensure normal, stable, and efficient running of services.

Kubernetes optimization and transformation

Kube-scheduler performance optimization

We have clusters using the 1.6 version of the scheduler. With the continuous growth of the cluster scale, the performance and stability problems of the old version of the Kubernetes scheduler (version before 1.10) are gradually prominent. Due to the low throughput of the scheduler, service capacity expansion fails due to timeout. The scheduling time of a Pod is about 5s. The scheduler of Kubernetes is a queued scheduler model. Once the number of PODS waiting at peak capacity expansion is too large, the capacity expansion timeout of subsequent PODS will be caused. To this end, we have greatly optimized the performance of the scheduler and achieved a very obvious improvement. According to our actual production environment verification, the performance has been improved by more than 400%.

The Kubernetes scheduler works as follows:

(Kubernetes scheduler, picture from network)

Preselection failure interrupt mechanism

A scheduling process is mainly divided into three stages when judging whether a Node can be used as the target machine:

Pre-selection stage: Hard conditions, filtering out the nodes that do not meet the conditions, this process is called Predicates. These are a set of filtering conditions in a fixed order, and if any of the Predicate doesn’t match, the Node is discarded.
Priority stage: this stage is a soft condition. Nodes that pass are prioritized. This stage is called Priorities. Each Priority is a factor and has a certain weight.
Selection stage: Select the node with the highest priority from the preferred list, called Select. The selected Node is the machine on which the Pod will eventually be deployed.

Through in-depth analysis of the scheduling process, it can be found that in the pre-selection stage, the scheduler will continue to judge whether the subsequent filtering conditions meet even if it knows that the current Node does not meet certain filtering conditions. If there are tens of thousands of nodes, this judgment logic will waste a lot of calculation time, which is also an important factor in the poor performance of the scheduler.

Therefore, we propose the “pre-selection failure interrupt mechanism”, that is, once a pre-selection condition is not met, the Node will be immediately abandoned, and the subsequent pre-selection conditions will not perform judgment calculation, thus greatly reducing the amount of computation and greatly improving the scheduling performance. As shown below:

We contributed this optimization to the Kubernetes community (see PR for more details), added the alwaysCheckAllPredicates policy option, and released it as the default predicates policy in Kubernetes1.10. Of course, you can also use the original scheduling policy by setting alwaysCheckAllPredicates=true.

In actual tests, the scheduler improved performance by at least 40%. If you are currently using a version of Kube-Scheduler that is below 1.10, you are advised to try to upgrade to the new version.

Locally optimal solution

For optimization problems, especially optimization problems, we always hope to find the global optimal solution or strategy, but when the complexity of the problem is too high, the factors to be considered and the amount of information to deal with is too much, we tend to accept the local optimal solution, because the quality of the local optimal solution is not necessarily bad. Especially when we have certain criteria and indicate that the solution is acceptable, we usually accept the locally optimal result. In this way, from the cost, efficiency and other aspects of consideration, is the real strategy we will take in the actual project.

(Photo from Internet)

In the current scheduling strategy, each time the scheduler traverses all nodes in the cluster in order to find the optimal Node, which is called BestFit algorithm in the scheduling field. However, in the production environment, it doesn’t make much difference whether we choose the optimal Node or the suboptimal Node. Sometimes we still avoid selecting the optimal Node (for example, in order to solve the problem of frequently creating applications on the new machine, our cluster randomizes the optimal solution). In other words, finding a local optimal solution can satisfy the demand.

Assume that there are 1000 nodes in the cluster, and 700 of them can pass the Predicates (pre-selection stage) during the scheduling process OF PodA. Then we will iterate over all nodes and find out the 700 nodes, and then find the optimal Node NodeX through score sorting. However, the local optimal algorithm is adopted, that is, we believe that the demand can be satisfied as long as N nodes can be found and the Node with the highest score is selected among these N nodes. For example, 100 nodes that can pass Predicates (pre-selection stage) can be found by default, and the optimal solution is selected among these 100 nodes. Of course, the global optimal NodeX solution may not be in the 100 nodes, but we can also choose the optimal NodeY among the 100 nodes. The best case is to traverse 100 nodes and find 100 nodes, or maybe 200 or 300 nodes and so on, so that we can greatly reduce the computation time without having too much impact on our scheduling results.

The local optimal strategy is jointly completed by us and the community, which also involves the details of how to achieve fair scheduling and computational task optimization (see PR1,PR2 for details). This optimization was released in Kubernetes 1.12 and is the current default scheduling strategy, which can greatly improve the scheduling performance. Especially in large clusters, the effect is very obvious.

Kubelet transformation

Risk control

As mentioned earlier, stability and risk management are important for large-scale cluster management. Architecturally, Kubelet is the closest cluster management component to the real business. We know that the community version of Kubelet has a lot of autonomy over native resource management. Imagine what would happen if a business was running, but Kubelet killed the container for that business due to an eviction strategy. This should not happen in our cluster, so we need to converge and block Kubelet’s self-decision ability, and its operations on the business container on the machine should be initiated from the upper platform.

Container Restart Policy

Kernel upgrade is a daily operation and maintenance operation. When we upgrade the Kernel version by restarting the host machine, we found that the container on it could not self-heal or the version after self-healing was incorrect after the host machine was restarted, which would cause dissatisfaction of the business and also cause us a lot of operation and maintenance pressure. We added a Reuse to Kubelet and a Rebuild to ensure that the system disk and data disk information is preserved and that the container can heal itself after the host restarts.

IP state retention

According to the network environment of Meituan-Dianping, we developed the CNI plug-in ourselves, and applied and reused IP based on Pod unique identification. The application IP can be reused even after Pod migration and container restart, bringing a lot of benefits for business online and operation and maintenance.

Restricted expulsion strategy

We know that Kubelet has automatic node repair capabilities, such as eviction and deletion of abnormal or non-compliant containers, which is too risky for us, and we allow containers to be non-compliant on minor factors. For example, when Kubelet finds that the number of containers on the current host is larger than the maximum number of containers set, it will select and delete some containers. Although this problem does not happen easily under normal circumstances, we also need to control this problem to reduce this risk.

scalability

Resource allocation

In terms of Kubelet’s scalability, we have enhanced the operability of resources, such as binding Numa to containers to improve the stability of the application; Set CPUShare for the container according to the application level, so as to adjust the scheduling weight; Bind containers to CPUSet and so on.

Strengthen the container

We enable services to configure containers and enable services to extend parameters such as ULIMIT, IO Limit, PID Limit, and swap to their containers and enhance the isolation between containers.

Application in-place upgrade

As we all know, Kubernetes default as long as the KEY information of Pod changes, such as image information, will start the Pod reconstruction and replacement, which is very expensive in the production environment, on the one hand, IP and HostName will change, on the other hand, frequent reconstruction also brings more pressure to cluster management. It may even result in scheduling failure. In order to solve this problem, we have developed a top-down in-place application upgrade feature, which allows dynamic and efficient modification of application information and in-place (host) upgrades.

The mirror to distribute

Image distribution is an important link that affects the capacity expansion time of containers. We take a series of measures to optimize the image distribution to ensure high efficiency and stability:

Cross-site synchronization: Ensure that the server can always pull images from the nearest image repository for capacity expansion, reducing the pull time and cross-site bandwidth consumption.
Pre-distribution of basic image: The basic image of Meituan Dianping is a public image for constructing service images. The business mirror layer is the application code for the business and is usually much smaller than the underlying mirror. During capacity expansion, if the basic mirror is already local, you only need to pull the service mirror, which greatly speeds up capacity expansion. To achieve this, we pre-distribute the base image to all servers.
P2P image distribution: Basic image pre-distribution in some scenarios, thousands of servers will pull images from the image repository at the same time, putting great pressure on the image repository service and bandwidth. Therefore, we developed the function of P2P distribution of images. The server can not only pull images from the image warehouse, but also get the image fragments from other servers.

Resource management and optimization

Optimization key technology

Service portrait: Sketches the CPU, memory, network, disk, and network I/O capacity and load of applications to learn about application characteristics, resource specifications, application types, and actual resource usage at different times, and then analyzes the correlation from the service and time dimensions to optimize overall scheduling and deployment.
Affinity and mutual exclusion: Affinity exists for applications that have less computing power but higher throughput capacity. On the contrary, if there is resource competition or mutual influence among applications, they are mutually exclusive.
Scene first: Most of Meituan Dianping’s businesses are basically stable scenes, so scene division is necessary. For example, one type of service is very sensitive to delay and does not allow too much resource competition even at peak hours. In this scenario, the delay caused by resource competition should be avoided and reduced to ensure sufficient resources. Type of business may be the breakthrough in some period need CPU resources configuration of ceiling, we through the CPU Set of way to make this kind of business share this portion of the resources, to be able to break through the application specifications of the machine resources limitation, service not only can achieve higher performance, but also to idle resource utilization, to further improve resource utilization.
Elastic scaling: Application deployment includes traffic prediction, automatic scaling, rule-based high/low scaling, and machine learning-based scaling.
Fine-grained resource allocation: Fine-grained resource scheduling and allocation based on resource sharing and isolation technologies, such as Numa binding, task priority, and CPU Set.

Strategies to optimize

The scheduling strategy has two main functions: one is to deploy the target machine according to the established strategy; Second, it can achieve the optimal arrangement of cluster resources.

Affinity: Affinity exists between the applications that have invocation relationships and dependencies, or the applications that combine to achieve low computing capability and high throughput capability. Our CPU setization is to use the affinity constraint of CPU preference to build applications, so that applications with different CPU preferences complement each other.
Mutual exclusion: In contrast to affinity, applications that have competing relationships or interfere with services are deployed separately during scheduling.
Application priority: The division of application priority provides the premise for us to solve the resource competition. At present, when resource competition occurs in containers, we can not decide who should get resources. With the concept of application priority, we can limit the number of important applications on a single host at the scheduling layer, reduce resource competition on a single machine, and provide the possibility for the bottom layer of a single machine to solve resource competition. At the host layer, resources are allocated according to application priorities to ensure sufficient resources for important applications and run low-priority applications.
Scatterability: Scatterability of applications is mainly for disaster recovery. Scatterability is classified into different levels. We provide different levels of shatter granularity, including host, Tor, machine room, Zone, and so on.
Isolation and exclusivity: This is a special type of application that must be deployed independently using a host or virtual machine isolation environment, such as search team services.
Special resources: Special resources meet special hardware requirements for certain services, such as Gpus, SSDS, and special nics.

Online cluster optimization

Online cluster resource optimization problem, unlike offline cluster can predict resource demand to achieve a very good effect, due to the unknown future demand, online cluster is difficult to achieve the effect of offline cluster in resource layout. To solve the problem of online clustering, we have adopted a series of optimizations from upper scheduling to lower resource usage.

Numa binding: Resolves the problem of unstable service feedback. Numa binding binds the CPU and Memory of the same application to the most appropriate Numa Node, reducing the overhead of cross-node access and improving application performance.
CPU set-based: Bind a group of applications with complementary features to the same group of cpus so that they can fully use CPU resources.
Staggered application peak: Staggered application peak based on service portrait data to reduce resource competition and mutual interference and improve service SLA.
Rescheduling: Optimize resource allocation to improve service performance and SLA with fewer resources; Solve the fragmentation problem and improve the resource allocation rate.
Interference analysis: Identifies abnormal containers based on service monitoring data and container information, improves service SLA, and discovers and handles abnormal applications.

conclusion

At present, we are actively exploring the following aspects:

Hybrid deployment of online and offline services improves resource utilization efficiency.
Intelligent scheduling, service traffic and resource usage awareness scheduling, improve service SLA.
High performance, strong isolation and more secure container technology.

Author’s brief introduction

Guo Liang, senior engineer of Cluster Scheduling Center of Meituan-Dianping Basic R&D platform.

Recruitment information

Meituan-dianping cluster scheduling center is committed to creating an efficient and industry-leading cluster management and scheduling platform, building industry-leading cloud solutions through enterprise-level cluster management platform, improving cluster management ability and stability, reducing IT costs, and accelerating the company’s innovation and development. At the same time, as Kubernetes has become the de facto standard in the industry, Meituan-Dianping is gradually embracing the community, participating in open source and has made great progress in the field of cluster scheduling. We also look forward to working with colleagues in the industry to jointly improve the cluster management and scheduling capacity, reduce the IT cost of the whole industry, collaborative innovation and development. Meituan-dianping basic research and development platform is looking for talents in cluster management and scheduling, elastic scheduling, Kubernetes and Linux kernel. If you are interested, please send your resume to [email protected].