The author | Shang Zhimin, ali cloud container services senior technical experts

In 2019 Double 11, container service ACK supported the containerization of Alibaba’s internal core system and the cloud products of Ali Cloud itself, and exported alibaba’s large-scale container technology for many years to many ecological companies around double 11 with productized ability. By supporting container clouds from various industries around the world, container services has deposited the capabilities of hosting cloud native applications in the central platform supporting the unitary global architecture and the flexible architecture, managing more than 1W container clusters. This article will introduce the container services in the massive Kubernetes cluster management experience.

What is Massive Kubernetes cluster management?

You may have seen some of the shared best practices on how Alibaba manages a single cluster of 1W nodes. Managing large nodes is an interesting challenge. However, the massive Kubernetes cluster management here will focus on how to manage more than 1W different specifications of Kubernetes cluster. According to our communication with some peers, often an enterprise only needs to manage several to dozens of Kubernetes clusters, so why do we need to consider managing such a large number of Kubernetes clusters?

  • First of all, container Service ACK is a cloud product on Ali Cloud, which provides Kubernetes as a Service capability for global customers. Currently, it has been supported in 20 regions around the world.

  • Secondly, thanks to the development of cloud native era, more and more enterprises embrace Kubernetes, and Kubernetes has gradually become the infrastructure of cloud native era, becoming platform of platform.

background

First let’s take a look at the pain points of hosting these Kubernetes clusters:

1. Different types of clusters: standard, serverless, AI, bare metal, edge, Windows and other Kubernetes clusters. Different types of clustering have different parameters, components, and hosting requirements, and need to support more vertical-oriented Kubernetes;

2. Different cluster sizes: Each cluster has different sizes, from a few nodes to tens of thousands of nodes, from a few services to thousands of services, etc., which need to be able to support the continuous growth of several times the number of clusters every year.

3. Cluster security compliance: Kubernetes clusters distributed in different geographies and environments need to comply with different compliance requirements. For example, Europe’s Kubernetes cluster needs to comply with the EU’s GDPR act, and China’s financial industry and government cloud need additional levels of protection and other requirements.

4. Continuous evolution of the cluster: It is necessary to continuously support the evolution of new versions and features of Kubernetes.

Design objectives:

  1. Support unitary filing management, capacity planning and water level management;
  2. Support global deployment, distribution, disaster recovery, and observability;
  3. Supports pluggable, customizable, building block continuous evolution of flexible architectures.

1. Supports unitary file management, capacity planning, and water level management

unitized

Generally speaking of unitary, people are associated with a scenario such as insufficient capacity of a single room or two-place three-center DISASTER recovery. What does unitization have to do with Kubernetes management?

For us, a region (such as Hangzhou) may manage thousands of Kubernetes, and the cluster lifecycle management of these Kubernetes needs to be maintained uniformly. As a Kubernetes professional team, the naive idea is to manage these Guest K8s masters through multiple Kubernetes meta-clusters. The boundary of a Kubernetes meta-cluster is a cell.

Once we often heard that the optical fiber of certain room was cut off, and the power of certain room was interrupted due to the failure of the service. Container service ACK at the beginning of the design supported the architecture form of multi-live in the same city, and the master component of any user Kubernetes cluster would be automatically scattered in multiple rooms. The stability of the cluster will not be affected by the single room problem; On the other hand, to ensure the communication stability between master components, the container service ACK will also try to ensure the communication delay between master components within milliseconds in the scheduling policy when disconnecting master components.

Step change

As you all know, The load of the master component of Kubernetes cluster is mainly related to the node scale of Kubernetes cluster, the number of components that need to interact with Kube-Apiserver such as controller or workload on the worker side and the call frequency. For tens of thousands of Kubernetes clusters, each user Kubernetes cluster size and business form are very different, we can not use a set of standard configuration to manage all user Kubernetes clusters.

At the same time, from a cost point of view, we provide a more flexible, more intelligent hosting capability. Considering that different resource types will have different load pressures on the master, we need to set different factors for each resource type. Finally, we can conclude a computing paradigm, through which we can calculate the gear of each user Kubernetes cluster master to adapt. In addition, we will continue to optimize and adjust these factor values and paradigms based on the built real-time indicators of Kubernetes unified monitoring platform, so as to realize the ability of intelligent smooth shift.

Capacity planning

Kubernetes meta-cluster capacity model, how many users can a single meta-cluster host Kubernetes cluster master?

  • First, confirm the container network plan. Here, we choose Terway, a high-performance container network developed by Ali Cloud. On the one hand, it needs to open the network between user VPC and host master through ENI, an elastic network card. On the other hand, it provides high-performance and rich security policies.

  • Then, you need to plan network segments for Nodes, Pods, and Services based on IP resources in the VPC.

  • Finally, we will design the number of guest Kubernetes with different stalls deployed in each meta cluster unit and reserve 40% water level in combination with statistical rules and comprehensive consideration of cost, density, performance, resource quota, stall ratio and other factors.

2. Support global deployment, distribution, disaster recovery, and observability

Container services are already supported in 20 geographies around the world, and we provide fully automated deployment, distribution, disaster recovery, and observability capabilities, with a focus on global observability across data centers.

Global observability across data centers

** The observability of a large cluster with a global layout is critical to the day-to-day security of the Kubernetes cluster. The key and core of observability design is how to efficiently, reasonably, safely and expandably collect the real-time status indicators of the target cluster in each data center under the complicated network environment.

We need to take into account the collection of observable data at the scale of the regional data center, the unit cluster, and the observability and visualization of the global view. Based on this kind of design concept and objective demand, global observability must use multi-stage combination, namely the observability of the edge layer implementation sinking to cluster internal need of observation, the middle layer of observability is used to realize the monitoring data in some area of convergence, the center layer of observability for pool, the formation of the global view and the alarm.

The advantage of this design is that it can be flexibly expanded and adjusted within each level, which is suitable for the growing cluster scale. The corresponding other levels only need to adjust parameters, and the hierarchy is clear. The simple network structure enables Intranet data to be transmitted to the public network and aggregated.

The design of the monitoring system for the global layout of the large cluster is crucial to ensure the efficient operation of the cluster. Our design concept is to collect and aggregate the data of each data center in real time around the world to realize the global view and data visualization, as well as fault location and alarm notification.

In the cloud native era, Prometheus, as the second project of CNCF, was naturally suitable for container scenarios. Prometheus combined with Kubernetes to realize service discovery and monitoring of dynamic scheduling services, which had great advantages in various monitoring schemes. Has become the de facto standard for container monitoring solutions, so we also chose Prometheus as the basis for our solution.

For each cluster, the following indicators need to be collected:

  1. OS indicators, such as node resource (CPU, memory, disk, etc.) water level and network throughput;
  2. Kube-apiserver, Kube-Controller-Manager, kube-Scheduler and other indicators;
  3. Kubernetes cluster status collected by kubernetes-state-metrics (CAdvisor);
  4. Etcd indicators, such as ETCD write time, DB size, throughput between peers, and so on.

After global data is aggregated, the AlertManager interconnects with Prometheus and drives various alarm notification methods, such as alarm notification by nailing, email, and SMS.

Monitoring alarm Architecture

In order to reasonably distribute the monitoring burden among the multiple levels of Prometheus and achieve global aggregation, we used the functionality of federal Federation. In a federated cluster, a separate Prometheus is deployed in each data center to collect current data center monitoring data, and Prometheus in one center is responsible for aggregating monitoring data from multiple data centers.

Based on the functions of Federation, the global monitoring architecture is designed as follows, including three parts: monitoring system, alarm system and display system.

From the perspective of aggregation from meta-cluster monitoring to central monitoring, the monitoring system presents a tree structure, which can be divided into three layers:

  1. Edge Prometheus

Prometheus was sunk into each meta-cluster to effectively monitor the indicators of the meta-cluster Kubernetes and the user cluster Kubernetes and to avoid the complexity of network configuration.

  1. Cascade Prometheus

Cascading Prometheus is used to aggregate monitoring data from multiple regions. Cascading Prometheus exists in every major region, such as China, Europe, The Americas, and Asia. Each large region contains several specific regions, such as Beijing, Shanghai, Tokyo, etc. As the cluster size grows within each large region, the large region can be split into new large regions and always maintain a cascading Prometheus within each large region, enabling flexible architecture expansion and evolution.

  1. Center Prometheus

Central Prometheus connects all cascading Prometheus for final data aggregation, global view, and alarms. To improve reliability, central Prometheus uses a hypermetro architecture, in which two Prometheus central nodes are located in different usable areas, all connected to the same lower Prometheus.

Figure 2-1 Global multi-level monitoring architecture based on Prometheus Federation

Optimization strategy

1. Separate monitoring data traffic from API server traffic

The proxy function of API Server enables Kubernetes to access Pod, Node, or Service from outside the cluster through API Server.

Figure 3-1 Accessing Pod resources in the Kubernetes cluster in API Server proxy mode

The commonly used method of transparent transmission of Kubernetes Prometheus indicators from the cluster to the outside of the cluster is through the API server agent function, the advantage is that the API server port 6443 can be reused to open data, easy management; The disadvantages are also obvious, increasing the load on the API server.

If the API Server agent model is used, the pressure on the API Server will increase and the potential risks will be increased, considering that the customer cluster and nodes will continue to expand with the sale. For this reason, a LoadBalancer service is added for Prometheus to monitor the traffic and implement traffic separation. Even if the number of monitored objects continues to increase, this ensures that API Server does not incur the overhead of Proxy functionality.

2. Collect specified metrics

In the central Prometheus, only the indicators that need to be used are collected, and all indicators must not be captured; otherwise, the data will be lost due to excessive network transmission pressure.

3. Label management

Label is used to Label regions and meta-clusters on cascading Prometheus, so the granularity of the meta-cluster can be located at the center of Prometheus aggregation. At the same time, reduce unnecessary labels as far as possible to achieve data saving.

3. Support the pluggable, customizable and building block continuous evolution of flexible architecture

The first two parts briefly describe some thoughts on how to manage the massive Kubernetes cluster. However, it is far from enough to achieve global and unitized management alone. The success of Kubernetes, with its declarative definitions, highly active community, and good architectural abstractions, has made Kubernetes the Linux of the cloud native era.

We must consider continuous iterations of the Kubernetes version and CVE bug fixes. We must consider continuous updates of Kubernetes components, whether CSI, CNI, Device Plugin, Scheduler Plugin, etc. For this we provide a full cluster and component continuous upgrade, grayscale, pause and so on.

Components are pluggable

The component to check

Component upgrade

In June 2019, Alibaba opened its internal cloud native application automation engine OpenKruise to open source. Here we focus on its BroadcastJob function, which is very suitable for upgrading components on every worker machine or detecting nodes on every machine. The Broadcast Job will run a POD on each node in the cluster until it finishes. It is similar to the broadcastOnset of the community, except that DaemonSet keeps a pod running on each node, while BroadcastJob ends with the pod.

Cluster template

In addition, considering the different usage scenarios of Kubernetes, we provide a variety of Kubernetes cluster profile, which can facilitate users to choose a more convenient cluster. We will continue to provide more and better cluster templates based on our extensive clustering practices.

conclusion

With the development of cloud computing, cloud native technology based on Kubernetes continues to drive the digital transformation of the industry.

Container service ACK provides a secure, stable, high-performance Kubernetes hosting service and has become the best carrier to run Kubernetes on the cloud. In this Double 11, container service ACK made contributions to the double 11 in various scenes, supporting alibaba’s internal core system containerized cloud, alibaba cloud micro-service engine MSE, video cloud, CDN and other cloud products, as well as the ecological companies and ISV companies of Double 11. Including Jushita e-commerce cloud, Cainiao logistics cloud, payment system in Southeast Asia and so on.

Container services ACK will continue to move forward, providing higher and better cloud native container networking, storage, scheduling and resiliency capabilities, end-to-end full-link security capabilities, Serverless and Servicemesh capabilities.

Interested developers can go to Ali Cloud console and create a Kubernetes cluster to experience. At the same time, we also welcome the container ecological partners to join ali Cloud container application market, and create the cloud original era together with us.

This book highlights

  • In the practice of double 11 super scale K8s cluster, the problems encountered and solutions are detailed
  • Best combination of Cloud biogenesis: Kubernetes+ Container + Dragon, to achieve 100% cloud on the core system technical details
  • Double 11 Service Mesh large-scale landing solution

“Alibaba cloudnative wechat public account (ID: Alicloudnative) focuses on micro Service, Serverless, container, Service Mesh and other technical fields, focuses on cloudnative popular technology trends, large-scale implementation of cloudnative practice, and becomes the technical public account that most understands cloudnative developers.”

For more information, please pay attention to “Alibaba Cloud Original”.