The author | huai right, in the stone

** A single K8s cluster provides namespace-level isolation, and supports a maximum of 5K nodes and 15W pods. Multi-k8s cluster solves the problem of resource isolation and fault isolation of a single cluster, breaks the limit of the number of nodes and PODS that can be supported, but at the same time, it also brings about an increase in the complexity of cluster management. Especially in the private cloud scenario, K8s engineers cannot reach the customer environment as quickly as in the public cloud, and the operation and maintenance costs are further magnified. Therefore, how to low cost, high efficiency, automation and low management of multiple SETS of K8s clusters has become a common problem in the industry.

background

Multi-cluster is mainly used in the following scenarios:

1. The product itself requires multi-cluster capability. The product management and control must be deployed in the K8s cluster, and the product also needs to provide the K8s cluster for users to use. To ensure fault isolation, stability, and security, the management and control of container services must be deployed in different clusters. 2. When using K8s, users want to have multiple production clusters for different services to implement resource isolation and fault isolation. 3. The same user may need the capability of multiple types of clusters. Take edge computing IOT as an example, it requires a customized edge K8s cluster.

We summarize the difficulties in the operation and maintenance of K8s cluster, which can be divided into two parts:

Difficulty 1: Operating and maintaining the management and control plane of the K8s cluster

  • How to support users to eject new Kubernetes clusters with one click?
  • How to upgrade multiple K8s cluster versions, when the community major CVE discovery, whether to upgrade the cluster one by one?
  • How to automatically repair the faults that occur when multiple K8s clusters are running?
  • How do I maintain the CLUSTER ETCD, including upgrade, backup, recovery, and node migration?

Difficulty 2: Operating and maintaining Worker nodes

  • How to quickly expand or shrink Worker nodes? At the same time, it is necessary to ensure that the version and configuration of on-host software (such as Docker, Kubelet and other components that cannot be hosted by K8s) of each node are aligned with other nodes.
  • How to upgrade on-host software on several Worker nodes? How to grayscale release software packages on Worker nodes?
  • How to automatically repair on-host software faults that may occur on several Worker nodes? For example, if a panic occurs in Docker/Kubelet, can we handle it automatically?

As a complex automated operation and maintenance system, K8s supports the release, upgrade and life cycle management of the upper business, but the operation and maintenance of K8s itself is often carried out by a workflow (Ansible, Argo, etc.) in the industry. The degree of automation in this process is not high, and operation and maintenance personnel are required to have relatively professional KNOWLEDGE of K8s. However, when multiple SETS of K8s clusters need to be operated and maintained, the operation and maintenance cost will also rise linearly under ideal circumstances, and will be magnified again under the scenario of private cloud.

Ali has long encountered the problem of operating and maintaining a large number of K8s clusters. We abandoned the traditional workflow mode and explored another path: Manage K8s with K8s, The article “Demystifying Kubernetes as a Service — How Alibaba Cloud Manages 10,000s of Kubernetes Clusters” of CNCF introduced Alibaba’s management of K8s in large scale Exploration and experience in cluster management.

Of course, there is a big difference between the private cloud scenario and alibaba’s internal scenario. Alibaba Group uses Kube-on-Kube technology mainly to value its scale effect. A yuan K8s cluster can be used to manage hundreds of business K8s clusters, while the private cloud scenario has a small scale effect. Proprietary cloud mainly focuses On kube-on-KUbe technology’s ability to automate the operation and maintenance of K8s clusters and compatible with a variety of K8s clusters, which enriches the user’s use scenarios while improving stability.

K8s declarative operation and maintenance philosophy: Manage K8s with K8s

The declarative API of K8s subverts the traditional procedural operation and maintenance mode. The declarative API corresponds to the end-state oriented operation and maintenance mode: The user will define the desired state in the Spec. The Controller of K8s will perform a series of operations to help the user achieve the desired state. As long as the requirements are not met, the Controller will keep trying.

For K8s native resources such as Deployment, K8s Controller is responsible for the final state maintenance, while for user-defined resources such as a Cluster, K8s provides a powerful and easy-to-use CRD+Operator mechanism. Custom end-state oriented o&M can be implemented with a few simple steps:

1. Define your own resource type (CRD) and implement your own Operator, which includes a custom Controller; 2. Users only need to submit their own CR in the form of yamL or JSON; 3. The Operator monitors the change of CR, and the Controller starts to execute the corresponding o&M logic. 4. During the operation, if the final state does not meet the requirements, the Operator can monitor the changes and perform corresponding recovery operations.

Operator is one of the best practices for operating and maintaining applications in code. Of course, this is only a framework, which saves some repetitive work of users, such as event monitoring mechanism and RESTful API monitoring, but the core operation and maintenance logic still needs to be written by users in case by case.

Cloud native KOK architecture

Kube-on-kube is not a new concept, there are many excellent solutions in the industry:

  • The ants
  • Community projects

However, the above schemes have strong dependence on public cloud infrastructure, and the particularity of private cloud makes us consider:

  • Lightweight enough for users to pay, customers are often averse to exclusive management node overhead;
  • It does not rely on the public cloud and the different infrastructure within Ali Group;
  • Adopt cloud native architecture.

After considering these three factors, we designed a more general KOK solution:

Noun explanation:

  • Meta-cluster: a meta-cluster, i.e., the lower Kube of kube-on-kube;
  • Production Cluster: business Cluster, i.e., upper Kube of Kube-on-Kube;
  • Etcd cluster: An ETCD cluster created and maintained by an EtCD operator running in a meta-cluster. Each service cluster can have an ETCD alone or share it with multiple service clusters.
  • PC – master – pod: business Cluster control pod, is actually apiserver/scheduler/controller – manager this three kinds of pod, RMB by running on the Cluster of Cluster Operator maintenance;
  • Pc-nodes: Nodes of a service cluster. The Machine Operator initializes the nodes and adds them to the service cluster. The Machine Operator maintains the nodes.

etcd Operator

The etCD Operator is responsible for creating, destroying, upgrading, and troubleshooting an ETCD cluster. It also monitors the status of an ETCD cluster, including the health status of a cluster, member, and the amount of data stored.

Ali Cloud – Cloud native application Platform -SRE team improved the open source version of ETCD Operator to enhance the operation and maintenance function and stability. This Operator is responsible for the operation and maintenance management of a large number of ETCD clusters in Ali, and the operation is stable and time-tested.

Cluster Operator

The Cluster Operator is responsible for the creation and maintenance of K8s management and control components (Apiserver, Controller Manager and Scheduler) of the service Cluster, as well as the generation of corresponding certificates and KubeconFig.

We built the Cluster Operator with Ant Group-PaAS engine and proprietary cloud product team, with the ability to customize rendering, version tracing, and dynamically add supported versions.

The K8s management and control components of a business cluster are deployed in a meta-cluster, which is a bunch of common resources including Deployment, Secret, Pod, PVC, etc. Therefore, a business cluster does not have the concept of Master node:

  • Kube-apiserver: consists of 1 Deployment +1 Service. Kube-apiserver is stateless and Deployment can meet the requirements. Meanwhile, apiserver needs to connect to the EtCD cluster pulled by the etCD Operator. If the user environment has the capability of LoadBalancer Service, it is recommended to use LoadBalancer Service to expose Apiserver preferentially. If this capability is not available, we also provide NodePort exposed form;
  • Kube-controller-manager: 1 Deployment, stateless application;
  • Kube-scheduler: 1 Deployment, stateless application

However, the deployment of the above three components alone does not provide a usable K8s, and we also need to satisfy the following scenarios:

1. A usable business K8s needs to deploy coreDNS, Kube-Proxy or any other component in addition to ETCD and three major components; 2. Some components need to be deployed in a meta-cluster like ETCD and three components. For example, Machine Operator is a component that pulls up nodes and can operate before nodes exist in the service cluster. 3. Upgrade component versions.

Therefore, to meet scalability requirements, we designed Addons hot-plug capability, which can import all Addon components with a single command; Addons also supports dynamic rendering. You can customize addons configuration items.

Machine Operator

The Machine Operator is responsible for the necessary initialization operations of the node, and then create docker, K8S, NVIDIA and other components of the node and maintain their final state. Finally, add the node to the service cluster.

We adopt the KubeNode component maintained by aliyun-cloud native application platform-Serverless node management team. This Operator is responsible for the online and offline of nodes within the group, and realizes an end-state oriented operation and maintenance framework. CRD can be customized for different Arch or OS. It is suitable for use in the changeable private cloud environment.

In short, KubeNode implements an end-state oriented operation and maintenance framework, which mainly contains two concepts of Component+Machine.

1. Users provide operation and maintenance scripts according to templates to generate Component CR; 2. If the node is to go online, a Machine CR will be generated, which will specify which components need to be deployed; 3.KubeNode listens to Machine CR, and o&M operations will be performed on the corresponding node.

In theory, this design can be extended for different Arch or OS without changing Operator source code, with high flexibility. At the same time, we are exploring how to combine IaaS Provider to achieve RunAnyWhere’s goals.

Cost comparison of multi-cluster solutions

By using an automation tool (also a cloud-native Operator, described in a future article) to string together the above processes, we can compress the production time of a cluster to the level of minutes.

The following figure illustrates the cost analysis of tiled multi-cluster solution (direct deployment of multiple sets) and KOK multi-cluster solution:

Tiled multi-cluster solution KOK multi-cluster solution
The delivery cost TKG TG+tG*(K-1)
Upgrade cost UGP*K UGP+uGP*(K-1)
User cost T*K t*K

In the command, T is the deployment time of a SINGLE K8s cluster, T is the deployment time of a single service cluster, K is the number of clusters, G is the number of offices, U is the meta-cluster upgrade time, U is the service cluster upgrade time, and P is the number of upgrades.

According to our practical experience, under normal circumstances, T and U are about 1 hour, and T and U are about 10 minutes. We estimate that:

  • The delivery time of multiple clusters (the number of clusters is 3) decreases from > 3 hours to <1 hour.
  • The cluster upgrade time decreases from >1 hour to 10 minutes.
  • The cost of creating a new cluster decreases from >2 hours to 10 minutes.

conclusion

Tiled multi-cluster will inevitably bring linear increase of operation and maintenance complexity, difficult to maintain; However, KOK multi-cluster regards K8s cluster itself as a K8s resource, and uses THE powerful CRD+Operator capability of K8s to upgrade the operation and maintenance of K8s from the traditional procedural to declarative, which reduces the dimension of the operation and maintenance complexity of K8s cluster.

At the same time, the multi-cluster design scheme introduced in this paper, on the basis of learning from alibaba Group’s years of operation and maintenance experience, adopts the cloud native architecture, gets rid of the dependence on different infrastructure, and realizes RunAnyWhere. Users can enjoy the easy-to-use, stable and lightweight K8s multi-cluster capability by providing a common IaaS facility.

Cloud native application platform team hiring!

Ali Cloud native application platform team is looking for talents at present, if you meet:

  • Passionate about container and infrastructure-related cloud native technology, rich accumulation and outstanding achievements (such as product implementation, innovative technology implementation, open source contribution, leading academic achievements) in one direction of cloud native infrastructure in related fields such as Kubernetes, Serverless platform, container network and storage, operation and maintenance platform;

  • Excellent presentation, communication and team work skills; Forward thinking about technology and business; Strong ownership, result-oriented, good at decision-making;

  • Familiar with at least one programming language in Java, Golang;

  • Bachelor degree or above, at least 3 years working experience.

Resume can be sent to [email protected], if you have any questions, please add wechat: MDx252525.

Course recommended

In order for more developers to enjoy the dividends brought by Serverless, this time, we gathered 10+ Technical experts in the field of Serverless from Alibaba to create the most suitable Serverless open course for developers to learn and use immediately. Easily embrace the new paradigm of cloud computing – Serverless.

Click to free courses: developer.aliyun.com/learning/ro…

“Alibaba Cloud originator focuses on micro-service, Serverless, container, Service Mesh and other technical fields, focuses on the trend of cloud native popular technology, large-scale implementation of cloud native practice, and becomes the public account that most understands cloud native developers.”