Feng Yong holds a PhD in computer Science from Northwestern Polytechnical University. She has more than 10 years of design and development experience in high performance computing, big data and cloud computing, focusing on scheduling, resource management and application management. I was also deeply involved in the development and commercialization of relevant open source projects, such as OpenStack, Mesos, Swarm, Kubernetes, Spark, etc., and once led the relevant development team in IBM.

preface

If anyone in the cloud computing world hasn’t heard of Kubernetes, it’s like someone doesn’t know that Chongqing hotpot must have chili. Kubernetes has become the de facto standard platform for managing data centers, like Android on phones and Windows on laptops. Around Kubernetes, the open source community has built a rich technology ecosystem, whether CI/CD, monitoring operations, application frameworks, security and anti-intrusion, users can find projects and products that are suitable for them. However, once the scenario is scaled up to a multi-cluster, hybrid cloud environment, there are only a handful of open source technologies that users can rely on, and they tend to be less mature and comprehensive.

In order to make it as easy for developers and users to use their familiar open source projects and products in multi-cluster and hybrid environments as it is on a single Kubernetes cluster platform, RedHat, ant, and ali cloud jointly launch and Open source Open Cluster Management (OCM) to solve the life cycle Management problem of objects such as resources, applications, configurations, and policies in a multi-cluster and mixed environment. Currently, OCM has submitted an incubator application for the Sandbox level project to CNCF TOC.

Project official website: open-cluster-management. IO /

History of multi-cluster management

Back a few years ago, when the industry was still focused on whether Or not Kubernetes would be available at production level, there were some of the first players to sign up for the “multi-cluster federation” technology. Most of them are far more than the average volume of Kubernetes practice pioneer, from the earliest Redhat, Google entered the KubeFed v1 attempt, and then joined IBM to learn from experience and launched KubeFed v2. In addition to these large enterprises exploring the multi-cluster federated technology in the production practice of Kuberentes scenarios, in the commercial market, most of the service products based on Kubernetes packaging by various manufacturers have also experienced the evolution process from single-cluster product service to multi-cluster form and hybrid cloud scenarios. In fact, both enterprises and business users have common needs, focusing on the following aspects:

Multi-geographic issues: When a cluster needs to be deployed on a heterogeneous infrastructure or across a wider geographic area

Kubernetes cluster relies on ETCD as the data persistence layer, and ETCD as a distributed system has requirements on the network delay between each member of the system, and there are some restrictions on the number of members, although in the case of delay tolerance can be adjusted by adjusting heartbeat and other parameters. However, it cannot meet the requirements of global deployment across countries and continents, nor can it guarantee the number of available regions in large-scale scenarios. Therefore, in order to ensure at least stable operation of ETCD, etCD is usually planned into multiple clusters by region. In addition, hybrid cloud architectures are increasingly being accepted by users on the premise of business availability and security. It is difficult to deploy a single ETCD cluster across cloud service providers, and the Kubernetes cluster is correspondingly split into multiple clusters. When the number of clusters increases and the administrator is overwhelmed, it is natural to need an integrated management and control system to manage and coordinate multiple clusters simultaneously.

Scale issues: When a single cluster hits a scale bottleneck

Sure, the open source version of Kubernetes has significant scalability bottlenecks, but worse still, it’s hard to really quantify Kubernetes’ scale. The community initially provided the Kubemark suite to verify the cluster’s performance, but the reality was very poor, and all Kubemark did was repeatedly schedule Workload expansion and contraction under different number of nodes. However, in practice, the reasons for Kubernetes performance bottleneck are complex and there are many scenarios, so Kubemark is difficult to describe the scale of multiple clusters comprehensively and objectively, and can only be used as a very coarse-grained reference scheme. Community support for multi-dimensional measurements of cluster capacity in scale envelopes followed by the more advanced cluster pressure suite, Perf-Tests. As users become more aware of scale issues, they can plan the distribution of several Kubernetes clusters in advance based on actual scenarios (such as IDC scale, network topology, etc.), and the need for multi-cluster federation also emerges.

Disaster/Isolation issues: When more granular isolation and disaster recovery requirements arise

Service application Dr Is implemented by deploying applications to infrastructure availability zones of different granularity based on scheduling policies in clusters. Combined with network routing, storage, and access control technologies, the problem of service continuity after availability zone failure can be solved. But what about troubleshooting at the cluster level, or even at the cluster management control platform itself?

As a distributed system, ETCD can naturally solve most node failures, but unfortunately in practice, ETCD services can still be down, either due to management errors or network partitions. To prevent an ETCD from destroying the world when it goes wrong, a more granular disaster recovery strategy is often provided by reducing the blast radius. For example, in practice, it is more inclined to build multiple clusters within a single data center to avoid the problem of brain splitting, and at the same time, make each cluster become an independent autonomous system. Even in the case of network partition or upper level control offline, it can run completely, and at least maintain the site stably. This naturally creates the need to manage multiple Kubernetes clusters simultaneously.

On the other hand, the isolation requirement also comes from the lack of multi-tenant capability of the cluster, so the isolation policy at the cluster level is directly adopted. On a side note, the good news is that Kubernetes’ control side fairness/multi-tenant isolation is being built brick by brick. Through the APIPriorityAndFairness feature, which entered Beta in version 1.20, the soft isolation strategy for traffic can be proactively customized for scenarios. Instead of passively punishing traffic limiting through acLs. If the initial cluster planning is divided into multiple clusters, then the isolation problem is solved naturally. For example, we can allocate exclusive clusters for big data or specific business applications based on business.

The main functions and architecture of OCM

OCM is designed to simplify the administration of multi-Kubernetes clusters deployed in a mixed environment. It can be used to extend multi-cluster management capabilities for different management tools in the Kubernetes ecosystem. The OCM summarizes the basic concepts required for multi-cluster management and considers that any management tool must have the following capabilities:

1. Understand the definition of cluster.

2. Select one or more clusters in a scheduling mode.

3. Distribute configurations or workloads to one or more clusters.

4. Control user access to the cluster.

5. Deploy management probes to multiple clusters.

OCM adopts the hub-Agent architecture, which contains several primitives and basic components of multi-cluster management to meet the above requirements:

● The ManagedCluster API defines managed clusters, and OCM installs an agent named Klusterlet in each cluster to complete cluster registration and lifecycle management.

● Define how to schedule configurations or workloads to which clusters through Placement APIS. The scheduling results are stored in the PlacementDecision API. Other configuration management and application deployment tools can use PlacementDecisiono to determine which clusters need to be configured and deployed.

● Define the configuration and resource information distributed to a cluster through the ManifestWork API.

● Groups clusters through the ManagedClusterSet API and provides boundaries for users to access the cluster.

● The ManagedClusterAddon API defines how the management probe is deployed in multiple clusters and how it communicates securely with the control plane on the hub side.

The architecture is shown in the figure below. Registration is responsible for cluster registration, cluster life cycle management, registration and life cycle management of management plug-ins. Work is responsible for the distribution of resources; Placement is responsible for scheduling cluster load. On top of this, developers or SRE teams can easily develop and deploy management tools in different scenarios based on API primitives provided by OCM.

By using OCM API primitives, we can simplify the deployment and operation of many other open source multi-cluster management projects, and extend the multi-cluster management capabilities of many Kubernetes single-cluster management tools. Such as:

1. Simplify the management of multi-cluster network solutions such as Submariner. Use OCM’s plug-in management function to centralize the deployment and configuration of Submariner on a unified management platform.

2. Provide rich multi-cluster scheduling strategy and reliable resource distribution engine for application deployment tools (KubeVela, ArgoCD, etc.).

3. Expand the existing Kuberenetes single-cluster security Policy governance tool (Open Policy Agent, Falco, etc.) to make it capable of multi-cluster security Policy governance.

OCM also has two built-in management plug-ins for application deployment and security policy management. The application deployment plug-in adopts the subscriber mode, which can obtain the resource information of application deployment from different sources by defining subscription channels, and its architecture is shown in the figure below:

In order to integrate with the Kubernetes ecosystem, OCM implemented several design solutions of Kubernetes SIG-multicluster, including KEP-2149 Cluster ID: github.com/Kubernetes/…

And keP-1645 Multi-Cluster Services API about clusterSet: github.com/Kubernetes/…

Also working with other developers in the community to promote Work APIgithub.com/Kubernetes-… The development of.

The main advantages of OCM

Highly modular — optional/clipping modules

The overall OCM architecture is much like a “microkernel” operating system, with the OCM chassis providing services such as core capability clustering metadata abstraction, while other extensibility capabilities are deployed as separate components that can be decoupled. As shown in the above in the whole scheme of OCM in addition to the ability of the core part, the ability of other top can be tailored according to the actual demand, for example, if we don’t need complex cluster topology relationship, you can cut off the cluster grouping related module, if we don’t need any OCM distributed resources used only as metadata, You can even cut the Agent component delivered by the entire resource. This also helps guide users to log in to the OCM gradually. Users may only need to use a small number of features in the initial stage, and then gradually introduce more feature components as the scenario expands, and even support hot plug on the running control surface at the same time.

More inclusive — Swiss Army knife for complex usage scenarios

At the beginning of the design of the whole OCM scheme, the construction of advanced capabilities in complex scenes was considered by integrating some mainstream third-party technical schemes. For example, in order to support more complex application resource rendering delivery, OCM supports the installation of the application in the form of Helm Chart and the loading of the remote Chart repository. The Addon framework is also available to allow users to customize their own needs through the extensibility interface provided, such as Submarine, a multi-cluster network trust solution developed based on the Addon framework.

Ease of use – reduces complexity

To reduce user complexity and ease of migration to THE OCM solution, OCM provides a traditional command based multi-cluster federated control flow. It is worth noting that the following features are still under development and will be officially introduced in a later version:

ManagedClusterAction allows you to send atomic instructions to each managed cluster, which is the most intuitive way to automate the orchestration of clusters as a central management system. A ManagedClusterAction can have its own directive type, directive content, and the specific state of its execution.

● ManagedClusterView enables us to actively “project” resources in managed clusters to the multi-cluster federated central system. By reading the “projection” of these resources in the central system, we can make more dynamic and accurate decisions in the federated system.

OCM’s practice in Ant Group

OCM technology has been applied to the infrastructure of Ant Group. As a first step, OCM Klusterlets are deployed to managed clusters one by one by using some operation and maintenance means similar to community Cluster API. Thus, the meta-information of dozens of online and offline clusters in the ant domain is unified into the OCM. These OCM Klusterlets provide the upper-layer product platform with the basic capability of multi-cluster management operation and maintenance (O&M) to facilitate the function expansion in the future. Specifically, the first step of OCM includes the following aspects:

● Certificateless: In traditional multi-cluster federated systems, we often need to configure the corresponding cluster access certificate for each cluster metadata, which is also a required field in KubeFed V2 cluster metadata model. Because OCM adopts the Pull architecture, agents deployed in each cluster Pull tasks from the central and there is no active central access to the actual cluster, so the metadata of each cluster is only a placeholder for complete desensitization. At the same time, because the certificate information does not need to be stored, there is no risk of the certificate being copied and misappropriated in the OCM solution

● Automated cluster registration: The previous cluster registration process has a lot of manual intervention in the operation of the link extended the time of collaboration and communication at the same time loss of flexibility, such as site level or machine room level flexibility. In many scenarios, manual verification is essential. Take full advantage of the auditing and verification capabilities provided by OCM cluster registration and integrate them into the approval process tools within the domain to automate the entire cluster registration process, achieving the following goals:

(1) Simplify the cluster initialization/takeover process;

(2) More clearly control the authority of the control center.

● Automatic cluster resource installation/uninstallation: Take-over mainly involves two things: (a) installing the application resources required by the management platform in the cluster, and (b) entering the cluster metadata into the management platform. For (a) resources that can be further classified at the Cluster level and Namespace level, (b) is generally a critical operation for the upper management system, and the product is considered to take over the Cluster from the moment the metadata is entered. Before the introduction of OCM, all the preparation work needs to be manually pushed step by step. With OCM, the entire process can be automated, simplifying the cost of human collaboration. The essence of this thing is to sort out the cluster management as a process operation, define the concept of state on the cluster metadata so that the product hub can automate the process of taking over the “trivia” to be done. After registering the cluster in OCM, the resource installation and uninstallation process is clearly defined.

Through the above work, dozens of clusters in the ant domain are managed by THE OCM. Clusters that are automatically created and deleted have also been automatically added and deleted during the Singles Day promotion. Later, it also plans to integrate with KubeVela and other application management technologies to jointly complete the cloud biological management capabilities of applications and security policies in ant domains.

The practice of OCM in Aliyun

In Aliyun, the OCM project is KubeVela github.com/oam-dev/kub…

One of the core dependencies for undifferentiated application delivery in a mixed environment. KubeVela is a “one-stop” application management and delivery platform based on the Open Application Model (OAM), and is currently the only cloud native application platform project hosted by CNCF Foundation. In terms of functions, KubeVela can provide developers with an end-to-end application delivery model, as well as grayscale publishing, elastic scaling, observability and other multi-cluster oriented operation and maintenance capabilities, and can deliver and manage applications in a mixed environment with a unified workflow. In the whole process, OCM is the main technology for KubeVela to implement Kubernetes cluster registration, management and application distribution strategy.

On the public cloud, the above features of KubeVela combined with ali Cloud ACK multi-cluster management capability can provide users with a powerful application delivery control plane, which can be easily implemented:

● Mixed environment one-key construction station. For example, a typical hybrid environment could be an ACK cluster (production cluster) of a public cloud plus a local Kubernetes cluster (test cluster) managed by multiple ACK clusters. In both environments, application components are often provided by different providers, such as the database component might be MySQL in the test cluster and Ali Cloud RDS product in the public cloud. In such a mixed environment, traditional application deployment and operation and maintenance are extremely complex. KubeVela, on the other hand, makes it very easy to define in detail the artifacts to be deployed, the delivery workflow, and the differentiated configurations of different environments in a deployment plan. This not only eliminates the tedious manual configuration process, but also greatly reduces release and operation risks with Kubernetes’ powerful automation capability and determinism.

● Multi-cluster microservice application delivery: Microservice applications under the cloud native architecture are often composed of diversified components, such as container components, Helm components, middleware components, cloud service components, etc. KubeVela provides users with a multi-component application delivery model oriented to microservices architecture. By using the distribution strategy provided by OCM, unified application delivery can be carried out in a multi-cluster and mixed environment, greatly reducing the difficulty of operating and managing microservices applications.

In the future, alibaba cloud team will work with RedHat/OCM community, Oracle, Microsoft and other partners to further improve KubeVela’s application orcheography, delivery, operation and maintenance capabilities for a mixed environment, so that the delivery and management of micro-service applications in the cloud native era can truly be “fast and good”.

Join a community

The OCM community is still in the early stages of rapid development, and interested companies, organizations, schools and individuals are welcome to participate. Here, you can work with technical experts from ant group, RedHat, ali cloud and Kubernetes core Contributor partners to learn, build and promote OCM popularity.

●GitHub address: github.com/open-cluste…

On September 10 this year, the INCLUSION· Bund Conference will be held as scheduled. As a global fintech event, it will continue to maintain the original aspiration of making technology more inclusive. On the afternoon of the 11th, we will hold an open source event for multi-cluster and hybrid cloud architecture, where key developers of the OCM community will bring you the best practices of multi-cluster and hybrid cloud architecture built around OCM. You are welcome to attend the event offline and have face-to-face communication. Thank you for your attention and participation in OCM, welcome to share with more friends who have the same needs, let us work together for the use of multi-cluster, hybrid cloud further contribute to the experience!

Recommended Reading of the Week

  • RFC8998+BabaSSL– Let the country secret to the farther sea of stars

  • MOSN sub-project Layotto: Open a new chapter of service grid + application runtime

  • Open a new chapter in cloud native MOSN – Fusion Envoy and GoLang ecology

  • MOSN multi-protocol extension development practice

More articles please scan code to pay attention to “financial level distributed architecture” public number