Brief introduction: In order to enable developers and users to use their familiar Open source projects and products in multi-cluster and hybrid environment as well as on a single Kubernetes Cluster platform, RedHat, ant and aliyun jointly launched and Open source OCM (Open Cluster Management). IO /\_… An incubation application for a Sandbox-level project has been submitted to the CNCF TOC.

Author: Feng Yong (Lu Jing)

Anyone in cloud computing who hasn’t heard of Kubernetes is like someone who doesn’t know that hot pot in Chongqing requires chili. Like Android on phones and Windows on laptops, Kubernetes has become the de facto standard platform for managing data centers. Around Kubernetes, the open source community has built a rich technology ecosystem, whether it is CI/CD, monitoring operation and maintenance, application framework, security and anti-intrusion, users can find their own projects and products. However, once the scenario is extended to multi-cluster, hybrid cloud environment, users can rely on the open source technology is few, and often not mature, comprehensive.

In order to enable developers and users to use their familiar Open source projects and products in multi-cluster and hybrid environment as well as on a single Kubernetes Cluster platform, RedHat, ant and aliyun jointly launched and Open source OCM (Open Cluster Management). IO /\_… An incubation application for a Sandbox-level project has been submitted to the CNCF TOC.

Open Cluster Management

History of multi-cluster management

Back a few years ago, when the focus/debate was on whether Or not Kubernetes was production-level available, there were some of the first players to sign up for the “multi-cluster federation” technology. Most of them are far more than the average size of the Kubernetes practice pioneer, from the earliest entry of Redhat, Google to do the attempt of KubeFed v1, and later to work with IBM to learn from the experience and launch KubeFed v2. In addition to these large enterprises exploring multi-cluster federated technologies in the production practice of Kuberentes, in the commercial market, most of the vendors’ service offerings based on Kubernetes packaging have also experienced the evolution process from single-cluster product services to multi-cluster forms and hybrid cloud scenarios. In fact, both enterprises and business users have common needs, which are focused on the following aspects:

Multi-region issues: When clusters need to be deployed on heterogeneous infrastructures or across a wider region.

The Kubernetes cluster relies on ETCD as the data persistence layer. Etcd, as a distributed system, requires network latency between various members of the system and limits the number of members, although it can be adapted by adjusting parameters such as heartbeat if the latency is tolerable. However, it cannot meet the global deployment needs of transnational and trans-continent, nor can it guarantee the number of available zones in large-scale scenarios. Therefore, in order to make ETCD at least stable operation, etCD is generally planned into multiple clusters by region. In addition, hybrid cloud architectures are increasingly accepted by users on the premise of business availability and security. It is difficult to deploy a single ETCD cluster across cloud service providers, and correspondingly, Kubernetes clusters are split into multiple clusters. As the number of clusters increases and administrators are overwhelmed, it is natural to have an aggregated control system to manage and coordinate multiple clusters simultaneously.

Scale problem: The scale of a single cluster encounters a bottleneck.

Admittedly, the open source version of Kubernetes has obvious scalability bottlenecks, but to make matters worse, it’s hard to really quantify the size of Kubernetes. The community initially provided the Kubemark suite to verify the performance of the cluster, but the reality was very thin. What Kubemark did was to repeatedly scale the Workload with different number of nodes. However, in practice, Kubernetes performance bottleneck is caused by complex reasons and numerous scenarios. It is difficult for Kubemark to comprehensively and objectively describe the scale of multiple clusters, and kubemark can only be used as a reference scheme in very coarse granularity. Then there was community support for measuring cluster capacity in multiple dimensions in scale envelopes, and then there was the more advanced cluster stress test suite perf-Tests. When users have a clearer understanding of the scale problem, they can plan the distribution of several Kubernetes clusters in advance according to the actual scenario (such as IDC size, network topology, etc.), and the need for multi-cluster federation also emerges.

Dr/isolation issues: When more granular isolation and Dr Requirements arise.

Service application Dr Is implemented by deploying applications to infrastructure availability zones of different granularity based on scheduling policies in clusters. Combined with network routing, storage, and access control technologies, the system can solve the problem of service continuity after availability zone failures. But how to solve the cluster level, or even cluster management control platform itself?

As a distributed system, ETCD can naturally solve the problem of most node failures. Unfortunately, in practice, ETCD services can still be down, either due to management errors or due to network partitions. In order to prevent “world destruction” when etCD problems occur, the “blast radius” is often reduced to provide a more granular disaster recovery strategy. For example, in practice, it is preferred to build multiple clusters in a single data center to avoid the problem of brain splitting. At the same time, each cluster can be an independent autonomous system, which can run completely even in the case of network partition or offline control of the upper layer, or at least maintain stable on-site operation. This naturally led to the need to manage multiple Kubernetes clusters simultaneously.

On the other hand, the isolation requirement also comes from the cluster’s lack of multi-tenant capability, so the isolation policy at the cluster level is directly adopted. The good news is that Kubernetes control plane equity/multi-tenant isolation is being built brick by brick. With API Priority And Fairness, which is in Beta with version 1.20, you can proactively customize the traffic soft isolation strategy for scenarios. Rather than passively punishing traffic limiting through acLs. If the cluster is divided into multiple clusters at the beginning of the cluster planning, then the problem of isolation will be solved naturally. For example, we can assign exclusive clusters to big data according to business, or exclusive clusters to specific business applications, etc.

The main functions and architecture of OCM

OCM is designed to simplify the administration of multiple Kubernetes clusters deployed in a mixed environment. Can be used to extend multi-cluster management capabilities for different management tools in the Kubernetes ecosystem. OCM summarizes the basic concepts needed for multi-cluster management, and believes that in multi-cluster management, any management tool should have the following capabilities:

  1. Understand the definition of clusters
  2. Select one or more clusters by a scheduling method
  3. Distribute configuration or workload to one or more clusters
  4. Governs user access control over clusters
  5. Deploy management probes to multiple clusters

OCM adopts the hub-Agent architecture and contains several multi-cluster management primiples and basic components to meet the above requirements:

  • The Managed Cluster API is used to define Managed clusters. In addition, the OCM will install an agent named Klusterlet in each Cluster to complete Cluster registration, lifecycle management and other functions.
  • Define how the configuration or workload is scheduled to which clusters through the Placement API. The scheduling results are stored in the Placement Decision API. Placement Decisiono is used by other configuration management and application deployment tools to determine which clusters need to be configured and deployed.
  • The Manifest Work API defines the configuration and resource information to be distributed to a cluster.
  • Managed Cluster Set AP is used to group clusters and provide limits for users to access clusters.
  • The Managed Cluster Addon API defines how the management probe is deployed across multiple clusters and how it communicates securely with the control plane on the hub side.

The structure is shown in the figure below, where registration is responsible for cluster registration, cluster life cycle management, registration and life cycle management of management plug-ins; Work is responsible for resource distribution; Placement is responsible for scheduling the cluster load. On top of this, the developer or SRE team can easily develop and deploy management tools in different scenarios based on the API primitives provided by OCM.

By leveraging OCM’s API primitives, you can simplify the deployment and operation of many other open source multi-cluster management projects, as well as extend the multi-cluster management capabilities of many Kubernetes single-cluster management tools. Such as:

  1. Simplify the management of multi-cluster network solutions such as Submariner. Use OCM’s plug-in management capabilities to centralize the deployment and configuration of Submariner on a unified management platform.
  2. Provides rich multi-cluster responsible scheduling policies and reliable resource distribution engine for application deployment tools (KubeVela, ArgoCD, etc.).
  3. Expand the existing Kuberenetes single cluster security Policy governance tool (Open Policy Agent, Falco, etc.) to make it capable of multi-cluster security Policy governance.

OCM also has two built-in management plug-ins for application deployment and security policy management. The application deployment plug-in adopts the subscriber mode and can obtain the resource information of application deployment from different sources by defining subscription channels. Its architecture is shown in the figure below:

At the same time, in order to closely integrate with the Kubernetes ecosystem, OCM has implemented several designs of kubernetes SIG-Multicluster, Includes the concept of clusterset in the KEp-2149 Cluster ID and KEp-1645 multi-Cluster Services API. We are also working with other developers in the community to promote the Work API.

The main advantages of OCM

Highly modular – optional/clipping modules

The overall OCM architecture is much like a “microkernel” operating system. The OCM chassis provides services such as core capabilities cluster metadata abstraction, while other extended capabilities are deployed as separate components that can be split apart. As shown in the above in the whole scheme of OCM in addition to the ability of the core part, the ability of other top can be tailored according to the actual demand, for example, if we don’t need complex cluster topology relationship, you can cut off the cluster grouping related module, if we don’t need any OCM distributed resources used only as metadata, You can even crop the entire resource delivered Agent component. It also helps to guide users to the OCM step by step, where users may only use a small number of features at the beginning, and then slowly introduce more feature components as the scenario expands, even with support for hot swapping on the running control surface.

More inclusive – the Swiss Army knife for complex use scenarios

At the beginning of the design of the whole OCM scheme, the construction of advanced capabilities of some complex scenarios was considered by integrating some third-party mainstream technical solutions. For example, in order to support more complex rendering of application resources, OCM supports installing applications as Helm Chart and loading remote Chart repositories. At the same time, the Addon framework is also provided to support users to customize their needs through the provided extensible interface, such as Submarine is a multi-cluster network trust solution developed based on the Addon framework.

Ease of use – Reduces complexity

To reduce user complexity and ease of migration to the OCM solution, OCM provides a traditional command-based multi-cluster federated control process. It’s worth noting that the following features are still in development and will be available in a later release:

  • Managed Cluster Action enables us to issue atomic instructions to the Managed clusters one by one, which is also the most intuitive way to automate the orchestration of each Cluster as a central control system. A Managed Cluster Action can have its own directive type, directive content, and directive execution state.
  • Managed Cluster View allows us to actively “project” resources from a Managed Cluster into a multi-cluster federated system. By reading the “projection” of these resources into the federated system, we can make more dynamic and accurate decisions in the federated system.

OCM practice in ant Group

OCM technology has been applied to the infrastructure of ant Group. As the first step, some operation and maintenance methods similar to community Cluster API are used to deploy OCM Klusterlet to the managed clusters one by one, so that the meta information of dozens of online and offline clusters in the ant domain is unified into OCM. These OCM Klusterlets provide the basic capability of multi-cluster management and maintenance (O&M) for the upper-layer product platform to facilitate future function expansion. To be specific, the first step of OCM includes the following aspects:

  • ** Certificateless: ** In traditional multi-cluster federated systems, we often need to configure the corresponding cluster access certificate for each cluster’s metadata. This is a required field in KubeFed V2’s cluster metadata model. As OCM adopts the Pull architecture as a whole, agents deployed in each cluster Pull tasks from the central and there is no process for the central active access to the actual cluster, so the metadata of each cluster is only a placeholder for thorough desensitization. At the same time, because the certificate information does not need to be stored, there is no risk of certificate being copied and misappropriated in the OCM scheme.
  • ** Automated cluster registration: ** The previous cluster registration process had a lot of manual intervention, which extended the collaborative communication time and lost the flexibility to change, such as site level or machine room level flexibility. In many scenarios where manual verification is essential, take full advantage of the audit and verification capabilities provided by the OCM cluster registration and integrate them into the domain approval process tool to automate the registration process for the entire cluster, achieving the following goals:

(1) Simplify the cluster initialization/takeover process. (2) More clearly control the authority of the control center.

  • Automatic cluster resource installation/uninstallation: Take-over mainly involves two things: (a) Installing application resources required by the management platform in the cluster, and (b) recording cluster metadata into the management platform. For (a) resources that can be further divided into Cluster level and Namespace level, and (b) is generally a critical operation for the upper management system, the product is considered to take over the Cluster from the moment the metadata is entered. Prior to the introduction of OCM, all preparatory work was manually driven step by step. The entire process can be automated through OCM, simplifying the cost of human collaboration and communication. The essence of this is to organize the cluster management into a process operation, and define the concept of state on top of the cluster metadata so that the product hub can automate the process to take over the chores. The resource installation and uninstallation process is clearly defined once the cluster is registered in the OCM.

Through the above work, dozens of clusters in the ant domain are within the management scope of OCM. In the Event, clusters that were created and deleted automatically were also added and deleted automatically. Later, it also plans to integrate with KubeVela and other application management technologies, and cooperate to complete the cloud biochemistry management capability of application, security policy and other aspects in the ant domain.

OCM practice in Ali Cloud

In AliYun, the OCM project is one of KubeVela’s core dependencies for undifferentiated application delivery in a hybrid environment. KubeVela is a “one-stop” application management and delivery platform based on the Open Application Model (OAM). It is also the only cloud native application platform project currently hosted by the CNCF Foundation. In terms of function, KubeVela can provide developers with end-to-end application delivery model, multi-cluster-oriented operation and maintenance capabilities such as gray scale publishing, elastic scaling and observability, and can deliver and manage applications in a unified workflow for mixed environment. In the whole process, OCM is the main technology of KubeVela to realize Kubernetes cluster registration, management, application distribution policy.

In the public cloud, the above features of KubeVela combined with the multi-cluster management capability of Aliyun ACK can provide users with a powerful application delivery control plane, which can be easily implemented:

  • Mixed environment one key site. For example, a typical hybrid environment could be a public cloud ACK cluster (production cluster) plus a local Kubernetes cluster (test cluster) managed by ACK multiple clusters. In both environments, the provider of the application component is often different, such as the database component might be MySQL in the test cluster, and the Aliyun RDS product on the public cloud. In such a mixed environment, traditional application deployment and operation and maintenance are extremely complex. KubeVela, on the other hand, makes it very easy for users to define in detail the artifacts to be deployed, the delivery workflow, and the different configurations of different environments in a single deployment plan. This not only eliminates the cumbersome manual configuration process, but also significantly reduces the release and operation risk with Kubernetes’ strong automation and certainty.
  • Multi-cluster microservice application delivery: Cloud-native microservice applications are often composed of a variety of components, such as container components, Helm components, middleware components, and cloud service components. KubeVela provides users with a multi-component application delivery model for microservice architecture. With the help of the distribution strategy provided by OCM, unified application delivery can be carried out in a multi-cluster and mixed environment, which greatly reduces the difficulty of operation and management of microservice applications.

In the future, aliyun team will work with RedHat/OCM community, Oracle, Microsoft and other partners to further improve KubeVela’s application orchefication, delivery and operation and maintenance capabilities for hybrid environment, so that the delivery and management of microservice applications in the cloud native era can truly be “fast and good”.

Join a community

The OCM community is still in the early stages of rapid development and interested businesses, organizations, schools and individuals are welcome to participate. Here, you can partner with ant group, RedHat, alibaba cloud technologists, and Kubernetes core Contributor to learn, build, and drive OCM’s popularity.

  • GitHub (github.com/open-cluste…
  • Through the video to understand OCM (www.youtube.com/channel/UC7…
  • Weeks to community will meet everybody (docs.google.com/document/d/…
  • Communicate freely on the Kubernetes Slack channel # open-cluster-mgmt (slack.k8s.io/)
  • Join the email group for key discussions (groups.google.com/g/open-clus…
  • Visit the community website for more information (open-cluster-management-io /)

The original link

This article is ali Cloud original content, shall not be reproduced without permission.