Kubernetes multi-cluster practice in SOFAStack CAFE Unit hybrid cloud product

background

SOFAStack is ant Group’s commercial financial level cloud native architecture product. Based on SOFAStack, it can quickly build cloud native micro-service system and develop cloud native applications with more reliability, scalability and easier maintenance. At the macro architecture level, the oceanstor 9000 provides the evolution path of the single-node architecture to the same-city active-active architecture, two-node, three-center architecture, and remote active-active architecture, enabling system capacity expansion and scheduling in multiple data centers, making full use of server resources, providing room-level DISASTER recovery (Dr) capability, and ensuring service continuity.

At the Application lifecycle management level, SOFAStack provides a multi-mode Application PaaS platform, Cloud Application Fabric Engine (SOFAStack CAFE). It provides PaaS platform capabilities for application management, process choreography, application deployment, and cluster operation and maintenance, meeting the o&M requirements of classic and cloud native architectures in financial scenarios, smoothing the transition of traditional architectures and protecting financial technology risks.

In terms of the operation and maintenance of the Cloud native architecture, SOFAStack CAFE provides the operation and maintenance capability of the Cloud native multi-cluster publishing of the Cloud native applications through the LHC (LDC Hybrid Cloud), realizing the mixed deployment of applications in multiple regions, multi-machine rooms and multi-clouds. This article will demystify the LHC and detail some of our practices in the underlying Kubernetes multi-cluster publishing system.

challenge

When the LHC product was born, one of the first problems we faced was choosing an appropriate underlying Kubernetes multi-cluster framework for it. At that time, Kubernetes community had just completed its official multi-cluster project KubeFed, which provided a series of multi-cluster infrastructure capabilities such as multi-cluster management, multi-cluster distribution of Kubernetes resources and state reflux, which naturally became the best choice for us at that time.

However, as mentioned earlier, the community’s multi-cluster framework provides only “basic capabilities”, which have many unsatisfactory and even conflicting points for our unitary hybrid cloud product. One of the most prominent problems is that the community does not have the concept of “unifies”, its multi-cluster is pure multi-Kubernetes cluster, for any multi-cluster Kubernetes resources (in KubeFed we call it federal resources), its distribution topology can only be according to the cluster. However, in the unitary model, the resources of an application service are distributed among multiple deployment units, and the relationship between deployment units and clusters is flexible — in our current model, the relationship between clusters and deployment units is 1: N, that is, a Kubernetes cluster can contain multiple deployment units. At this point, we hit a point of difference with the community framework, which is the biggest challenge: the upper level business needs to manage Kubernetes resources in the deployment unit dimension, while the lower level community framework only considers clusters.

In addition, the basic capabilities covered by KubeFed itself are not enough to meet all our requirements, such as lack of cluster tenant isolation, no support for resource annotation delivery, high network connectivity requirements between the main cluster and subgroups, etc. Therefore, conflict resolution and capacity replenishment have become our key issues in building multi-cluster capacity at the bottom of LHC products.

practice

Let’s talk about some specific practices in building the underlying Kubernetes multi-cluster capability for LHC products by module.

Multi-topology federated CRD

In the community KubeFed framework, we conduct multi-cluster distribution of Kubernetes resources through federated CR. A typical federated CR spec looks like this:

You can see that it mainly contains three fields, where placement is used to specify the cluster to be distributed, template contains the single cluster resource body of the federated resource, and overrides is used to specify the custom part of the resource body in the template for each subgroup.

As mentioned earlier, Kubernetes resources for a unitized application need to be distributed in the deployment unit dimension rather than the cluster dimension, so the community CRD above is clearly not sufficient and needs to be modified. After modification, the spec for the new federated CR looks like this:

As you can see, rather than completely discarding the CRD of the community, we “upgraded” it, completely customizing the distribution topology of federated resources by turning the concrete “cluster” into an abstract “topology,” breaking the constraints of a single cluster dimension. In the spec above, we set The topologyType to cell to specify that the resource is distributed in the deployment unit dimension, whereas the topologyType to Cluster is fully compatible with the community’s native cluster dimension distribution mode.

Of course, just defining a new CRD won’t solve the problem; we need to modify its corresponding implementation to make it work. However, if the KubeFed Controller of the community is to be aware of the multi-topology model, it is bound to make a lot of changes to its underlying implementation, which may eventually become a semi-rewrite, with high research and development costs, and it cannot continue to flow back to the upstream of the community for modification, resulting in high maintenance costs. We need to find a better way to decouple our multi-topology model from KubeFed’s native multi-cluster capabilities.

Separate and extend the federated layer ApiServer

Since we don’t want to make too many intrusive changes to the community KubeFed Controller, we definitely need a transformation layer to transform the multi-topology federated CRDS described above into the corresponding community CRDS. The transformation logic for a particular topology is also deterministic, so the simplest and most efficient transformation is handled directly by Kubernetes ApiServer. ApiServer’s Conversion Webhook capability for CRD can just meet the implementation requirements of this Conversion layer.

Therefore, we combine KubeFed Controller with a dedicated Kubernetes ApiServer to form a separate Kubernetes control surface, which we call the “federated layer”. This independent control surface contains only federated multi-cluster-related data, ensuring that it does not interfere with other Kubernetes resources and also avoids heavy dependency on external Kubernetes clusters at deployment time.

So what’s so special about ApiServer at the federation level? The main body of it is still Kubernetes native ApiServer, which provides all the capabilities that ApiServer can provide. What we have done is to “wrap” it and put in the capabilities that we need to extend to the federated layer. Some of the key scaling capabilities are described below.

Built-in multi-topology federated CRD conversion capability

As mentioned above, this is the most important capability that ApiServer at the federation layer provides. With the multi-version capability of Kubernetes CRD, we define our multi-topology federated CRD and community CRD as two versions of the same CRD, and then integrate Conversion Webhook for this CRD in the federated layer ApiServer. You can customize the conversion implementation of the two. In this way, on the federated layer control surface, any federated CR can read and write in two forms at the same time, so that the upper layer business only cares about the deployment unit (or other business topologies), while the bottom layer KubeFed Controller still cares only about the cluster, realizing its insensitive support to the multi-topology federated CRD model.

The following takes the deployment unit topology as an example to briefly introduce the conversion between the deployment unit topology and the cluster topology. In KubeFed, we manage the subset by creating a KubeFedCluster object that contains the subset access configuration, We can then specify the subset to be distributed by the name of the KubeFedCluster object in the placement of the federated CRD. All our transformation logic needs to do, then, is to convert the deployment unit name in the multi-topology federated CRD to the KubeFedCluster object name of its corresponding cluster. Because of the cluster and deployment unit is 1: n relationship, so we only need to visit each deployment units to create additional contains its cluster configuration KubeFedCluster object, and through the unified naming rules for the generating unit can be deployed by the namespace (that is, the tenants and the work space group name) with the name of addressing to name.

By analogy, we can easily support more topology types in a similar way, greatly increasing our flexibility in using the federated model.

Support to use MySQL/OB directly as etCD database

The ETCD database is an essential dependency for any Kubernetes ApiServer. At Ant’s main site, we have abundant physical machine resources and a strong DBA team to provide consistently high availability of ETCD, but this is not the case for the complex external output scenarios of SOFAStack products. Outside the region, the cost of operating and maintaining ETCD is much higher than that of MySQL. In addition, SOFAStack is often output with OceanBase. We also hope to make full use of the mature multi-machine room disaster recovery capability provided by OB to solve the problem of high availability of database.

Therefore, after some research and attempts, we integrated the k3S community’s open source ETCD on MySQL adapter Kine into the federated layer ApiServer, enabling it to directly use the common MySQL database as the ETCD backend, eliminating the trouble of maintaining a separate ETCD. In addition, we also adapted some differentiated behaviors of OB and MySQL (such as cutting main auto-increment jump) to make it perfectly compatible with OB, so as to enjoy the high data availability and strong consistency brought by OB.

In addition, we also integrate some Admission Plugin in the federated layer ApiServer for verifying resources related to the initialization of the federation. As most of them are related to product business semantics, we won’t go into details here.

It’s worth noting that these extensions have the ability to disassemble into separate components and Webhooks, so they can also be used in the form of community native plug-in installations that don’t rely heavily on standalone ApiServer. Currently, we have isolated ApiServer primarily for the purpose of isolating data at the federation layer and facilitating independent deployment and maintenance.

To sum up, from the architectural level, the federated layer ApiServer mainly acts as a north-south bridge for the federated resources. As shown in the figure below, it provides the KubeFed Controller with ApiServer capability to carry the community federated resources southward through the capability of multiple versions of CRD. Northbound provides mapping and transformation capabilities for upper-layer business products from business topologies (deployment units) to cluster topologies.

KubeFed Controller capability improved

As mentioned earlier, in addition to the federated model, the community KubeFed Controller could not meet all of our requirements in terms of its underlying capabilities, so we made some enhancements to it during the transition process. Some of the general enhancements we have contributed to the community include support for setting timeout synchronization between Controller worker concurrency and multiple cluster Informer cache and support for Service special field retention. In addition, we have enhanced some higher-order capabilities in the form of Feature Gate to achieve real-time synchronization with code base and upstream community. Here are a few representative enhancements.

Supports subgroup multi-tenant isolation

In SOFAStack products, whether in public or private clouds, all resources are isolated by tenant and workspace (group) granularity to ensure that individual users and their subordinate environments do not interact with each other. For KubeFed, the main resource of interest is the Kubernetes cluster, which the community implementation does not isolate, as can be seen from the removal logic of the federated resource: When a federated resource is deleted, KubeFed checks all clusters managed in its control plane to ensure that the resource is deleted as a single cluster resource in all subgroups. In the context of the SOFAStack product semantics, this is clearly unreasonable and creates the risk of interaction between different environments.

Therefore, we extend the federated resource and KubeFedCluster object, which represents the nanotube cluster in KubeFed, in a non-invasive way. By injecting some well Known labels into it, it holds some metadata of the business layer, such as tenant and workspace group information, etc. Using this data, we pre-select the subcluster when KubeFed Controller processes federated resources, so that any processing of federated resources will only limit the scope of reading and writing to the tenant and workspace group it belongs to, achieving complete isolation of multi-tenant and multi-environment.

Support grayscale distribution capability

Grayscale publishing is an essential capability for a financial production level publishing deployment platform like SOFAStack CAFE. For any application service change, we want it to be greyscale published to a specified deployment unit in a user-controlled way. This puts forward corresponding requirements for the underlying multi-cluster resource management framework.

As you can see from the introduction of federated CRD above, we specify deployment units (or other topologies) that need to be distributed for federated resources through placement. On the first delivery of a federated resource, we can grayscale by gradually adding deployment units to the placement for publication, but when we update the resource, there is no way to grayscale — the placement already contains all deployment units, Any changes to federated resources are immediately synchronized to all deployment units, and we cannot grayscale by resetting placement to the deployment unit we want to publish, as this will cause resources in other deployment units to be deleted immediately. At this point, in order to support grayscale publishing, we need the ability to specify which deployment units in the placement are to be updated and the rest remain unchanged.

To do this, we introduced the concept of placement masks. As the name suggests, it acts like a mask for placement, and when the KubeFed Controller processes federated resources, the scope of the topology it updates becomes the intersection of placement and Placement Mask. At this point, we only need to specify its placement mask when updating the federated resource, so that we can finely control the range of deployment units affected by this change, and achieve fully autonomous and controllable grayscale publishing capability.

As shown in the figure below, we add a placement mask for the federated resource that contains only deployment unit RZ00A, and you can see that the sub-resource in RZ00A has been successfully updated (generation 2 after the update). Rz01a resources are not processed (so there is no updated generation), to achieve the effect of grayscale release.

It is worth mentioning that the introduction of placement masks solves not only the grayscale publishing problem, but also the disaster publishing problem. In the event that part of the cluster (deployment unit) becomes unavailable due to a machine room disaster, we can continue to publish other available deployment units as normal through placement masks without blocking the entire multi-cluster release due to a local exception. After cluster recovery, placement masks prevent unexpected automatic changes to newly recovered deployment units, ensuring strong control of release changes.

Supports custom Annotation delivery policies

KubeFed has a policy for resource delivery, that is, only the attributes of the spec class are delivered, not the attributes of the Status class. The starting point for this principle is simple: we need to ensure that the specs of the subset resources are strongly controlled by the federation layer, but keep their status separate. Most attributes of any Kubernetes object are non-spec (status) attributes, not to mention the attributes of spec and status, such as metadata name, labels, etc. Streams such as creationTimestamp and resourceVersion belong to status. However, there are exceptions to the rule. One property that acts as both spec and status is annotations.

Many times, it is not possible to converge all specs and status of a Kubernetes object into the real spec and status properties. A typical example is Service. For those who are familiar with Service applications, we can use a LoadBalancer Service with Cloud Controller Manager (CCM) provided by different Cloud vendors to implement load balancing management on different Cloud platforms. Annotations (Annotations) for Service are a built-in Kubernetes object, and their spec and status are fixed and non-extensible. Different cloud vendors support different parameters, so Annotations for Service naturally carry these configurations. Played the role of spec. At the same time, some CCM also supports backflow of some specific states of load balancing to Annotations, such as some intermediate states including error messages in the process of creating load balancing. In this case, annotations of Service play the role of status. At this point, KubeFed faces the question of whether to send the Annotations field. The community chose not to send Annotations at all, which does not affect the ability of Annotations as status, but also loses the ability to control Annotations as specs.

So is it possible to have both? Of course there is an answer. In KubeFed, for each Kubernetes resource type that needs to be multi-clustered, a FederatedTypeConfig object needs to be created at the federation layer to specify information such as the federated type and GVK for the single-cluster type. Since the Spec /status attribute of the Annotations field is also specific to a specific resource type, we can add a Propagating Annotations configuration to the object. Annotations KubeFed Controller allocates and controls the keys used as specs for annotations of the resource type, and treats the remaining keys as status. Values on subcluster resources are not overwritten. With this extended capability, we can flexibly customize annotation delivery strategies for any Kubernetes resource type to achieve complete spec management for multi-cluster resources.

In the case of Service, we configure the FederatedTypeConfig object of this type as follows:

The first picture below shows the delivery template specified by the FederatedService template, and the second picture shows the managed Service in the actual subset. As you can see, we specify the spec in the federal resources class annotation (such as a service. Beta. Kubernetes. IO/antcloud – loadbalancer – the listener – protocol) has been successfully issued on the sub resources, And as part of the resources of their own status annotations (such as status. Cafe. Sofastack. IO/loadbalancer) is normal, not because of the strong control of annotations field covered by deleting or.

In addition, we have enhanced the state backflow capability of KubeFed Controller so that it can backflow status class fields of all federated resource types in real time; KMS encrypted storage that supports federated level sub-cluster access configuration to meet financial level security compliance requirements, etc., is not covered in space.

So far, the federated layer has met the vast majority of the needs of the upper level unitary application publishing operation and maintenance products, but as mentioned above, we are doing a “hybrid cloud” product, in the hybrid cloud scenario, heterogeneous cluster and network connectivity limitations are the most typical problems we will encounter in the operation and maintenance of Kubernetes cluster. For the federated layer, because it mainly focuses on Kubernetes application resource management, so the heterogeneity of the cluster will not bring too much impact, as long as it is in line with a certain version of the Kubernetes specification within the cluster can be directly managed in theory; And connectivity limits can be deadly: Because KubeFed adopts the push mode for subset control, it requires that KubeFed Controller needs ApiServer that can directly access each subset, which has a very high requirement for network connectivity. In many cases, the network environment cannot meet such requirements. Even if there was a way to do it, it would be expensive (such as connecting the central cluster to all the user clusters). Therefore, we must find a solution to reduce the federation layer’s requirements for network connectivity between clusters so that our products can be compatible with more network topologies.

Integrate ApiServer Network Proxy

Since the direct forward connection between KubeFed Controller and the subset ApiServer may be limited, we need to set up a proxy that can be connected in reverse between the two and provide forward access through the long-connection channel established by the proxy. ApiServer Network Proxy (ANP) is an ApiServer Proxy developed by the community to solve the problem of Kubernetes cluster internal Network isolation. It just provides the capability of reverse long-connection Proxy that we need. ApiServer allows us to access subsets of ApiServer without the need for forward network access.

However, ANP mainly solves the problem of accessing ApiServer within a single cluster. Its connection model is that multiple clients access one ApiServer, but for the multi-cluster control like the federation layer, its connection model is that one client accesses multiple ApiServer. To this end, we expanded the ANP backend connection model, to support the dynamic location according to the name of the cluster, built by lies in ANP server even when told that the connection need access to the cluster, ANP subsequent route requests to the server will be established and submitted to the agent of the cluster name long connection channel. The resulting architecture is shown below. By integrating this “multi-cluster extension” OF ANP, we can easily manage multiple clusters in a more demanding network environment.

conclusion

Finally, let’s conclude with specific product capabilities to briefly reflect some of the highlights of SOFAStack CAFE’s multi-cluster product compared to the community version of KubeFed:

● Support multi-tenant, can be at the bottom of the Kubernetes cluster resources such as tenant, workspace level isolation ● Break declaratively constraints, Support fine multi-cluster grayscale release, • Support advanced capabilities such as customized annotation delivery, full state reflux, and cluster access certificate KMS encryption ● Use ANP to continue to manage all user clusters in push mode when network connectivity is limited, such as heterogeneous hybrid cloud ● Multi-cluster control plane is independent Kubernetes cluster is deployed independently, and MySQL/OB is directly used as the back-end database

At present, SOFAStack has been applied in more than 50 financial institutions at home and abroad, among which Zhejiang Rural Credit, Sichuan Rural Credit and other enterprises are using CAFE’s union-based hybrid cloud architecture to carry out the full life cycle management of container applications and build a multi-regional, high-availability multi-cluster management platform.

The future planning

As can be seen from the above practice, at present, our application of the underlying multi-cluster framework mainly focuses on Kubernetes cluster management and multi-cluster resource governance, but the application of multi-cluster still has broader possibilities. In the future, we will evolve capabilities that include but are not limited to:

● Multi-cluster resource dynamic scheduling capability ● Multi-cluster HPA capability ● Multi-cluster Kubernetes API proxy capability ● Light CRD capability using single-cluster native resources as templates directly

In the future, we will continue to share our thoughts and practices on these capabilities. We welcome your continued attention to our multi-cluster products and welcome any comments and exchanges at any time.