Industrial and Commercial Bank of China: Application of multi-K8S cluster management and disaster recovery practice

Abstract:At Huawei Developer Conference (Cloud) 2021, ICBC PaaS Cloud Platform Architect Shen Yifan delivered a keynote speech titled “ICBC Multi-K8S Cluster Management and Disaster Recovery Practice”, sharing the practical process of ICBC using multi-cloud container arrangement engine Karmada.

This article to share from huawei cloud community “Karmada | industrial and commercial bank of multiple k8s cluster disaster management and practice”, the original author: technology torch bearers.

At Huawei Developer Conference (Cloud) 2021, ICBC PaaS Cloud Platform Architect Shen Yifan delivered a keynote speech titled “ICBC Multi-K8S Cluster Management and Disaster Recovery Practice”, sharing the practical process of ICBC using multi-cloud container arrangement engine Karmada.

The speech mainly includes four aspects:

1) The current situation of ICBC cloud platform 2) The multi-cluster management scheme and selection in the industry 3) Why do you choose Karmada? 4) Landing situation and future prospects

ICBC cloud computing business background

In recent years, the rise of the Internet has had a huge impact on the financial model and service model of the financial industry, which makes us have to make some huge innovations. At the same time, from the current point of view, the banking business system into the cloud has been the general trend, so far, ICBC has formed the infrastructure cloud, application platform cloud, financial ecology cloud and branch cloud with ICBC characteristics, composed of our overall cloud platform architecture.

The overall architecture of the cloud platform of ICBC

ICBC Cloud Platform Technology Stack

It adopts the industry’s leading cloud products and mainstream open source technology. On this basis, it combines some of our financial business scenarios to carry out in-depth customization.

Infrastructure cloud: Based on Huawei Cloud Stack8.0 products combined with operation and maintenance requirements, the company customizes and builds a new generation of infrastructure cloud.
Application platform cloud: Through the introduction of open source container technology Docker, container cluster scheduling technology Kubernetes, independently developed and built the application platform cloud.
Upper application scheme: based on HAProxy, Dubbo, Elasticserch, etc., the surrounding supporting cloud ecology such as load balancing, micro-service, holographic monitoring, log center is established.

ICBC financial cloud results

In terms of container cloud, ICBC’s financial cloud is also very effective, which is firstly reflected in its large scale of cloud entry. Up to now, the scale of cloud containers of application platform exceeds 200,000, and business containers account for about 55,000. Some core businesses of the whole have been inside the container cloud. In addition to the largest cloud entry scale in the industry, our business scenarios involve a wide range of core applications and banking business systems, including personal financial system accounts, quick payment, online channels, commemorative coin reservation, etc., which have been containerized and deployed. In addition, some of our core technology support applications, such as MySQL, some middleware and micro-service framework, have also been in the cloud. In addition, some new technology fields, including the Internet of Things, artificial intelligence, big data and so on.

As more and more core business applications go into the cloud, the biggest challenge for us is disaster resilience and high availability, and we have done a lot of practice in this regard:

1) The cloud platform supports a multi-level fault protection mechanism, which ensures that different instances of the same business will be evenly distributed to different resource domains in two places and three centers, and ensures that the overall availability of the business will not be affected when a single storage, a single cluster or even a single data center fails.

2) In the case of failure, the cloud platform realizes automatic recovery of failure through container restart and automatic drift.

In the overall container cloud practice, we also encountered some problems, the most prominent of which was the current situation of multiple clusters in the container layer of the Pass layer. There are four main reasons for the total number of K8s and K8s clusters in ICBC, which has reached nearly one hundred.

1) cluster variety: just also said our business scenario is very broad, such as the GPU to have different support GPU devices, middleware, database it to the underlying network container storage requirements are different, it is bound to generate different solutions, so we need to customize for different business scenarios different cluster.

2) Restricted by the performance of K8S, including Scheduler, ETCD, API Server and other performance bottlenecks, each cluster has an upper limit of its number.

3) Business is expanding very fast.

4) There are many fault domain partitions. Our two-place and three-center architecture has at least three DCs, and different network areas within each DC are isolated through firewalls. Such a multiple relationship will produce many distribution of cluster fault domains in the middle.

Based on the above 4 points, in view of the current situation, we are still relying on existing solution container cloud cloud GuanPing stage, through the cloud GuanPing manage these multiple k8s cluster, the other top business applications need to choose its cluster, including it needs preferences, network, area, etc., to choose one specific k8s cluster. After selecting the K8S cluster, we conduct automatic shattering scheduling through failure rate internally.

However, the existing solution still exposes a lot of problems for upper level applications:

1) For upper-level applications, it may be a part of the container cloud’s concern that we have the ability to scale automatically during peak business, but automatic scaling is now within the cluster, not across the cluster as a whole.

2) No cross-cluster automatic scheduling capability, including scheduling capability may be within the cluster, the application needs to independently select a specific cluster

3) The cluster is not transparent to upper users

4) No automatic migration capability across cluster failures. We still rely mainly on redundancy of replicas in the two-place and three-center architecture, so in the automated recovery process of failure recovery, there is a lack of high availability in this area.

Multi-cluster management scheme and selection in the industry

Based on the current situation, we have set some goals and carried out the overall technology selection for some solutions in the industry, which is divided into five modules:

Why I want it to be an open source project with a certain degree of community support is mainly based on three considerations:

1) The overall solution is expected to be independent and controllable within the enterprise, which is also a major benefit of open source
2) Don’t want to spend more power
3) Why not integrate all the scheduling and fault recovery capabilities into the cloud platform just now? This is the part where we want the overall multi-cluster management module to be isolated from the cloud platform and sink into a multi-cluster management module below.

Based on these goals, we conducted some research on community solutions.

Kubefed

The first one we investigated was a popular cluster federal federation project. The whole federation was divided into V1 and V2 versions. When we investigated, the main version was V2, namely Kubefed.

Kubefed itself is part of the solution, with cluster lifecycle management, Override, and basic scheduling capabilities, but for us it has several fatal weaknesses that we don’t need right now:

1) The scheduling level is a very basic scheduling capability, and they are not prepared to spend more energy on scheduling to support custom scheduling, and they do not support scheduling according to resource allowance.

2) the second point is known for, it itself does not support native k8s object, I want to be in its management in a cluster use its new definition of CRD, for we have been used for so long k8s native upper application of resource objects, cloud GuanPing sets itself docking API, we also need to develop, this part of the cost is very big.

3) It basically does not have the ability of automatic fault migration

RHACM

The second project we investigated was RHACM, which was mainly led by Red Hat and IBM. After the overall investigation, we found that RHACM had relatively sound functions, including the capabilities we mentioned just now. Moreover, its upper application layer was positioned closer to the user layer in terms of its ability to the cloud tube platform. But it only supports OpenShift, which is too expensive and too heavy to retrofit with the amount of K8S clusters we already have in stock. At the time of our research, it wasn’t open source, and the overall community support wasn’t strong enough.

Karmada

At that time, we communicated with Huawei about the current situation and pain points of multi-cluster management, including the discussion on community federated projects. Both of us also hope to innovate in this aspect. The following figure is the functional view of Karmada:

Karmada functional view

From the perspective of its overall functional view and planning, it is very consistent with the goals we mentioned above. It has the overall life cycle management of the cluster, cluster registration, multi-cluster scaling, multi-cluster scheduling, overall unified API, and the support of the underlying standard API. It is all CNI/CSI in its overall functional view. Including the application of the upper layer, the overall planning of ISTIO, MASH, CI/CD, etc. So it fits us very well based on the overall idea, and ICBC decided to participate in this project, working with Huawei and many of our project partners to build the Karmada project and give it back to our community as a whole.

Why Karmada

The technical architecture

In my personal understanding, it has the following advantages:

1) Karmada is deployed in the form of K8S, including API Server, Controller Manager, etc. For enterprises that have already owned so many K8S clusters, the transformation cost is relatively small. We only need to deploy a management cluster on it

2) Karmada-Controler-Manager manages a variety of CRD resources, including Cluster, Policy, Binding, Works, etc., as management side resource objects, but it does not invade the native K8S native resource objects that we want to deploy.

3) Karmada only manages the scheduling of resources among clusters, and the allocation within the subset group is highly autonomous

How are the overall Resources of Karmada distributed?

First, the cluster is registered with Karmada
Second, define the Resource Template
Step 3: Develop Distribution Strategy Propagation Policy
Fourth, develop the Override policy
Step five, watch Karmada at work

The following figure is the overall issuance process:

When we define deployment, it matches through the Propagation Policy, and then it generates a Binding, namely the Propagation Binding, and then passes a Policy on Override, I’m going to generate each of these works. The Works is essentially an encapsulation of resource objects in a subset group. In my opinion, the mechanism of Propagation and Workers is more important here.

The Propagation mechanism

Firstly, we defined the Propagation Policy. As can be seen from the overall YAML, we decided on a simple strategy and selected a cluster named Member 1. The second is which K8S resource templates I need this policy to match, which matches an NGINX Deployment where we defined a namespace with default. In addition to supporting Cluster affinity, it also supports Cluster tolerance, distribution by Cluster label, failure domain.

After the Propagation Policy is defined, the previously defined K8S resource template to be sent will be automatically matched with it. After matching, the deployment will be distributed to three clusters, such as ABC, so that it is bound to the three clusters. This binding relationship is called a Resource Bindding.

The YAML of Resource Bindding may be a cluster that you select, in this case, it is member 1. Now the entire Resource Bindding supports both cluster and namespace levels. These two levels correspond to different scenarios. Namespace level refers to the scope of namespace used for Resource Bindding when namespace is used for tenant isolation within a cluster. There is also a cluster scenario, that is, the whole subset group is used by one person and one tenant. We can directly bind with cluster Resource Bindding.

The Work mechanism

After we have established Propagation Bindding, how did the work distribute it?

When Propagation Bindding is developed, for example, three clusters ABC are created. Here, 1: m refers to the three clusters. After the three clusters are found, the Bindding Controller will work to generate a specific Works object based on the Resource Template and your binding relationship. The Works object as a whole is the encapsulation of resources in a specific subset group. At the same time, a status of Works has the feedback of a subcluster resource, and the whole Work YAML can be seen. As can be seen from the Manifests, a subcluster of the whole can be seen. Override has been done in detail. The overall Deployment YAML to be distributed to the subset group is already here, so it is just a wrapper around a Resource.

Karmada advantage

After specific use and verification, ICBC found that Karmada has the following advantages:

1) Resource scheduling

Customize cross-cluster scheduling policies
Apply transparency to the upper layer
Support for two types of resource binding scheduling

2) disaster

Dynamic binding adjustment
Automatically distribute resource objects by cluster label or failure domain

3) Cluster management

Support for cluster registration
Lifecycle management
A standardized API

4) Resource management

Support for K8S native objects
Works supports subgroup resource deployment status retrieval
Resource object distribution supports both pull and push

The landing of Karmada in ICBC and its outlook for the future

First let’s take a look at two of Karmada’s features: how to manage clusters and resource distribution?

So far, in the test environment of ICBC, Karmada has carried out some management on the stock cluster. In terms of future planning, a key point is how to integrate with our overall cloud platform.

Cloud Platform Integration

In this respect we want the previously mentioned aspects of multi-cluster management, cross-cluster scheduling, cross-cluster scaling, failover, and overall view of resources all to sink into a control plane like Karmada.

For the upper cloud management platform, we pay more attention to its management of user business, including user management, application management, input management, etc., as well as the Policy derived from Karmada, for example, defined on the cloud platform. Specific cloud platform may need to connect a line to each K8S cluster, such as which node POD is in. The plane of Karmada may not be concerned, but the specific information of POD location class may need to be obtained from the subset group of K8S. This may also be a problem that we need to solve in the future integration. Of course, this is also in line with the design concept of Karmada itself, which does not need to care about the specific position of POD in the K8S subset group.

Future Outlook 1- Scheduling across clusters

For cross-cluster scheduling, Karmada already supports the failure domain intention, application preference, and weight comparison mentioned above. However, we hope that it can also be scheduled according to the resources and quantity of the cluster, so that there is no resource imbalance among the subgroups. Although it is not implemented at the moment, one of its CRDs is called Cluster. Cluster has a state message which collects the state that node is ready. The Allocatable that is left on the node is the remaining information of CPU memory. In fact, after we have this information, we will do custom scheduling, is a matter of planning.

After the overall scheduling is designed, the effect we hope to produce in ICBC is shown in the following figure. When I scheduled Deployment A, it was scheduled to Cluster 1 due to the preferred setting. Deployment B may be deployed to Cluster 123 because of a failure domain shattering; Deployment C is also a failure domain shatter, but its excess POD is scheduled to a cluster such as Cluster 2 due to resource margin.

Future Vision 2- Scaling across clusters

Scaling across clusters is currently part of the Karmada plan, and there are probably some issues we still need to address:

1) considering it across the cluster scale and a subset of the relationship between the scale, because now our top business application configuration is often a single cluster expansion strategy, then the whole group of expansion strategy across the cluster strategy and subset are configuration, the relationship between them, exactly is the upper management as a whole to do, or have a priority, which may be behind us to consider.

2) The relationship between cross-cluster scaling and cross-cluster scheduling is generally based on a single scheduler. One of my multi-clusters is only responsible for the scaling part, such as how many CPUs and memory are reached. For example, the scaling part is to be carried out 70% to 80% of the time and to how many. The specific scheduling is still handled by the overall scheduler.

3) We need to aggregate the metric of each cluster, including some performance bottlenecks, and we need to consider its overall working mode in the future.

Future Outlook 3- Cross-cluster failover and high availability

1) The judgment strategy for the health status of the subset: it may just be lost contact with the management cluster, and the business container of the subset itself is not damaged

2) Custom failover policy: Like RestartPolicy, Always, Never, OnFailure

3) Rescheduling and scaling across clusters: It is expected that its multi-cluster scheduling is a single scheduler of the whole, while scaling controls its own scaling strategy.

On the whole, for some business scenarios of ICBC, Karmada’s current capability and future planning can predictably solve the pain points of our business scenarios. I am very glad to have the opportunity to join the Karmada project. I hope more developers can join Karmada to build a community with us and build such a new project of cloud management.

Attachment: Karmada community technical exchange address

The address of the project: https://github.com/karmada-io… Slack address: https://karmada-io.slack.com

Click on the attention, the first time to understand Huawei cloud fresh technology ~