By Du Kewei (Name: Su Lin)

Senior development Engineer of Ant Group

Responsible for the stability of Ant Kubernetes cluster, focusing on cluster component change and stability risk protection

Read this article in 15738 words for 20 minutes

Before the speech

In order to support the iterative upgrading of ant business, Ant Infrastructure launched the Gzone comprehensive cloud project this year. Gzone and Rzone must be combined and deployed in the same cluster. The actual number of nodes managed by a single Sigma cluster will exceed 10,000, and the services undertaken by a single Sigma cluster will become more complex.

Therefore, we initiated a performance optimization program for large-scale Sigma clusters, expecting to align with community standards on request latency and not decline as size increases.

Etcd, as the data storage database for Sigma clusters, is the cornerstone of the entire cluster and can directly determine the performance ceiling. The recommended storage limit for a single ETCD cluster is 8G, but the storage capacity of a single ETCD cluster of Ant Sigma cluster has already exceeded this limit, so the cloud project on Gzone is bound to increase the burden of ETCD.

First of all, ant business mixed loss computing, offline computing and online business, and mixed a large number of Pod with a life cycle of minutes or even seconds. The number of Pod created in a single cluster increased to hundreds of thousands every day, all of which needed ETCD to support.

Secondly, complex business requirements give rise to a large number of List (List All, List by Namespace, List by Label), watch, create, Update, delete requests. The performance of these requests will deteriorate seriously with the increase of etCD storage scale, and even lead to ETCD OOM, request timeout and other anomalies.

Finally, the increase in the number of requests also aggravates the surge of THE REQUEST RT P99 by THE COMPACT and defrag operations on ETCD, and even the request times out, which leads to the intermittent loss of the key components of the cluster such as the scheduler and CNI service Operator components, resulting in the unavailability of the cluster.

According to previous experience, data horizontal split for ETCD cluster is an effective optimization means, the typical split is to store Pod and other important data in a separate ETCD cluster, so as to reduce the pressure of single ETCD storage and request processing, and reduce the request processing delay. However, Pod resource data for Kubernetes cluster has particularity, with other resources do not have high requirements, especially for the K8s cluster has a considerable scale of services for the split is very careful.

This paper mainly records ant group in the process of Pod resource data separation of some practical experience and experience.

Please give us more advice!

PART. 1 Challenges faced

From previous experience of Pod data splitting, WE know that Pod data splitting is a high-risk and complex process due to the particularity of Pod data.

Pod is a combination of containers, the smallest unit that can be scheduled in Sigma cluster, and the final carrier of service workload. The core and final delivery resource of Sigma clusters is Pod resource.

SLO, the core of Sigma cluster, is also the index of Pod creation, deletion and upgrade. Pod resource data can be said to be the most important resource data of Sigma cluster. At the same time, Sigma cluster is event-driven and designed for the final state system, so Pod resource data splitting should not only consider the basic data consistency before and after, but also consider the impact on other components in the process of splitting.

The core operations in the previous experience are data integrity verification and shutdown of key service components. Data integrity check, as the name implies, is to ensure the consistency of data before and after, while key service components are shut down to avoid unintended consequences caused by components not being shut down in the process of splitting, such as unexpected deletion of Pod and damage to Pod status. But if you apply the same process to an ant Sigma cluster, you have a problem.

As the core infrastructure of Ant Group, Ant Sigma has become a cloud base with 80+ clusters and 1.2 W + nodes in a single cluster after more than 2 years of development. On a cluster of this size, there are millions of pods running inside ants, with short duration pods being created 20W + times per day. In order to meet various business development needs, Sigma team has cooperated with mafengyuan storage, network, PaaS and other cloud native teams. Up to now, the number of third-party components jointly built by Sigma has reached hundreds. A Pod split to restart components requires a lot of communication with the business side, requiring multiple people to work together. If careless operation, combing not completely missing several components may cause unintended consequences.

From the status quo of ant Sigma cluster, the problems of the existing Pod data splitting experience process are summarized:

  1. Manual operation of a large number of components takes a long time to restart, error prone

There are dozens of components that need to be restarted, and it takes a lot of communication time to confirm with the owner of each component and sort out the components that need to be restarted. Omissions can have unintended consequences, such as residual resources, dirty data, and so on.

  1. Complete downtime lasts long to break SLO

During data splitting, components are completely shut down, the cluster functions are completely unavailable, and the splitting operation is extremely time-consuming. According to previous experience, the duration may be as long as 1~2 hours, which completely breaks the SLO commitment of Sigma cluster.

  1. The data integrity check method is weak

This tool is simple to implement. It reads key data from one ETCD and then writes it to another ETCD. It does not support resumable data transfer. In addition, important revision field of the original key is damaged due to rewrite of ETCD, affecting resource sion of Pod data, which may cause unexpected consequences. Revision will be explained in detail later. The final check is to see if the number of keys is consistent. If the data in the middle key is corrupted, it will not be detected.

PART. 2 Problem analysis

Good hope

As a lazy person, I do not want to communicate with so many component owners about the restart problem. A large number of component restarts are also easy to cause operation omissions, resulting in unexpected problems. And is there a better way to verify data integrity?

If the component is not restarted, the entire process evolves into the following process, which is expected to simplify the process while ensuring security.

In order to achieve good expectations, let’s go back to the source and review the entire process.

What is data splitting doing?

Etcd is known to store various resource data in a Kubernetes cluster, such as Pod, Services, Configmaps, Deployment, and so on.

By default, kube-Apiserver stores all resource data in a set of ETCD clusters. As the storage size increases, etCD clusters face performance bottlenecks. To improve the performance of Kube-Apiserver accessing ETCD by splitting etCD data in resource dimension is a common experience optimization idea in the industry. In essence, it is to reduce the data scale of single ETCD cluster and reduce the QPS of single ETCD cluster.

According to the size and requirements of the ant Sigma cluster, it is required to split into four independent ETCD clusters, which store Pods, Leases, events, and other resource data respectively. The following briefly describes the resource data to be split for the first three types (Pods, Lease, and Event).

The Event resources

K8s event resource data is not the event in Watch, but generally represents the event of the associated object, such as Pod image pulling, container startup, etc. In business, GENERALLY, CI/CD needs to display the state timeline in a stream and frequently pull event resource data.

Event resource data itself has a validity period (2 hours by default). In addition to observing the life cycle changes of resource objects through events, there are generally no important service dependencies. Therefore, event data can be discarded without ensuring data consistency.

Because of the above data characteristics, event splitting is the easiest. You only need to modify the startup configuration of APIServer and restart APIServer, without data migration or cleaning up old data. Apart from Kube-Apiserver, no component restarts or configuration changes are required.

Lease resources

Lease resources are generally used for Kubelet heartbeat reporting, and are also the resource type selected by the controller component recommended by the community.

Each Kubelet uses a Lease object to report heartbeat, which is reported every 10 seconds by default. The more nodes there are, the more update requests etCD undertakes. The number of update times per minute of the Lease is 6 times of the total number of nodes. 10,000 nodes is 60,000 times per minute, which is still quite significant. The Lease resource update is important to determine whether the Node is Ready, so it is split separately.

The master selection logic of controller class components is basically the open source master selection code package, that is, the components that use Lease to select the master are unified master selection logic. Kubelet’s code logic for reporting heartbeats is even more within our control. The Lease resources do not require strict data consistency. The Lease data must be updated within a certain period of time to ensure normal functions of the components that use the Lease.

The default value of Kubelet to determine whether the Ready logic is in the controller-manager is 40s. That is, as long as the corresponding Lease resource is updated within 40s, it is not judged as NotReady. And 40s this time can be adjusted long, as long as the update at this time does not affect the normal function. The Lease duration of the controller class component that uses the selected master is generally 5s to 65s and can be set by itself.

Therefore, Lease resource splitting is more complicated than event, but it is also simpler. During the split process, you need to synchronize the Lease data from the old ETCD cluster to the new etCD cluster. In general, you can use the etcdctl make-mirror tool to synchronize the data. If a component updates the Lease object, the request may fall on the old ETCD or the new ETCD. Updates that fall on the old ETCD are synchronized to the new ETCD using the make-mirror tool because there are fewer Lease objects and the whole process takes a short time and is not a problem. In addition, delete the Lease resource data in the old ETCD after the migration is split to release the space occupied by locks. Although the space is small, do not waste it. Similar to event resource splitting, the whole splitting process does not require any component restart or configuration modification except kube-Apiserver.

Pod resources

Pod resources are probably the resource data we are most familiar with, and all workload is ultimately actually carried by Pod. The management core of K8s cluster is the scheduling and management of Pod resources. Pod resource data requires strict data consistency. The Watch event event generated by any UPDATE of Pod should not be missed, otherwise it may affect Pod resource delivery. The characteristics of Pod resources are also the reason why traditional Pod resource data splitting requires massive reboots of related components, which will be explained later.

The community Kube-Apiserver component already has its own configuration — etcd-Servers-overrides — to set up a separate ETCD store by resource type.

–etcd-servers-overrides strings Per-resource etcd servers overrides, comma separated. The individual override format: group/resource#servers, where servers are URLs, semicolon separated. Note that this applies only to resources compiled into this server binary.

A brief example of a common configuration for resource splitting is as follows:

Events split configuration

–etcd-servers-overrides=/events#etcd1.events.xxx:2xxx; https://etcd2.even…

Leases Split configuration

–etcd-servers-overrides=coordination.k8s.io/leases#etcd1.leases.xxx:2xxx; https://etcd2.leas…

Pods Split configuration

–etcd-servers-overrides=/pods#etcd1.pods.xxx.net:2xxx; https://etcd2.pods…

Is a component restart required?

To understand whether restarting components is necessary and what is the impact of not restarting components. We verified in the test environment and found that after the split, new pods cannot be scheduled, existing pods cannot be deleted, and Finalizier cannot be removed. After analysis, it was found that the related components were not aware of Pod creation and deletion events.

So why is this a problem? To answer this question, it is necessary to explain thoroughly from the core concept of the entire design of K8s to the specific details of the implementation.

If K8s were an ordinary business system and Pod resource data split only affected the storage location of KUbe-Apiserver’s access to Pod resources, this article would not exist only at the Kube-Apiserver level.

Common service systems have a unified storage access layer. Data migration, split operation and maintenance (O&M) only affect the configuration of the storage access layer, and upper-layer service systems are not aware of it.

But K8s is a different kind of firework!

The K8s cluster is a complex system consisting of many extension components that work together to provide a variety of capabilities.

Extension components are designed for the end state. There are mainly two State concepts in the end-state orientation: Desired State and Current State. All objects in the cluster have a Desired State and Current State.

  • The expected state is simply the final state described by the Yaml data of the object we submit to the cluster;

  • The current state is the actual state of an Object in the cluster.

The data requests such as CREATE, Update, patch and DELETE we use are all modification actions for the final state, expressing our expectation for the final state. After executing these actions, the current cluster state is different from our expectation state. The various Operators(Controllers) of the cluster extend the Reconclie through their differences, driving the Object from its current state to its final state.

The Current Operators class component is basically developed using an open source framework, so the code logic that runs the component can be considered consistent. In the Operator component, the final state is obtained by sending a List request to kube-Apiserver to obtain the object YAML data of the final state. However, in order to reduce the load pressure of Kube-Apiserver, The List request is executed only once at component startup (if no unexpected errors occur), and if the final data Object YamL changes after that, the kube-Apiserver actively pushes the Event (WatchEvent) message to the Operator.

In this sense, it can also be said that the K8s cluster is an event-driven end-state design.

The WatchEvent message flow between Operator and Kube-Apiserver needs to ensure that no event is lost. The yamL data returned by the initial List request, This, combined with the WatchEvent change events, is the final state that the Operator should see and the user expects. An important concept for ensuring that events are not lost is resource Sion.

Every object in the cluster has this field, even the resource defined by the user through CRD.

The resourceVersion mentioned above is related to unique revisions of the ETCD store itself, especially for List requests that are heavily used by operators. The split migration of data to the new ETCD storage cluster directly affects the resourceVersion of resource objects.

So what is etCD revision? What is the association with resourceVersion for K8s resource objects?

3 revisions of Etcd

There are three kinds of Revision in Etcd, which are respectively Revision, CreateRevision and ModRevision. The correlation and characteristics of these three kinds of Revision are summarized as follows:

“Key-value” is the logical clock of the MVCC in the ETCD, with Revision fields that are guaranteed to increment when a write or update is made.

K8s ResourceVersion and Etcd Revision

Every object output from kube-Apiserver must have a resourceVersion field, which can be used to detect object changes and concurrency control.

You can see more from the code comments:

// ObjectMeta is metadata that all persisted resources must have, which includes all objects
// users must create.
type ObjectMeta struct {  
    ...// omit code here
    // An opaque value that represents the internal version of this object that can
  // be used by clients to determine when objects have changed. May be used for optimistic
  // concurrency, change detection, and the watch operation on a resource or set of resources.
  // Clients must treat these values as opaque and passed unmodified back to the server.
  // They may only be valid for a particular resource or set of resources.
  //
  // Populated by the system.
  // Read-only.
  // Value must be treated as opaque by clients and .
  // More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#concurrency-control-and-consistency
  // +optional
  ResourceVersion string `json:"resourceVersion,omitempty" protobuf:"bytes,6,opt,name=resourceVersion"`...// omit code here
}
Copy the code

Create, UPDATE, Patch, and DELETE writes in Kube-Apiserver’s request verbs all update revisions in ETCD and, more strictly, cause Revision growth.

Now the resourceVersion field in resource Object in K8s and various revisions in ETCD are summarized as follows:

Of all the kube-Apiserver request responses, special attention needs to be paid to the List response. The resourceVersion of the List request is the HEADer. Revision of etCD. This value is the MVCC logical clock of ETCD. The value of resourceVersion in the List request response is affected.

For example, if the Pod under test-namespace is listed, even if there is no modification action for the Pod resource under test-namespace, ResourceVersion in the response is also likely to grow each time (because other keys in the ETCD have writes).

In our continuous component Pod data split, we only prohibit Pod write operations, other data is not prohibited, in the kube-Apiserver configuration update rolling effect process, Inevitably, old ETCD revisions are much larger than new ETCD revisions that store Pod data. This leads to serious inconsistencies before and after the List resourceVersion split.

The value of resourceVersion in Operator is the key to ensure that events are not lost. Therefore, the data separation of ETCD not only affects Kube-Apiserver, but also affects a large number of Operator components. Once the change event is lost, it will cause Pod failure to deliver, messy data and other problems.

So far, although we have seen that the list resourceVersion given to the Operator is split inconsistently, the list resourceVersion returned from the old etcd is larger than that returned from the new etcd, Operator drop Pod update event

To answer this question, we need to start from ListAndWatch in K8s component collaboration design, which is bound to start from client-go and kube-Apiserver.

The Client ListAndWatch – go

We all know that the Operator component is event aware through the open source client-go code package.

Client-go in Operator is aware of the data object event schematic

The key is the ListAndWatch method, in which resourceVersion, which ensures that the client does not lose event events, is obtained through List requests.

ListAndWatch lists all the objects for the first time and gets the version number of the resource object. Then it watches the version number of the resource object to see if it has been changed. The resource version number is set to 0 first, and the list() may cause a delay in the local cache relative to the contents of the ETCD. Reflector will use the Watch method to fill in some of the delay, making the local cache data consistent with etCD data.

The key codes are as follows:

// Run repeatedly uses the reflector's ListAndWatch to fetch all the
// objects and subsequent deltas.
// Run will exit when stopCh is closed.
func (r *Reflector) Run(stopCh <-chan struct{}) {
  klog.V(2).Infof("Starting reflector %s (%s) from %s", r.expectedTypeName, r.resyncPeriod, r.name)
  wait.BackoffUntil(func() {
    iferr := r.ListAndWatch(stopCh); err ! = nil { utilruntime.HandleError(err) } }, r.backoffManager,true, stopCh)
  klog.V(2).Infof("Stopping reflector %s (%s) from %s", r.expectedTypeName, r.resyncPeriod, r.name)
}
// ListAndWatch first lists all items and get the resource version at the moment of call,
// and then use the resource version to watch.
// It returns error if ListAndWatch didn't even try to initialize watch.
func (r *Reflector) ListAndWatch(stopCh <-chan struct{}) error {
  var resourceVersion string
  // Explicitly set "0" as resource version - it's fine for the List()
  // to be served from cache and potentially be delayed relative to
  // etcd contents. Reflector framework will catch up via Watch() eventually.
  options := metav1.ListOptions{ResourceVersion: "0"}

  if err := func() error {
    var list runtime.Object
      ... // omit code here
    listMetaInterface, err := meta.ListAccessor(list)
      ... // omit code here
    resourceVersion = listMetaInterface.GetResourceVersion()
        ... // omit code here
    r.setLastSyncResourceVersion(resourceVersion)
    ... // omit code here
    returnnil }(); err ! = nil {return err
  }
    ... // omit code here
  for{...// omit code here
    options = metav1.ListOptions{
      ResourceVersion: resourceVersion,
      ... // omit code here
    }
    w, err := r.listerWatcher.Watch(options)
        ... // omit code here
    iferr := r.watchHandler(w, &resourceVersion, resyncerrc, stopCh); err ! = nil { ...// omit code here
      return nil
    }
  }
}
Copy the code

The flow chart is more clear:

Watch processing in kube-Apiserver

For each watch request, Kube-Apiserver creates a new watcher. Start a Goroutine watchServer specifically for this watch request and push resource event messages to the client in this new watchServer.

However, the important point is that the parameter watchRV in the client’s watch request is the response from the List in client-go. Kube-apiserver only pushes the event message larger than watchRV to the client. In the splitting process, watchRV of client may be much larger than resourceVersion of kube-Apiserver’s local event, which is the root cause of losing Pod update event messages of client.

From this point, it is necessary to restart the Operator component. Restarting the component can trigger the relist of client-Go and obtain the latest Pod List resourceVersion, so as not to lose the Pod update event message.

PART. 3 Problem breaking game

Resolving restart Problems

At this point, it seemed that we would need to restart the component, but after analyzing the problem, we figured out the cause of the problem and found a solution to the problem.

The restart component problem mainly involves two subjects: client-go and kube-apiserver. Therefore, to solve the problem, we can start from these two subjects and seek the breakthrough point of the problem.

First, for client-go, the key is to let ListAndWatch re-initiate the List request to get the latest resourceVersion of Kube-Apiserver, so as not to lose the subsequent event messages. If you can get client-Go to refresh the local resourceVersion with a List request at a specific time, you can solve the problem. However, if you change the client-Go code, you still need to publish and restart the component to take effect. The problem then is how to re-initiate the List request without modifying the client-Go code.

Reviewing the logical flow of ListAndWatch, we can find that the key to determine whether to initiate a List request is the wrong judgment returned by the Watch method. The watch method returns an error based on the kube-Apiserver’s response to the watch request.

Different watch request processing

The watch request processing of KuBE-Apiserver has been introduced in the previous section. We can achieve our goal by modifying the watch request processing process of Kube-Apiserver to realize the cooperation with client-Go.

As we know from the above, watchRV of client-Go is much larger than resourceVersion in kube-Apiserver’s local watch cache. Can be done according to the characteristics of kube – apiserver send specified error (TooLargeResourceVersionError), triggering the Client – relist the go. The Kube-Apiserver component will inevitably need to be rebooted so that the updated configuration can execute our modified logic.

The transformation logic is shown as follows:

Technical support data consistent

Previous experience is through the etCD make-mirror tool to achieve data migration, the advantage is simple and convenient, open source tools out of the box. The disadvantage is that this work is simple to implement, that is, read the key from one ETCD, and then write to another ETCD, does not support resumable breakpoint, and is not friendly to large amount of data and long time consuming migration. The createRevision message in the ETCD key was also corrupted. Therefore, strict data integrity checks are required after the migration is complete.

In view of the above problem, we can change the idea, we are essentially to do data migration, etCD itself has a special storage structure (KeyValue), we want to retain the integrity of data before and after. Therefore, the snapshot tool of ETCD comes to mind. The snapshot tool is originally used for DISASTER recovery of ETCD, that is, it can use snapshot data of etCD to create a new ETCD instance. And the data from snapshot in the new ETCD will keep the integrity of the original keyValue, which is what we want.

// etcd KeyValue data structure
type KeyValue struct {
  // key is the key in bytes. An empty key is not allowed.
  Key []byte `protobuf:"bytes,1,opt,name=key,proto3" json:"key,omitempty"`
  // create_revision is the revision of last creation on this key.
  CreateRevision int64 `protobuf:"varint,2,opt,name=create_revision,json=createRevision,proto3" json:"create_revision,omitempty"`
  // mod_revision is the revision of last modification on this key.
  ModRevision int64 `protobuf:"varint,3,opt,name=mod_revision,json=modRevision,proto3" json:"mod_revision,omitempty"`
  // version is the version of the key. A deletion resets
  // the version to zero and any modification of the key
  // increases its version.
  Version int64 `protobuf:"varint,4,opt,name=version,proto3" json:"version,omitempty"`
  // value is the value held by the key, in bytes.
  Value []byte `protobuf:"bytes,5,opt,name=value,proto3" json:"value,omitempty"`
  // lease is the ID of the lease that attached to key.
  // When the attached lease expires, the key will be deleted.
  // If lease is 0, then no lease is attached to the key.
  Lease int64 `protobuf:"varint,6,opt,name=lease,proto3" json:"lease,omitempty"`}Copy the code

Migrating data clipping

The etCD snapshot data does what we want to keep the KeyValue intact, but the data stored in the reconstructed ETCD is all the data of the old ETCD, which is not what we want. We can certainly initiate the clean work of redundant data after creating the ETCD, but this is not the best approach.

We can modify etcd Snapshot tool to achieve our data clipping during snapshot. In the storage model of ETCD, there is a list of buckets. Buckets, as a storage concept of ETCD, corresponds to a table in a relational database, and each key corresponds to a row in the table. The most important bucket is the bucket named Key, which stores all the resource objects in K8s. The key of all resource objects in K8s has a fixed format, and each resource has a fixed prefix according to the resource category and namespace. For example, Pod data is prefixed with /registry/Pods/. In the snapshot process, we can distinguish Pod data according to this prefix and whittle down the non-POD data.

In addition, according to the features of ETCD, the storage size of etCD snapshot data is the size of etCD hard disk file. There are two values db total size and DB inuse size. Db total size is the size of the storage file occupied by the ETCD on the hard disk, which contains a lot of data that has become garbage keys but has not been cleaned. Db inuse size Size is the total size of all available data. When the etCD defrag method is not used frequently, the value of total is generally much higher than the value of inuse.

In data clipping, even if we clipped non-POD data, the data in the snapshot would not change at all. In this case, we needed to defrag to release redundant storage space.

In the diagram below, you can see how DB total changes. Finally, the snapshot data size is the Pod data size, which is very important for saving data transfer time.

Pod ban write pit

In the previous split process, we mentioned that when K8s prohibits the writing of a type of resource, it can be implemented through MutatingWebhook, which is to directly return the deny result, relatively simple. Here’s a little bit of a pothole we came across.

Our initial MutatingWebhookConfiguration configuration as follows, but we found that after the apply the configuration, or able to receive the Pod update event messages.

// There is a problem with the first version configuration
apiVersion: admissionregistration.k8s.io/v1
kind: MutatingWebhookConfiguration
metadata:
  name: deny-pods-write
webhooks:
- admissionReviewVersions:
  - v1beta1
  clientConfig:
    url: https://extensions.xxx/always-deny
  failurePolicy: Fail
  name: always-deny.extensions.k8s
  namespaceSelector: {}
  rules:
  - apiGroups:
    - ""
    apiVersions:
    - v1
    operations:
    - "*"
    resources:
    - pods
    scope: The '*'  
  sideEffects: NoneOnDryRun
Copy the code

After investigation, it was found that the status field of Pod was updated. By reading the code of Apiserver, we found that there was not only one resource related to Pod storage, but also the following types: Pod Status and Pod are different resources for apiserver storage.

"pods":             podStorage.Pod,
"pods/attach":      podStorage.Attach,
"pods/status":      podStorage.Status,
"pods/log":         podStorage.Log,
"pods/exec":        podStorage.Exec,
"pods/portforward": podStorage.PortForward,
"pods/proxy":       podStorage.Proxy,
"pods/binding":     podStorage.Binding,
Copy the code

After adjustment, the configuration below is one that disables Pod data from being fully updated. Note the Resource configuration field.

This is a little pit point, recorded here.

apiVersion: admissionregistration.k8s.io/v1
kind: MutatingWebhookConfiguration
metadata:
  name: deny-pods-write
webhooks:
- admissionReviewVersions:
  - v1beta1
  clientConfig:
    url: https://extensions.xxx/always-deny
  failurePolicy: Fail
  name: always-deny.extensions.k8s
  namespaceSelector: {}
  rules:
  - apiGroups:
    - ""
    apiVersions:
    - v1
    operations:
    - "*"
    resources:
    - pods
    - pods/status
    - pods/binding
    scope: The '*'  
  sideEffects: NoneOnDryRun
Copy the code

The final split process

After solving the previous problems, our final split process comes out.

The schematic diagram is as follows:

During data splitting, only Pod data cannot be written. Read operations are allowed, and other resources can be read and written. The whole process can be achieved through program automation.

The time of forbidden write operation of Pod varies according to the size of Pod data, mainly consumed in the Pod data copy process, basically the whole process can be completed in a few minutes.

No component restart is required, except that Kube-Apiserver cannot avoid the need to update the storage configuration to restart. At the same time, it also saves a lot of time to communicate with the component owner and avoids many uncertainties in many operations.

The whole process can be done by one person.

PART. 4 Final summary

This paper starts from the goal of data separation, draws lessons from previous experience, but according to its own actual situation and requirements, breaks through the previous experience of the rut, through technological innovation to solve the component restart and data consistency guarantee problems, in improving efficiency but also in technical security.

The whole thought process and key implementation points are introduced in detail.

The whole thought process and implementation key points

We didn’t invent anything, we just took existing logic and tools and tweaked them to accomplish our goals. However, behind the transformation and improvement process, we need to understand the details of the bottom layer, which is not a few boxes can be understood.

Knowing why and why is a must in most jobs, and while it takes up a lot of our time, it is well worth it.

To end with an old saying:

The best way to use it is to save yourself and all you.

“References”

(1) 【etcd storage limit】

Etcd. IO/docs/v3.3 / d…

(2) 【etcd snapshot】 :

Etcd. IO/docs/v3.3 / o…

(3) [Scaling the peak of scale — Optimization practice of Ant Group’s large-scale Sigma Cluster ApiServer] :