Zhou Cheng, Tencent Cloud engineer, mainly responsible for the design, development, operation and maintenance of Tencent ETCD monitoring platform, with large-scale Kubernetes and ETCD cluster operation and maintenance development experience.

Tang Cong, senior engineer of Tencent Cloud, author of geek Time column “ETCD Practice Class”, active contributor of ETCD, mainly responsible for r&d and design of Public ETCD platform and Serverless product of Tencent Cloud K8s cluster and internal business.

background

As Kubernetes has become the dominant container choreographer, more and more businesses are using Kubernetes to deploy and manage services in production environments on a large scale. Tencent Cloud TKE is based on native Kubernetes to provide container-centered, highly scalable high-performance container management services. Since its launch in 2017, with the popularity of Kubernetes, our cluster scale has also grown to ten thousand levels. In this process, our basic components, In particular, ETCD faces the following challenges:

  • How to collect etCD and other component metrics monitoring data of TKE cluster through a monitoring system?
  • How to efficiently manage the ten-thousand-level cluster and actively discover faults and potential dangers?
  • How to quickly detect abnormalities, achieve rapid disposal, and even self-healing?

In order to solve the above challenges, we based on Kubernetes extension mechanism, to achieve a set of etCD cluster management, scheduling, migration, monitoring, backup, inspection in one of the visual ETCD platform, this article focuses on and you share our ETCD monitoring platform is how to solve the above challenges.

Faced with the problem of large-scale surveillance data collection, Our solutions range from a single Prometheu instance at the beginning of TKE to dynamically building multiple Prometheus instances based on Promethes-Operator, adding monitoring targets, To achieve horizontal scalable monitoring system based on TKE cloud native Prometheus products, and successfully provide stable ETCD storage service and monitoring capability for ten thousand Kubernetes cluster. The number of Kubernetes clusters of ETCD monitoring platform governance has also achieved a breakthrough from single digit to thousands and thousands of levels. At present, the data scale of Prometheus Target reaches tens of thousands per unit time, and the data volume of single regional indicator Series reaches tens of millions. In the face of large-scale monitoring data, the monitoring data availability can still maintain above 99.99%.

In the face of various uncontrollable human operation errors, hardware and software failures that may occur in the complex distributed environment, we built a multi-dimensional and extensible inspection system based on Kubernetes extension mechanism, rich etCD knowledge and experience accumulation, which helps us efficiently manage the ten-thousands-level cluster and actively discover potential dangers.

In the face of huge monitoring data and flood of alarms, we established a standardized data operation system based on highly available monitoring data and combined with operation scenarios to significantly reduce invalid alarms and improve the accuracy of alarms. In addition, we further introduced multi-dimensional SLO to converge alarm indicators and provide intuitive service level indicators for business parties. Through standardized data operation system, alarm classification, alarm follow-up, alarm raising mechanism, and self-healing strategy for simple scenarios, faults can be quickly handled and even self-healed.

Below, we’ll walk you through how we solved these three challenges and hope the best practices can help you quickly build scalable business monitoring systems.

How to build a highly available and scalable monitoring data collection service?

First of all, how to collect the monitoring data of etCD metrics of TKE cluster through a monitoring system?

As we all know, ETCD is an open source distributed key-value storage system, which is the metadata storage of Kubernetes. The instability of ETCD will directly lead to the unavailability of upper-layer services, so the monitoring of ETCD is very important.

In 2017, when TKE was born, there were few clusters, so a single instance of Prometheu could handle monitoring issues.

In 2018, as Kubernetes became more recognized and our TKE cluster count increased, we introduced Promtheus -operator to dynamically manage Prometheus instances, We basically handled thousands of Kubernetes cluster monitoring requirements. Here is the architecture diagram.

Prometheus – Operator architecture

We deployed Promethee-Operator in each region, created Prometheus instances for different business types, and created ServiceMonitor resources through the API every time we added a Kubernetes/ ETcd cluster. Inform Prometheus to collect new cluster monitoring data.

However, in this scheme, with more and more Kubernetes/ ETCD clusters, etCD monitoring alarm stability is challenged, monitoring link instability, monitoring curve breakpoints, alarm flooding, many virtual alarms, difficult to follow up.

The pain problem

What are the specific problems?

Here we analyze with you from two perspectives of monitoring instability and operation and maintenance cost.

Monitoring instability

Unstable monitoring components: Large monitoring data often causes Prometheus instance to appear in OOM, and frequent changes cause Prometheus instance to die due to changes in etCD management.

Monitoring and service coupling: in order to avoid OOM caused by large amount of data, Prometheus needs to manually split data sharding, which not only increases maintenance cost, but also has automatic management mechanism, which is strongly coupled with manual sharding, which is not conducive to later operation and function expansion.

Monitoring link is unstable: Monitoring links are mainly composed of Promethy-operator and Top-Prometheus. Sharing top-Prometheus with other services, Top-Prometheus receives a large amount of data and restarts in OOM frequently. The startup load is slow and the restart takes a long time, which further magnifies the impact and often causes long data breakpoints.

Operational costs

Monitoring components need self-maintenance: Monitoring data fragments need to be manually split into monitoring instances, and monitoring components need to be self-operated and maintained to ensure monitoring availability.

Alarm rules maintenance is difficult: alarm rules of relying on the regular of etcd name matching rules maintenance is difficult, for the rules of the new alarm scene, need to understand the rules of the existing configuration, before adding a new rule to the existing rules to increase specific etcd unclick logic of cluster, the new operation often affect existing alarm would happen.

Difficult to follow up alarms: Many indicators and a large number of alarms cannot accurately reflect service problems. Because alarm indicators do not have service features, services are difficult to understand and alarm information cannot be directly returned to the service side, making alarm follow-up difficult.

In addition, when monitoring Target is added for Prometheus, which is based on open source, Prometheus anomalies occur, service restarts, data breakpoints occur, and monitoring service availability is low due to large amount of monitoring data.

Problem analysis

As shown in the figure above, the monitoring service consists of the bottom Prometheus Server and Top-Prometheus.

Why do changes get stuck?

As shown in the figure above, the Secret resource is generated by the ETCD cluster, and Promethy-operator uses Secret, Prometheus CRD and ServiceMonitor generate the static_config file of Prometheus instance, which Prometheus instance ultimately relies on for data fetching.

If etcd increases, Secret increases. If Prometheus CRD updates, Prometheus CRD updates frequently. If static_config updates frequently, Prometheus fails to work due to frequent changes in the pull configuration.

Where does the capacity problem come from?

Under the background of the continuous growth of TKE cluster and the launch of producted ETCD, the number of ETCD keeps increasing. Etcd itself has a large number of indicators. Meanwhile, in order to efficiently manage the cluster, various hidden dangers are discovered in advance, and the inspection strategy is introduced.

Top-prometheus collects etCD indicators and other support services; therefore, monitoring services are often unavailable when Top-Prometheus occurs OOM.

Extensible Prometheus architecture

How to solve the above pain points?

TKE’s cloud native Prometheus was developed to address the pain points of large-scale data scenarios, ensure the stability of standardized data operation bases, and provide highly available monitoring services. We decided to fully migrate the ETCD monitoring platform to the TKE cloud native Prometheus monitoring system.

TKE Cloud Native Prometheus monitors the introduction of the file-sync service to implement hot updates to configuration files, avoiding changes that could cause Prometheus to restart, and successfully addressing pain points in our core scenario.

At the same time, TKE cloud native Prometheus realized the elastic fragmentation of monitoring data through Kvass, effectively distributed a large amount of data, and realized the stable acquisition of tens of millions of data.

Most importantly, the Kvass project is open source, and its architecture is shown below. See “How to Monitor a 100,000-container Kubernetes Cluster with Prometheus” and GitHub source code for more.

Cloud native extends Prometheus architecture

The diagram above shows our Prometheus architecture based on the extensible TKE cloud native. Let me give you a brief overview of the components.

Introduction of centralized Thanos

Thanos consists of two services: Thanos-Query and Thanos-rule. Thanos-query queries monitoring data, and thanos-rule aggregates monitoring data to generate alarms.

Thanos – query: Thanos-query Multiple Prometheus data query tasks can be realized by configuring the Store field, and data aggregation of original Prometheus or Prometheus in the TKE cloud can be realized by using the query capability. In addition, it provides a unified data source for upper-layer monitoring and alarms, and functions as a convergence data query portal.

Thanos-rule: Thanos-rule aggregates data collected by query, and implements alarms based on the configured alarm rules. The convergence of alarm capabilities and centralized alarm configuration ensure the stability of alarm links no matter how the underlying Prometheus service changes.

Smooth migration

TKE Cloud native Prometheus is fully compatible with the open source Prometheus-Operator solution, so the original Prometheus-Operator configuration can be retained during migration. You only need to add corresponding labels for TKE cloud native Prometheus to identify. However, the migration of indicator exposure from within the cluster to external TKE cloud native Prometheus affects services that rely on monitoring indicators in and outside the cluster.

External exposure: Through the introduction of centralized Thanos-Query, all regional indicators are exposed through Thanos-Query. With the upper-level centralized Query, the lower-level migrates TKE cloud native Prometheus or extends it in parallel. External services that rely on monitoring indicators, such as market monitoring and alarms, are not detected.

Internal dependencies: The custom-metrics Service in a cluster depends on monitoring indicators. Because TKE cloud native Prometheus is used, indicators cannot be collected by internal Service. Therefore, an Intranet LB corresponding to the cluster where Cloud native Prometheus is located is created for internal access by the supported environment. The Intranet LB is used to configure custom-metrics to collect monitoring indicators.

TKE cloud native Prometheus effect

Monitoring availability: TKE Cloud native Prometheus measures the availability of its monitoring services based on external exposure indicators of Prometheus, such as Prometheus_TSdb_head_series and UP, etc. Prometheus_tsdb_head_series is used to measure the total amount of monitoring data to be collected. The UP indicator indicates whether a collection task is healthy. You can sense the availability of monitoring services by using these two indicators.

Success rate of data collection: As a business side, we are more concerned about the success rate of collection of specific business indicators. In order to effectively measure availability, business indicators are sampled and datalized. The data before and after migration was collected at an interval of 15s, and the data drop rate was determined by combining the theoretical data volume to reflect the availability of monitoring services. After statistics, specific data in the past 30 days are shown in the figure below:

After introducing TKE cloud Prometheus, the total amount of monitoring data has been up to tens of millions, the monitoring alarm link is stable, and the coverage rate of inspection data is more than 70%. The success rate fluctuates in a short time due to the transformation of ETCD service platform. Besides, the success rate of monitoring index pulling is more than 99.99%. The data has remained at 100% for the past 7 days, and the monitoring service remains highly available.

How to efficiently manage etCD clusters and find hidden risks in advance?

Then comes the second question: how to efficiently manage the ten-thousand-level cluster and actively discover faults and potential dangers?

In the first problem, we have solved the problem of metrics collection for large etCD clusters. We can find some hidden problems through metrics, but it is not enough to meet our demands for efficient governance of ETCD clusters.

Why do you say that?

The pain problem

In the process of using ETCD on a large scale, we may encounter various hidden dangers and problems, such as the following:

  • Data in the ETCD cluster is inconsistent due to process or node restart. Procedure
  • Etcd performance deteriorates due to large key-value writes
  • A large number of keys are written into services abnormally, causing potential stability problems
  • A few service keys fail to write QPS, causing errors such as speed limiting in the ETCD cluster
  • After the ETCD is restarted or upgraded, manually check the cluster health from multiple dimensions
  • Incorrect operations may split the ETCD cluster
  • .

Therefore, in order to effectively govern the ETCD cluster, we summarized these potential pitfalls into automated checks, such as:

  • How to effectively monitor etCD data inconsistency?
  • How to discover large key-values in a timely manner?
  • How can I detect abnormal increases in the number of keys in a timely manner?
  • How to monitor abnormal write QPS in time?
  • How to automate health checks from multi-dimensional clusters to make changes more secure?
  • .

How to feed these etCD best practices into the governance of large scale ETCD clusters on the live network?

The answer is inspection.

Based on Kubernetes extension mechanism and rich etCD knowledge and experience accumulation, we have built a multi-dimensional and extensible inspection system, which helps us efficiently manage the ten-thousand-level cluster and actively find potential dangers.

Why did we build etCD based on the Kubernetes extension mechanism?

Etcd cloud native platform introduction

In order to solve a series of pain points in our business, our etCD cloud native platform design objectives are as follows:

  • Observability. The cluster creation and migration process can be visualized, and the current progress can be viewed at any time. The cluster creation and migration process can be paused, rolled back, grayscale, and batch.
  • High development efficiency. Fully reuse the community’s existing infrastructure components and platforms, focus on business, rapid iteration, efficient development.
  • High availability. Each component has no single point and can be expanded in parallel. The migration module preempts tasks through distributed locks and can be migrated concurrently.
  • Scalability. Migration object, migration algorithm, cluster management, scheduling strategy, inspection strategy and other abstraction, plug-in, to support a variety of Kubernetes cluster types, a variety of migration algorithms, a variety of cluster types (CVM/ container, etc.), a variety of migration strategies, a variety of Kubernetes versions, a variety of inspection strategies.

Reviewing our design goals, observability and high development efficiency are a good match for Kubernetes and its declarative programming, as detailed below.

  • Observability. Based on the Event to do real-time migration progress function, through Kubectl, visual container console can view, start, pause all kinds of tasks
  • High development efficiency. Kubernetes REST API design elegant, define custom API, SDK automatic generation, greatly reduce the development workload, can focus on the business field system development, at the same time, automatic monitoring, backup module can be based on Kubernetes community components, customized expansion development, To satisfy our function and address the pain points.

Kubernetes is a highly scalable and configurable distributed system with rich extension patterns and points for each module. After selecting programming mode based on Kubernetes, we need to abstract etCD cluster, migration task, monitoring task, backup task, migration strategy and so on into Kubernetes custom resources, and realize the corresponding controller.

Below is an architecture diagram of the ETCD cloud native platform.

The following uses etCD cluster creation and allocation as an example to introduce the principles of the ETCD platform:

  • Creating an ETCD cluster through Kubectl or a visual Web system is essentially submitting an EtcdCluster custom resource
  • Etcd-apiserver writes CRD to a separate ETCD store and etcd-Lifecycle operator listens to the new cluster. Depending on the backend Provider declared by EtcdCluster, choose whether to create the ETCD cluster based on the CVM Provider or containerized.
  • Once the cluster is created, etcd-Lifecycle operator will also add a series of backup policies, monitoring policies, and inspection policies, which are essentially CRD resources.
  • When a business needs to allocate an ETCD cluster, the scheduling service obtains a series of candidate clusters that meet the business conditions after filtering the process. Then how to return the best ETCD cluster to the user? Here, we support a variety of optimization strategies, such as minimum connection count, which takes the cluster connection count from Prometheus via Kubernetes’ API, and returns the cluster with the minimum connection count to the business, i.e. the newly created cluster, for immediate allocation.

This section describes etCD inspection cases

How to add one inspection rule to the inspection system?

An inspection rule corresponds to a CRD resource, as shown in the yamL file below, which adds a data differentiated inspection policy to the gz-Qcloud-ETcD-03 cluster.

apiVersion: etcd.cloud.tencent.com/v1beta1
kind: EtcdMonitor
metadata:  
creationTimestamp: "2020-06-15T12:19:30Z"  
generation: 1  
labels:    
clusterName: gz-qcloud-etcd-03    
region: gz    
source: etcd-life-cycle-operator  
name: gz-qcloud-etcd-03-etcd-node-key-diff  
namespace: gz
spec:  
clusterId: gz-qcloud-etcd-03  
metricName: etcd-node-key-diff  
metricProviderName: cruiser  
name: gz-qcloud-etcd-03  
productName: tke  
region: gz
status:  
records:  
- endTime: "2021-02-25T11:22:26Z"    
message: collectEtcdNodeKeyDiff,etcd cluster gz-qcloud-etcd-03,total key num is      
122143,nodeKeyDiff is 0     
startTime: "2021-02-25T12:39:28Z"  
updatedAt: "2021-02-25T12:39:28Z"
Copy the code

After the YAML file is created, the inspection service implements the inspection policy and provides metrics for Prometheus to collect. The result is as follows:

How to quickly detect abnormalities, achieve rapid disposal, and even self-healing?

Based on the stable TKE cloud native Prometheus monitoring link and comprehensive inspection capability, ETCD platform has been able to provide various monitoring indicators related to the availability of ETCD clusters. However, due to the large number of clusters, numerous indicators, numerous user usage scenarios, and complex deployment environment, it is difficult to quickly locate abnormal causes. To achieve rapid disposal and immediate recovery.

In order to improve the ability of abnormal perception and realize rapid treatment and self-healing, the following problems are mainly faced.

  • How to standardize monitoring and alarms in the face of various ETCD clusters and complex service application scenarios?

The business scenario of ETCD differs from the operation scenario. Based on the operation requirements, the access of ETCD cluster is standardized to provide standardized monitoring indicators required by the operation. Further standardize alarms based on standardized services and ETCD specifications to standardize operation of monitoring alarms.

  • In the face of massive indicators, how to effectively converge, quickly measure the availability of ETCD cluster, and detect anomalies?

Therefore, SLO is introduced to effectively reflect the availability of ETCD services, and a multi-dimensional monitoring system is built around SLO to realize rapid anomaly perception and problem location, so as to further recover quickly.

The following will solve the above problems one by one to build an efficient data operation system and realize the rapid perception of anomalies.

Access standardization

Etcd operation and maintenance information access CRD: The continuous operation and maintenance of ETCD is configured through CRD, which fully follows Kubernetes specification. The basic information of ETCD is defined in Spec, and the service information is expanded in the form of Annotation. A CRD contains all the information required by ETCD operation and maintenance.

Cloud native data solutions: Prometheus uses Static Config to configure collection tasks, while TKE cloud native Prometheus makes full use of ServiceMonitor resources provided by Prometry-Operator to configure collection tasks. Automatic access of component Metrics can be achieved by configuring just a few filter tags. As a data store, ETCD itself generally runs outside the operation and management cluster. In order to collect the monitoring indicators of ETCD itself, No Selector Service in Kubernetes is used to achieve. The etCD Metrics are collected by directly configuring Endpoints of the corresponding ETCD node.

Standardization specification: EtCD monitoring indicators are introduced into product, scenario, and specification labels through ServiceMonitor’s Relabel capability to standardize operational information. The product label reflects the product category of the ETCD service object. The scenario label is obtained by dividing the application scenarios of etCD. The specifications are divided into small, default, and Large based on the etCD node specifications and user usage.

Warning standard: through the implementation of the standardization, the alarm rules no longer rely on a large number of regular matching, through the scene and specifications to determine the corresponding threshold of the alarm indicators, combining with the warning index expression alarm rules of configuration can be realized, for the new alarm rules, through the scene and specifications of the effective segmentation, can not change under the condition of existing alarm rules to realize the new. At the same time, the scenario and specification labels of the self-developed alarm system can reflect the people who should handle the alarm, so that the alarm can be pushed in a targeted way, the alarm can be classified, and the accuracy of the alarm can be improved.

The above standardized process is not only applicable to cloud native components, but also to components that run on the machine in binary. Corresponding indicators can also be collected through self-established No Selector Service. After the operation class labels of components are determined according to operation information such as application scenarios, The Relabel capability of ServiceMonitor can quickly link with TKE cloud native Prometheus to monitor alarm links and establish a standardized data operation system.

Based on the above standardized process, the productized ETCD live network operation is supported. With the launch of productized ETCD, the Relabel capability of ServiceMonitor is utilized to realize the access as operation and maintenance feature without changing the monitoring layer:

Define access specifications: Import service and specification operation labels. Based on these labels, etCD usage scenarios are reflected in monitoring indicators, providing data basis for stereotypical market monitoring, and implementing alarm rule configuration, operation and maintenance (O&M) based on these labels.

Direct adaptation of general alarm rules: Generates general alarm rules based on operation label services and specifications, monitoring indicators and thresholds, and implements alarms of different dimensions.

Analysis view: Based on service scenarios and in combination with different monitoring indicators, a standardized monitoring view is directly used to generate an ETCD of business dimensions to monitor the market.

Build data operation system for SLO

The introduction of SLO

How to abstract an SLO: A SLO is a service level goal, which is primarily internal and used to measure the quality of service. Before the SLO is determined, the SLI (Service level indicator) must be determined. The service is user-oriented, so an important indicator is the user’s perception of the service, among which the error rate and delay perception are the most obvious. At the same time, the service itself and the third-party services that the service depends on will also determine the quality of service. Therefore, for ETCD service, SLI three elements can be determined as request error rate and delay, and whether there is Leader and node disk IO. Node disk IO reflects the error rate and latency of read operations to some extent, and SLI is further stratified into ETCD availability and read/write availability. The etCD SLO calculation formula can be preliminarily determined in combination with Prometheus real-time computing capability.

SLO calculation: SLO measures service quality. Service quality is determined by user perception, service status, and dependent underlying services. Therefore, SLO consists of latency based on etCD core interface RPC (Range, Txn, and Put), disk I/O, Leader, and related inspection indicators.

SLO operation plan: Through the analysis of ETCD service, the calculation formula of SLO and specific SLO indicators are preliminarily obtained, but they are only preliminarily realized. The SLO needs to be constantly corrected by comparing actual anomalies to improve the ACCURACY of the SLO. After a period of observation and correction, SLO indicators become increasingly accurate and gradually form the operation mode as shown in the following figure. Through SLO linkage monitoring, alarm and live network problems, operation efficiency is improved and active service capability is improved. After a period of operation, SLO alarms expose faults through phone alarms in case of several exceptions, realizing active discovery of exceptions.

TKE Cloud native Prometheus landing SLO

Introducing the Recording Rules

The availability and delay of ETCD and other key indicators for SLO construction have been collected by TKE cloud native Prometheus. Relying on the computing capacity of Promethues, SLO calculation can be realized. However, due to many indicators involved in SLO calculation, large volume of ETCD and large delay of SLO calculation, There are more breakpoints.

Recording Rules are Prometheus’ Recording Rules, which enable an expression to be set up in advance and the results to be stored as a new set of time series data. In this way, the complex SLO calculation formula can be decomposed into different units, the calculation pressure is distributed, data breakpoints are avoided, and the SLO history data can be queried very quickly because the calculation results are saved. At the same time, Promethues updates the recording rules through the received SIGNUP semaphore, so the overloading of the recording rules is real-time. This feature is conducive to continuous modification of the calculation formula and continuous optimization of SLO during SLO practice.

Data value operation system construction

Through the landing of SLO, the etCD platform monitoring alarms are unified based on THE SLO entrance. Considering that etCD is used in various scenarios, daily obstacle removal is difficult, and problem analysis is difficult, SLO fast obstacle removal and three-dimensional SLO monitoring are established around the SLO monitoring system, as shown in the following figure.

Operational demands

Fundamental validation: Monitoring provides an overview of etCD, such as capacity information, component stability, service availability, etc.

Feature demands of different scenarios: Different application scenarios have different emphases on ETCD and monitoring dimensions. The monitoring plate should reflect the features of different scenarios.

Troubleshooting: In the case of resource jitter at the underlying IAAS layer, the affected ETCD cluster can be quickly identified. In the case of a fault, the impact plane can be quickly identified, and the fault cause can be further identified in the alarm view.

Three-dimensional monitoring

The monitoring view of etCD platform is shown in the following figure, which is divided into level 1, Level 2, level 3 and obstacle removal view. Level 1 is to monitor the market, level 2 is divided into three scenarios, level 3 is a single cluster monitoring, is the key to specific problems, the barrier view linkage ETCD and Kubernetes to achieve two-way query.

Level 1 Monitoring view: The SLO is calculated based on multiple monitoring indicators. It effectively measures etCD availability and converges monitoring indicators to achieve a unified entrance. Based on the SLO, you can establish a multi-region monitoring market to learn about etCD status and quickly identify fault impact areas.

Level 2 Monitoring view: Basis etcd application scenario, the secondary monitor by business, big customer scenarios, such as demand to realize the characteristics of different scenarios, business each regional overall availability, response to reality whether each business area has enough etcd resources, large customers will need to reflect on the capacity of its size, open to the customers also need to consider the situation.

Level 3 monitoring view: The level 3 monitoring view is a single-cluster monitoring view. This view helps you locate etCD problems and rectify faults.

SLO obstacle removal monitoring view: Etcd is the underlying storage service of Kubernetes. In the process of obstacle removal, ETCD and Kubernetes often need two-way confirmation. In order to improve the efficiency of obstacle removal, SLO obstacle removal monitoring consists of forward query and reverse query views of ETCD and Kubernetes cluster.

Operating results

SLO monitoring system basically covers all operation scenarios and plays a key role in the actual operation process.

Underlying IAAS jitter: Use level 1 monitoring to quickly identify an impact surface and further identify affected ETCD clusters in different scenarios to quickly determine an impact surface.

Fault location: After receiving the corresponding SLO alarm, the three-level monitoring can be used to determine the cause of the SLO alarm, confirm the impact indicator, and realize the rapid recovery of the fault. At the same time, the positive and negative query of ETCD and Kubernetes is not only convenient for etCD problem confirmation, but also the sharp tool for Kubernetes problem confirmation.

Active service: through SLO monitoring the market to find etCD anomalies in advance for many times, and actively feedback to the upper service team, effectively strangle the service failure in the cradle.

Self-healing ability: Etcd node failure will affect etcd usability, through the SLO monitoring alarm, can quickly feel abnormal, relying on the advantage of container is changed deployed at the same time, the transition etcd of the cluster nodes are running in the form of Pod, when abnormal node will automatically eliminate abnormal Pod, add a new node, thus the user has no awareness of the premise to realize fault self-healing.

conclusion

This article shares our best practices with you in detail around the three pain points of large-scale Kubernetes and ETCD clustering. Firstly, Prometheus, the native monitoring system of TKE cloud, solved the problem of how to stably collect 10,000 level instances. Secondly, it solved the problem of automatic and efficient management of 10,000 level clusters through the extensible inspection system. Finally, we introduced SLO indicators, built a series of operation and maintenance monitoring views, and deployed ETCD cluster in container. Fast fault detection and self-healing are realized.

Based on the excellent performance of TKE Cloud native Prometheus, etCD monitoring stability has maintained above 99.99%, providing a stable source of data for THE SLO operating system. Relying on the perfect data operation system, ETCD monitoring platform is committed to creating AIOPS intelligent system for ETCD in the future, realizing a high degree of self-healing ability, and providing intelligent and reliable operation and maintenance solutions for ETCD and even more components. At the same time, considering the characteristics and requirements of business scenarios, THE ETCD platform will further improve the expansion capacity, provide pluggable expansion methods, build together with the community, and gradually build a complete ETCD monitoring operation system.