The author is | segment super

Source |ERDA official account


We are a business software company. From the very beginning we have made our software delivery process very standard and simple. All software is delivered based on our enterprise digital platform Erda (now open source) and the underlying software is based on Kubernetes. It provides unbound IT services for cloud vendors such as DevOps, micro-service governance, multi-cloud management and fast data management.

With the increasing number of customers we serve, now we are managing a considerable number of Kubernetes clusters. The stability of the cluster determines the service quality and external reputation. For a long time, we have been very passive in this matter, and often ERDA supports students or customers to feedback to us and say: There are some business problems, we need to help to check whether it is caused by the platform, and then we will go to a final operation to solve the problem and reply to the customer. Seemingly professional and fierce, urgent users urgent, in fact, no chapter can not, a chicken feather.

Usually we rely on the monitoring and control system to found the problem in advance, but monitoring data as a positive link, it is difficult to cover all scenarios, often because of the cluster configuration inconsistencies or some of the more underlying resources is unusual, even if the monitoring data are completely normal, but the system still has some function is not available. In this regard, we have made a set of inspection system, for some of the weak points in the system and the consistency of the diagnosis, the fly in the ointment is that the system’s scalability is not very good, the cluster and inspection management is relatively rough.

Finally, we decided to make a more cloud-native diagnostic tool, using operator to realize the management of cluster and diagnostic items, abstracting the resource concept of cluster and diagnostic items, to solve the diagnosis problem of large-scale Kubernetes cluster, by sending diagnostics to other clusters in the center, and uniformly collecting the diagnostic results of other clusters. To achieve any time can be obtained from the center to the other all the cluster running state, to achieve the large-scale Kubernetes cluster effective management and diagnosis.


Project introduction

Project address:

Kubeprober is a diagnostic tool designed for large-scale Kubernetes clusters. It is used to execute diagnostics in Kubernetes clusters to prove that the functions of the cluster are normal.

  • Support for large cluster: Support for multi-cluster management, support for configuring the relationship between cluster and diagnostic items in the management side and unified view of all cluster diagnostics.
  • Cloud native: The core logic is implemented using operators, providing full Kubernetes API compatibility.
  • Extensible: Support user-defined patrol items.

Its core architecture is as follows:

Different from the monitoring system, Kubeprober proves whether the functions of the cluster are normal from the perspective of inspection. Monitoring, as a forward link, cannot cover all scenes in the system. Monitoring data of each environment in the system is normal, and it cannot guarantee that the system is 100% available. Therefore, a tool is needed to prove the availability of the system retroactively, essentially prior to the point at which the user finds the cluster unavailable, such as:

  • Whether all nodes in the set can be scheduled, whether there are special stains, etc.
  • Whether POD can be properly created and destroyed, verify the whole link from Kubernetes, Kubelet to Docker.
  • Create a Service and test the connectivity to verify that Kube-Proxy’s links are working.
  • Resolve an internal or external domain name to verify that the CoreDNS are working properly.
  • Access an Ingress domain name to verify that the Ingress components in the cluster are working properly.
  • Create and delete a namespace to verify that the associated Webhook is working properly.
  • Put /get/delete and other operations are performed on the ETCD to verify that the ETCD is running properly.
  • Verify that MySQL is running correctly by using the mysql-client operation.
  • Simulate the user to log in and operate the business system to verify whether the main process of the business is frequent.
  • Check whether the certificates for each environment are expired.
  • Expiration check of cloud resources.
  • More…

Component is introduced

As a whole, Kubeprober adopts Operator to realize the core logic. The management between clusters uses remotedialer to maintain the heartbeat link between the administered cluster and the managed cluster. The administered cluster grants the Probe-Agent the minimum required permissions through RBAC. In addition, it can report the metadata information of the managed cluster in real time through the heartbeat link, and access the token of the APIServer, so as to realize the function of operating the relevant resources of the managed cluster in the managed cluster.


An operator that runs on a managed Cluster. This operator maintains two CRDs. One is a Cluster, which manages managed clusters. The other is Probe, which manages the built-in and user-written diagnostic items. Probe – Master pushes the latest diagnostic configuration to the managed cluster through the watch and the two CRDs. Meanwhile, Probe – Master provides an interface for viewing the diagnostic results of the managed cluster.


This operator maintains two CRDs. One is a probe that is exactly the same as the probe-master. The probe-agent executes the cluster’s diagnostics according to the probe’s definition. The other is probeStatus, which is used to record the diagnostic results of each Probe. Users can view the diagnostic results of the cluster by kubectl get probeStatus in the managed cluster.

What is the Probe

The diagnostic plan running in kubeprobe is called Probe, and a Probe is a set of diagnostic items. We suggest running the diagnostic items in the unified scene as a Probe, and the probe-agent component will watch the Probe resource. Execute the diagnostic items defined in Probe and write the results in the resource of ProbeStatus.

We expect to have an output can clearly see the state of the cluster, therefore, we suggest that all the Probe belongs to application, middleware, Kubernets, possible basis set the four scenarios, so that we can show in the state, from top to bottom to see clearly what is which level to cause problems in the system.

At present, there are still few probes, and we are still improving them. We also hope to build them together with you.

Customize the Probe

Compare other diagnostic tools

At present, the community already has Kuberhealthy and Kubeeye to do the work of Kubernetes cluster diagnosis.

Kuberheathy provides a clear framework for you to easily write your own diagnostic items, the diagnosis of CRD, you can easily use the way of Kubernetes to individual Kubernetes for physical examination.

Kubeeye is also targeted at a single cluster. It detects cluster control plane and various Node problems mainly by calling Kubernetes’ Event API and Node-Problem Detector. It also supports custom diagnostics.

Kubeprober does the same thing for diagnosing Kubernetes clusters, providing a framework for writing your own diagnostics. In addition, Kubeprober mainly solves the diagnosis problem of large-scale Kubernetes cluster. Through the idea of centralization, the cluster and diagnostic items are abstracted into CRD, which can realize the management of configuration of other Kubernetes diagnostic items and the collection of diagnostic results in the central Kubernetes cluster. The future will also address the operational and maintenance issues of large Kubernetes clusters.

How to use

Kubeprober mainly solves the problem of diagnosing large-scale Kubernetes clusters. Usually, we choose one of the clusters as the master cluster and deploy the Probe-master cluster, and the other clusters as the treated cluster and deploy the Probe-agent cluster.

Install the probe – master

Probe-master uses Webhook, which requires verification certificate to run, so you need to deploy the cert-manager service first:

Kubectl apply -f

Then install Probe-master:

APP=probe-master make deploy

Install the probe – agent

Logically, a cluster resource is assigned to a proc-agent. Before deploying the proc-agent, a cluster resource needs to be created first. The proc-master will generate a secreyKey for each cluster, The unique credentials used to interact with the probe-master.


Before deploying the Probe-Agent, you need to modify the configMap of the Probe-Agent:

vim config/manager-probe-agent/manager.yaml
apiVersion: v1
kind: ConfigMap
  name: probeagent
  namespace: system
  probe-conf.yaml: |
    probe_master_addr: http://kubeprober-probe-master.kubeprober.svc.cluster.local:8088
    cluster_name: moon
    secret_key: 2f5079a5-425c-4fb7-8518-562e1685c9b4

Finally, install the Probe-Agent:

APP=probe-agent make deploy

After installing the ProbeAgent, we can see the cluster information in the Master cluster.

Configuration of the probe

With cluster, we also need to create probe resources in the master cluster, such as:

kubectl apply -f config/samples/kubeprobe_v1_cron_probe.yaml

If kubectl get probe is used to view the list of probes, and kubectl label is used to associate a probe with the cluster, then the probe-agent used will execute the relevant diagnostic items in the cluster.

kubectl label cluster moon probe/probe-link-test1=true
kubectl label cluster moon probe/probe-cron-sample=true

Finally, we can use the kubectl get probeStatus in the diagnosed cluster to view the diagnostic results of the diagnosed cluster.

Welcome to Open Source

Welcome to Open Source

Currently, Kubeprobe has released the first version 0.0.1. There are still many features that are not perfect. Probe-master management capabilities can be further amplified and explored, and Probe programming needs to be more simple and easy to use. We hope to build together with the community, together to create a large-scale Kubernetes cluster management artifact. Everyone is welcome to follow, contribute code and STAR!

  • Erda Github address:
  • Erda Cloud Website:
  • Contributing to KubeProber: