Author | Duan Chao | SourceErda Erda public account

background

We are a business software company. From the very beginning we have made the software delivery process very standard and simple. All the software is delivered based on our enterprise digital platform Erda (now open source), the underlying is based on Kubernetes. IT provides enterprises with DevOps, microservice governance, multi-cloud management, fast data management and other IT services that are not bound by cloud vendors.

With the increasing number of customers we serve, now we manage a considerable number of Kubernetes clusters, the stability of the cluster determines the quality of service and external reputation, for a long time we are very passive in this matter, often Erda support students or customers feedback to us: There is a problem in some business, we need to help to check whether it is caused by the platform, and then we will go on the operation and finally solve the problem and reply to the customer. Seemingly professional and fierce, urgent users urgent, in fact, no chapter can not, a chicken feather.

Usually we rely on the monitoring and control system to found the problem in advance, but monitoring data as a positive link, it is difficult to cover all scenarios, often because of the cluster configuration inconsistencies or some of the more underlying resources is unusual, even if the monitoring data are completely normal, but the system still has some function is not available. In this regard, we made a set of inspection system to diagnose some weak points and consistency in the system. However, the expansion of this system is not very good, and the management of cluster and inspection items is also relatively rough.

Finally, we decided to make a more cloud native diagnostic tool, using operator to realize the management of cluster and diagnosis, abstract the resource concept of cluster and diagnosis, to solve the diagnosis problem of large-scale Kubernetes cluster, through the center to deliver diagnosis to other clusters, and unified collection of diagnosis results of other clusters, To achieve at any time from the center to obtain the running status of all other clusters, to achieve effective management and diagnosis of large-scale Kubernetes cluster.

Kubeprober

Project introduction

Project address: github.com/erda-projec…

KubeProber is a diagnostic tool designed for large-scale Kubernetes cluster, used to perform diagnostic items in Kubernetes cluster to prove that the function of the cluster is normal, KubeProber has the following characteristics:

  • Large-scale cluster: Supports multiple cluster management. You can configure the relationship between clusters and diagnosis items on the management terminal and view diagnosis nodes of all clusters in a unified manner.
  • Cloud native: Core logic is implemented using operator, providing complete Kubernetes API compatibility.
  • Extensible: Supports user-defined inspection items.

Its core architecture is as follows:

Different from the monitoring system, kubeProber from the perspective of inspection to prove that the functions of the cluster is normal, monitoring as a forward link, can not cover all scenarios in the system, the monitoring data of each environment in the system are normal, also can not guarantee that the system is 100% available. Therefore, a tool is needed to prove the availability of the system in reverse order to fundamentally discover the points in the cluster that are not available before the user, such as:

  • Whether all nodes in the cluster can be scheduled and whether special stains exist.
  • Whether POD can be created and destroyed normally, verify the whole link from Kubernetes, Kubelet to Docker.
  • Create a Service and test the connectivity to verify that the kube-proxy link is normal.
  • Resolve an internal or external domain name to verify that CoreDNS is working properly.
  • Access an Ingress domain name to verify that the Ingress component in the cluster is working properly.
  • Create and delete a namespace to verify that the related Webhooks are working properly.
  • Perform put, get, and delete operations on the Etcd to verify that the Etcd runs properly.
  • This section describes how to perform operations on the mysql-client to verify whether mysql is running properly.
  • Simulate users to log in and operate the business system to verify whether the main process of business is frequent.
  • Check whether the certificate of each environment has expired.
  • Check the expiration of cloud resources.
  • More…

Component is introduced

Kubeprober adopts Operator to realize the core logic as a whole. The management between clusters uses Remotedialer to maintain the heartbeat link between the managed cluster and the management cluster. The managed cluster entrusts probe-Agent with the minimum required permissions through RBAC. In addition, the managed cluster metadata and the token for accessing apiserver are reported in real time through heartbeat links. In this way, the managed cluster can operate the resources of the managed cluster.

probe-master

Operator running on the management Cluster. This operator maintains two CRDS, a Cluster, for managing the managed Cluster. The other is Probe, which is used to manage built-in and user-written diagnostic items. Probe-master pushes the latest diagnostic configuration to the managed cluster through the two CRDS of Watch. Meanwhile, Probe-Master provides an interface for viewing the diagnosis results of the managed cluster.

probe-agent

The operator running on the managed cluster maintains two CRDS. One is a probe that is exactly the same as probe-master. Probe-agent executes diagnostic items of the cluster according to the definition of probe. The other is ProbeStatus, which is used to record the diagnosis result of each Probe. Users can view the diagnosis result of this cluster by using Kubectl get ProbeStatus in the managed cluster.

What is the Probe

A diagnosis plan run in Kubeprobe is called Probe. A Probe is a set of diagnosis items. We suggest that the diagnosis item in a unified scenario be run as a Probe. Execute the diagnostics defined in Probe and write the results to the ProbeStatus resource.We expect to have an output can clearly see the state of the cluster, therefore, we suggest that all the Probe belongs to application, middleware, Kubernets, possible basis set the four scenarios, so that we can show in the state, from top to bottom to see clearly what is which level to cause problems in the system.

At present, there are still few Probe, we are still improving, and we hope to build together with you.

Customize the Probe

Compare other diagnostic tools

There are already Kuberhealthy and KubeEye in the community to do Kubernetes cluster diagnostics.

Kuberheathy provides a clear framework that allows you to easily write your own diagnostic items, CRD the diagnostic items, you can easily use Kubernetes way to perform a checkup on a single Kubernetes.

Kubeeye is also targeted at a single cluster. It mainly calls Kubernetes event API and Node-problem-Detector to detect cluster control plane and various Node problems, and also supports custom diagnostic items.

What Kubeprober does is diagnose the Kubernetes cluster, providing a framework for writing your own diagnostics. In addition, Kubeprober mainly solves the large-scale Kubernetes cluster diagnosis problem, through the centralized idea, the cluster with the diagnosis abstract into CRD, can achieve in the center of Kubernetes cluster management other Kubernetes diagnosis configuration, diagnosis result collection, The operation and maintenance of large-scale Kubernetes clusters will also be solved in the future.

How to use

Kubeprober mainly solves the diagnosis problem of large-scale Kubernetes cluster. Usually, we choose one cluster as the master cluster and deploy probe-Master, and other clusters as the managed cluster and deploy probe-Agent.

Install the probe – master

Probe-master uses Webhook. The operation of Webhook requires certificate verification, and the service of cert-Manager needs to be deployed first:

kubectl apply -f https://github.com/jetstack/cert-manager/releases/download/v1.3.1/cert-manager.yaml
Copy the code

Then install probe-master:

APP=probe-master make deploy
Copy the code

Install the probe – agent

Logically, a cluster resource corresponds to a probe-Agent. Before deploying probe-Agent, a cluster resource needs to be created. Probe-master will generate a secreykey for each cluster. Unique credentials used to interact with the Probe-Master.

Kubectl apply-f config/samples/kubeprobe_v1_cluster.yaml kubectl get cluster-o wide #Copy the code

Before deploying probe-Agent, modify the Probe-Agent configMap.

vim config/manager-probe-agent/manager.yaml
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: probeagent
  namespace: system
data:
  probe-conf.yaml: |
    probe_master_addr: http://kubeprober-probe-master.kubeprober.svc.cluster.local:8088
    cluster_name: moon
    secret_key: 2f5079a5-425c-4fb7-8518-562e1685c9b4
Copy the code

Finally install probe-Agent:

APP=probe-agent make deploy
Copy the code

After installing ProbeAgent, we can see the cluster information in the master cluster.

Configuration of the probe

With cluster, we also need to create probe resources in the master cluster, such as:

kubectl apply -f config/samples/kubeprobe_v1_cron_probe.yaml
Copy the code

Kubectl get Probe is used to view the probe list, and Kubectl label is used to associate a probe with a cluster. Then the probe-Agent used by the peer will perform related diagnostic items in the cluster.

kubectl label cluster moon probe/probe-link-test1=true
kubectl label cluster moon probe/probe-cron-sample=true
Copy the code

Finally, we can use Kubectl get ProbeStatus in the diagnosed cluster to view diagnostic results for the cluster.

Welcome to open source

Welcome to open source

At present, KubeProbe has released the first version 0.0.1, but many functions are not perfect. Probe-master’s management ability can be further enlarged and explored, and the writing of Probe needs to be more simple and easy to use. We want to work together with the community to create a massive Kubernetes cluster management artifact. Welcome to follow, contribute code and Star!

  • Erda Github address:https://github.com/erda-project/erda
  • Erda Cloud official website:https://www.erda.cloud/
  • Contributing to KubeProber: github.com/erda-projec…