Wechat official account: Operation and maintenance development story, author: Jock

preface

In Kubernetes, the network is provided through third-party network plugins, and the implementation of these third-party plugins is complex, so that troubleshooting network problems often hit a wall. Is there any way to monitor all network connections in a cluster?

One such project is Kubenurse, which monitors all network connections in a cluster and provides monitoring metrics for Prometheus to collect.

Kubenurse

Kubenurse is very simple to deploy on cluster nodes using Daemonset, with Yaml files in the example directory of the project.

After the deployment is successful, / Alive sends a check request to/Alive every five seconds. Then, it runs various internal methods to check the cluster network. To prevent excessive network traffic, the check result is cached for three seconds. Its detection mechanism is as follows:As can be seen from the figure above, Kubenurse will perform network probe on Ingress, DNS, Apiserver, and Kube-proxy.

All checks create public metrics that can be used to detect:

  • SDN network latency and errors

  • Network latency and errors between Kubelet

  • Communication between Pod and apiserver fails

  • Ingress round-trip network delays and errors

  • Service round-trip network latency and errors (kube-proxy)

  • Kube – apiserver problem

  • Kube-dns (CoreDns) error

  • External DNS resolution error (ingress URL resolution)

These data are mainly reflected by two monitoring indicators:

  • Kubenurse_errors_total: error counters by error type

  • Kubenurse_request_duration: request duration distribution by type

These indicators are identified by Type and correspond to several different detection targets:

  • Api_server_direct: Detects API servers directly from nodes

  • Api_server_dns: Detects the API Server by DNS from the secondary node

  • Me_ingress: Detects the Service through the Ingress

  • Me_service: uses Service to detect the Service Service

  • Path_ $KUBELET_HOSTNAME: Detects each other between nodes

Then these indicators are divided into quantile P50, P90 and P99 respectively, so that cluster network conditions can be confirmed according to different situations.

Install the deployment

Use the official deployment file for deployment. But there are a few changes that need to be made. (1) Clone the code locally first

git clone https://github.com/postfinance/kubenurse.git

Copy the code

(2) Go to the example directory and modify the ingress.yaml configuration, mainly adding the domain name as follows.

---
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  annotations:
    kubernetes.io/ingress.class: nginx
  name: kubenurse
  namespace: kube-system
spec:
  rules:
  - host: kubenurse-test.coolops.cn
    http:
      paths:
      - backend:
          serviceName: kubenurse
          servicePort: 8080


Copy the code

(2) Update the daemonset. Yaml configuration, mainly changing the ingress entry domain name, as shown below.

--- apiVersion: apps/v1 kind: DaemonSet metadata: labels: app: kubenurse name: kubenurse namespace: kube-system spec: selector: matchLabels: app: kubenurse template: metadata: labels: app: kubenurse annotations: prometheus.io/path: "/metrics" prometheus.io/port: "8080" prometheus.io/scheme: "http" prometheus.io/scrape: "true" spec: serviceAccountName: nurse containers: - name: kubenurse env: - name: KUBENURSE_INGRESS_URL value: KUBENURSE_SERVICE_URL value: KUBENURSE_SERVICE_URL value: http://kubenurse.kube-system.svc.cluster.local:8080 - name: KUBENURSE_NAMESPACE value: kube-system - name: KUBENURSE_NEIGHBOUR_FILTER value: "app = kubenurse" image: "postfinance/kubenurse: v1.2.0" ports: - containerPort: 8080 protocol: TCP tolerations: - effect: NoSchedule key: node-role.kubernetes.io/master operator: Equal - effect: NoSchedule key: node-role.kubernetes.io/control-plane operator: EqualCopy the code

(4) Create a ServiceMonitor to obtain indicator data as follows:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: kubenurse
  namespace: monitoring
  labels:
    k8s-app: kubenurse
spec:
  jobLabel: k8s-app
  endpoints:
  - port: "8080-8080" 
    interval: 30s
    scheme: http
  selector:
    matchLabels:
     app: kubenurse
  namespaceSelector:
    matchNames:
    - kube-system

Copy the code

(5) To deploy the application, run the following command in the example directory:

kubectl apply -f .

Copy the code

(6) Wait for all applications to become RUNNING, as shown below.

# kubectl get all -n kube-system -l app=kubenurse NAME READY STATUS RESTARTS AGE pod/kubenurse-fznsw 1/1 Running 0 17h pod/kubenurse-n52rq 1/1 Running 0 17h pod/kubenurse-nwtl4 1/1 Running 0 17h pod/kubenurse-xp92p 1/1 Running 0 17h pod/kubenurse-z2ksz 1/1 Running 0 17h NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/kubenurse ClusterIP 10.96.229.244 < None > 8080/TCP 17h NAME DESIRED CURRENT READY up-to-date AVAILABLE NODE SELECTOR AGE daemonset.apps/kubenurse 5 5 5 5 5 <none> 17hCopy the code

(7)prometheusTo check whether the data is normally obtained.Check whether the indicator is normal.(8) At this timegrafanaThe above diagram shows the monitoring data, as follows.

Reference documentation

【 1 】 github.com/postfinance…

Public account: Operation and maintenance development story

Making:Github.com/orgs/sunsha…

Love life, love operation

If you think the article is good, please click on the upper right corner and select send to friends or forward to moments. Your support and encouragement is my biggest motivation. Please follow me if you like