The author | jock

Public, | operations development

Share | jock

Monitoring is the most important part of the entire operation and maintenance (O&M) and product life cycle. It aims to detect faults in advance, locate problems with monitoring data, and provide data for problem analysis.

First, the purpose of monitoring

Monitor throughout the application lifecycle. From program design, development, deployment, offline. Its main service objects are:

  • technology

  • business

The monitoring system enables you to learn about the status of the technology environment and help detect, diagnose, and resolve faults and problems in the technology environment. However, the ultimate goal of monitoring system is business, is to better support business operation, to ensure the continuous development of business.

Therefore, the purpose of monitoring can be summarized as follows: 1. Real-time monitoring of the system 7*24 hours; 2. Timely feedback of system status; 3

Second, the mode of monitoring

Monitoring from top to bottom can be divided into:

  • Business monitoring

  • Application of monitoring

  • The operating system

Service monitoring mainly provides some service indicators and data to alarm or display the growth rate and error rate. It is necessary to define specifications or even bury points in advance. Application monitoring consists mainly of probes and introspection. Probes are mainly used to detect application characteristics from the outside, such as listening for port responses. Introspection is about looking inside an application. By detecting and returning measurements of its internal state, internal components, transactions, and performance, an application can send events, logs, and metrics directly to monitoring tools. Operating systems monitor usage, saturation, and errors of major components, such as CPU usage, CPU load, and so on.

Third, monitoring methods

The main monitoring methods are as follows:

  • Health check. Health check is used to monitor the health status of applications and check whether services are running properly.

  • The logs. Logs provide rich information for locating and rectifying faults.

  • Call chain monitoring. Call chain monitoring can show all information of a request, including service call link, time consumed, etc.

  • Indicator monitoring. Indicators are discrete data points based on time series, which can reflect the trend of some important indicators through aggregation and calculation.

In the preceding four monitoring modes, health check is provided by infrastructures such as cloud platforms. Logs are generally collected, stored, calculated, and queried by an independent log center. Call chain monitoring also has an independent solution for burying, collecting, calculating, and querying service calls. Index monitoring is through some Pointers grab target exposure indicators, and then clean up these index data, aggregation, we through the aggregation of data for display, alarm, etc.

Note: The program is mainly for index monitoring

4. Monitoring and selection

4.1. Health check

The cloud platform provides the health check capability, which can be directly configured on the cloud platform.

4.2, logs,

The mature open source logging solution is ELK.

4.3. Call chain monitoring

Call chain Health uses third party health software, commonly used skywalking, Zikpin, Pinpoint, Elastic APM, Cat. Among them, Zikpin and CAT have a certain degree of code intrusion, while Skywalking, Pinpoint and Elastic APM are based on bytecode injection technology and have no code intrusion and minimal changes.

Pinpoint agent only supports Java and PHP, while Skywalking and Elastic APM support multiple languages such as Java/NodeJS/Go.

In cloud native environments, SkyWalking and Elastic APM are better suited.

Elastic APM uses ES as storage and allows you to view app information directly on Kibana, but there is a fee for the diagram. Skywalking is an open source product of Chinese people. It has graduated from Apache Foundation and has a very active community. The version of SkyWalking is rapidly iterative, designed for microservices, cloud native architecture and container based architecture (Docker, K8s, Mesos).

Pinpoint and skywalking contrast may refer to: skywalking.apache.org/zh/blog/201…

4.4 Index monitoring

In the cloud native environment, where traditional monitoring methods are not appropriate, zabbix is the obvious choice, but in the cloud native environment, Prometheus has become popular for several reasons:

  • Mature community support. Prometheus, a graduate project of CNCF, is the first cloud-native monitoring solution endorsed by many major manufacturers and with a large community.

  • Easy to deploy and operate. Prometheus core only has a binary file, no other third-party dependencies, deployment and maintenance is very convenient.

  • The Pull model is adopted to Pull monitoring data from each monitoring target through HTTP Pull mode. Generally, the Push model collects information and pushes it to the collector through the Agent. The Agent of each service needs to configure monitoring data items and monitoring information of the server, which will increase the operation and maintenance difficulty in a large number of services. In addition, using the Push model, the monitoring server will receive a large number of requests and data at the same time during the traffic peak, which will cause great pressure to the monitoring server, and even the service is unavailable in serious cases.

  • Powerful data model. Monitoring data collected by Prometheus is stored in a built-in timing database in the form of indicators, which support custom labels in addition to basic indicator names. Labels can be used to define various dimensions to facilitate aggregation and calculation of monitoring data.

  • Powerful query language PromQL. PromQL enables you to query, aggregate, visualize, and alarm monitoring data.

  • Perfect ecology. Common operating systems, databases, middleware, libraries, and programming languages, Prometheus provides access solutions and client SDKS for Java/Golang/Ruby/Python to quickly implement custom monitoring logic.

  • High performance. Prometheus provides excellent performance in data acquisition and query for a single instance that processes hundreds of monitoring indicators and hundreds of thousands of data per second.

Note: Prometheus is not suitable for scenarios where 100% accuracy of collected data is required due to the possibility of lost data.

5. Prometheus Monitoring System Overview

The overall framework of the monitoring system is as follows:

  • Prometheus Server: captures indicators and stores time series data

  • My friend: Expose indicators to the quest

  • Pushgateway: Pushes indicator data to the gateway in push mode

  • Alertmanager: Alarm component that handles alarms

  • Adhoc: for data query

The process is simple: The Prometheus server receives data directly or via PushGateway, stores the data in TSDB, collates the data, alerts it through Altermanager, or displays it through Grafana.

6. Monitoring objects of indicators

Monitoring systems generally divide monitoring objects by layers. In our monitoring system, we mainly focus on the following types of monitoring objects:

  • Host monitoring refers to the monitoring data of software and hardware resources on host nodes.

  • Container environment monitoring refers to monitoring data of the environment in which services run.

  • Application service monitoring, mainly refers to the basic data indicators of the service itself, the running status of withdrawal service itself.

  • Third-party interface monitoring refers to invoking other external service interfaces.

For application services and third-party interface monitoring, we commonly use the following metrics: response time, QPS, success rate.

6.1 Host monitoring

6.1.1. Why is host monitoring required

Host is the carrier of the system, all system applications run on the host, if one or several hosts break down, will lead to the above running so applications can not provide services normally, even lead to production accidents. So the monitoring and early warning of the host is very necessary, we can deal with it before its failure, to avoid serious accidents.

6.1.2 How to judge the resource situation

Host monitoring is mainly from the following three aspects to comprehensively consider its status:

  • Usage: The average amount of time a resource is busy working, usually as a percentage over time

  • Saturation: The length of the resource queue

  • Error: Count of resource error events

6.1.3 Which resources need to be monitored

The main resource objects of hosts are:

  • CPU

  • memory

  • disk

  • availability

  • Service status

  • network

6.1.4. How to Monitor

In the Prometheus monitoring solution, host resource indicators are collected through Node-Exporter and stored in the Prometheus sequence database, and the specific status of each indicator can be queried through PromQL.

1, CPU,

CPU usage and saturation are monitored. Node_cpu_seconds_total Alarms are generated based on the CPU usage. For example, an alarm is generated when the CPU usage exceeds 80%. CPU is a Gauge. Therefore, an alarm is generated when the CPU usage reaches the maximum value within a specified period of time. For example, the following expression shows hosts whose CPU usage is greater than 60% within 5 minutes:

100-(avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by(instance)* 100) > 60

Copy the code

CPU indicators and user – mode, kernel – mode indicator data, this is based on the situation to monitor. Node_loadCPU saturation usually refers to the load of the CPU. Normally, the total CPU load does not exceed the total number of cpus. For example, if there are two cpus, the total CPU load does not exceed 2. The collected data includes 1 minute, 5 minutes, and 15 minutes of load data. When configuring monitoring data, you need to set the statistical time. Generally, the 5-minute load is selected as the statistical time.

node_load5 > on (instance) 2 * count by(instance)(node_cpu_seconds_total{mode="idle"})

Copy the code
2, memory,

Memory is monitored primarily in terms of usage and saturation. (1) Memory usage You can intuitively see the overall CPU usage. The calculation method is (free + Buffer + Cache)/total. The main indicators are:

  • Node_memory_MemTotal_bytes: indicates the total memory on a host

  • Node_memory_MemFree_bytes: memory available on a host

  • Node_memory_Buffers_bytes: memory in the buffer cache

  • Node_memory_Cached_bytes: memory in the page cache

For example, the following expression is used to count memory usage greater than 80% :

100 - sum(node_memory_MemFree_bytes{job="node-exporter"} + node_memory_Buffers_bytes{job="node-exporter"} + node_memory_Cached_bytes{job="node-exporter"})by (instance) / sum(node_memory_MemTotal_bytes{job="node-exporter"})by(instance)*100 > 80

Copy the code

(2) Saturation The saturation of memory refers to the reading and writing of memory and disk to monitor. Indicators are:

  • Node_vmstat_pswpin: number of bytes read from disks to memory per second (unit: KB)

  • Node_vmstat_pswpout: indicates the number of bytes written from memory to disk per second (unit: KB)

3, disk

Disk monitoring is a bit special, we do not follow the USE method to measure. Considering its utilization rate alone does not have much effect, because the impact of the remaining 20% of 10G and the remaining 20% of 1T on us is different, so we can monitor its growth trend and direction. For example, predict whether the disk will be used up within 4 hours based on the disk growth in the previous 1h.

predict_linear(node_filesystem_free_bytes{job="node-exporter",mountpoint! =""}[1h], 4*3600)Copy the code

Of course, if only this forecast will generate a lot of garbage alarms, because the growth rate may be very fast in a certain hour, so the forecast will run out in the next 4 hours, but when we board the host, we only use 40%, even if there are alarms we will not deal with. So we can add another condition, such as the disk usage is greater than 80% and will run out in the next 4 hours. As follows:

(100 - (node_filesystem_avail_bytes{fstype! ="",job="node-exporter"} / node_filesystem_size_bytes{fstype! ="",job="node-exporter"} * 100)>80) and (predict_linear(node_filesystem_free_bytes{job="node-exporter",mountpoint! ="",device! ="rootfs"}[1h],4 * 3600) < 0)Copy the code

In addition, you need to monitor the I/O of disks, whether cloud host disks or physical disks. Each disk has its own IOPS. If a host has a high I/O, other problems may occur, such as a busy system or heavy load. Indicators are defined by Node-Exporter. You only need to export the indicators and display the aggregated data or handle alarms. The polymerization formula is as follows:

100-(avg(irate(node_disk_io_time_seconds_total[1m])) by(instance)* 100)

Copy the code
4. Usability

Availability refers to the availability of the host. You can judge whether the host is available by the UP indicator. If the value is 0, the host is down, and if the value is 1, the host is alive. The host is not available:

up{job="node-exporter"}==0

Copy the code
5. Service status

Service status to monitor key services, such as docker. Service, SSH. Service, kubelet. Service, etc. Indicators for:

  • node_systemd_unit_state

For example, the docker.service status is alive:

node_systemd_unit_state{name="docker.service",state="active"} == 1

Copy the code

Monitor key services so that we can be notified of service problems as soon as possible.

6, network

The network mainly monitors its incoming and outgoing traffic on each host, as well as the status of TCP connections. Prometheus’s Node-exporter captures the network adapter of each host and its traffic to and from the network adapter, as well as the TCP status of each host. The node-exporter can perform page display or alarm handling based on the aggregated indicators. For example, counting incoming traffic:

((sum(rate (node_network_receive_bytes_total{device! ~'tap.*|veth.*|br.*|docker.*|virbr*|lo*'}[5m])) by (instance)) / 100)Copy the code

Collecting statistics on outbound traffic:

((sum(rate (node_network_transmit_bytes_total{device! ~'tap.*|veth.*|br.*|docker.*|virbr*|lo*'}[5m])) by (instance)) / 100)Copy the code

And the number of TCP states in ESTABLISHED:

node_netstat_Tcp_CurrEstab

Copy the code

You can monitor and generate alarms based on each indicator.

6.2 Container monitoring

6.2.1. Why is container monitoring needed

In the cloud native era, the container is the carrier of our application, it is the application infrastructure, so it is necessary to monitor it. When we create a container, we usually give it a limit for CPU and memory. Especially for memory, if it reaches the limit, it will result in OOM. At this time, we will do upgrade configuration or find the cause processing.

6.2.2 What are the main monitoring targets

The monitoring objects are as follows:

  • cpu

  • memory

  • The event

6.2.3 How to Monitor

We use cAdvisor to get container metrics (kubelet has integrated this service).

1, CPU,

In a container, we simply monitor its status by its usage, which we get by its (usage /limit). As follows:

sum( node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate * on(namespace,pod) group_left(workload, workload_type) mixin_pod_workload ) by (workload, workload_type,namespace,pod) /sum( kube_pod_container_resource_limits_cpu_cores * on(namespace,pod) group_left(workload,  workload_type) mixin_pod_workload ) by (workload, workload_type,namespace,pod) * 100 > 80Copy the code

If the CPU usage continues to exceed our threshold, consider increasing the CPU Limit.

2, the memory

As with the CPU, the container’s memory adequacy is determined by its usage. As follows:

sum(
            container_memory_working_set_bytes
          * on(namespace,pod)
            group_left(workload, workload_type) mixin_pod_workload
        ) by (namespace,pod) / sum(
            kube_pod_container_resource_limits_memory_bytes
          * on(namespace,pod)
            group_left(workload, workload_type) mixin_pod_workload
        ) by (namespace,pod) * 100 / 2 > 80

Copy the code

If the memory usage is higher than the threshold we set, consider whether we need to increase the MEMORY of Pod.

3, events,

The event here is for Kubernetes’ POD event. In Kubernetes, there are two types of events. One is a Warning event, indicating that the state transition that generated the event was between unexpected states. The other is a Normal event, which indicates that the desired state is the same as the current state. We take the life cycle of a Pod as an example. When a Pod is created, the Pod will enter the Pending state first, waiting for the image to be pulled. When the image is accepted and passed the health check, the Pod will be in the Running state. A Normal event is generated. If the Pod crashes during the runtime due to OOM or other reasons, and enters the Failed state, which is not expected, then the Kubernetes will generate a Warning event. In view of this scenario, if we can generate monitoring events, we can timely check some problems that are easily ignored by resource monitoring.

In Kubernetes through kube-eventer for event monitoring, and then for different events to alarm notification.

6.3 Application service monitoring

6.3.1 Why is Application service Monitoring required

Application is the carrier of business and the most intuitive user experience. Whether the application status is directly related to the good business and user experience. If certain monitoring measures are not taken, the following problems may occur:

  • Failure to identify or diagnose a fault

  • There is no way to measure application performance

  • There is no way to measure the business metrics and success of an application or component, such as tracking sales data or transaction values

6.3.2 What are the monitoring indicators

  • Application monitoring

  • HTTP interface: URL survival, number of requests, time spent, and number of exceptions

  • JVM: number of GC cycles, GC duration, size of each memory region, number of current threads, number of deadlocked threads

  • Thread pool: number of active threads, task queue size, task execution time, number of rejected tasks

  • Connection pool: total connections and active connections

  • Business indicators: depending on the business, such as PV, order volume, etc

6.3.3 How to Monitor

1. Application monitoring

Application metrics measure the performance and state of an application, including the experience of the application’s end users, such as latency and response time. Behind the scenes, we measure the throughput of the application: requests, volume of requests, transactions, and transaction time. ** “(1) HTTP interface monitoring” ** Blackbox_EXPORTER of Prometheus can be used to monitor interface survival, which can be used to detect HTTP, HTTPS, TCP, DNS, and ICMP protocols to capture data for monitoring.

“(2) JVM Monitoring” ** The purpose of MONITORING JVM using Prometheus can be realized by exposing JVM data to buried sites in the application, monitoring AND collecting JVM data using Prometheus, displaying JVM data using Prometheus Grafana platelets, and creating alarms. (1) Add Maven dependencies to pom.xml file.

< the dependency > < groupId > IO. Prometheus < / groupId > < artifactId > simpleclient_hotspot < / artifactId > < version > 0.6.0 < / version > </dependency>Copy the code

(2) Add a method to initialize the JVM at a place where initialization can be performed.

@PostConstruct
    public void initJvmExporter() {
        io.prometheus.client.hotspot.DefaultExports.initialize();
    }

Copy the code

(3), in/SRC/main/resources/application configuration properties file used for Prometheus monitoring Port (Port) and Path (Path)

management.port: 8081
endpoints.prometheus.path: prometheus-metrics

Copy the code

(4), in/SRC/main/Java/com/monitise/prometheus_demo/PrometheusDemoApplication. Open the HTTP port in Java files

@SpringBootApplication // sets up the prometheus endpoint /prometheus-metrics @EnablePrometheusEndpoint // exports the data at /metrics at a prometheus endpoint @EnableSpringBootMetricsCollector public class PrometheusDemoApplication { public static void main(String[] args) { SpringApplication.run(PrometheusDemoApplication.class, args); }}Copy the code

The exposed interfaces and data can then be collected during application deployment. Because there are so many applications, you can do it by automatic discovery. Annotations from the top and bottom of a service are automatically discovered.

prometheus.io/scrape: 'true'
prometheus.io/path: '/prometheus-metrics'
prometheus.io/port: '8081'

Copy the code
2. Monitor business indicators

Business metrics are a further layer of application metrics and are often synonymous with application metrics. If you consider measuring the number of requests to a particular service as an application metric, business metrics typically do something to the content of the request. An example of an application metric might be measuring the delay of a payment transaction, and the corresponding business metric might be the value of each payment transaction. Business metrics may include the number of new users/customers, number of sales, sales by value or location, or any other metric that helps measure the health of the business.

6.4 Third-party interface monitoring

6.4.1. Why is third-party Interface Monitoring required

The performance of third-party interfaces directly affects services. Therefore, it is very important to monitor the abnormal status of third-party interfaces. The main factors are response time, viability and success rate.

6.4.2 What are the monitoring indicators

  • The response time

  • viability

  • The success rate

6.4.3 How to Monitor

You can use Prometheus’s Blackbox_EXPORTER to monitor the interface. Through the dimension of third-party interface monitoring, we can easily associate our own services with the third-party services we use, and display in a unified view which third-party service interfaces are used by the service, and the response time and success rate of these third-party service interfaces. When the service is abnormal, it helps to locate the fault. At the same time, some internal services may not monitor the alarm comprehensively, and third-party monitoring can also help them improve the quality of service.

7. Alarm notification

At what threshold does the alarm need to be generated? What is the corresponding fault level? Alarms that do not need to be handled are not good alarms. Therefore, it is important to define a proper threshold. Otherwise, o&M efficiency will be reduced or the monitoring system will lose its function.

Prometheus allows alarm trigger conditions to be defined based on PromQL, and Prometheus periodically calculates PromQL and sends alarm messages to the Alertmanager when the conditions are met.

When configuring alarm rules, you can classify alarms by group, so that alarms in the same group can be aggregated for easy configuration and viewing. When an Alertmanager receives an alarm, it can perform additional processing such as grouping, suppressing, silencing, and routing to different receivers. The Alertmanager supports multiple alarm notification methods, such as email notification, enterprise wechat notification, and webhook notification. We can define different forms of notification in order of priority, so that different measures can be taken according to different forms of notification.

Viii. Troubleshooting process

After receiving a fault alarm, you must have an onCall mechanism to handle the fault in a timely manner.

8.1 Classification of fault grades

Before handling a fault, you need to know the fault and take proper measures. So we need to make a division of the failure level. For example, the system fault level is divided into four levels according to the Basic Requirements of Information System Security Level Protection, and the primary and secondary faults are major faults. Level 3 and level 4 faults are common faults.

8.1.1 Level 1 Fault

If a system fault is expected to seriously affect the company’s production service system and cause the interruption of the related production service system for more than 1 hour and cannot be recovered within 24 hours, it has one or more of the following characteristics and is defined as a Level 1 fault.

  1. The equipment room network of the company and the VPC network of Aliyun are faulty, causing staff and users to fail to access related service systems.

  2. Key servers such as the WEB website and APP system break down or refuse to provide services due to other reasons;

  3. Information system security incidents caused by modification, counterfeiting, leakage or theft of business data by technical means;

  4. Critical service systems cannot provide services due to viruses.

8.1.2 Level 2 Fault

If an information system fault is expected to or has seriously affected the company’s production service system, resulting in the interruption of the relevant production service system for more than one hour and is expected to be recovered within 24 hours, it is defined as a Level 2 fault.

  1. The company’s computer room network and Ali Cloud VPC are faulty.

  2. Key servers such as the WEB website and APP system break down or refuse to provide services due to other reasons;

  3. Level 3 failure that cannot be resolved within 12 hours.

8.1.3 Level 3 Fault

A level 3 fault is defined when one of the following conditions is met.

  1. After a fault occurs, the operating efficiency of the information system is affected and the speed of the information system slows down, but the access to the service system is not affected.

  2. It is expected to recover within 12 hours after failure occurs.

  3. Level 4 failure that cannot be resolved within 24 hours

8.1.4 Level 4 Fault

A level 4 fault is defined when one of the following conditions is met.

  1. When a fault occurs, it can be handled in an emergency at any time without affecting the overall operation of the system.

  2. Network data may occasionally be switched due to virus attacks, but the normal access and running of the system are not affected.

8.2. Troubleshooting procedures

8.2.1 Fault discovery

After discovering a fault or receiving a fault report, the staff should first record the fault occurrence time and discovery time, and discovery department, discovery person and contact phone number, make a preliminary judgment on the fault level, and report to relevant personnel for handling.

8.2.2 Troubleshooting

  1. A faulty system should be notified to the O&M personnel. The o&M personnel should first inquire about the recent changes of equipment and configurations and find out the impact range of the fault to determine the fault level and possible location.

  2. For general faults, report them according to the specified upgrade requirements, and report the troubleshooting situation to the competent leader in time.

  3. For critical faults, report them according to the requirements for fault upgrade, and report the fault rectification to the competent leader in time.

8.2.3 Fault reporting

According to the fault level and the occurrence of the time limit, to the situation of the fault timely report, and to the reporter, inform the interpersonal time content for record. Major faults are reported by the troubleshooting group leader, and common faults are reported by the troubleshooting personnel. The time limit for reporting fault upgrade is as follows:

Reporting time limit Primary failure The secondary fault Level 3 failure Level 4 fault
immediately Director of operations Operations staff Operations staff Operations staff
Half an hour Technical director Director of operations
1 hour
Technical director Director of operations
4 hours
Technical director
12 hours
Director of operations
24 hours

8.3 Troubleshooting flow chart

After the

Public account: Operation and maintenance development story

Making:Github.com/orgs/sunsha…

Love life, love operation