For daily operation and maintenance work, server monitoring is a must and the most basic content. During enterprise infrastructure operation and maintenance (O&M), administrators must be able to learn about the running status of all servers to discover and minimize faults. Usually we will use some monitoring software to obtain the basic indicators of each server and centralized view, analysis, monitoring.There are many open source and paid server monitoring systems in the market, such as the old zabbix, Nagios, NewRelic, CollectD, etc., and the recently popular Telegraf and Prometheus. Various systems have their points of interest, such as Zabbix’s strong ecology, NewRelic’s services, and Prometheus’s cloud-native friendliness. Compared with middleware and business monitoring, server monitoring is more basic and focuses on the monitoring’s ease of use, stability, real-time, alarm richness, and convenience of using reports.

This issue introduces how to use Ali Cloud SLS to quickly build a complete set of server/host basic indicators real-time monitoring scheme.

Introduction to SLS timing storageThe log storage engine of SLS was released in 2016. At present, it undertakes the log data storage of Ali and many enterprises, and tens of PB log data are written every day. A large part of the data belongs to timing data or is used to calculate timing indicators. In order to enable users to one-stop data access, cleaning, processing, extraction, storage, visualization, monitoring, problem analysis and other processes throughout the DevOps life cycle, we specially launched the timing storage function. With log storage for you to solve all kinds of machine data storage problems.

SLS timing storage is designed from the very beginning to solve the timing storage needs of Ali and many head enterprise customers, and with the help of ali’s years of technical accumulation, so that it can adapt to the majority of enterprise-level timing monitoring/analysis demands. SLS timing sequence storage has the following characteristics: 1. Rich upstream and downstream: In data access, SLS supports many collection methods, including various open source agents and internal monitoring data channels of Ali Cloud; At the same time, the stored time series data can be connected to various streaming computing and offline computing engines, and the data is completely open. 2. High performance: SLS storage and computing separation architecture gives full play to the clustering capability, especially in the case of a large amount of data, the speed of the end to end is significantly improved; 3. O&m free: the sequential storage of SLS is completely servitized, and users do not need to operate and maintain instances by themselves. In addition, all data are stored with three copies of high reliability, so there is no need to worry about the reliability of data. 4. Open source friendly: SLS timing storage native support Prometheus writing and query, SQL92 analysis method, native docking Grafana and other visualization schemes; 5. Intelligence: SLS provides a variety of AIOps algorithms, such as multi-cycle estimation, prediction, anomaly detection, timing classification and other timing algorithms, which can be used to quickly build an intelligent alarm and diagnosis platform suitable for the company’s business.

2. Overview of server monitoring schemeSLS’s host monitoring solution is very simple, just install a Logtail to collect the basic indicators of each host, the server is cloud, no operation and maintenance, SLS provides a visual dashboard by default, can also be more professional visualization through Grafana.

Logtail collects basic indicators commonly used by hosts, such as CPUS, memory, networks, and disks, and visualizes key indicators for direct viewing.

Data Access The data access process is very simple. You only need to perform operations on the SLS console (for non-Aliyun servers, you need to run two additional commands on the servers). For details about how to access data, see Collecting Host Monitoring Data.

Add a collection configuration for Logtail on each host. The Logtail collection configuration can be managed in the cloud without manual configuration after logging in to each server.

{
  "inputs": [
    {
		
      "detail": {
      "IntervalMs": 30000
      },
      "type": "metric_system_v2"
    }
  ]
}
Copy the code

Grafana is currently the most accepted visualization solution in the field of operation and maintenance visualization. SLS adds two dashboards specifically for host monitoring, including a cluster-level monitor plate and a detailed indicator plate for a single machine. These large plates can be imported into Grafana with one click.

The configuration process of Grafana is as follows: 1. In Grafana, the timing library of SLS is used as the data source of Prometheus, and the setting method is as follows: Visual configuration of Grafana. 2. Import the SLS template in Grafana template market: Hosts monitor cluster indicators and hosts monitor single-node indicators.

V. Monitoring Data Analysis and Alarm Configuration AS a qualified O&M personnel, it is not enough to configure the cool monitoring dashboard, but also need to set enough alarm items for the cluster and be able to quickly locate problems by using the syntax of monitoring data analysis. These are essentially calculations and statistics for cluster indicators.SLS timing data supports MULTIPLE query methods such as SQL, PromQL and SQL+PromQL. PromQL query language is relatively concise, and SQL can realize more powerful semantics. The host monitoring data is relatively simple. You are advised to use PromQL or SQL+PromQL.

The following describes the statistical methods commonly used in alarm and analysis: 1. Calculate the average value of a certain indicator of all machines, for example, average CPU 2. Find the N machines with the highest memory usage, for example, find the 5 machines with the highest memory usage 3. Find a machine whose index exceeds X, for example, find a machine whose network traffic exceeds 10M per minute 4. Calculate the change of a certain indicator on a certain machine compared to a certain point in time, for example, calculate the change of disk usage on a certain machine compared to one day ago

These are very easy to implement with PromQL and can be debugged directly in Grafana’s Explore page: 1. Avg (cpu_util) 2. Topk (5, mem_util) 3. Locate the device whose network traffic exceeds 10 MB per minute :(sum_over_time(net_in[1m]) + sum_over_time(net_out[1m])) > (10)10241024) 4. Calculate the disk usage change of a machine compared with one day ago: disk_util{hostname=”iZ2ze06ibdlxtgebgtu4xdZ”} – disk_util{hostname=”iZ2ze06ibdlxtgebgtu4xdZ”} offset 1dAlarms can also be configured on Grafana or the Dashboard used to monitor the cluster. For example, the following is an alarm for configuring the average CPU usage of a cluster. The alarm rule is as follows: The average CPU usage of a cluster in the last five minutes is calculated every minute.

Six, the basis of summarizing the services index monitoring is our monitoring operational domain is one of the most basic work, the structure of corporate IT all-round monitoring still has a lot of work to do, such as monitoring, cloud middleware application monitoring, monitoring, business monitoring, etc., and the use of the log and timing of the SLS storage function can be easily implemented, Other related implementations will be presented in future articles.

Does more articles and materials | click behind the text to the left left left 100 gpython self-study data package Ali cloud K8s practical manual guide] [ali cloud CDN row pit CDN ECS Hadoop large data of actual combat operations guide the conversation practice manual manual Knative cloud native application development guide OSS Operation and maintenance field manual Cloud native architecture white paper Zabbix enterprise-level distributed monitoring system source document Linux&Python self-study information package 10G interview questions