Build server performance monitoring system in 15 minutes with no threshold

Server monitoring is every Internet companies attach importance to and want to do as much as possible, from data collection, data processing, data visualization eventually to real-time monitoring alarm, this a series of complex process can take a lot of manpower and time, so that some of the time because of its high complexity cannot reach the expected control effect. When the accident happened, we found that the imperfect monitoring system caused a lot of unnecessary losses, which we regret too much.

In order to solve such troubles of enterprises, Seven Niuyun launched a solution to quickly build the server performance monitoring alarm. Qiniuyun’s open source log/information collection tool LogKit, combined with Pandora’s big data workflow engine and timing database service, can easily conduct all-round monitoring of massive performance index data of a large number of servers. The entire deployment and use process takes only 15 minutes.

Monitoring content

Logkit’s current collection of machine performance indicators mainly includes ten modules and hundreds of indicators

systemModule: Monitoringload1,load5,load15,The number of users,Number of CPU coresAs well asSystem startup timeAnd so on.
processesModule: Monitors the number of processes in various states, such asIn the operation of the/suspended/interruptible/free/hangNumber of processes in state, etc.
netstat: Monitors the number of network connections in various states, for examplesyn send/syn recvNumber of network connections in equal state.
net: Monitors the status of network devices, for exampleThe number of packets received and sent,The number of bytes received and sent packetsAnd so on.
mem: Monitors memory status in real time.
swap: Monitors the status of swap partitions, for examplein,Swap out,usage,Free sizeAnd so on.
cpu: Monitors the real-time CPU status, includingThe CPU usage.Proportion of interruption timeAnd so on.
kernel: Monitors the number of kernel interrupts, the number of context switches, and the number of processes forked.
disk: Monitors disk usage, includingDisk usage,Inode usageAnd so on.
diskio: Monitors disk read and write statusRead and write the number,The total availableAnd so on.

For details, see Introduction and configuration of the LogKit System Information Collection module

Monitor the effect drawing

After the deployment is complete, you can directly load the monitoring template we built for you, and finally see the following renderings.

1. Template variables

Template variables can filter data. For example, the hostname template variable can be used to view the metric information for a specific server, making it easy to manage dozens or hundreds of machines. Similarly, there are template variables for specific resources on a machine, such as disks, CPUS, network cards, and so on.

2. Overview

The global information overview displays basic server information, such as system load, memory usage, disk usage, and network bandwidth. It controls the running status of the whole system in the most intuitive way, facilitating timely discovery and processing when basic resources are insufficient.

3. CPU Usage

CPU Usage as the name implies, CPU Usage. The figure shows the Usage of CPU resources by users and systems in the system. If the CPU usage is high and the overall running is stable, your services are healthy. If the CPU usage curve fluctuates widely, it indicates that the service can be optimized or alarms can be added to add resources during peak times.

4. Load values and processes of the system

In this figure, you can view the system load in different statistical periods. You can set alarm thresholds based on the load. You can also see the corresponding number of processes, such as the number of running processes, the number of dormant processes, and some notable abnormal processes, such as zombies and blocked processes. If zombie processes exist, services are abnormal and need to be handled in a timely manner.

5. Memory usage

In this figure, you can view the total, used, and free memory information. You can also set alarms based on these information to discover system performance shortcomings in time. On the other hand, when the free memory of the system is low or close to zero, but the cache part has a lot of memory, it indicates that this part of the service is relatively dependent on the memory cache. Although the service can still run normally, it is very likely that the program can not make maximum use of the memory cache, resulting in performance problems.

6. The kernel of information

The basic kernel information includes the kernel context switch, number of forks, opened/Max handles and so on. There is usually a limit on the number of open handles on a server, beyond which the service will have problems. High concurrent access to the server is likely to result in too many open handles. Monitoring the number of open handles in real time can help you see the health of the service.

7. CPU status

In this diagram, you can see the status of each CPU in the server, which is a detailed expansion of CPU Usage. CPU is an elastic resource. Even if the CPU usage reaches 100%, there will not be a direct service crash, but it may lead to slow service response. Keeping a close eye on CPU usage and setting alarm monitoring for CPU are also essential for operation and maintenance.

8. Networking

This figure shows the number of TCP connections in various states in the system, such as SYN SENT and FIN WAIT. Using these data, you can find the health status of the request in time, for example, a large number of CONNECTIONS in FIN WAIT and CLOST WAIT states. A lot of slow requests or connection faults occur and need to be rectified. You can view these indicators together with the number of open file handles.

The ICMP, IPV4

In this figure, you can view the status of sending and receiving network protocols such as ICMP and ipv4.

In this figure, you can see udp datagrams and the number of UDP errors. If the number of udp errors is too many, the network is in bad condition.

Status of each network adapter

The nic status displays information such as Network Usage, Network Packets, Network drops, and Network errors of the NIC.

9. Switch zone status

This figure shows the status of switch-in and switch-out of the switch partition, and the usage of the switch partition.

10. Disk usage

The importance of disks is undeniable, and disk overcrowding can have devastating effects on services, and is definitely something to monitor.

Disk I/o

The Disk I/O information includes Disk I/O requests, Disk I/O bytes, and Disk I/O time. If the Disk I/O is too high, performance problems may occur.

Disk Usage

In this figure, you can view the total and Used disk space, display the disk usage in real time, and set an alarm mechanism. When the available disk space is less than a certain threshold, an alarm is generated.

Quick start

Here are the components provided by Pandora to build an operation and maintenance monitoring application. There are only four steps to build this application.

Note that in order to use Pandora’s services smoothly, first, you need to have a real-name verified Qiniu account; Second, apply for permission to use Pandora;

Step 1: Download and start LogKit

Download the LogKit application of the operating system from the LogKit Download page. You can refer to the LogKit Wiki for details on how to configure logKit. However, if there are no special requirements, just use the default configuration. To start LogKit, type the following command

./logkit -f logkit.conf
Copy the code

Step 2: Configure the Metric collector

On the LogKit visual configuration page, you can easily configure the metric information to be collected and enter the configured URL in the browser to access the LogKit manager (http://127.0.0.1:3000 by default).

Open thelogkitAfter configuring assistant, clickAdded the system information collection collectorButton to enter the edit collector page.
Select the Metric information type to be collected. By default, all Metric information will be collected. Click the drop-down list box to stop collecting Metric information.
Then select each one you want to collectMetricThe “Select All/None” button can be used to quickly select/deselect all fields. Note: Please at least for eachMetricInformation selects a field;
And then fill in the relevantMetricNote: SomeMetricThere is no configuration to fill in, so it is not displayed;
Fill in the destination to send the data topandoraPlatform, selectpandora senderFill in yourselfpandoraaccountak/sk, fill in thereponameIf there are no special requirements, other options can use the default values, detailedsenderInformation, can refer toSenders Wiki
Finally, you can review the content of the configuration file to avoid errors that may have been overlooked. You can also customize the frequency at which the metric information is collected3s
Click “Confirm and submit,” and a runner is created to collect metric information.

Step 3: Configure the Grafana data source

Open the Grafana application in the Seven Cow Application market and follow these steps to configure:

Create an

1. First log in to the application platform of Seven Cattle Portal to find Grafana

2. Click the Deploy Now button to start creating the Grafana application

Enter your app name and application alias, select the deployment region (note! Currently TSDB data is in East China, so Grafana can only be deployed in East China), click OK to create

Application name: a unique application name that must meet the following conditions: 1. The name can contain only letters, digits, and a minus sign (-), and must start and end with a letter or digit. 2. The length cannot exceed 30 characters.) Application alias: The name of the title used for display.

3. After the app starts, enter the password (the length of the password must be >=6) and click confirm

Note that since Grafana App has a public domain name, it is recommended to set a strong password (this password can be changed after entering Grafana App).

4. Access Grafana to go to the Grafana page

Note that the Grafana App is exposed to the public network and the bookmarked address is used for subsequent access.

Configure the TSDB data source

Before we can use Pandora TSDB in Grafana, we need to add the data source.

Log in to Grafana and click On Data Sources from the menu
Click the Add Data Source button
Enter the Name of the data source in Name and select Pandora TSDB for Type
If pandorA_tsDB_rePOName is not configured in the sender configuration of the metric collector, fill in Name with pandorA_repo_name as configured above

Note: the URL must be filled in http://localhost:8999

Step 4: Import the Grafana Dashboard configuration file

Download the Grafana Dashboard configuration file

Download a configuration file template at https://pandora-dl.qiniu.com/MetricMemo.json

Import the downloaded dashboard into Grafana

At this point, you can see a cool visual operation monitoring diagram, of course, no alarm monitoring is complete, let’s configure the alarm monitoring.

Configure Grafana alarms

Right here in Grafana, we offer you complete alarm functionality.

Set the Grafana alarm Channel

Click the New Channel button, you can select a variety of alarm methods in Type, including Slack, Email, Webhook, etc.
For example, if the CPU usage is higher than 30%, you can send an email alarm (only for testing and based on the actual production environment). The configuration method is as follows: Edit the corresponding panel, select Alert, and configure the query to be monitored and its threshold.
Click on State History to see alarm history.
When the alarm information is generated, you can receive the alarm email.

For more information about the Grafana alarm configuration, see the Grafana alarm documentation

At this point, a detailed server performance monitoring system is completed, go to experience it!

With other advanced usage

Logkit detailed configuration documents
Grafana configuration document
Self-developed component monitoring
Configure nginxMetric monitoring
Configure phP-FPM monitoring