There is a saying in the operation and maintenance industry: “No monitoring, no operation and maintenance”. Yes, it is no exaggeration, monitoring is commonly known as the “third eye”. Without monitoring, basic operations and business operations are “blind”. So monitoring is fundamental to operations. Especially at the time of the conversation so hot now, with the monitoring data by yourself, it is more necessary, someone says operations is back pan, then, had a monitor, with plenty of data, all data to talk, still need to do back pan operations, so as an engineer operations, how to build a set of monitoring system is your first job.

Before opening, let us take the global vision, to explore the operational monitoring tool selection and how to construct operational monitoring platform design train of thought, if you are just the line operations this career, so this column is very suitable for you, if you have the operational workplace deep for many years, so can also help you open mentality and vision.

I. Common operation and maintenance monitoring tools

Now there are many operation and maintenance monitoring tools, which is good, which is not good, which is suitable for you, which is not suitable for you, in fact, only after you understand their characteristics, you can know, so start from here.

1, Cacti

Cacti is a graphical analysis tool for network traffic monitoring based on PHP,MySQL,SNMP and RRDTool. Simply put, Cacti is a PHP program. It obtains remote network devices and related information by using SNMP protocol (in fact, using snmpget and SNMPwalk commands of net-SNMP software package) and plots by RRDTOOL, which is displayed by PHP program. We can use it to display the status or performance trend graph of monitored objects over a period of time.

Cacti is a very old monitoring tool, in fact, it is more appropriate to say that it is a traffic monitoring tool, the traffic monitoring is more accurate, but there are many disadvantages, the graph is not beautiful, does not support distribution, there is no alarm function, so fewer and fewer people use it.

2, Nagios

Nagios is a free, open source network monitoring tool that can effectively monitor the status of Windows, Linux and Unix hosts, network Settings such as switches and routers, printers, and more. When the system or service status is abnormal, send an email or SMS alarm to inform the website operation and maintenance personnel in the first time, and send a normal email or SMS notification after the status recovers.

Nagios main characteristics is to monitor the alarm, is the most powerful alarm function, can support various alarm way, but the drawback is not powerful data collection mechanism, and the data out of the figure is also very humble, when their hosts’ monitoring, add the host is also very trouble, configuration files are based on the text configuration, management and configuration does not support web way, This is error-prone and not maintainable.

3, Zabbix

Zabbix is an enterprise-level open source solution that provides distributed system monitoring and network monitoring capabilities based on a WEB interface. Zabbix can monitor various network parameters to ensure the safe operation of the server system. It also provides a powerful notification mechanism for system operation and maintenance personnel to quickly locate and solve problems.

Zabbix consists of two parts, Zabbix Server and optional Zabbix Agent. Zabbix Server can use SNMP, Zabbix Agent, ping, port monitoring to provide remote server/network status monitoring, data collection and other functions. It runs on Linux, Solaris, HP-UX, AIX, Free BSD, Open BSD, OS X, and more.

Zabbix solves the problem that cacti does not have alarms, and nagios cannot be configured through the Web. It also supports distributed deployment, which makes zabbix quickly become popular. Zabbix has become the most popular operation and maintenance monitoring platform for small and medium-sized enterprises.

Of course, Zabbix also has its disadvantages. It consumes a lot of resources. If a lot of hosts are monitored, monitoring timeout and alarm timeout may occur.

4, Ganglia

Ganglia is a scalable distributed monitoring system designed for HPC (high performance computing) clusters. It monitors and displays information about the state of nodes in a cluster. It uses the GMOND daemon running on each node to collect data on CPU, memory, disk utilization, I/O load, network traffic, and more. This is then summarized into the Gmetad daemon, where rrdTool is used to store the data, and finally the historical data is presented as a curve through a PHP page.

Ganglia monitoring system consists of three parts: GMOND, Gmetad and WebFrontend. Gmond is installed on the client that needs to collect data. Gmetad is the server. Webfrontend is a PHP Web UI interface.

The main feature of Ganglia is that it collects data and presents it centrally. This is ganglia’s strength and uniqueness. Ganglia can aggregate all data into a single interface and present it centrally. The client’s GMOND program consumes almost no system resources, which makes up for zabbix’s lack of consumption performance.

Finally, Ganglia is more intelligent in monitoring big data platforms. You only need a configuration file to enable Ganglia to monitor Hadoop and Spark. There are nearly a thousand monitoring indicators, fully meeting the monitoring requirements of big data platforms.

5, Centreon

Centreon is a powerful distributed IT monitoring system that enables monitoring of networks, operating systems and applications through third-party components: first, IT is open source and we can use IT for free; Secondly, its bottom layer uses a monitoring engine similar to Nagios as monitoring software, and the monitoring engine writes the monitored data into the database regularly through the Ndoutil module, and Centreon reads the data from the database in real time and displays the monitoring data through the Web interface. Finally, you can manage and configure hosts with one click through Centreon Web, or Centreon is a management configuration tool for Nagios. The Web configuration interface provided by Centreon makes it easy to complete the manual configuration of hosts and services for Nagios.

Centreon’s strengths are one-click configuration and management, and support for distributed monitoring. Everything nagios can do is available through Centreon. Centreon can also integrate with Ganglia, and Centreon integrates the data collected by Ganglia. The system automatically monitors hosts and generates alarms.

6, Prometheus

Prometheus is an open source system monitoring and alarm framework for hardware metrics, such as server-oriented monitoring, as well as for highly dynamic service-oriented architecture monitoring. Prometheus’ multidimensional data collection and data filtering query language is also very powerful for popular microservices. Prometheus was designed for service reliability, allowing you to quickly locate and diagnose problems when a service fails.

7, Grafana

Grafana is an open source metric analysis and visualization suite. In plain English, Grafana is a graphical visualization platform that displays our monitoring data through a variety of cool interface effects. If you don’t like zabbix’s graphics interface, you can use Grafana’s visualization. Grafana also supports many different data sources, including Graphite, InfluxDB, OpenTSDB, Prometheus, Elasticsearch, CloudWatch and KairosDB.

8. Comparison diagram

Second, unified operation and maintenance monitoring platform design ideas

Operation and maintenance monitoring platform is not a simple download of an open source tool and then set up, it needs to carry out various integration and secondary development according to the monitoring environment and characteristics, in order to fully meet its own needs. Then let’s talk about the design ideas of operation and maintenance monitoring platform.

Build a smart operational monitoring platform, must be to run the monitoring and fault alarm as the key point, these two aspects will all involved in the business system of network resources, hardware resources, software resources, such as database resources into a unified operational monitoring platform, and by eliminating the difference of management software, the difference of data collection methods, Unified management, standardization, processing, presentation, user login and authority control are realized for various data sources, and standardized, automated and intelligent operation and maintenance management is finally realized.

The intelligent operation and maintenance monitoring platform can be divided into six layers and three modules from low to high, as shown in the figure below:

Data collection layer: located at the bottom layer, it mainly collects network data, service system data, database data, and operating system data, and then standardizes and stores the collected data. Data Display layer: Is located in the second floor, is a Web interface, main is to obtain the data to the data collection layer to show, show the way can be a graph, histogram, pie, etc., through the graphical data, can help operations staff to understand the host or network over a period of time running state and running of the trend, and as the operations staff to troubleshoot or solve problems. Data extraction layer: located in the third layer, it is mainly used to normalize and filter the data obtained from the data collection layer and extract the required data to the monitoring and alarm module. This part is the interface between the monitoring and alarm modules. Alarm rule configuration layer: located in the fourth layer, it mainly sets alarm rules, alarm threshold, alarm contact and alarm mode according to the data obtained from the third layer. Alarm event generation layer: located in the fifth layer, it mainly records the alarm events in real time, stores the alarm results in the database for call, and forms the analysis report of the alarm results, so as to count the failure rate and the trend of failure in a period of time. User display management layer: located at the top layer, it is a Web display interface, which mainly displays the monitoring statistics results and alarm fault results in a unified manner, and implements multi-user and multi-permission management, as well as unified user and permission control.

In these six layers, the function realization is divided into three modules, namely data collection module, data extraction module and monitoring and alarm module. The functions of each module are as follows:

Data collection module: This module mainly completes the collection of basic data and graphic display. Data collection can be done in many ways, through SNMP, through proxy modules, or through custom scripts. Common data collection tools include Cacti, Ganglia, etc. Data extraction module: this template mainly completes the filtering and collection of data, and extracts the required data from the data collection module to the monitoring and alarm module. Data extraction can be achieved through the interface provided by the data collection module or custom scripts. Monitoring and alarm module: this module mainly completes the setting of monitoring script, alarm rule, alarm threshold and alarm contact, etc., and centralized display and historical record of alarm results. Common monitoring and alarm tools include Nagios, Centreon and so on.

After understanding the general design ideas of the operation and maintenance monitoring platform, the following details how to achieve such an intelligent operation and maintenance monitoring system through software.

The figure below is the topology diagram of an operation and maintenance monitoring platform based on the design ideas in the figure above. It can be seen from the figure that it is mainly composed of three parts, namely, data collection module, monitoring and alarm module and data extraction module. Among them, the data extraction module is used for data communication between the other two modules. The data collection module can be composed of one or more data collection servers. Each data collection server can directly collect various data indicators from the server group, and finally store the data to the data collection server after standardizing the data format. The monitoring alarm module obtains the required data from the data collection server through the data extraction module, and then sets the alarm threshold and alarm contact, and finally realizes the real-time alarm. The alarm mode supports mobile phone SMS alarm, email alarm, etc. In addition, you can also expand the alarm mode through plug-ins or customized scripts. This set of monitoring and alarm platform is basically realized.

Iii. Selection of enterprise operation and maintenance monitoring platform

1. Select Zabbix as the monitoring platform for small and medium-sized enterprises

Zabbix is a comprehensive operation and maintenance monitoring platform that integrates data collection, data display, data extraction, monitoring and alarm configuration, user display and other aspects.

Zabbix is a monitoring software that can be used quickly. It can meet the monitoring and alarm needs of small and medium-sized enterprises, so it is the preferred platform for operation and maintenance monitoring of small and medium-sized enterprises. However, when Zabbix monitors a large number of servers, many problems will occur, such as inaccurate monitoring data, alarm timeout and other problems. This is because Zabbix has high requirements on server performance. When the number of monitored servers exceeds 500, the monitoring performance deteriorates sharply, and distributed monitoring deployment is required. And the performance of the monitoring server needs to be improved.

In terms of security, if the Agent of the Zabbix client fails, the collected data will be lost, and the Zabbix Server is also a single point. Therefore, you may need to perform HA on the Zabbix Server to ensure data security and high availability of monitoring.

2. Choose Ganglia+Centreon as the monitoring platform for large Internet enterprises

Combined application + secondary development for open source monitoring software is a large Internet companies build a basic strategy of monitoring platform, with huge amounts of servers, many complex business systems of monitoring, no software can independently accomplish the enterprise all the monitoring requirements, therefore, a variety of open source monitoring software combined application + secondary development is the final direction of monitoring platform.

Ganglia is recommended because ganglia client software has a very low footprint on service resources, and has a large number of extension plug-ins. Monitoring expansion is also very easy. At the same time, combined with the professional Web monitoring platform Centreon, it can achieve perfect cooperation in data collection, data display, data extraction, monitoring and alarm configuration, user display, etc. This is why we recommend a ganglia+ Centreon combination for monitoring massive servers.

Iv. Talk about the evolution of our operation and maintenance monitoring platform

This is an experience and summary. Based on the evolution of our monitoring platform for so many years, I have summarized the construction ideas and strategies for monitoring platform at different stages and with different number of machines.

1. The stage with less than 100 machines

Due to the small number of machines in this period, the demand for monitoring is also very simple. The purpose of monitoring may be mainly used for problem notification, quick positioning and problem solving. The characteristics of the monitoring platform at this stage are summarized as follows:

(1) Simple deployment, easy to use (2), stable operation, no failure (3), can be alarm, in the form of email, SMS and so on

Based on the above characteristics and requirements, you can use popular open source monitoring software Nagios, Cacti, Zabbix, Ganglia, etc. Popular open source product documentation is a lot, can be used quickly, and a lot of previous experience, problems are easy to solve.

At first, we chose Nagios because it was the first popular software. Later, we switched to Zabbix because it was not convenient to add hosts and services. At this stage, Zabbix should be the best choice.

2. The number of machines is from 200 to 1000

At this stage, due to the increasing number of machines, monitoring requirements began to become complicated, but it was mainly used for notification and alarm, discovering problems and avoiding the recurrence of the same problems. According to the characteristics of this stage, we mainly did the following work on the monitoring platform during this period:

(1) Classification of monitoring content: As there are many machines to monitor, monitoring content also increases, so we classify monitoring according to different uses, mainly divided into system basic monitoring data, network monitoring data and business monitoring data.

(2) Full coverage monitoring: Will all machines are included in monitoring, mainly include monitoring and hardware monitoring software, hardware monitoring mainly performance and fault monitoring hardware, software to monitor all kinds of basic monitoring data mentioned in addition to the first step, also increased the business logic to monitor, as much as possible to cover the business process, through a lot of custom monitoring to reduce and remove repetitive problem, security business and stable operation.

(3), a variety of alarm way, make sure no omission: all monitoring classified according to the importance, urgency, respectively, by email, WeChat, messages, telephone calls and such notice in the form of different levels of each monitoring correspond to different people, to ensure that each monitor some processing, and for important business with the method of continuous notice, no treatment has been notified.

The difficulty in this stage is the processing of alarm information. As more and more machines and more services need to be monitored, alarm information has exploded. It is common to receive thousands of alarm emails every day. Too many emails actually lose the significance of alarms, because we cannot check every email, and many of them are unnecessary alarms. For example, when the system load increases occasionally, alarm emails are sent, which is completely unnecessary.

Therefore, at this stage, the main is a strategy for monitoring alarm configuration and optimization, to minimize unnecessary alarm mail, for example, the system load monitoring, can choose a few consecutive load exceeds the threshold, then how long before the alarm operation, through the optimization of the alarm strategy, greatly reduce the alarm information, up to dozens of every day, in this way, You won’t miss any alarms.

3. The stage where the number of machines exceeds 1000

Due to the continuous growth of business, more and more servers are required. When our servers exceed 1000, the monitoring situation changes, or there are many strange monitoring problems, mainly as follows:

(1) The alarm is not timely

When our servers exceeded 1000, our Zabbix often went on strike, sometimes the monitoring data could not be displayed in time, sometimes the alarm did not come, especially the alarm delay, which is the most terrible thing, the online business can not break down 7*24 hours, although the monitoring abnormal, But issued by the monitoring system is already 1 or a few hours later, then what is the meaning of monitoring, timeliness is the first requirement of the monitoring system, this is a problem that must be solved.

How to solve this problem? In addition to optimizing monitoring, such as deploying distributed proxy and enabling Zabbix active mode, we also extended and optimized data collection. We used Ganglia instead of Zabbix to collect basic data. However, zabbix is still used for the implementation of business data. By sharing the load of data collection, zabbix’s load is greatly reduced, and the accuracy and timeliness of data collection are restored to normal.

(2) A single point of failure occurs in the alarm system

Due to the large number of servers, the data collected is also growing rapidly. Once, the monitoring server broke down unexpectedly. When the system was restored and started up, it was an hour later, and the operation and maintenance became blind.

Since the monitoring system went down, we have deployed the monitoring server with distributed high availability to avoid single point of failure. Meanwhile, we have made remote backup of the monitored data. When the monitoring server fails, it will automatically switch to the standby monitoring system, and the monitoring data will be saved and synchronized automatically.

(3) The alarm monitoring system cannot meet the alarm requirements

The increase of the business, customers have become more demanding on the stability of the business, in order to ensure stable operation of the business system, business logic control requirements are proposed, the business logic control is to monitor the operation of the business system logic, when running the business logic fault, also the need for alarm, apparently, monitoring of business logic, no ready-made tools and code, Zabbix can only be independently developed according to the business logic. By improving the business logic interface and reporting data, we have carried out a number of secondary development on Zabbix to meet the needs of monitoring the business logic.

At last, the operational monitoring platform is an integral part of operations work, how to build a suitable operational monitoring platform, every company needs are different, each of the ops face pain points are also different, but, no matter what the demand, how many demand, plus ca change, have the various monitoring data of the machine, operations can do many things. On the road of operation and maintenance monitoring, we move forward together.