Today’s surveillance operations and even the whole industry is in change, in the face of so many changes and uncertainty, operational monitoring planning should first consider guarantee the sustainability of the technology investment, avoid locking in a particular architecture and scheme, but based on the key points of the core technology and demands, following the technical trend, smooth evolution, technology advanced, Continuously outputs business value in phases during the evolution process. This paper will introduce the technical selection of several common operation and maintenance monitoring systems.

Monitor system functions

Monitoring system is the core component of operation and maintenance system or platform system, it carries the data closed loop part of operation and maintenance work. From the function point of view, the monitoring system is divided into data collection function, data reporting function, data storage function, alarm function, large screen function, report function and other function modules. From the perspective of technology scene, monitoring system can be divided into computer room monitoring, hardware monitoring, network monitoring, operating system monitoring, middleware monitoring, cloud platform monitoring, service monitoring, dial test monitoring and other vertical technical fields; From the perspective of service scenarios, monitoring systems can be divided into vertical service areas such as resource monitoring, cost monitoring, audit monitoring, quality monitoring, operation monitoring, and security monitoring.

No matter from which point of view, the core responsibility of the monitoring system is to ensure the timely collection, correct processing, accurate alarm and reasonable display of all information on the platform.

Monitor the working position of the system

Operation and maintenance is responsible for supporting the normal operation of business modules, which needs to build the operation and maintenance technology stack from the bottom cloud or hardware, as shown in the following figure. Generally speaking, the function of o&M technology stack from bottom to top includes environment (such as IDC room), equipment (such as cloud host and hard disk), basic software system (such as Linux), deployment and management (such as Docker and K8S), middleware (such as mysql database), business scheduling, and finally the uppermost business module. Different companies and different business scenarios have different implementation modes of o&M technology stacks, but they are within the scope shown in the following figure.In the o&M stack, the monitoring system (shown on the right side of the figure above) needs to be responsible for health collection and risk warning at all levels and components on a vertical dimension. The working position of the monitoring system runs through all levels of the operation and maintenance technology stack, which puts forward high requirements on the technical comprehensiveness, reliability and engineering strength of the monitoring system.

The core components of the monitoring system

Data collector

The data collector is a data collection and report tool that supports the plug-in mechanism. It can directly collect relevant operation and maintenance data from the system it runs, or obtain data from the API of other systems, or monitor monitoring data from the system or third-party components.

Data storage warehouse

A data storage warehouse is usually a time series database, which is responsible for handling a large number of monitoring data writes and complex monitoring data queries. The data storage warehouse generally contains necessary functions such as data compression, data expiration, and aggregation operation.

User manipulation and visual interface

The user operation interface is the entrance of the user management monitoring system. It must make the management of monitoring indicators and alarms easy to use and maintainable. The data visualization interface is responsible for providing the display of monitoring data. It must support the necessary display of time series data and support some flexibility in query capability.

Data processing engine

The data processing engine processes the time series data in the data storage warehouse and generally supports streaming and batch processing. One of the most important functions of a data processing engine is the calculation of monitoring alarms.

Key technology of monitoring system

Monitoring system covers a wide range of technologies and has deep technology stack, so it is easy to have technical risks. When designing or evaluating a monitoring system, we should pay special attention to the following key technologies:

The collector

The collector determines the source of monitoring data, and the quality of the collector determines the coverage, data quality and timeliness of monitoring data. A good monitoring system should be equipped with a large number of collectors for common technical scenarios and provide convenient custom data interfaces. The monitoring data of standard scenarios account for about 70% of all monitoring data. A large number of standard collectors can greatly reduce the holding cost of monitoring systems. Custom monitoring data accounts for about 30% of all monitoring data. A well-designed custom monitoring data interface can better schedule, organize and collect custom data sources, and lay a solid engineering foundation for subsequent secondary development.

Time series storage technology

The management, storage and processing of time series are the core links in the closed-loop monitoring system. The key points of time series technology are availability, reliability, compression ratio, old data cleaning, index management, multi-dimensional aggregation and so on.

Query language and query efficiency

Query language is the query interface of monitoring data. Good query language can greatly release the value of monitoring data, while bad query language will limit the further processing and use of monitoring data (some monitoring systems do not support query data through statements, so this option should be avoided). The efficiency of querying data affects the efficiency of the monitoring system, especially in scenarios such as alarm calculation, report generation, and data statistics. The low efficiency of querying data greatly affects the usage of data.

Alarm policy configuration mode

The way alarm policies are configured should be considered for flexibility and maintainability. Hybrid architecture, microserver and other new technologies give rise to a more modern business system technology stack, which puts forward higher requirements for the flexibility of alarm strategy. Alarm strategy should support conditional alarm, combined conditional alarm, year-on-year and sequential, regression, linear fitting and other advanced functions. It is better to support alarm merger based on clustering algorithm (alarm merger based on clustering algorithm is generally considered to be the most effective and feasible alarm merger method in the industry at present). Cloud native, container to bring high dynamic server environment, the environment needs to maintainability stronger warning policy configuration, high frequency automatic even change the business environment, the lack of maintainability alarm policy configuration way leads to configuration of the monitoring system can’t keep up with the change of business environment, not only cost a lot of manpower, also easily lead to leakage, mismatch.

API and secondary open programming interface

As infrastructure programmability gradually softwareizes operations, the softening trend continues to move up the o&M stack. As the center of operation and maintenance system, monitoring system needs powerful API and secondary programming interface to cooperate well with CMDB, virtualization environment, deployment system (CI, CD), operation and maintenance automation system and other operation and maintenance subsystems. An isolated monitoring system will form a data island and become a bottleneck in o&M workflow, affecting the overall planning and technical evolution of o&M system.

Common technology selection

Zabbix

Native Zabbix scheme was used to collect data through Zabbix Agent. Zabbix Server receives and stores data and calculates alarms. Display data via Zabbix Web UI or Grafana; Collect custom monitoring data by shell script, as shown in the following figure:

Advantages: the scheme is mature and the initial holding cost is low

Disadvantages: Performance and management efficiency bottlenecks with more than 1000 servers; The maintenance cost of custom scripts is high. Poor scalability.

Zabbix + secondary development

Based on the data storage and alarm calculation capabilities of Zabbix Server, some built-in monitoring indicators of Zabbix Agent are used. Self-developed data reporting device to manage and collect self-defined monitoring index data; Manage alarm configurations and custom alarm collection through the CMDB or server and automatically synchronize alarm configurations to the Zabbix Server and data reporting device. The Zabbix server generates alarms and sends them to the alarm center. The alarm center manages and sends alarms in a unified manner. The diagram below:

Advantages: Alarm configuration and custom collection are managed in a unified manner and can be customized according to service scenarios, providing good O&M experience and high maintainability.

Disadvantages: Performance and management efficiency bottlenecks with more than 1000 servers; Weak data capability; The technical investment is not sustainable, and the subsequent intelligent operation and maintenance, data-driven and other technical routes are locked.

Prometheus

With the native Prometheus solution, you can mine data through the open source community, store data through Prometheus, calculate and send alarms through AlertManager, and display data through Grafafa, as shown below:

Advantages: The open source community has a large number of collectors that can be used directly; Strong data capability; Strong data processing ability; Prometheus is a new generation of monitoring system factual standard, technology investment risk is low, technology dividend is large.

Disadvantages: No visual management interface, high threshold for alarm configuration and data query. The system has many components, loose coupling between components, and high management and maintenance costs.

OpsMind

Compatible with the core functions of Prometheus, with Prometheus excellent secondary development interface, self-developed distributed storage engine, alarm engine, index management, data query and other business functions, making full use of the core advantages of Prometheus to make up for the functional deficiencies of Prometheus, as shown below:

Packaging Prometheus as a complete monitoring solution and adding AIOps capabilities:

  1. Provide a complete distributed solution, greatly enhance system capacity, performance and stability

  2. System management visualization, basic configuration to encourage the demand side to complete by themselves, reduce mechanical work, accelerate the demand response time

  3. Data query is productized and customized, data democracy, and the platform directly outputs data capability and data value

  4. Intelligent alarm merging reduces missed alarms and repeated alarms

  5. Provides monitoring items and alarm sorting services based on industry characteristics and learns from industry best monitoring practices

  6. Provide product maintenance, technical support and customized development to ensure party A’s autonomy and control and protect the continuity of technological investment

At present, this scheme is widely recognized and adopted by first-line Internet enterprises. The technology investment is sustainable, and it follows the development trend of the industry to enjoy the dividend of industrial technology development to the maximum extent.

Technology trends

In the process of monitoring system technology evolution, we must continue to pay attention to and appropriately follow the following technological trends:

The central role of monitoring systems

Monitoring system is playing an increasingly central role in the overall operation and maintenance system, and the operation and maintenance system is gradually changing from process-driven to data-driven. We should pay more attention to the openness of the monitoring system, so that the monitoring system has the ability to connect and integrate with all other operation and maintenance subsystems, and make external technical output such as data and algorithm.

Automatic identification and automatic acquisition

The wave of cloud native technology has brought mixed technology stack and highly dynamic server architecture. We should attach importance to the independent ability of the collector. When facing the complex and changeable monitored environment, the collector should do as much as possible to automatically identify the environment and collect indicators independently.

Focus on high dimensional data management

The advent of clouds, containers, and microservices has increased the number of monitored objects by two or three orders of magnitude, so high-dimensional data management capabilities are particularly important, and our time series management technology architecture should be well prepared for a billion levels of sequential data.

Introduction of data science and machine learning

Our architecture should support the introduction of data science technologies and machine learning technologies, AIOps technology is still evolving rapidly, many algorithms and data methods are still changing, and there should be enough flexibility for such changes.

Emphasize data visualization

As the amount of monitoring data increases geometrically, it is difficult for traditional data display methods to express the accurate meaning of large-scale data. We should accumulate more data visualization techniques suitable for operation and maintenance of big data besides simple display methods such as line graph, histogram and scatter graph.

Based on the perspective of operation and maintenance, reflect business value

Operational environment bearing business running, and must have the meanings assigned to the business operational point of view of data, such as service requests corresponding business orders, service response time corresponding to the user experience, resource utilization, the corresponding business cost model, we should be based on data from the monitoring system of ability, dig the business meaning of monitoring data, to export the business value of monitoring system.