The introduction

Decline in large distribution service scenarios, each service version rapid iteration, all kinds of expanding business scale, at the same time, the scene of monitoring is constantly changing, online fault may occur at any time, each platform is complex, how to ensure the online service and stable operation, at the same time, improve operational efficiency, reduce operational cost become the challenge of monitoring platform.

First, the nature of monitoring

Before designing a monitoring platform, let’s first discuss what monitoring is. Take it literally first. When we separate the word monitoring, “monitoring” means “monitoring”. It is necessary to carry out 7*24 hours uninterrupted monitoring of hardware resources and software resources involved in the platform, obtain their operation information, and constantly detect anomalies. Control refers to management control. If an exception is detected, it can be handled in a timely manner and corresponding control measures can be taken, such as sending an alarm to wake up o&M personnel for troubleshooting or implementing predefined self-healing measures for quick recovery.

Therefore, “supervision” is the means, and “control” is the result. In fact, the establishment of monitoring platform is to constantly enrich the ways and means of “monitoring”, so as to speed up the management and “control” of online business platform, and ensure the stable and healthy operation of each platform and business on the line.

The core business process of the monitoring platform is abstracted into seven steps, as shown in the figure below:

When a fault occurs, the monitoring platform must detect the fault in a timely manner. When an anomaly is detected, the platform needs to notify the detection result, whether it is notified to the R & D or operation and maintenance personnel for manual solution, or the central decision platform to judge whether automatic recovery processing can be carried out, it needs to timely stop the fault loss. The platform or the r&d personnel need to locate the root cause of the detected problem and perform recovery or upgrade based on the actual situation. Finally, the failure of the recheck, research and development, operation and maintenance and other related personnel to recheck the cause of the problem and preventive measures to avoid the same problem. If the AI of the platform has a high level of intelligence, it can train new processing strategies to update the policy library according to new faults and scenarios.

Second, data acquisition

Data collection is the basis of the monitoring platform. Subsequent services need the collected monitoring data to process corresponding business processes. The roughly collected data is shown in the following table. Of course, there are far more index data in the real environment than in the following table.

For collectors, open source monitoring solutions already have mature data collectors, such as the Telegraf collector in the TICK technology system, or the Exporter collector in the Prometheus technology system. It basically covers common machine and application service data collection, including middleware, etc. Meanwhile, it can also be customized according to its own business characteristics.

Third, CMDB

In the early stage of business platform development, a small number of servers and operation and maintenance personnel may be able to meet production needs. However, with the continuous expansion of business scale, continuous expansion of monitoring object categories and complex monitoring scenarios, how to deal with the monitoring, operation and maintenance of large-scale infrastructure and application services has become a problem that we have to face and need to solve. The core value of CMDB is how to efficiently organize all kinds of monitoring and operation and maintenance resources, and use them as the data base of the monitoring platform to improve the efficiency of monitoring and operation.

So what exactly is CMDB? The Configuration Management Database (CMDB) is the Configuration Management Database. It can be classified into application – oriented or service – oriented according to the scenario. To put it simply, CMDB manages various resource models and basic data, enabling various applications on the monitoring platform to conveniently use resources and model data for business processing and data synchronization. To build a CMDB, you first need a high degree of abstraction from the resource model. Monitoring o&M mainly involves multiple resource objects such as machine rooms, machines, various devices, and application services. Therefore, a universal resource model is required to cope with different application scenarios. The whole life cycle of a resource object from purchase, use, monitoring to replacement involves resource data synchronization among multiple subsystems, so data consistency and accuracy among subsystems need to be guaranteed. With the support of the first two phases, it is time to take advantage of the value of the CMDB data. It must provide strong support in data visualization, automated monitoring operation and maintenance, intelligent monitoring and data operation.

Monitoring objects can be divided into two categories. One is infrastructure, including equipment rooms, racks, servers, network devices, front-end devices, and power supplies. The other is the application service, middleware and other application objects built on the basis of infrastructure. The following figure shows the relationship between the two types of monitored objects.

Whether monitoring object is facility object or application object, it is a kind of resource information. It needs to be supported and extended through an abstract resource model. According to the above monitoring object association management, we can get the following resource model.

In real scenarios, resource isolation is required. A tenant may apply for a batch of servers that allocate resources according to development environment, test environment, pre-release environment, and production environment. Therefore, we need to organize and manage resources according to the machine resource management and application business scenarios with the business as the main axis.

Four, root cause traceability

For example, a service crashes. However, there may be many causes, such as memory overflow caused by bugs in the service, server CPU overflow, or dependent services. Therefore, fault root cause tracing is an important method to assist R&D and o&M personnel to locate faults. To minimize the loss caused by a fault, you need to quickly and accurately locate the root cause of a fault. Fault root cause tracing can be divided into trend chart assisted analysis and system independent analysis.

1. Assisted analysis of trend charts

Monitoring platforms need to have a full information, comprehensive coverage of the platform panoramic monitoring market. Users can customize according to their own service requirements. In addition, the running trend data of equipment rooms, racks, servers, clusters, service instances, and middleware should be displayed. When a fault occurs, certain data changes must occur, and these data changes must be reflected in the running trend diagram. Research and development personnel and OPERATION and maintenance personnel can determine the possible causes of exceptions based on the trend diagram.

2. System independent analysis

The above panoramic chart and fine-dimension chart, which require the participation of R&D and operation personnel, are a means of post-mortem analysis. However, the monitoring platform itself exists to reduce manpower maintenance investment and fast fault location. Therefore, independent system analysis and root cause mining are the real value of the monitoring platform, as well as the difficulties in business and technology.

During fault cause location, you can find key events through fault area screening and multi-dimensional association analysis to locate faults. Of course, if the AI technology can be combined with continuous training of the corresponding analysis model, it can eventually achieve the effect of fault location without manual intervention.

(1) Fault area screening When a fault occurs, the machine room and machine with specific fault need to be screened out first. If the interface response of a service times out, which applications are involved in the service, which service instances are corresponding to these applications, and which machine room and machine are these instances deployed in? What middleware does it depend on, and where is the middleware deployed? The first step is to determine which machines in which rooms and which services are likely to have problems. On this basis, it analyzes whether there is a high load on the server in these machine rooms, whether the services are normal, whether the middleware depends on the normal operation, and whether there are persistent error logs and other information for further screening to narrow the scope of failure.

(2) After the multi-dimensional correlation analysis has narrowed the fault area, it is necessary to conduct fault exploration and dig out the real platform bleeding point location. In addition to the above, the event information of the platform needs to be associated, for example, was there a release upgrade or configuration change before the failure occurred? Based on the locked fault area and corresponding event flow information, the root cause list of fault points is given after comprehensive judgment, and the corresponding fault proportion value is calculated.

5. Data storage

In the monitoring platform, data is mainly divided into two types, one is timing data and the other is event data. Timing data mainly includes CPU, memory, disk, network and traffic data. We need to select suitable storage platform according to different data characteristics, and finally form the data hybrid storage architecture of monitoring platform.

Event data on the monitoring platform includes alarm events, fault self-healing events, and log data. It should be noted that the log data is also a kind of event data, because for the program, the output of the log must be the symbolic event of the program, such as catching an exception, such as an important logical judgment, etc. So we think of log data as event data.

The whole event data storage platform is mainly divided into data access layer and storage analysis layer. Its general architecture is as follows:

The data access layer provides unified traffic access, interface authentication, traffic statistics, and traffic switchover functions.

The storage analysis layer is mainly responsible for the storage, full-text retrieval, data aggregation calculation and analysis of platform event data. Therefore, the following key features need to be met:

(1) Support high-performance storage of massive event data;

(2) Support fuzzy search based on event description;

(3) Horizontal expansion of data storage nodes can be conveniently realized;

(4) Support redundant copy storage of data nodes to achieve the purpose of high data availability.

Considering the importance of event data storage, it is necessary to design a dual-ES cluster with mutual active and standby to ensure the availability of event data storage platform to the maximum extent. If possible, you are advised to deploy two equipment rooms in two clusters to avoid platform unavailability caused by a single room failure. The data access layer double-writes data to the storage analysis layer, and the data query and search are obtained from the ES primary cluster.

The platform can comprehensively analyze the availability of the primary cluster based on indicators such as average response time and errors. After several decision periods, if an exception is found, traffic is switched to the secondary cluster to achieve high availability of the platform.

Six, summarized

As the first part of how to design a monitoring platform, this article roughly describes what needs to be done to build a platform. This topic covers data collection, CMDB, fault location, and data storage. A complete monitoring platform goes beyond this, involving anomaly detection, fault self-recovery, alarm notification, monitoring brain (central decision center), etc., which will be described in the next article.

With the emergence of AI technology, we have the opportunity to apply artificial intelligence technology to the monitoring platform, which opens up more application scenarios for us to imagine. Monitoring platform gradually from automated monitoring operation and maintenance to AI intelligent monitoring operation and maintenance. The fault perception, analysis and decision making, and task scheduling and execution are ultimately completed by the machine itself, so as to achieve the purpose of unattended and efficient maintenance of the online environment.