The background,

A monitoring system detection and alarm are inseparable. Detection is used to detect exceptions, and alarm is used to send problem information to the corresponding person. In vivo monitoring system 1.0 era, each monitoring system maintains a set of computing, storage, detection and alarm convergence logic respectively. This architecture is very unfavorable to the underlying data fusion, and it cannot realize the application of monitoring system in a wider range of scenarios. Therefore, it is necessary to carry out overall planning and adjust the whole monitoring system architecture. In this context, the goal of unified monitoring was established.

In the past, monitoring was divided into basic monitoring, general monitoring, call chain, log monitoring, dial test monitoring and other major systems. The goal of unified monitoring is to carry out unified calculation, unified storage, unified detection, unified alarm, unified display of all monitoring index data. Here do not repeat, will be a later vivo monitoring system evolution of the article to further explain.

We talked about the big picture of surveillance unity. Before all the assembly to the alarm monitoring system will each convergence, news, etc, in order to reduce the redundancy, you need to convergence and so on work done by a service unified processing, at the same time the alarm center platform to update iteration phase, so you need to build a convergence for unity within each business to provide alarm and message assembly, alarm sent warning platform, with the idea, We plan to reduce the alarm convergence and alarm sending capabilities of all systems, and make the unified alarm service a universal service decouple from all monitoring services.

Ii. Status analysis

For monitoring systems in the 1.0 era, as shown in Figure 1, each monitoring system must first converge alarms and then connect to the old alarm center to send alarm messages. Each set of system has to maintain a set of rules separately, and there are many repetitive functions. In fact, these functions are highly versatile, and it is completely possible to establish a reasonable model for unified processing of anomalies generated by abnormal detection services, so as to generate problems, and then unified message assembly, and finally send alarm messages.

(Figure 1 Alarm flow chart of the old monitoring system)

In the monitoring system, there are several important concepts from the detection of an exception to the final alarm:

abnormal

In a detection window (the window size can be customized), when one or more index values reach the exception threshold defined by the detection rule, an exception is generated. As shown in figure 2, detection rules define when the parameter values in an inspection window for 6 cycles, there are 3 points more than threshold is thought to have abnormal, we referred to as “the detection rules for 6 to 3, as shown in the first detection window (blue dotted line in the basket) only two points 6 and 7 indexes exceed the threshold (95), does not meet the conditions of 6-3, So there is no exception in the first detection window. In the second detection window (in the green dotted box), the index value of 6, 7 and 8 points exceeds the threshold value (95), so the second window is an exception.

The problem

A set of all of the same kind of exceptions that occur in a continuous period is called a problem. As shown in Figure 2, the second detection window is an exception, and this exception also has A problem A. If the third detection window is also an exception, the corresponding problem of this exception is also A, so the relationship between the problem and the exception is one-to-many.

The alarm

When a problem is reported to users by SMS, phone, or email through the alarm system, it is called an alarm.

restore

If the exception corresponding to the problem does not meet the exception conditions defined in the detection rule, all exceptions are considered to be recovered, and the problem is also considered to be recovered, and the corresponding recovery notification is sent.

(Figure 2 Timing data anomaly detection schematic diagram)

3. Measurement indicators

How do we measure a system, how do we improve it, how do we manage it? “If you can’t measure it, you can’t manage it,” said Management guru Peter Drucker. It can be seen from here that if you want to comprehensively manage and improve a system, you need to have a measure of its performance indicators, know where its weak points are, and find out the symptoms before you can apply the right medicine to the case.

(Figure 3 Time node relationship diagram of o&M indicators)

Figure 3 shows the relationship between monitoring system operation indicators and corresponding time nodes, mainly reflecting the corresponding relationship between MTTD, MTTA, MTTF, MTTR, MTBF and time nodes. These indicators are of high reference value for improving system performance and helping o&M teams to find problems early. There are many cloud alarm platforms in the industry that also pay attention to these indicators. Here we focus on MTTA and MTTR, which are closely related to alarm platforms:

MTTA(Mean time to acknowledge, average response time) :

(Figure 4 Calculation formula of MTTA)

  • T [I] — The response time of the operation and maintenance team or r&d personnel after the problem of the i-th service occurs during the operation of the system;

  • R [I] — Monitors the total number of times the i-th service has a problem during system operation.

Average response time is the average time it takes the operations or r&d team to respond to all issues. MTTA metrics are used to measure the responsiveness of operations or r&d teams and the efficiency of alert systems. By tracking and minimizing MTTA, project management teams can optimize processes, improve problem solving efficiency, guarantee service availability, and improve user satisfaction [1].

MTTR(Mean Time To Repair)

(FIG. 5 Calculation formula of MTTR [2])

  • T [ri] — The total time for the service to recover after the ith service alarm occurs for r times during the system running

  • R [I] — Monitors the total number of times the i-th service generates an alarm during system operation

Average Repair time (MTTR) is the average time it takes to repair a system and restore it to normal functioning. The MTTR computes as the operations or r&d personnel begin to handle the exception, and continues until the interrupted service is fully restored, including any required test time. In the IT service management industry, the R in MTTR does not always stand for repair. It can also mean recovery, response, or resolution. Although these indicators all correspond to MTTR, they all have their own meanings. Therefore, to figure out which MTTR to use can help us better analyze and understand the problem. Let’s take a quick look at what they mean:

1) Mean Time to Recovery Refers to the average time required to recover a system alarm. This covers the entire process from the alarm caused by a service exception to the recovery. MTTR is a measure of the speed of the overall recovery process.

2) Mean Time to respond Indicates the average time between the occurrence of the first alarm and the recovery of the system, excluding any delays in the alarm system. This MTTR is commonly used in network security to measure a team’s effectiveness in mitigating system attacks.

3) Mean Time to resolve Indicates the average time to completely resolve a system fault, including the time required to detect the fault, diagnose the problem, and ensure that the fault does not occur again to resolve the problem. This MTTR metric is primarily used to measure the resolution of unforeseen events, not service requests.

The core of MTTA improvement is to find the right people and find the right people [3]. Only by finding the right people who can deal with problems in the shortest time can MTTR be effectively improved. Usually we will encounter in the process of the production practice of “warning”, large amount of alarm appeared to operations staff or development to solve the classmate, for stress sensitive students it is easy to appear the phenomenon of “Wolf coming”, as long as received the alarm will be very nervous, at the same time when a large number of alarm information frequent harassment our operations staff, will trigger the alarm fatigue, It is reflected in too many unimportant events, fewer fundamental problems, frequent processing of ordinary events, and important information drowning in the vast ocean. [4]

(Figure 6 alarm flooding problem [5])

4. Functional design

Based on the analysis of the above two important indicators, we conclude that we need to reduce the number of alarms to be sent, improve the accuracy of alarms, improve the efficiency of solving problems, and reduce the recovery time of problems by focusing on the number of alarms, alarm convergence, and alarm upgrade. The following describes how to reduce the number of alarms and send valuable alarms to users from the system and function levels. This article also focuses on alarm message convergence.

It can be seen from Figure 1 that there are many repetitive function modules in each monitoring system, so we can extract these function modules. As shown in Figure 7, the capabilities such as alarm convergence, alarm masking, and alarm upgrade are unified in the unified alarm service. In this architecture, unified alarm service and detection related service are completely decoupled, which has certain universality in capability. For example, if other service teams with alarm or message convergence requirements want to access unified alarms, unified alarms must meet the requirements of message convergence and direct message sending. Unified alarms provide flexible and configurable message sending modes, and provide simple and diversified functions to meet various requirements.

(Figure 7 Unified alarm system structure)

4.1 Alarm Convergence

The alarm platform generates tens of thousands of alarms every day. O&m and developers need to analyze, prioritize, and troubleshoot these alarms. If tens of thousands of alarms are not converged and alarms are sent for each exception, it will increase the working pressure of O&M personnel. Of course, not all alarms need to be sent to O&M personnel for handling. Therefore, you need to use various methods to converge alarms. The following describes how to converge alarms from four aspects.

First alarm Waiting

When an exception occurs, we do not send alarms immediately, but wait for a period of time to send alarms. Generally, the system can customize this period. If the value is too large, the alarm delay will be affected, and if the value is too small, the alarm combination effect will not be improved. For example, if the waiting time for the first alarm is 5 seconds, if A indicator on node 1 is abnormal and A indicator on node 2 is abnormal within 5 seconds, node 1 and node 2 in the alarm sending season are combined to send an alarm notification.

The alarm interval

Before the fault is rectified, the system sends an alarm message at intervals based on the alarm interval. The alarm interval controls the frequency of sending alarms.

Dimension of abnormal convergence

The exception convergence dimension is used to merge the exceptions in the same dimension. For example, in path A on the same node, exceptions generated by the same detection rule are combined according to the configured convergence dimension when alarms are sent.

Message merge dimension

When multiple exceptions converge into one problem, the message merge is involved in the alarm sending. The message merge dimension is used to specify which dimensions can be merged. If this is a little confusing, we can look at the transformation from exception to message in Figure 8.

If an exception has two dimensions name and gender, when the two abnormal after unified alarm, we will according to the configuration of convergence strategy to merge, we can see from the picture the gender convergence dimensions are defined as abnormal, usually abnormal convergence dimensions must be the choice of two or more than two have the same properties of abnormal, So that after the message is merged, only the same value of the same property is taken, corresponding to the sample graph, we replace the sex placeholder with male. The name is defined as the alarm merge dimension, which means that all exception names are displayed in the message text, so we will replace the {sex} placeholder with male during the message merge. The name is defined as the alarm merge dimension, which means that the name of all exceptions must be displayed in the message text, so we will replace the sex placeholder with male during the message merge. The name is defined as the alarm merge dimension, which means that the name of all exceptions must be displayed in the message text. Therefore, during the message merge, the information corresponding to the {name} placeholder will be splicing in the message text one by one.

(Figure 8 message text replacement diagram)

4.2 Alarm Claim

If an alarm is claimed, subsequent alarms are only sent to the person who claims the alarm. Alarm claim is mainly used to reduce the need to send the alarm to other personnel after the alarm is followed up, and to some extent, solve the problem that the alarm is processed repeatedly. Claimed alarms can be unclaimed.

4.3 Alarm Masking

For the same problem, you can set alarm masking. If subsequent alarms of the problem are generated, they will not be sent out. Alarm masking can reduce alarms generated during fault location and resolution or service version change, effectively reducing the trouble caused by invalid alarms to O&M personnel. The masking can be set to periodic or to a specific period of time, or you can cancel the masking.

4.4 Alarm Callback

When an alarm rule is configured with callback, the callback interface is invoked to restore services or services when an alarm is generated. When an alarm is generated for a service, the system can perform automatic configuration to restore the service to the normal state, shorten the fault recovery time, and quickly restore the service in the first emergency.

4.5 Incorrect labeling

For a fault, you can note whether the fault is a false alarm. The purpose of error warning annotation is to let system developers know which points need to be improved and optimized during anomaly detection, so as to improve the accuracy of alarms and provide users with real and effective alarms.

4.6 Alarm Upgrade

If the alarm persists for a certain period of time, the system automatically upgrades the alarm based on the configuration and sends the alarm upgrade information to the corresponding personnel. An alarm upgrade is to shorten the MTTA. If the alarm is not cleared for a long time, the fault is not responded to in a timely manner. In this case, higher-level personnel are required to handle the fault.

As shown in Figure 9, the alarm system sends a large number of alarms every day. Of course, these alarms are sent to alarm recipients of different services. The more alarms, the better. The more alarms, the better to accurately reflect service exceptions. Therefore, it is important to improve effective alarms, improve alarm accuracy, and reduce alarm quantity. Through the above system design and functional design can effectively reduce the repeated alarm sending.

(Figure 9 Number of host monitoring alarms)

5. Architecture design

Above, we explained how to solve various problems existing under the old architecture from the system and function layer names. Then, we should use a set of architecture to realize this concept.

Let’s look at how to design this architecture. As the last link in the whole monitoring process, unified alarms must meet the requirements of alarm sending and service notification sending. Therefore, unified alarms must be universal. The unified alarm service should be decoupled from other services, especially the existing monitoring system, so as to release the common capability. The service needs to adapt to different service logic based on different service scenarios. For example, some services need alarm convergence while others do not. In this case, the service needs to provide flexible access mode to meet service requirements.

As shown in Figure 10, the unified alarm core logic is realized by the convergence service. The convergence service can consume the anomalies in Kafka or receive the pushed exceptions through the RestFul interface. The exceptions are first processed to generate a problem and then stored in the MySql library. After the alarm convergence module, the problem will be pushed to the Redis delay queue, which will be used to control the time when the message is sent out of the queue. After the message is taken out of the queue, text assembly and other operations will be carried out, and finally the message will be sent out through configuration.

(Figure 10 Unified alarm architecture)

The configuration management service is used to manage configuration information such as applications, events, and alarms. The metadata synchronization service is used to synchronize metadata required by alarm convergence from other services.

Core implementation

The core of unified alarms is alarm convergence, which reduces the number of repeated alarm messages to avoid alarm paralysis for alarm recipients caused by a large number of alarms.

It has been mentioned above that delay queue is used for alarm convergence. Delay queue is widely used in e-commerce and payment projects. For example, the order will be automatically cancelled if the payment is not made within 10 minutes after the order is placed. The purpose of using delay queues in the alarm system is to combine as many exceptions corresponding to the same problem as possible within a certain period of time, reducing the number of alarms to be sent. For example, if A service A has three nodes, an alarm will be generated for each node when an exception occurs. However, after alarm convergence, the alarms of the three nodes can be combined and notified by one alarm.

There are many ways to realize the delay queue. Here we choose Redis to realize the delay queue. The main reason for choosing Redis delay queue is that it supports high performance score sorting, and the persistence of Redis ensures the consumption and storage of messages.

As shown in figure 11, after a problem through a series of validation to heavy into redis delay queue, the queue time due the smallest problems will be to the front, have a listening tasks at the same time continue to see if the queue have overdue tasks, if there is overdue tasks will be taken out, take out the message through the message assembly operations such as forming a message text finally, Then send them through different channels according to the configuration.

(FIG. 11 Schematic diagram of delayed task execution [6])

7. Future prospects

Based on the unified alarm service positioning, the alarm service must be simple, efficient, and accurate to tell o&M or developers where faults need to be rectified. Therefore, for the construction of follow-up services, we should consider how to further reduce artificial configurations, enhance the ability of intelligent alarm convergence, and enhance the ability of root cause location. These problems can be solved well by the support of AI. At present, major manufacturers are advancing towards AIOps exploration, and some products have been put into use, but it will take some time for AIOps to be implemented on a large scale. Compared with THE use of AI, the most urgent task is to connect the upstream and downstream services with unified alarms to open up data, pave the way for data flow, enhance the degree of automation of services, and support alarm sending from a higher dimension to provide more accurate information for fault discovery and resolution.

Viii. Reference materials

[1]What are MTTR, MTBF, MTTF, and MTTA? A guide to Incident Management metrics

[2] Average repair time [Z].

[3] Four key indicators that can’t be missed in operation and maintenance!

[4]PIGOSS TOC Smart Service Center makes alarm management smarter

[5] Larger-scale intelligent alarm convergence and alarm root cause technology practice [EB/OL].

[6] Did you know that Redis can implement delay queuing?

Author: Vivo Internet Server Team -Chen Ningning