Click here to watch the big shots share

Status of IT O&M alarms

Now IT operational field to ensure the normal service operation is the main method to real-time monitor related operational indicators, and set some rules based on experience, through compare real-time monitoring data and rules, when a indicators monitoring values are not conform to the rules set, is judged to be abnormal condition, so IT will send corresponding alarms to the alarm platform. After receiving the notification, the alarm platform assigns the alarm to the corresponding O&M personnel to handle the alarm. The O&M personnel locate the root cause of the fault and rectify the fault. As can be seen from this process, the whole process is centered on the alarm, so the quality of the alarm is crucial.

But in the actual operation and maintenance process, we can find that there are many problems in this process. First, the rules of the monitoring system are difficult to set. Because rules are set based on expert experience, with the expansion of the system scale, the increase of complexity, the improvement of monitoring coverage, the number of monitoring indicators increase exponentially, and the indicator form is ever-changing, the rule setting based on expert experience is unable to do its job, and the false alarm rate and missing alarm rate remain high. Operation and maintenance personnel may encounter alarm storms and be bombarded with thousands of alarms every day. After a fault occurs, it is inefficient to search for the root cause of several alarms one by one, which greatly increases the fault recovery time. Failures can be unpredictable, and some that could have been avoided still occur.


The concept of intelligent alarm and related technologies

To solve the above problems, the concept of intelligent alarm is introduced in the field of intelligent operation and maintenance. Intelligent alarm mainly solves four problems: 1. Accurate alarm and alarm storm rejection; 2. Two, fast fault location; 3. Make fault prediction to avoid the occurrence of faults; Four, rule setting automation, no longer through manual experience to set rules.

The core idea of intelligent anomaly detection is to use machine learning algorithm to automatically learn the rules of historical data, so as to realize the automation of rule setting. In this way, you do not need to manually set a large number of rules. In most cases, automatic learning rules are more accurate and reasonable, greatly improving the quality of alarm generation.

Intelligent anomaly detection technologies include indicator anomaly detection, log anomaly detection, root cause analysis and fault prediction. Index anomaly detection is generally divided into single index anomaly detection and multi index anomaly detection. Single indicator anomaly detection, namely time series anomaly detection, can be divided into three types: statistics-based algorithm, unsupervised learning algorithm and supervised classification algorithm. The statistics-based algorithm is very simple and easy to implement, but can only handle simple scenarios. Commonly used unsupervised learning algorithms include IForest isolated forest, LOF local anomaly factor, One-class-SVM, Autoencoder, etc. These algorithms do not need marking data, have high accuracy, but are difficult to select features. The commonly used algorithms for supervised classification include XGBoost, GBDT, decision tree and support vector machine. These algorithms are very accurate, but the marking data is difficult to obtain. Multi-indicator anomaly detection is to judge anomalies by integrating multiple indicators. It is necessary to reduce the dimension of data first, and then use two detection algorithms, supervised classification and unsupervised learning.

Logs are semi-structured data with abundant information in the system. Using log anomaly detection, you can identify anomalies in real-time system logs, facilitating fault discovery and location. Log detection methods include natural language processing and log pattern recognition. In natural language processing, text quantization + deep learning can understand the text information in logs, identify potential anomalies, and constantly improve and optimize the model by combining artificial feedback and annotation. Log pattern recognition is to extract the patterns of normal logs, discover the logs that are different from the normal patterns, and continuously self-learn and adjust the pattern matching results in combination with feature engineering.

Returning for analysis of the core idea is to use data mining algorithm is the interaction relationship between excavated indicators, when the related indicators generate alarm at the same time, is that affect alarm side of alarm for returning for the alarm, the affected alarm for derivative alarm, operations staff received alarm just returning for alarm, so just returning for the returning for the alarm can determine the breakdown, Greatly improving root cause locating efficiency. Root cause analysis greatly shortens the time for locating the root cause and recovering the fault, reducing the loss caused by the fault.

Root cause analysis can analyze the influence relationship among indicators. Correlation analysis can be conducted by using Pearson correlation, Spearman correlation, Kendall correlation, J-measure and two-sample test algorithms to judge whether multiple indicators often fluctuate or increase together. Fp-growth and Apriori algorithm are used to associate the events that often happen together in history and carry out frequent item analysis. Drilling analysis, using correlation matching, to find the root cause.

The core idea of fault prediction is to use algorithm to learn the change rule of index history, predict the change trend of index in the future according to the rule, and send an alarm in advance when a fault is likely to occur. By predicting future failures, operation and maintenance personnel can intervene in advance to avoid failures and reduce unnecessary losses.

There are many methods for fault prediction, but they are mainly divided into three categories: traditional statistical method, machine learning algorithm and deep learning algorithm. Traditional statistical methods include ARIMA, Holt-Winters, Prophet, etc., which are suitable for predicting steady-state or differential steady-state indicators. Machine learning algorithms, such as XGBoost and GBDT, use feature engineering to construct input features according to the scene, which is suitable for predicting multi-variable indicators. Deep learning algorithm neural network with RNN cycle, LSTM both short-term and long-term memory network, etc., using historical data slide window way, to forecast into supervised learning problems, using neural network to extract the features of this kind of algorithm in most indicators on performance is very good, the accuracy is very high, but because of the large amount of calculation, the resource demand is relatively high, and poor performance.


Intelligent Alarm Practice

Based on the above intelligent alarm concept and related methods, combined with the past practice, we will introduce the actual combat scheme in single indicator anomaly detection, root cause analysis and fault prediction.

This picture shows the overall framework diagram of single indicator anomaly detection, which mainly includes two aspects: offline module and online module. Offline module is mainly about the study and analysis of historical data. In the process of a large number of experiments, we found that it was difficult to find a universal anomaly detection algorithm that performed well for all indicators. An algorithm usually performed well for specific indicators, so we first classified the indicators. After classification, we can determine the algorithm used for each index. Then, the algorithm model parameters can be trained, and the sample database data can be used as test data to evaluate the model. Offline module is a scheduled task, which can reclassify and train a certain index periodically to ensure the timeliness and accuracy of the model.

The online module mainly uses the model trained by the offline module to detect the real-time monitoring data. To judge the health status of indicators, the main step is to use the real-time monitoring data as the model input to determine whether indicators are abnormal. If abnormal, the corresponding alarm is generated. In addition, a manual review process is provided for exceptions to the model output, so that the results of the review are saved to the sample repository.

After specific analysis of a large number of indicators, we found that all indicators can be approximately divided into three categories. The first category is periodic component, which has obvious fluctuation period. The second category is trend components, which change more gently, rising or falling slowly, such as disk usage. The third category is the stability component, which tends to fluctuate within a range, such as service detail time. Data decomposition is also the decomposition of data into these three components.

Inspired by this, we first decomposed time series data, and then decomposed an indicator into trend component, periodic component and stable component. Then a Pearson correlation coefficient between these three components and the original curve is calculated, and the component with the largest coefficient is the category of the index.

After completing the classification index, we can determine which algorithm to use for detection. To ensure reliability, we provide two algorithms for each type of indicator and make voting decisions. For periodic indicators, we provide year-on-year algorithm and anomaly detection algorithm based on prediction. For trend index we use sequential algorithm and isolated forest algorithm; We use 3-sigma algorithm and quartile algorithm for stability index. The historical data of the index in the past 7 days were used in the algorithm parameter training, and the selection of model parameters was determined by network search.

This diagram shows the main flow of root cause analysis. In relation to correlation analysis, we provide three methods: strong correlation analysis, frequent item mining and correlation analysis. You can use these three methods to obtain the association relationship between indicators. After some indicators are associated with keys, you can perform root cause analysis on alarms. Firstly, the alarms of each monitoring system are unified into our system and processed in a unified format. Some invalid alarms are filtered out using advanced compression methods. Valid alarms are analyzed and a root cause analysis report is generated. O&m personnel locate faults based on the report.

The data source of the offline part is the historical data of the index, and the output is the model. In the online part, the prediction data is obtained after calculation according to the model, which serves as the data basis of fault prediction. Because LSTM (Long and Short-term memory Network) can well grasp the characteristics of possible connections in the context of practice sequences, the effect is very good in most cases, so in terms of algorithm selection, we choose LSTM algorithm as the prediction model.

This is a failure prediction rendering of the number of visits to a Web system, with blue lines representing actual values and yellow lines representing forecasters. The red dot is the point in time at which we predict Web traffic will be as high as the system can handle. Therefore, you can send a notice in advance so that O&M personnel can increase or expand resources in advance to prevent service quality deterioration caused by heavy traffic.


Intelligent Alarm Practice

Fault detection is the decisive link of the whole alarm quality, so intelligent anomaly detection should introduce more and better algorithms to control the false alarm rate and false alarm rate within 1%. In the link of fault location, how to analyze more comprehensive correlation relationship according to the existing data is the key point, combined with multi-dimensional drilling analysis, give more accurate fault root cause. In the troubleshooting process, fault self-healing can be introduced to realize automatic repair after locating the problem, further improving o&M automation.

The questionnaire

In order to provide developers with the most practical, the most popular cutting-edge, the most dry video tutorials, please let us hear your needs, thank you for your time! Click on to fill in
The questionnaire


Tencent Cloud University is a one-stop learning and growth platform for cloud ecosystem users under Tencent Cloud. Tencent Cloud University big tycoon share invites internal technology big tycoon every week, to provide you with free, professional, the latest technology trends of the industry to share.