Monitoring indicator 10K+! Ctrip real-time intelligent detection platform practice

Introduction: This article will introduce Ctrip real-time intelligent exception detection platform — Prophet. So far, Prophet has basically covered all business lines of Ctrip, and the number of monitored indicators has reached 10K+, covering all orders, payments and other important business indicators of Ctrip. Prophet uses time series data as data input, monitoring platform as access object, intelligent alarm to realize abnormal alarm function, and real-time alarm based on Flink real-time computing engine to realize abnormal real-time warning, providing one-stop anomaly detection solution.

I. Background introduction

1. Faults caused by rule alarms

Most monitoring platforms realize the early warning of monitoring indicators based on regular alarm. Regular alarms are generated based on statistics. For example, an indicator increases or decreases to a certain threshold. Rule alarms require users to be familiar with service indicators to accurately configure alarm thresholds. In this way, configuring rule alarms is cumbersome and the alarm effect is poor, requiring a large amount of manpower and resources to maintain rule alarms.

When an alarm is generated, many people need to check whether the alarm is correct and adjust the threshold. For example, Ctrip has three company-level monitoring platforms, and each business department builds its own monitoring platform based on its own service requirements or scenarios. Ctrip has more than a dozen monitoring platforms of different sizes. It is very tedious for users to configure monitoring indicators on each monitoring platform.

Second, the Prophet

In view of the above problems, Ctrip built its own real-time intelligent exception detection platform — Prophet. Ctrip was inspired by FaceBook’s Prophet, but its implementation is different from FaceBook’s Prophet.

1. One-stop anomaly detection solution

First, Prophet takes time series-type data as data input. Secondly, Prophet takes the monitoring platform as the access object and aims at de-regularization. Intelligent anomaly detection is realized based on deep learning algorithm and real-time anomaly detection is realized based on real-time computing engine, providing a unified anomaly detection solution.

2.Prophet System Architecture

Bottom layer: Hadoop bottom layer. YARN is used to run Flink jobs as a unified resource scheduling engine. HDFS is used to store the trained TensorFlow model.
Engine layer: First, the data must live in the message queue. Prophet uses Kafka. In addition, Prophet uses the Flink computing engine for real-time exception warning and TensorFlow as the training engine for the deep learning model. Prophet also stores historical data based on a sequential database.
Platform layer: The top layer is Prophet, the platform layer that provides services externally. Clog is used to collect job logs. Muise is a real-time computing platform. Qconfig is used to store the configuration items needed in the job. Hickwall Is used to monitor alarms for jobs.

3. According to Flink?

At present, the mainstream real-time computing engines include Flink, Storm and SparkStreaming, etc. Ctrip chooses Flink as the real-time computing engine of Prophet platform mainly because Flink has the following four characteristics:

Efficient state management: a lot of state information needs to be stored during exception detection. State Backend of Flink can be used to store intermediate State information.
Rich window support: Windows include scrolling Windows, sliding Windows and other Windows. Prophet performs data processing based on a sliding window.
Support multiple Time semantics: Prophet is based on Event Time.
Different levels of fault-tolerant semantics are supported: Prophet needs to be At Least Once or Exactly Once.

4.Prophet operation process

Users only need to configure intelligent alarms on their common monitoring platforms. The monitoring platform connects to the Prophet intelligent alarm platform to complete all subsequent processes. The monitoring platform needs to do two things:

Firstly, the monitoring indicators configured by users are synchronized to Prophet platform.
Secondly, the monitoring platform should push the monitoring indicator data configured by the user to the Kafka message queue in real time.

After Prophet received the new monitoring metrics, he tried to use Tensorflow training model. Model training requires historical data, and the platform can provide historical data query interface according to the agreed specifications. Prophet obtains historical data through the interface and carries out model training. If there is no interface, Prophet accumulates training data set based on data in message queue. After the model is trained, it is uploaded to HDFS, and Prophet updates the configuration in the configuration center to notify Flink that a newly trained model is available to load. All the values of the monitoring indicators pushed to Kafka in real time are synchronized to Prophet’s timing database for exception detection.

When the model training is completed, Flink’s job will try to load the new model once it detects that the configuration has been updated. It will consume the indicator data in Kafka in real time. Finally, the output detection results and abnormal alarms will be written back to Kafka, and each monitoring platform will obtain the alarm data from Kafka. The whole Prophet operation process is not perceived by users, and users only need to configure alarms, which provides great convenience.

Three, intelligent and real-time

1. Intelligent challenge

There are some challenges before you can do intelligent detection.

Small negative samples: the probability of exceptions in the production environment is low. Ctrip only accumulated about several thousand negative sample data over many years.
Multiple types of service indicators: Service indicators include service indicators such as order and payment, service indicators such as request number and response delay, and hardware indicators such as CPU, memory, and hard disk.
There are many forms of business indicators: Because there are different types of business indicators, business indicators have different forms. Ctrip will be business index form induction for three parts. The first is the index with relatively stable cycle fluctuation; the second is the index with stable and not violent fluctuation; the third is the index with very violent fluctuation and unstable form.

2. Deep learning algorithm selection

To solve the above three problems, Ctrip tried various deep learning algorithms such as RNN, LSTM and DNN.

RNN: The advantage of RNN is that it is suitable for data of time series type, but the disadvantage is that it has the problem of gradient disappearance.
LSTM model: The advantage of LSTM is to solve the problem of gradient disappearance. RNN and LSTM deep learning algorithms need to first train a model for each indicator, and then input the current data set to predict the trend of the current data set based on the model. Then anomaly detection is performed by comparing the predicted data set with the current data set. The advantage of this method is high detection accuracy, but single index single model also brings more resource consumption.
DNN: The advantage of DNN is that a single model can cover all anomaly detection scenarios. However, feature extraction will be very complex, and features in different frequency domains need to be extracted, requiring a large amount of user annotation data.

3. Offline model training

Ctrip generally issues a version every two weeks, and tries to train every two weeks for each business indicator. The training data input by the model is also taken as a data set of two weeks.

Data pre-processing is required before using historical data. For example, there may be null values in historical data, which need to be supplemented by mean standard deviation.
Secondly, there must be some outliers in the historical data interval, and the outliers in the outlier interval need to be replaced with some predicted values.
In addition, because the data during holidays is complex, you need to replace the outliers during holidays. After preprocessing the historical data set, the features of different time series or frequency are extracted.
Then, a classification model is used to classify whether the index is stationary, aperiodic or periodic. Different types of indicators need different models for training.

4. Model dynamic loading

After model training, the Flink job needs to dynamically load the model. However, in practical scenarios, it is not possible to restart the Flink job every time a model is trained. So the Prophet platform uploads the model training to HDFS, notifies the distribution center, and Flink starts to pull the model from HDFS. To ensure that each model is evenly distributed among different Task Managers, all monitoring indicators are keyBy based on their own IDS and evenly distributed among different Task Managers. Each Task Manager loads only its own part of the model to reduce resource consumption.

5. Real-time data consumption and prediction

Real-time anomaly detection is required after model loading. The real-time data is first consumed from the Kafka message queue. Prophet is currently based on Flink Event Time + sliding window. The time granularity of monitoring indicators can be divided into many types, for example, one point in 1 minute, one point in 5 minutes, and one point in 10 minutes. For example, based on a point per minute scenario, the length of a window opened in a Flink job is ten time granularity, i.e., ten minutes. When ten data are accumulated, the first five data are used to predict the next data, that is, the data at the first, second, third, fourth and fifth moments are used to predict the data at the sixth moment, and then the data at the second, third, fourth, fifth and sixth moments are used to predict the data at the seventh moment. Finally, the predicted and actual values at the 6th, 7th, 8th, 9th and 10th moments are obtained. Then the predicted value is compared with the actual value. This is the ideal scenario with no data anomalies.

6. Data interpolation and replacement

Unexpected situations often arise in actual scenarios. For example, only 9 pieces of data were obtained in the above 10-minute scene, and the data at the fourth moment was missing. Prophet will use mean standard deviation to make up for such missing data. In addition, if the 6th, 7th, 8th, 9th and 10th time intervals were detected as abnormal intervals, they fell or rose. Then the data in this interval is considered abnormal and cannot be used as model input. At this point, the original value of the sixth time granularity needs to be replaced with the value of the sixth time granularity predicted by the previous batch model. Among the five time values of 2, 3, 4, 5 and 6, the fourth time value is derived from interpolation, and the sixth time value is the predicted value trained in the time interval replaced by outliers.

The interpolated value is used as the model input to obtain the new predicted value 7. And then predictions in turn. The predicted values at time 6, 7, 8, 9, and 10 of the exception interval in the intermediate process need to be stored as a state to Flink StateBackend for subsequent Windows to use these predicted values.

7. Real-time anomaly detection

Real-time anomaly detection can be judged from the following aspects:

Judgment based on anomaly type and sensitivity: different indicators have different anomaly types, such as rising anomaly and falling anomaly. Secondly, different indicators have different sensitivities, which can be defined as high sensitivity, medium sensitivity and low sensitivity. When the high sensitivity index has a simple decline jitter, it is considered as a decline anomaly. When the medium sensitivity index may drop two points in a row, the abnormality will be judged. For low sensitivity indicators, only when the decline is large will it be judged as abnormal.
Based on the deviation judgment between the predicted set and the actual set: if the deviation between the predicted result and the actual result is large, the interval at moment 6, 7, 8, 9 and 10 is identified as the potential abnormal interval.
Judgment based on the mean value and standard deviation of historical data in the same period: comparison with the time of the same period last week is also required. If the value deviation in the same period is large, it is judged as abnormal. When there are a large number of abnormal samples, simple machine learning classification model can be used to judge the anomaly through the predicted value and the actual value.

8. Common scenarios

Common Problems Solution to common problems

Reading the scenario use: developer.aliyun.com/article/741…