On the afternoon of September 26th, iQiyi technical product team held the 19th session of “I Technology Salon”. The theme of this session was “Exploration and Application of Data Governance” **. Several senior experts from Kuaishou, Meituan and Kuaikan jointly conducted in-depth discussions on relevant technical issues.

Professor Peng Tao, a researcher of IQiyi, shared the content entitled “Exploration and Practice of IQiyi Data Quality Monitoring” **, which mainly introduced the rule engine module in the data governance platform, including the current problems, objectives, abnormal detection methods and exploration of the subsequent function of the rule engine.

The following is a share of ** “Exploration and Practice of IQiyi Data Quality Monitoring”, compiled according to the speech of ** at [I Technology Conference].

Questions and Objectives: Why data quality monitoring?

In fact, data quality monitoring is similar to the current epidemic prevention and control work. Nucleic acid testing can detect the virus as soon as possible, and tracing the source can help us better understand the scenarios in which the virus will occur or who will have a greater impact on, which is convenient for tracking. This is similar to data quality monitoring.

There are many reasons for data problems. We divide the causes of abnormal data into the following three factors:

  • ** Product factors: ** If APP is released, the delivery strategy of Pingback changes;

  • ** Operation and external factors: ** Such as channel operation, content diversion, brush volume and fan behavior, partner factors;

  • ** Technical problems: ** Missing data, computing logic problems, these can also have a significant impact on the data.

In the face of data anomalies of different reasons, how do we control them from the perspective of monitoring? At present, iQiyi quality control is carried out from three levels, including:

  • **Pingback layer: **Pingback is the source of each report. From the source, the delivery quality of Pingback is improved.

  • ** Data middle layer: ** Avoid abnormal data transmission downstream by adding necessary monitoring in the data middle layer;

  • ** Business report layer: ** is a very important and intuitive part facing users and operations. This part of data is oriented to a large number of personnel, and

Everyone’s focus will be different, and what needs to be done is to try to cover the monitoring of the important business, especially the core data monitoring.

In response to these problems, IQiyi proposed several goals for data quality monitoring.

The first is to find exceptions and deal with them in time;

Secondly, it is to locate the causes of anomalies, some of which may be reasonable, such as operation reasons. Reasonable factors only need to be noted, and wrong data need to be developed and processed, including but not limited to front-end, back-end, data development, etc.

The ultimate goal is to improve the quality of data, ensure data flow and operational health.

How to detect anomalies

The figure above is the current flow chart of iQiyi’s data quality monitoring. Since the middle layer and the business report layer are stored in the form of data tables in IQiyi, iQiyi’s data quality monitoring merges these two parts and finally divides the data quality monitoring into Pingback and report.

** Data preprocessing module: ** is responsible for unified formatting of different data sources;

The rule engine is divided into two parts, exception detection and intelligent attribution, which will be described in detail later.

** Work order processing system: ** is responsible for the follow-up processing of abnormal data, including the remarks on abnormal causes, error data repair and other offline management work. Finally, whether abnormal and abnormal causes are written into the sample database for continuous optimization of subsequent monitoring.

Iqiyi internally divides Pingback’s monitoring into the following three dimensions:

  • ** Business dimension: ** subdivides into specific businesses and terminals, such as IQiyi Android client, IQiyi iPhone client, etc.;

  • ** Event type dimension: ** Monitor different user behaviors, such as start, play, display click, etc.;

  • Time dimension: divided into three levels: 5 minutes level, hour level, day level.

For Pingback monitoring of the above dimensions, indicators are standardized to facilitate automatic monitoring, including PV and UV of logs, null value rate and efficiency of fields, mean method of numeric types, distribution of enumerated values, etc.

For the monitoring of Pingback dimension, here is an example of iQiyi starting UV. At the beginning of iQiyi APP, only cold start events were delivered (users opened the APP manually). In our data analysis, we found that playing UV was higher than starting UV every day. We found out later that there were a lot of missed deliveries, such as Push pulls, users switching programs and cutting back, and recovering directly from historical tasks. Later, when we added other types of post startup events, the PV and UV of post startup increased greatly, ensuring that our DAU calculation was more realistic and reasonable.

Pingback hopes to find these problems at the delivery level as soon as possible, so Pingback abnormal monitoring, we will also intervene in the gray level to find and repair problems, reduce the impact range to the minimum.

Iii. Test report

The factors affecting reports have been described in the “Issues and Objectives” section of ** and will not be repeated here. In view of these problems, we divide monitoring into dimensions and indicators ** : dimensions refer to the subjects concerned by the business, such as overall data, channel data, version data, album data, etc.; Indicators refer to the values calculated in specific dimensions. Taking channel data as an example, we will have indicators such as new UV, next-day retention, and 7-day retention.

4. Abnormal detection mode and detection engine

The figure above shows the detection method involved in the anomaly detection module of iQiyi data quality monitoring, which is combined with multiple anomaly detection in front and decision maker in back.

Each detection method applies to different scenarios and needs to be matched based on data. The following is a brief introduction of different detection methods based on a real data from iQiyi, including a brief introduction of methods, application scenarios, advantages and disadvantages.

Thresholds and coincidences: This is a conversion rate indicator that is normally less than 100%, then suddenly rises to more than 100% on a given day and reaches 130% at times. For this index, anomaly detection can be carried out by threshold method. For example, when CTR is greater than 98% of the threshold value, the abnormal date can be easily marked out (the yellow point is the anomaly point).

5. Box detection and Gauss detection

Box detection and Gaussian detection

This is a statistical anomaly detection scheme that dynamically senses data trends based on historical trends. Using the Gaussian test for conversion rate as an example, we set data other than ±3σ as outliers based on the statistical indicators of the past 30 days. It can be seen that the outliers can be effectively detected when the data shows significant fluctuations, but the data that stabilizes near the high point and falls back is judged to be normal. It can be seen that the statistical method above is a dynamically adjusted detection method, which is very suitable for the detection of business promotion period. However, there is also a disadvantage. After the occurrence of an outlier, the outlier is added to the sequence, which has a great influence on the results of subsequent prediction. In this way, there will be some misreports.

Correlation detection

Correlation indicator detection is an indicator with strong correlation between two or more trends, so it is suitable for comparison between indicators with strong correlation. Similarly, taking conversion rate as an example, we split the conversion rate (conversion rate =B/A) into two indicators A\B. By calculating their historical correlation, we can find that under normal circumstances, A and B indicators have A strong correlation, the correlation coefficient is as high as 0.98. Through the correlation coefficient method, we regard anything lower than 0.8 as abnormal.

** Advantages: ** is convenient for horizontal comparison between business and indicators. For example, if the DAU index drops significantly, you can make a horizontal comparison by referring to other businesses, or make a vertical comparison by referring to other indicators under the same business to analyze whether it is abnormal.

** Disadvantages: ** can only determine the correlation anomalies of two indicators, but it cannot determine whether A anomaly or B anomaly, or both anomalies, which need to be combined with other detection methods; The delay effect is particularly serious. When the stability index is abnormal, it can be found quickly on the same day, but there will be a continuous alarm and a long time of false positives.

Finally, we introduced Facebook Prophet, which is a time series prediction model, and supports many types, including saturation prediction, trend mutation, and periodic indicators. Some parameters of holidays are also introduced to input, so this method is applicable to more scenarios than the previous one.

Again, take the above conversion rate indicator as an example to see how Prophet does, which provides three indicators: predicted value, prediction upper limit, and prediction lower limit. In the figure above, the purple line is the actual conversion rate indicator, the green area is the predicted upper and lower bounds, and the blue line in the middle is the predicted curve. By setting values outside the green range to outliers.

It can be found that this method can detect most outliers, and the effect is very obvious. However, there is an obvious problem: when the outliers are introduced into the sample, the predicted values deviate from the normal curve, and the upper and lower limits of the predicted values become larger and larger, resulting in over-fitting.

In summary, correlation, Gaussian detection, Prophet and other anomaly detection methods mentioned above are sensitive to sample data, so these outliers need to be processed in actual production. Later, we will eliminate the real outliers in combination with the sample database to make strategies and improve the overall effect of anomaly detection.

6. Follow-up planning

From the above introduction, it can be seen that there are many anomaly detection methods, and the configuration of dimension + indicator will introduce a lot of work. Therefore, the above method will only be applied to core data and other indicators in the early stage. In order to reduce the workload of configuration and improve the processing efficiency of abnormal data, we plan to do two things:

  • Intelligent detection: Do not allow users to configure production monitoring policies. Production monitoring policies are automatically configured based on the historical trend of data, except for correlation indicators. Because services are strongly associated, users need to manually configure them.

  • Intelligent attribution, anomaly is found, abnormal data dimension drilling, find out the most influential factors on the abnormal data;

The diagram above shows the architecture of the intelligent attribution module.

  • ** dimension driller management: ** is responsible for coordinating various modules and formulating the driller logic;

  • ** Data Atlas: ** IQiyi data center products, manage the upstream and downstream relationship of tables and fields, provide blood relationship for intelligent attribution;

  • ** Experts suggest: ** Precipitation of historical experience of abnormal causes. Due to many abnormal factors, through historical experience, we can determine the core direction of analysis, reduce the dimension explosion of drilling, and improve the calculation efficiency;

  • ** Attribution engine: ** is responsible for specific attribution execution logic, including finding the dimension value of the largest abnormal factor in the driller dimension; Summarize the exception causes of different dimensions and output readable exception causes.