The opening is introduced

This article will share the problems and challenges students encounter in the field of data quality and interface semantic monitoring. How do they build a functional business monitoring platform to build data quality and interface semantics monitoring capabilities, and what are the practical results achieved? Other lines of business development & testing students, how to share and build the ability of data quality and interface semantics monitoring?

background

Business characteristics

Strong dependence on data quality

Algorithm business is characterized by strong dependence on data quality, which is reflected in a saying in the industry: “Data and features determine the upper limit of machine learning, and models and algorithms only approach this upper limit”.

For example, it can be seen from the flow chart of online algorithm model in the following figure that in the model training stage, a large number of offline data & feature data are needed to train the model. If there is deviation in data quality, the trained model will inevitably have distortion in online prediction. However, in the process of model online prediction, real-time and offline data & features are also relied on as the basis of prediction. If there is a data quality problem, the predicted results can not meet the business requirements.

Again, for example, the following hello map data ETL process diagram, can be seen from the diagram, hello map data, need to go through acquisition (purchasing data), clean (screen and fine screening process at the beginning), fusion (multiple copies of data aggregation) and loading (storing the data into the ES), and finally through the SOA services, foreign provide the map data query capabilities. In this series of complex ETL processing process, any data quality problem will lead to abnormal online service data.

Strong dependency on external services

Another feature of algorithmic business is its strong dependence on external services. As can be seen from the intelligent customer service system architecture diagram below, the system internally relies on marketing platform, trading platform, payment platform, account platform and risk control center. Externally, it relies on voice conversion service (intelligent IVR service), hotline service (Heli) and Alipay. Any abnormality of a dependent service will affect the business process of the intelligent customer service system.

Problem is introduced

In electric dispatch

scenario

Scenario 1: Students feedback that there are a large number of power-deficient vehicles in the site, but the power-changing scheduling algorithm does not push the power-changing task;

Scenario 2: Feedback from the students of power changing operation and maintenance, receiving the power changing task, and finding that the vehicle is not short of power after arriving at the site;

Data processing process

Before analyzing the cause of the problem, we first understand the data processing process of real-time data warehouse. As shown in the figure below, various data, such as APP burial point, binlog and IOT data, are processed by sending Kafka messages to Flink service of data warehouse through access system, and finally stored in ES for invocation by various business parties.

Cause analysis,

Combined with the data processing flow chart in the figure above, the problem scenario is analyzed. The reasons are as follows:

Cause 1: Due to the excessive pressure of Flink message processing in warehouse, message consumption is piled up and data update is delayed. Finally, the business party cannot obtain the latest vehicle electric quantity data. Therefore, there are a large number of trams in shortage at the station, but no electricity changing task is pushed.

Reason 2: Since storehouse Flink deals with battery data differently according to different battery types, such as first-generation vehicle 591, second-generation vehicle 371, third-generation vehicle 668 and piled vehicle 663, etc., the vehicle starting with 376 is not in the code processing logic list, so the battery of this type of vehicle is 0 by default. Therefore, the algorithm judged that the 376 vehicle was short of electricity and pushed the power changing task to the operation and maintenance students, while the 376 vehicle on the actual site was not short of electricity.

Difficulties in offline testing

In my opinion, the difficulty of these scenarios in offline testing mainly lies in the high cost:

Specific performance tests are required;
The test scenario was extremely large. For example, the vehicle data involved 10 message topics, each with its own sub-fields and sub-logic, and the combination of the test scenarios was very large. At the same time, the message field will be flexible adjustment, need to track changes at any time for testing;
Incomplete test environment data made it impossible to simulate rich enough test data to verify various scenarios of Flink in the test environment

Site data update

scenario

Operation and maintenance feedback Site data is not updated in a timely manner, such as errors in basic information such as site name, latitude and longitude, and capacity

Data processing process

The data update process of the site is as shown in the figure below. The regional governance and label management data of the site are updated to the service of oasis through APP and PC respectively. Oasis then synchronizes the data update content to other external processing services, such as battery service, by sending MQ messages. The battery service consumes MQ messages for secondary processing, and after the data is updated locally, SOA invocation services can be provided to the business side. In this process, MQ message push & consumption, battery service data secondary processing, may cause data exceptions.

Cause analysis,

Combined with the data processing process shown in the figure above, the problem scenario is mainly caused by data differences caused by improper PUSH of MQ messages. Such as:

Cause one: When the APP side updates the site, it invokes the update label and the interface of the update site at the same time. When the update label interface is invoked first, the service does not send MQ messages. As a result, the battery service cannot receive notification of data change. (The issue has been fixed)

Cause 2: It is a process problem. The old process was 1. Update mongoDB, 2. Check whether mongoDB data is updated successfully. 3. Send an MQ notification. The problem of this process is that mongoDB has a master-slave design. Data is written to the master node and then synchronized to the slave node, resulting in a time difference. When data is written to the master node and updated data cannot be queried from the slave node, MQ messages are not sent, resulting in data delay. So change the data processing flow to: 1. Update mongoDB, 2. Send MQ notifications. (The issue has been fixed)

Difficulties in offline testing

In this scenario, the difficulty of offline testing mainly lies in the difficulty of comprehensive coverage of abnormal scenarios, which requires higher ability and experience of testers.

Data consistency

scenario

How to ensure the consistency and timeliness of data synchronization?

Data Synchronization Process

The asset data synchronization process is shown in the following figure. After the asset data is deposited into the DATABASE (DB), the updated data needs to be synchronized to ES in real time so that ES can provide the SOA query service externally. In this synchronization process, 1. How do we ensure that the daily ten-thousands-level or million-level data changes can be correctly synchronized from DB to ES? 2. How to monitor the reliability of the synchronization service in real time? When the synchronization service has a problem, can the alarm quickly, timely stop loss?

Dependant Service Reliability

scenario

Customer service students will occasionally complain that they cannot see the text converted by the user’s recording or the text converted is incorrect

Service dependencies

As can be seen from the dependency architecture diagram of ASR service, ASR’s safe voice to text service is the internal invocation of the service system, and is not obtained through interface access. Therefore, we cannot monitor it through interface status code and interface response time. In service scenarios, voice cannot be converted to text. Therefore, you cannot directly determine whether the service is unavailable by determining that the interface response content is empty. Therefore, it is necessary to introduce semantic analysis: “verify the response results through specific input design”, so as to effectively judge the availability of services.

conclusion

Based on the importance of data quality and external services to algorithm business, and the scenarios exposed by the above problems, we need to build data quality monitoring and interface semantics monitoring capabilities.

The following figure shows the hierarchical diagram of the online monitoring system. From the QA level, it mainly focuses on the monitoring dimensions that are strongly related to service scenarios and the correctness of service logic. This dimension is called functional service monitoring.

The solution

Monitoring purposes

Online monitoring generally serves two purposes:

Improve the ability to perceive online problems and quickly find and locate problems
Minimize the impact of online problems with minimal cost

We also hope to complete version 1.0 of data quality monitoring and interface semantics monitoring to complement other monitoring methods in the company.

Monitoring objects and scenarios

Data quality monitoring

Monitoring object

In the field of data quality monitoring, monitoring objects mainly involve the following three dimensions. The first layer is the dimension of library table, which relies on the COMPANY’s DQC platform to ensure the data quality falling on the library table after the completion of data scheduling tasks. The second layer is the data storage media layer, and the third layer is the application service layer. The data quality of these two layers is mainly based on multiple different data sources to compare the correctness of business logic to find data quality problems.

Monitor the scene

We take the data quality monitoring scene of Hello Map as an example to introduce the above three different levels and how to design monitoring scenes to cover them.

First, as shown in the figure below, data quality monitoring can be divided into single-data source quality analysis and multi-data source comparative analysis.

For the quality analysis of single data source, we focus on the dimension of library table, aiming at field level (field non-empty, field unique, value range of field value, field format & precision, etc.) and table level (data volume fluctuation). Two aspects of quality assurance.

For the comparative analysis of multiple data sources, we analyze the data consistency of the same data from different data sources after stripping the logical differences from the perspective of data consistency.

Secondly, we can know how the above two data analysis methods are implemented in the data quality test of The Halo map from the scene of data quality monitoring of the Halo Map in the following figure.

Interface semantic monitoring

Monitoring object

The monitored objects of interface semantics can be distinguished from service types, including Web services, RPC services, message services, and storage services.

Monitor the scene

From the perspective of service availability monitoring, there are three monitoring scenarios:

Service availability, mainly monitors whether the service is alive and whether the returned status code is normal.
Response time: monitors the complete response time of service invocation. If the response time exceeds the threshold, an alarm is generated.
Semantic correctness mainly monitors the content or fields of service responses, such as: a. Whether certain rules are met, such as the price field returned by monitoring, and whether the format is correct (number format, precision length, etc.). B. Monitor the effectiveness of key fields. For example, our service strongly depends on a certain field data of external services.

The third scenario is guaranteed by our interface semantic monitoring. The other two scenarios are covered by the company’s monitoring platform. It has formed a good complementary collaboration with the company platform.

Monitoring platform design ideas

Function module

The design of functional business monitoring platform can be divided into four modules and a platform. Four of these modules refer to:

Data & service module: mainly responsible for supporting reading different data monitoring objects, supporting different service types;
Rule policy module: It is required to support user-defined exception monitoring rules, user-defined alarm policies, monitoring policies at any period, user-defined alarm levels and masking rules, etc.
Alarm feedback module: mainly realizes alarm channel management (individual, group, email, etc.), alarm management process, docking improvement work order system, and realization of closed loop monitoring alarm improvement landing;
Basic function modules: mainly including user & authority management, task configuration management, alarm group & alarm information management, data tray, report display and other basic functions;

Platformization refers to the realization of ease of use, flexibility and unified access ability of monitoring platform based on platformization construction.

Hierarchical system design

Based on the above four modules and a platform design principle, we realize the system hierarchical design as shown in the figure below

Practice effect

Algorithm test platform introduction

We finally built an AI testing platform, on which we implemented data quality monitoring and interface semantic monitoring capabilities, and mainly planned and developed the following five features:

Data tray: mainly used to display three main information: a. Task team distribution; B. Key data indicators of task execution (cumulative & daily); C. Dynamic display of core or abnormal task report results. Hope in the data market, can see all the core information at a glance;
Task management: mainly used for task management-related operations, such as task filtering, creating, editing, triggering, obtaining execution logs, displaying execution status, etc.
Monitoring report: it is used to display the monitoring data of all tasks, such as task information, display monitoring data in various forms, and highlight outliers.
Alarm management: Collects all alarm problems and handles the entire alarm handling process, such as notification, problem confirmation, improvement of work order submission, and improvement of landing process tracking. Finally realize the closed-loop monitoring;
Data correction: it is mainly used to provide data correction and repair capabilities for abnormal data discovered by monitoring. At the same time, due to the high-risk characteristics of data correction, it is necessary to have functions such as authority management and data rollback.

Case sharing

Data quality monitoring

Introduction to the

By the end of February 2021, the data quality monitoring platform, a total of 8 data monitoring tasks, respectively from the algorithm platform, supply chain and battery services. We found a total of 18 problems, of which 61% were found by offline testing and 39% by online monitoring.

Case sharing

In the problem introduction section above, the problem scenario, data processing process and the cause of the problem have been introduced in detail. This paper mainly introduces how to do data comparison and problem analysis