The opening is introduced

This article will share the problems and challenges students encounter in the field of data quality and interface semantic monitoring. How do they build a functional business monitoring platform to build data quality and interface semantics monitoring capabilities, and what are the practical results achieved? Other lines of business development & testing students, how to share and build the ability of data quality and interface semantics monitoring?

background

Business characteristics

Strong dependence on data quality

Algorithm business is characterized by strong dependence on data quality, which is reflected in a saying in the industry: “Data and features determine the upper limit of machine learning, and models and algorithms only approach this upper limit”.

For example, it can be seen from the flow chart of online algorithm model in the following figure that in the model training stage, a large number of offline data & feature data are needed to train the model. If there is deviation in data quality, the trained model will inevitably have distortion in online prediction. However, in the process of model online prediction, real-time and offline data & features are also relied on as the basis of prediction. If there is a data quality problem, the predicted results can not meet the business requirements.

Again, for example, the following hello map data ETL process diagram, can be seen from the diagram, hello map data, need to go through acquisition (purchasing data), clean (screen and fine screening process at the beginning), fusion (multiple copies of data aggregation) and loading (storing the data into the ES), and finally through the SOA services, foreign provide the map data query capabilities. In this series of complex ETL processing process, any data quality problem will lead to abnormal online service data.

Strong dependency on external services

Another feature of algorithmic business is its strong dependence on external services. As can be seen from the intelligent customer service system architecture diagram below, the system internally relies on marketing platform, trading platform, payment platform, account platform and risk control center. Externally, it relies on voice conversion service (intelligent IVR service), hotline service (Heli) and Alipay. Any abnormality of a dependent service will affect the business process of the intelligent customer service system.

Problem is introduced

In electric dispatch

scenario

Scenario 1: Students feedback that there are a large number of power-deficient vehicles in the site, but the power-changing scheduling algorithm does not push the power-changing task;

Scenario 2: Feedback from the students of power changing operation and maintenance, receiving the power changing task, and finding that the vehicle is not short of power after arriving at the site;

Data processing process

Before analyzing the cause of the problem, we first understand the data processing process of real-time data warehouse. As shown in the figure below, various data, such as APP burial point, binlog and IOT data, are processed by sending Kafka messages to Flink service of data warehouse through access system, and finally stored in ES for invocation by various business parties.

Cause analysis,

Combined with the data processing flow chart in the figure above, the problem scenario is analyzed. The reasons are as follows:

Cause 1: Due to the excessive pressure of Flink message processing in warehouse, message consumption is piled up and data update is delayed. Finally, the business party cannot obtain the latest vehicle electric quantity data. Therefore, there are a large number of trams in shortage at the station, but no electricity changing task is pushed.

Reason 2: Since storehouse Flink deals with battery data differently according to different battery types, such as first-generation vehicle 591, second-generation vehicle 371, third-generation vehicle 668 and piled vehicle 663, etc., the vehicle starting with 376 is not in the code processing logic list, so the battery of this type of vehicle is 0 by default. Therefore, the algorithm judged that the 376 vehicle was short of electricity and pushed the power changing task to the operation and maintenance students, while the 376 vehicle on the actual site was not short of electricity.

Difficulties in offline testing

In my opinion, the difficulty of these scenarios in offline testing mainly lies in the high cost:

  1. Specific performance tests are required;
  2. The test scenario was extremely large. For example, the vehicle data involved 10 message topics, each with its own sub-fields and sub-logic, and the combination of the test scenarios was very large. At the same time, the message field will be flexible adjustment, need to track changes at any time for testing;
  3. Incomplete test environment data made it impossible to simulate rich enough test data to verify various scenarios of Flink in the test environment

Site data update

scenario

Operation and maintenance feedback Site data is not updated in a timely manner, such as errors in basic information such as site name, latitude and longitude, and capacity

Data processing process

The data update process of the site is as shown in the figure below. The regional governance and label management data of the site are updated to the service of oasis through APP and PC respectively. Oasis then synchronizes the data update content to other external processing services, such as battery service, by sending MQ messages. The battery service consumes MQ messages for secondary processing, and after the data is updated locally, SOA invocation services can be provided to the business side. In this process, MQ message push & consumption, battery service data secondary processing, may cause data exceptions.

Cause analysis,

Combined with the data processing process shown in the figure above, the problem scenario is mainly caused by data differences caused by improper PUSH of MQ messages. Such as:

Cause one: When the APP side updates the site, it invokes the update label and the interface of the update site at the same time. When the update label interface is invoked first, the service does not send MQ messages. As a result, the battery service cannot receive notification of data change. (The issue has been fixed)

Cause 2: It is a process problem. The old process was 1. Update mongoDB, 2. Check whether mongoDB data is updated successfully. 3. Send an MQ notification. The problem of this process is that mongoDB has a master-slave design. Data is written to the master node and then synchronized to the slave node, resulting in a time difference. When data is written to the master node and updated data cannot be queried from the slave node, MQ messages are not sent, resulting in data delay. So change the data processing flow to: 1. Update mongoDB, 2. Send MQ notifications. (The issue has been fixed)

Difficulties in offline testing

In this scenario, the difficulty of offline testing mainly lies in the difficulty of comprehensive coverage of abnormal scenarios, which requires higher ability and experience of testers.

Data consistency

scenario

How to ensure the consistency and timeliness of data synchronization?

Data Synchronization Process

The asset data synchronization process is shown in the following figure. After the asset data is deposited into the DATABASE (DB), the updated data needs to be synchronized to ES in real time so that ES can provide the SOA query service externally. In this synchronization process, 1. How do we ensure that the daily ten-thousands-level or million-level data changes can be correctly synchronized from DB to ES? 2. How to monitor the reliability of the synchronization service in real time? When the synchronization service has a problem, can the alarm quickly, timely stop loss?

Dependant Service Reliability

scenario

Customer service students will occasionally complain that they cannot see the text converted by the user’s recording or the text converted is incorrect

Service dependencies

As can be seen from the dependency architecture diagram of ASR service, ASR’s safe voice to text service is the internal invocation of the service system, and is not obtained through interface access. Therefore, we cannot monitor it through interface status code and interface response time. In service scenarios, voice cannot be converted to text. Therefore, you cannot directly determine whether the service is unavailable by determining that the interface response content is empty. Therefore, it is necessary to introduce semantic analysis: “verify the response results through specific input design”, so as to effectively judge the availability of services.

conclusion

Based on the importance of data quality and external services to algorithm business, and the scenarios exposed by the above problems, we need to build data quality monitoring and interface semantics monitoring capabilities.

The following figure shows the hierarchical diagram of the online monitoring system. From the QA level, it mainly focuses on the monitoring dimensions that are strongly related to service scenarios and the correctness of service logic. This dimension is called functional service monitoring.

The solution

Monitoring purposes

Online monitoring generally serves two purposes:

  1. Improve the ability to perceive online problems and quickly find and locate problems
  2. Minimize the impact of online problems with minimal cost

We also hope to complete version 1.0 of data quality monitoring and interface semantics monitoring to complement other monitoring methods in the company.

Monitoring objects and scenarios

Data quality monitoring

Monitoring object

In the field of data quality monitoring, monitoring objects mainly involve the following three dimensions. The first layer is the dimension of library table, which relies on the COMPANY’s DQC platform to ensure the data quality falling on the library table after the completion of data scheduling tasks. The second layer is the data storage media layer, and the third layer is the application service layer. The data quality of these two layers is mainly based on multiple different data sources to compare the correctness of business logic to find data quality problems.

Monitor the scene

We take the data quality monitoring scene of Hello Map as an example to introduce the above three different levels and how to design monitoring scenes to cover them.

First, as shown in the figure below, data quality monitoring can be divided into single-data source quality analysis and multi-data source comparative analysis.

For the quality analysis of single data source, we focus on the dimension of library table, aiming at field level (field non-empty, field unique, value range of field value, field format & precision, etc.) and table level (data volume fluctuation). Two aspects of quality assurance.

For the comparative analysis of multiple data sources, we analyze the data consistency of the same data from different data sources after stripping the logical differences from the perspective of data consistency.

Secondly, we can know how the above two data analysis methods are implemented in the data quality test of The Halo map from the scene of data quality monitoring of the Halo Map in the following figure.

Interface semantic monitoring

Monitoring object

The monitored objects of interface semantics can be distinguished from service types, including Web services, RPC services, message services, and storage services.

Monitor the scene

From the perspective of service availability monitoring, there are three monitoring scenarios:

  1. Service availability, mainly monitors whether the service is alive and whether the returned status code is normal.
  2. Response time: monitors the complete response time of service invocation. If the response time exceeds the threshold, an alarm is generated.
  3. Semantic correctness mainly monitors the content or fields of service responses, such as: a. Whether certain rules are met, such as the price field returned by monitoring, and whether the format is correct (number format, precision length, etc.). B. Monitor the effectiveness of key fields. For example, our service strongly depends on a certain field data of external services.

The third scenario is guaranteed by our interface semantic monitoring. The other two scenarios are covered by the company’s monitoring platform. It has formed a good complementary collaboration with the company platform.

Monitoring platform design ideas

Function module

The design of functional business monitoring platform can be divided into four modules and a platform. Four of these modules refer to:

  1. Data & service module: mainly responsible for supporting reading different data monitoring objects, supporting different service types;
  2. Rule policy module: It is required to support user-defined exception monitoring rules, user-defined alarm policies, monitoring policies at any period, user-defined alarm levels and masking rules, etc.
  3. Alarm feedback module: mainly realizes alarm channel management (individual, group, email, etc.), alarm management process, docking improvement work order system, and realization of closed loop monitoring alarm improvement landing;
  4. Basic function modules: mainly including user & authority management, task configuration management, alarm group & alarm information management, data tray, report display and other basic functions;

Platformization refers to the realization of ease of use, flexibility and unified access ability of monitoring platform based on platformization construction.

Hierarchical system design

Based on the above four modules and a platform design principle, we realize the system hierarchical design as shown in the figure below

Practice effect

Algorithm test platform introduction

We finally built an AI testing platform, on which we implemented data quality monitoring and interface semantic monitoring capabilities, and mainly planned and developed the following five features:

  1. Data tray: mainly used to display three main information: a. Task team distribution; B. Key data indicators of task execution (cumulative & daily); C. Dynamic display of core or abnormal task report results. Hope in the data market, can see all the core information at a glance;
  2. Task management: mainly used for task management-related operations, such as task filtering, creating, editing, triggering, obtaining execution logs, displaying execution status, etc.
  3. Monitoring report: it is used to display the monitoring data of all tasks, such as task information, display monitoring data in various forms, and highlight outliers.
  4. Alarm management: Collects all alarm problems and handles the entire alarm handling process, such as notification, problem confirmation, improvement of work order submission, and improvement of landing process tracking. Finally realize the closed-loop monitoring;
  5. Data correction: it is mainly used to provide data correction and repair capabilities for abnormal data discovered by monitoring. At the same time, due to the high-risk characteristics of data correction, it is necessary to have functions such as authority management and data rollback.

Case sharing

Data quality monitoring

Introduction to the

By the end of February 2021, the data quality monitoring platform, a total of 8 data monitoring tasks, respectively from the algorithm platform, supply chain and battery services. We found a total of 18 problems, of which 61% were found by offline testing and 39% by online monitoring.

Case sharing

In the problem introduction section above, the problem scenario, data processing process and the cause of the problem have been introduced in detail. This paper mainly introduces how to do data comparison and problem analysis

In electric dispatch
Data comparison method

According to the data processing process of real-time data warehouse, the data source is the buried point message reported by Kafka. The external service mode is to provide ES query. Within the company, there are other teams that use the same data source to provide services externally, such as the OssMap service of the Car service team. Therefore, we compared the ES query data of the real-time data warehouse with the SOA interface data of the vehicle service OssMap (as a reference) to verify the data quality of the real-time data warehouse.

Here, some students may be curious, can we directly compare the ES data of the warehouse with the OssMap data?

The answer is no. There are two main differences: 1. Timeliness; 2. Data processing logic differences;

Timeliness issues: Mainly reflects in, although the data source, kafka from the same topic, but the two services of message data processing efficiency is not the same, for the real-time demand higher fields, such as the state of vehicle in and out of the station, the vehicle battery voltage this kind of data, is bound to be due to the data processing efficiency difference between the service and appear a certain degree of error values.

Data processing logic difference problem: Mainly reflected in that different services have different processing methods for the same message content. For example, the vehicle tag ALERT_OVER (out of service area) returns 8 in the ES of the warehouse and 101 in the OssMap service.

So how do you solve these two differences?

The answer is that for the timeliness problem, we adopt the dynamic configuration tolerance value strategy to accommodate the error problem. For example, we configure the quantity tolerance value as 2, indicating that the quantity of the same car is allowed, and the data difference between the two services is not more than 2. namely

if (Math.abs(Vehicle A's battery count - Vehicle A's OssMap battery count) >2) {trigger an alarm; }else{Assume that the two services match the power quantity and no alarm is generated. }Copy the code

For the data processing logic difference problem, we use the transformation mapping method to equalize the data caliber returned by the two services. For example, on the basis of the label format of the car clothing, we converted the labels of the warehouse into the label format of the car clothing, and then made comparison.

Problem analysis

For the data quality monitoring scenario of power change scheduling, we found a total of 8 bugs, including 4 problems found in offline testing and 2 problems found in online monitoring.

According to the classification of the problem types, it can be divided into two categories: one is the data processing logic problems 4, accounting for 67%, and the other is the non-functional problems 2, accounting for 33%.

Non-functional problems:

  1. Capacity problem: It means that the maximum size returned by service query ES needs to be set, but the setting of the developer is too small, resulting in the number of returned moped at the site is less than the actual situation;
  2. Performance problem: it refers to the problem of message accumulation and data update delay caused by excessive pressure of Flink message consumption mentioned above;

Data processing logic problems:

  1. Error In filter criteria, such as “bike.in.site” but isinsite=0; Select * from vehicles where StationGuid exists but CityGuid does not.
  2. Demand omission refers to the omission of some statistical fields of data and some statistical rules, etc.

Site data update
Data comparison method

According to the data processing process of site data update, we know that site data update is to update oasis service first, and then inform battery service to update data through MQ message, and finally the battery service provides services externally. Therefore, oasis service and battery service data, in theory, are consistent. Therefore, after data caliber alignment between oasis service and battery service, we can directly conduct data comparison to monitor data quality.

Problem analysis

For the data quality monitoring scene of site data update, including the battery service data comparison task mentioned above and the site data comparison task within the algorithm platform, we found a total of 11 problems, including 7 problems found in offline testing and 4 problems found in online monitoring.

According to the classification of problem types, it can be divided into two categories: 7 data processing logic problems, accounting for 64%, and 4 non-functional problems, accounting for 36%.

Non-functional problems:

  1. Message consumption & push problems, such as failure of message consumption; The above mentioned APP update mechanism leads to the failure of message push.
  2. Concurrency problems. For example, the site data inside the algorithm platform is stored in Redis, but the concurrent locking mechanism is not adopted, so the data will cover each other when updating.
  3. Data update synchronization problems, for example, when new sites are added in the city, the redis of The warehouse updates the site information (offline refresh every day), but does not update the city information. As a result, the intelligent scheduling data center cannot find the vehicle data of the site when it queries (with the site and city ID query).

Data processing logic problems:

  1. Value error, this is a more common problem, from the wrong field to get data;
  2. Data processing logic errors, such as filtering expired dirty data, but there are bugs in the logic, filtering other basic data;
  3. Demand omission refers to the omission of some statistical fields of data and some statistical rules, etc.

Data consistency
Data comparison method

From the process of asset data flow, we mainly need to ensure the consistency of two data sources of asset DB and ES cluster to ensure high reliability of data synchronization task. then

  1. With tens of thousands, millions of data changes per day, how do we make sure that everything is synchronized correctly? A: We periodically query all update data from the asset DB and then compare it with the data in ES to ensure the consistency of all synchronized data.
  2. How do you monitor the reliability of synchronization services in real time? A: We compare data consistency between DB and ES in real time by high-frequency (minute level) + lightweight (only check the latest update data), so as to push back the reliability of synchronization service.

Interface semantic monitoring

Introduction to the

By the end of February 2021, the interface semantic monitoring platform has been connected with two data monitoring tasks, one from the algorithm platform and the other from the supply chain. We found a total of 1 online problem.

Case sharing

The problem scenario and service dependency architecture were described in detail in the problem Introduction section above. This section mainly introduces how to do interface semantic monitoring and problem analysis.

Customer service voice conversion
Interface semantic monitoring mode

For such scenarios requiring semantic monitoring of interfaces, we adopt the method of active detection + semantic analysis:

Active detection refers to the active request invocation of the safe voice to text service at a high frequency (minute level).

Semantic analysis: Prepares input voice files in advance, for example, a voice file with hello content, invokes the voice conversion service, and aligns the response content for analysis. If the response text is Hello, the monitoring succeeds. If the response text is not hello, an alarm is generated.

Problem analysis

By March 2021, we have found that ping An voice to text service is unavailable for 3 times. The reason is that the ASR service is unavailable because too many invalid recordings are pushed after the hotline service (resultant) on which the customer service system depends is changed.

Platform Access Mode

At this point, students with requirements such as data quality monitoring and interface semantics monitoring will ask, how do we access the platform? How do you acquire these monitoring capabilities?

As for the access mode of the platform, we first divide the roles into two parts. One is the demand side, that is, students who have monitoring needs and want to implement monitoring tasks of data quality and interface semantics quickly online with the help of the platform’s ability. The other is the platform side, that is, the maintenance staff of the monitoring platform (directly contact the algorithm testing team).

For the demand side, there are only four steps to realize the on-line monitoring task:

  1. Requirements docking: Communicate with the platform side about requirements after preparing requirements docking documents in advance;
  2. Monitoring rule scripting: Write multi-data source comparison rules according to the Monitoring script Python engineering template (completely independent, unlimited conditions)
  3. Script verification: After gitLab submits the monitoring rule script, it can conduct data comparison test in the test environment to verify the monitoring rule script;
  4. Monitoring task online: After the offline test is complete, the monitoring task can be online, monitored on the platform, and received alarm feedback through the alarm channel.

The future planning

Subsequently, we will build a functional business monitoring platform from the following three dimensions:

Current capabilities:

By the end of February 2021, we have preliminarily completed the capacity building of data market, task management and monitoring report. We plan to carry out the construction and implementation of alarm management and data correction on the basis of continuously improving the existing three functions. At the same time, with the diversity and complexity of subsequent access monitoring requirements, we will improve the reliability and alarm validity of the platform.

Business ability development:

We expect at the same time, the compaction quality monitoring data and interface, on the basis of semantic monitoring ability, can use the special ability to ascend, extended function expansion monitoring capacity building other dimensions, such as the page elements monitoring, intelligent customer service team in the ground current compatibility UI test automation projects, we are thinking about, Is it possible to combine the three capabilities of APP UI automation test + real machine platform + monitoring platform to output the monitoring capability of page elements?

Also, based on the pain points of business testing, can we use business process monitoring to quickly find online business process problems and reduce the incidence of low-level failures?

Co-construction and co-creation:

Through the construction of technology, so that everyone can do software testing, is our dream, we will continue to strive for it. The dream is very big, the distance is very far, we sincerely invite all interested students, and we build together, together to build the functional business monitoring capacity.

At the same time, we will also actively do a good job in promotion services, and hope that the platform can better empower students to develop and test.