TiD2020 Quality Competition Conference invited Zhao Ming, head of quality and Engineering Efficiency Department of Good Future AI Zhongtai, to deliver a speech titled “AI Engineering Performance Improvement Platform R&D Practice”.

Zhao Ming shared AI algorithm index and evaluation platform architecture and practice, AI microservice performance testing platform architecture and practice, data set management platform architecture and practice and corresponding case demonstration from three key words: accuracy, speed and stability.

Accuracy: The AI algorithm needs to meet the requirements of actual business scenarios and achieve high accuracy. Fast: after an image is processed by the algorithm, the result is returned to the user’s APP/Web end. It needs to evaluate its end-to-end index, such as 200~300 milliseconds, whether it can have a fast response and return. Based on this requirement, we developed a microservices performance test platform. Stability: for example, when voice, image or text is sent to the model, there will be no lag or even model crash, which is very related to the robustness of the model.

Good Future AI Zhongtai is committed to integrating AI technology into education scenarios, creating various AI capabilities, and continuously improving learning experience and efficiency. There are three main types of model development around AI technology.

** First speech, ** includes speech recognition, speech evaluation, emotion recognition and emotion analysis. For example, ASR converts speech into text and measures fluency and accuracy in spoken English. This is all part of its educational scene.

** The second image, ** such as printing OCR recognition, photo search, notes grading, education scene content review, etc., will use the image algorithm model to help improve efficiency.

** Third data mining, ** Some of which are related to NLP natural language processing, such as text keyword search, class feature statistics, highlight capture, and evaluation of Chinese oral expression ability. The capabilities of this series of models will be deployed on the PaaS platform as an AI microservice for each end to call.

(I) The algorithm is frequently tested without automation

At the algorithm level, the model changes, such as the increase of call rate, robustness and new data. The servitization aspects, such as performance optimization and interface changes, need to undergo complete testing, so the testing is very frequent. This process had not been automated and was inefficient.

(ii) No perception of industry leadership

Algorithmic engineers will use industry leadership as a very objective basis for formulating KPIs. For example, 50% of AI capabilities can be ranked first in the industry. This dimension is actually the target of algorithm KPI. How to evaluate this goal? We need to quantify it, to have a platform to show the metrics. Without this platform, industry leadership is impossible to perceive.

(3) High cost of performance evaluation after performance tuning

Whether the algorithm model or the server is tuned and expanded, we need to perform an end-to-end performance evaluation. As the frequency of assessments increases, so does the cost.

(4) Data fragmentation

Data is the blood of AI. In the process of AI production and research, every link is inseparable from data. The data needs to flow between the roles. Unified data management is required for products, algorithms, engineering research and development, and testing. If there is no unified platform for data of different business parties, including data of different versions, it will be very fragmented and difficult to manage and maintain.

Based on the pain points, we made improvements, that is, platform construction. We hope to solve the above four kinds of business pain points through the construction of algorithm evaluation platform, performance test platform and data set management platform.

(I) Tool platform system construction

Tool building requires empowering the roles. For example, product requirements, including the effect of the product online, there are many tools to do link tracking. For algorithms, such as the core AI capability model, some tool chains are also needed to help improve the output of the algorithm. The model servitizes tools as interfaces for external invocation. In the whole process, we need to ensure its quality and conduct whole-link monitoring through the operation and maintenance system Devops.

In terms of quality, the product is responsible for the quality of demand, the algorithm is responsible for the quality of AI model, the development is responsible for the quality of the code, the testing is responsible for the quality of the end-to-end product, and the operation and maintenance is responsible for the quality of the online product after the launch. All links are inseparable from the support of data and quality-related tools.

(II)Al algorithm index and evaluation platform architecture and practice

1. Scenario requirements and users

The first scenario, no annotation data: AI algorithm BadCase filtering. There is no need to do data annotation in advance. Badcase can be automatically screened through the process of competitive product analysis and comparison. The second scenario, no annotated data: the accuracy of the new data algorithm. We’re going to use the new data, and we’re going to do algorithmic accuracy measurements based on this platform. The third scenario includes annotated data: AI algorithm index competitive product evaluation and leading analysis. I will make algorithm indicators based on well-marked data and evaluate them objectively and quantitatively, including market leadership analysis. The fourth scene includes annotated data: AI algorithm index evaluation and analysis. For annotated data, the index of the algorithm can be further analyzed and evaluated, including finding badCase and providing these data to the algorithm for further optimization of the model. For these four scenarios, our main users are algorithms, algorithm testing and product managers.

2. Technical architecture

The technical architecture of Al algorithm index and evaluation platform is divided into basic function and interface call layer, logical abstraction layer and UI layer.

Basic functions and interface call layer, including permission management, data source management, storage management, analysis instance management module, report management. Above will call through the interface, call self-developed interface and competing product interface, through the script library for unified management, such as input, return of input parameters, format standardization.

The logical abstraction layer is the integration of basic functions into a workflow, to calculate the indicators of algorithms, or to process data end-to-end. Data needs to be preprocessed — desensitized, noise-reduced, cleaned, rotated, or filtered for empty audio. The evaluation rules of algorithm index are different for different models. For example, the indicator of pronunciation is the word error rate. The index of the image is the call rate. So, how can they be normalized and unified management? AI capability registration is the unification of new development capabilities. How do you register it with the platform while the underlying logic and the upper UI remain the same? For example, with 10 capabilities to 100 capabilities, you need to minimize the impact of changes and refactoring of logic or code on the front and back ends due to the registration of new capabilities. In short, it separates the underlying infrastructure from the business of the upper AI microservices. Results of the export, we will according to different templates, different algorithm models, there will be corresponding processing results report. It can be derived into a table for subsequent report sorting and analysis.

On the top of the logical abstraction layer, there will also be a multi-competing product processing engine, which is used to do unlabeled data, badCase automatic analysis, including data comparison of other competing products, to find their own shortcomings.

The UI layer. There are evaluation instances, visual arrangement, badcase screening and presentation of reports through some visual interfaces.

3. Practical effect

Al algorithm index and evaluation platform can provide algorithm BadCase for algorithm engineers to provide reference for model optimization. For example: OCR identification error, in the image identification process, due to printing problems caused by noise data will affect the identification. The figure above identifies “____in” as “-tin” in the process of identification. Therefore, when doing algorithm model processing, it is necessary to improve the robustness of the model itself, and special tuning of the model is needed for such scenarios. ASR transcription error is the inconsistency between speech data and annotated data in the process of converting speech data into text. For example, the annotated data is how many offers our department has this year. But the accuracy of model AB is not very good. There were some mistakes in the identification process. Model A was identified as “we did not get many points this year”, and Model B was identified as “How many points did our department get this year? Offer and points can’t be well distinguished.

We will list the errors, organize and collect unified reports on bad cases, and provide them to algorithm engineers for model optimization.

4. Index of algorithm model evaluation

The evaluation criteria of the model need a set of objective metrics.

The statistical indexes for images include accuracy, accuracy, recall rate, F1 and so on. Precision, Recall and Accuracy can be calculated by formula based on TP, FN, FP and TN dimensions. Finally, the corresponding Recall rate and Accuracy can be evaluated to see what needs to be improved compared with the previous version. For speech, we can use WER/CER (word/sub-error rate) as a statistical indicator. The lower the number, the better. Insert error, delete error, replace error divided by the total number of words as the calculation index. Metrics are embedded into the platform in the form of scripts.

For performance indicators, CPU, MEM, GPU, etc., are used as statistical indicators. For the stability index, crash rate and memory leak are mainly used as statistical indexes. Through these dimensions, we can objectively evaluate whether the model is accurate, fast and stable.

5.Demo1- Algorithm Badcase automatic screening

This method applies to scenarios without annotated data

First we will have a visually orchestrated interface, first drag and drop the data source. The data source is actually a set of data on FTP, which is actually a folder that holds the images under test. We will preprocess the image, such as data desensitization, automatic orthographic and so on. Because the picture may not be orthotopic, some models will be used to orthotopic. This facilitates model analysis. We will compare the desensitized data with three external competitive models simultaneously.

Badcase is similar to voting method. For example, there are three ABC models. If all the results obtained after phonetic analysis are “Today is Friday”, the credibility of Badcase is very high. If the self-research result is different from the analysis result of these three models, the probability is Badcase. Its advantage is that it does not need to do annotation, can grab data from the online, automatic screening. After the model is processed, some post-processing will be done to remove punctuation marks, Spaces, carriage returns and other non-critical factors. After processing, Badcase is displayed on the page. On this page we can adjust the threshold. The higher the threshold, the greater the possibility of Badcase. For example, if the Badcase accuracy is 0.9, it is highly likely to be a Badcase. If it’s 0.2, it could be a false positive.

For printed tables, we will carry out OCR evaluation of picture to text information capture, including Goodcase and Badcase. We can automate the analysis of unlabeled data through the platform. The Badcase analyzed can be provided to algorithm engineers for core iteration and optimization.

6.Demo2- Algorithm index competitive product evaluation

For annotated data

Start by creating an instance and naming the version through the visual choreography interface.

We drag data from FTP for model analysis and input the data into three models at the same time, one of which is a self-developed model. We need to see the gap between the self-developed model and the competitive model, and finally import the annotation data for analysis.

The labeled data corresponds to a standard answer. So you end up with three categories of indicators. Each model has a set of metrics. That way we can analyze how industry-leading our models are. This helps the algorithm to set kpIs.

The quantitative evaluation index for OCR class has two dimensions — sequence accuracy and character accuracy. These two accuracies can evaluate the effect of the model. At different times of iteration, the criteria for each evaluation model will be different. For example, in every dimension, some of them might be us ahead, some of them might be us behind. This index can see our position in the industry.

Of course, we can also do screening. This filter can be applied to different dimensions, or to different competing products, as you can see below is a Badcase, there are three models. On the far left is the self-developed version.

The effect of the research version is better than the other two competing products. So it’s going to have fewer bad cases. This page is a comparison of model ABC with standard data. If it’s exactly the same as the standard data, it’s accurate. If they are inconsistent, the identification is wrong.

For the print formula recognition algorithm model, input an image, the image content will be output as text. Some complex formula recognition needs to enhance the robustness of the model. This is difficult. As demonstrated earlier, we can measure this metric. It’s going to compute a logical formula to get a result. Then you can see the industry leadership.

This platform algorithm, testing partners can be used. Products can use the platform to map new data without algorithms or testing. He doesn’t need to understand very complex computational logic and can help engineers quickly draw conclusions through visual choreography such as drag and drop.

(iii) Architecture and practice of AI microservice performance test platform

1. Scenario requirements and users

AI microservices performance testing platform is to solve the “fast” problem.

The first requirement is algorithm and service-oriented test environment management and sharing. Algorithm engineers have a local r&d environment, and service-oriented development has an environment. There are many versions of the model involved, including the versions of the base dependency library Python and so on. These library standards should be as simple as possible, the simplest way is containerized management and environment sharing.

The second requirement is to automate deployment. After testing, all environments for performance testing need to be automatically prepared, such as scripts.

The third requirement is automatic geochemical pressure exploration, including setting the maximum TPS, concurrent number, to achieve automated pressure exploration.

The fourth requirement is performance bottleneck analysis. The bottleneck analysis is the key part. Processing speed is involved. For the model itself, model compression, pruning, improve processing speed, reduce the waste of hardware resources.

AI micro-service performance test platform for the users are service testing, algorithm testing, research and development, product managers. R&d and product managers can take performance indicators and communicate with customers whether they meet the business needs.

2. Technical architecture

AI microservice performance test platform architecture includes data source, interface layer, UI layer.

At the bottom is the data source, and we’ll use Prometheus for remote data monitoring. For data sources, there will be a database for persistent storage.

The middle layer is the interface layer. There will be automatic geochemical exploration logic, including remote implementation, deployment, index landing calculation, remote terminal control. You can also log in with one click.

The upper layer is the UI layer, which can be logged in through the platform interface, including resource application and release, automatic geochemical exploration pressure, bottleneck analysis and so on.

3. Practical effect

The platform can realize unattended automatic geochemical pressure detection, and according to the dichotomy algorithm, the maximum TPS or the number of concurrent requests of the system can be detected in the shortest time.

The dichotomy is that the concurrency starts at 100, which is the initial pressure, and the step size is 100. When it reaches its maximum TPS inflection point, as shown in the figure above, the maximum value of 300 appears, it drops the pressure down to between 200 and 300, down to 250. Without this platform, we would have to manually change concurrency every time. Because the service types will be more, there are based on HTTP protocol, there are also based on Websocket protocol. For simplicity and generality, the underlying implementation is based on Jmeter, and then replaces the configuration in JMX files with a template replacement logic. The Jmeter open source tool is quite powerful.

Manual pressure tests averaged about 10 rounds per AI microservice with varying concurrent numbers. One round is 60 minutes, 10 rounds is 600 minutes, which is quite time-consuming. And each time need to change the script, start the pressure test, observe the output of each indicator. The server has reported an error and needs to reduce the pressure. Index normal, continue pressurization. It’s labor-intensive, and the work is repetitive. But with automation, the effect is very obvious, from 600 minutes can be reduced to 10 minutes, the final review of the data, report no problem, can directly draw a conclusion, very fast.

Demo3- Automatic geochemical pressure and bottleneck analysis

Apply for a manometer first. After applying for the manometer, we can realize one-click login through terminal connection. You can operate all kinds of commands in this one. The Jmeter native format files can then be deployed in a 30-second constant pressure deployment script based on the newly deployed manometer, and then monitored. All work can be completed on this platform, including resource application, script remote push deployment execution, result output, monitoring data return and display. This platform can be used as a workbench to configure a variety of different scripts according to actual needs, including the related output of the results report. This report is presented as a Jmeter native report. TPS and other indicators can be displayed in this report.

At the same time, we can also perform automatic geochemical exploration scenarios. The pressure is 100, and the step is 100. Each round takes about 60 seconds, but in practice it’s longer than that. And then the granularity is 20, and there is a scenario that is the exit threshold, which is the granularity setting. Granularity means the difference between the maximum success pressure and the minimum failure pressure is less than a value. The finer the granularity, the more rounds you will run.

As can be seen from the monitoring chart, TPS reached an inflection point when the chart reached 300 after a period of operation, that is to say, TPS appeared a bottleneck. Bottleneck, we need to get the pressure down to 250.

When we’re done running, you can see a presentation of the report, at 100 the TPS is 551,200, at 733,300 it drops to 722. So we need to bring the pressure back down to 250 to 225. This way we can know that it has the maximum TPS and throughput under what concurrency. This is a TPS scenario, and there is a concurrency scenario, how many concurrency can be tolerated without error. This can be linked to Jmeter native reports. This is end-to-end, which includes network traffic, servitization processing, duration of model calls.

We can choose a time window to analyze the monitoring structure and bottleneck. We monitor major indicators, such as memory leaks. The threshold for memory leaks can be controlled, setting the default value to 5%. If the end point increases by more than 5% from the start point, we suspect a memory leak. The process threshold can be adjusted. We’re going to get even smarter later.

CPU usage checks the threshold. If the threshold allows 80% and its average usage is 90%, an error will be reported that the set upper limit has been exceeded, including network traffic. This is the kanban for overall monitoring. By developing backend logic services, we bring data back from Prometheus to do some processing and processing, and then back through the interface to do rendering from the front end. This can be done in a form that automatically detects stress when unattended, such as when you leave work on a Friday night and run an automated script that is always looking for the maximum TPS or concurrency number. Then you get all the reports Monday morning, and you can even run them in multiple groups. This saves time.

This can be seen as an agile test, with our tools and platform constantly shortening feedback cycles for algorithms or development.

(IV) Data set management platform architecture and practice

1. Scenario requirements and users

It is mainly used for training and testing in different scenarios. The classification of data needs different labels, such as which kind of data belongs to which business, or belongs to different versions, whether it is positive data or negative data, so as to apply it to different categories of algorithm models, which is actually the category of evaluation.

Algorithm engineers might use some of that data. We use black-box data for objective evaluation. This is invisible to algorithm engineers. However, some data is self-testing for the algorithm and can be used with some smoke test data for algorithm testing.

Query labeled data and the situation of labeled data to see which data were collected in the past, which have been labeled well and which have not been labeled, or its labeled quality is poor, and the part of data that needs to be labeled again.

We need to build a chaotic test system to improve the stability and robustness of the model.

There are also many users of the data set management platform, including algorithms, algorithm testing, product managers and data operations.

2. Process

Firstly, the data set is selected, then the automatic download is implemented, then the model-related work is processed, and finally the results are checked. Model processing is to monitor the normal operation of various indicators in the data location model. In the middle process, the monitoring accuracy, accuracy, speed and stability of the monitoring, through such dimensions, the actual monitoring effect is in line with the expectations of users and businesses.

For different types of data, such as noise data, null data, fuzzy images, illegal formats, zoomed images, low-quality data, over-limit data, fragmented data, light source point diagram, and confrontation data, we will pay attention to the performance of the algorithm model through different dimensions.

3. Practical effect

Monitoring simultaneous memory usage and CPU usage can reveal whether the model is stable. As shown in the figure above, the memory usage of the segment is constantly changing and fluctuating within the range of 6G-8g.

Meanwhile, the CPU usage fluctuates normally. The maximum fluctuation is around 50%, which is also acceptable. So the processing of this batch of data is not very risky for resource use, indicating that the stability of the model meets the delivery standards.

(1) Objectives

Considering the evaluation criteria of different dimensions, in the future, we will focus on high quality, efficiency improvement and cost reduction to help the production, research and measurement team deliver AI products and micro services more quickly through tool chain or platform research and development. That’s our goal.

2. Direction

For “quasi” : we will continue to provide the platform with the intelligent recommendation ability to support the optimization direction of algorithm indicators. What model is adopted, which parameters are used based on Badcase analysis, etc., can be recommended intelligently for the data of these dimensions and tell algorithm engineers the direction of next optimization.

For “fast” : find the bottleneck of the algorithm system to achieve intelligent positioning and analysis. For example, whether the model processing efficiency can be improved, whether some data compression can be done at the engineering level, whether some performance improvement can be done in the network link transmission, and whether horizontal expansion should be carried out when applying for resources as a service.

For “stable” : algorithmic and service-oriented memory leak scanning and location. Memory leaks not only affect servitization, but also affect project quality. It is the number one killer of system stability. How to quickly locate memory leaks is both important and difficult. Typically, a memory leak, analyzed by r&d and algorithms, can take several days to locate. This is contrary to agile development iterations. Therefore, we continue to study how to obtain static and dynamic scanning form through some tools for further accurate positioning.