AI engineering performance improvement platform research and development practice

TiD2020 Quality Competition Conference invited Zhao Ming, head of quality and Engineering Efficiency Department of Good Future AI Zhongtai, to deliver a speech titled “AI Engineering Performance Improvement Platform R&D Practice”.

Zhao Ming shared AI algorithm index and evaluation platform architecture and practice, AI microservice performance testing platform architecture and practice, data set management platform architecture and practice and corresponding case demonstration from three key words: accuracy, speed and stability.

Accuracy: The AI algorithm needs to meet the requirements of actual business scenarios and achieve high accuracy. Fast: after an image is processed by the algorithm, the result is returned to the user’s APP/Web end. It needs to evaluate its end-to-end index, such as 200~300 milliseconds, whether it can have a fast response and return. Based on this requirement, we developed a microservices performance test platform. Stability: for example, when voice, image or text is sent to the model, there will be no lag or even model crash, which is very related to the robustness of the model.

Good Future AI Zhongtai is committed to integrating AI technology into education scenarios, creating various AI capabilities, and continuously improving learning experience and efficiency. There are three main types of model development around AI technology.

** First speech, ** includes speech recognition, speech evaluation, emotion recognition and emotion analysis. For example, ASR converts speech into text and measures fluency and accuracy in spoken English. This is all part of its educational scene.

** The second image, ** such as printing OCR recognition, photo search, notes grading, education scene content review, etc., will use the image algorithm model to help improve efficiency.

** Third data mining, ** Some of which are related to NLP natural language processing, such as text keyword search, class feature statistics, highlight capture, and evaluation of Chinese oral expression ability. The capabilities of this series of models will be deployed on the PaaS platform as an AI microservice for each end to call.

(I) The algorithm is frequently tested without automation

At the algorithm level, the model changes, such as the increase of call rate, robustness and new data. The servitization aspects, such as performance optimization and interface changes, need to undergo complete testing, so the testing is very frequent. This process had not been automated and was inefficient.

(ii) No perception of industry leadership

Algorithmic engineers will use industry leadership as a very objective basis for formulating KPIs. For example, 50% of AI capabilities can be ranked first in the industry. This dimension is actually the target of algorithm KPI. How to evaluate this goal? We need to quantify it, to have a platform to show the metrics. Without this platform, industry leadership is impossible to perceive.

(3) High cost of performance evaluation after performance tuning

Whether the algorithm model or the server is tuned and expanded, we need to perform an end-to-end performance evaluation. As the frequency of assessments increases, so does the cost.

(4) Data fragmentation

Data is the blood of AI. In the process of AI production and research, every link is inseparable from data. The data needs to flow between the roles. Unified data management is required for products, algorithms, engineering research and development, and testing. If there is no unified platform for data of different business parties, including data of different versions, it will be very fragmented and difficult to manage and maintain.

Based on the pain points, we made improvements, that is, platform construction. We hope to solve the above four kinds of business pain points through the construction of algorithm evaluation platform, performance test platform and data set management platform.

(I) Tool platform system construction

Tool building requires empowering the roles. For example, product requirements, including the effect of the product online, there are many tools to do link tracking. For algorithms, such as the core AI capability model, some tool chains are also needed to help improve the output of the algorithm. The model servitizes tools as interfaces for external invocation. In the whole process, we need to ensure its quality and conduct whole-link monitoring through the operation and maintenance system Devops.

In terms of quality, the product is responsible for the quality of demand, the algorithm is responsible for the quality of AI model, the development is responsible for the quality of the code, the testing is responsible for the quality of the end-to-end product, and the operation and maintenance is responsible for the quality of the online product after the launch. All links are inseparable from the support of data and quality-related tools.

(II)Al algorithm index and evaluation platform architecture and practice

1. Scenario requirements and users

The first scenario, no annotation data: AI algorithm BadCase filtering. There is no need to do data annotation in advance. Badcase can be automatically screened through the process of competitive product analysis and comparison. The second scenario, no annotated data: the accuracy of the new data algorithm. We’re going to use the new data, and we’re going to do algorithmic accuracy measurements based on this platform. The third scenario includes annotated data: AI algorithm index competitive product evaluation and leading analysis. I will make algorithm indicators based on well-marked data and evaluate them objectively and quantitatively, including market leadership analysis. The fourth scene includes annotated data: AI algorithm index evaluation and analysis. For annotated data, the index of the algorithm can be further analyzed and evaluated, including finding badCase and providing these data to the algorithm for further optimization of the model. For these four scenarios, our main users are algorithms, algorithm testing and product managers.

2. Technical architecture

The technical architecture of Al algorithm index and evaluation platform is divided into basic function and interface call layer, logical abstraction layer and UI layer.

Basic functions and interface call layer, including permission management, data source management, storage management, analysis instance management module, report management. Above will call through the interface, call self-developed interface and competing product interface, through the script library for unified management, such as input, return of input parameters, format standardization.

The logical abstraction layer is the integration of basic functions into a workflow, to calculate the indicators of algorithms, or to process data end-to-end. Data needs to be preprocessed — desensitized, noise-reduced, cleaned, rotated, or filtered for empty audio. The evaluation rules of algorithm index are different for different models. For example, the indicator of pronunciation is the word error rate. The index of the image is the call rate. So, how can they be normalized and unified management? AI capability registration is the unification of new development capabilities. How do you register it with the platform while the underlying logic and the upper UI remain the same? For example, with 10 capabilities to 100 capabilities, you need to minimize the impact of changes and refactoring of logic or code on the front and back ends due to the registration of new capabilities. In short, it separates the underlying infrastructure from the business of the upper AI microservices. Results of the export, we will according to different templates, different algorithm models, there will be corresponding processing results report. It can be derived into a table for subsequent report sorting and analysis.

On the top of the logical abstraction layer, there will also be a multi-competing product processing engine, which is used to do unlabeled data, badCase automatic analysis, including data comparison of other competing products, to find their own shortcomings.

The UI layer. There are evaluation instances, visual arrangement, badcase screening and presentation of reports through some visual interfaces.

3. Practical effect

Al algorithm index and evaluation platform can provide algorithm BadCase for algorithm engineers to provide reference for model optimization. For example: OCR identification error, in the image identification process, due to printing problems caused by noise data will affect the identification. The figure above identifies “____in” as “-tin” in the process of identification. Therefore, when doing algorithm model processing, it is necessary to improve the robustness of the model itself, and special tuning of the model is needed for such scenarios. ASR transcription error is the inconsistency between speech data and annotated data in the process of converting speech data into text. For example, the annotated data is how many offers our department has this year. But the accuracy of model AB is not very good. There were some mistakes in the identification process. Model A was identified as “we did not get many points this year”, and Model B was identified as “How many points did our department get this year? Offer and points can’t be well distinguished.

We will list the errors, organize and collect unified reports on bad cases, and provide them to algorithm engineers for model optimization.

4. Index of algorithm model evaluation

The evaluation criteria of the model need a set of objective metrics.

The statistical indexes for images include accuracy, accuracy, recall rate, F1 and so on. Precision, Recall and Accuracy can be calculated by formula based on TP, FN, FP and TN dimensions. Finally, the corresponding Recall rate and Accuracy can be evaluated to see what needs to be improved compared with the previous version. For speech, we can use WER/CER (word/sub-error rate) as a statistical indicator. The lower the number, the better. Insert error, delete error, replace error divided by the total number of words as the calculation index. Metrics are embedded into the platform in the form of scripts.

For performance indicators, CPU, MEM, GPU, etc., are used as statistical indicators. For the stability index, crash rate and memory leak are mainly used as statistical indexes. Through these dimensions, we can objectively evaluate whether the model is accurate, fast and stable.

5.Demo1- Algorithm Badcase automatic screening

This method applies to scenarios without annotated data

First we will have a visually orchestrated interface, first drag and drop the data source. The data source is actually a set of data on FTP, which is actually a folder that holds the images under test. We will preprocess the image, such as data desensitization, automatic orthographic and so on. Because the picture may not be orthotopic, some models will be used to orthotopic. This facilitates model analysis. We will compare the desensitized data with three external competitive models simultaneously.

Badcase is similar to voting method. For example, there are three ABC models. If all the results obtained after phonetic analysis are “Today is Friday”, the credibility of Badcase is very high. If the self-research result is different from the analysis result of these three models, the probability is Badcase. Its advantage is that it does not need to do annotation, can grab data from the online, automatic screening. After the model is processed, some post-processing will be done to remove punctuation marks, Spaces, carriage returns and other non-critical factors. After processing, Badcase is displayed on the page. On this page we can adjust the threshold. The higher the threshold, the greater the possibility of Badcase. For example, if the Badcase accuracy is 0.9, it is highly likely to be a Badcase. If it’s 0.2, it could be a false positive.

For printed tables, we will carry out OCR evaluation of picture to text information capture, including Goodcase and Badcase. We can automate the analysis of unlabeled data through the platform. The Badcase analyzed can be provided to algorithm engineers for core iteration and optimization.

6.Demo2- Algorithm index competitive product evaluation

For annotated data

Start by creating an instance and naming the version through the visual choreography interface.

We drag data from FTP for model analysis and input the data into three models at the same time, one of which is a self-developed model. We need to see the gap between the self-developed model and the competitive model, and finally import the annotation data for analysis.

The labeled data corresponds to a standard answer. So you end up with three categories of indicators. Each model has a set of metrics. That way we can analyze how industry-leading our models are. This helps the algorithm to set kpIs.

The quantitative evaluation index for OCR class has two dimensions — sequence accuracy and character accuracy. These two accuracies can evaluate the effect of the model. At different times of iteration, the criteria for each evaluation model will be different. For example, in every dimension, some of them might be us ahead, some of them might be us behind. This index can see our position in the industry.

Of course, we can also do screening. This filter can be applied to different dimensions, or to different competing products, as you can see below is a Badcase, there are three models. On the far left is the self-developed version.

The effect of the research version is better than the other two competing products. So it’s going to have fewer bad cases. This page is a comparison of model ABC with standard data. If it’s exactly the same as the standard data, it’s accurate. If they are inconsistent, the identification is wrong.

For the print formula recognition algorithm model, input an image, the image content will be output as text. Some complex formula recognition needs to enhance the robustness of the model. This is difficult. As demonstrated earlier, we can measure this metric. It’s going to compute a logical formula to get a result. Then you can see the industry leadership.

This platform algorithm, testing partners can be used. Products can use the platform to map new data without algorithms or testing. He doesn’t need to understand very complex computational logic and can help engineers quickly draw conclusions through visual choreography such as drag and drop.