1. Introduction

As a national travel service platform with over 100 million DAILY active users, Autonavi Map provides users with massive retrieval, positioning and navigation services every day. To realize these services, accurate road information is required, such as electronic eye location, road condition information, traffic sign location information, etc. Will readers wonder how Autonavi senses real-world road information and provides this data to its users?

In fact, there are many ways to recycle real world road elements and update them to the Amap App. One of the very important methods is to use the means of computer vision, deploy the vision algorithm to the client, through the detection and recognition of the picture, quickly recover the information of the road.

To low cost, high recycling to achieve real-time road elements, we use MNN engine (the depth of a lightweight neural network inference engine), will be deployed to the client, the convolutional neural network model on the client side side of the model reasoning, thus completed in low computing power, memory of road elements collection of small client task.

The traditional CNN(convolutional neural network) requires a large amount of computation, and multiple models need to be deployed in business scenarios. How to deploy multiple models on low-performance devices and keep the application “small and excellent” without affecting real-time performance is a very big challenge. This article will share hands-on experience in deploying deep learning applications on low-performance devices using MNN engines.

Deployment of 2.

2.1 Background

As shown in Figure2.1.1, the business background is to deploy the CNN model related to road element recognition to the client, conduct model reasoning on the end side, and extract the location and vector information of road elements.

Due to the requirements of this business scenario, 10+ or more models need to be deployed on the end at the same time to meet the information extraction requirements of more different road elements, which is a great challenge for low-performance devices.

Figure 2.1.1 Autonavi Data collection

In order to achieve the “small and optimal” application, MNN engine deployment model encountered many problems and challenges. Here are some experiences and solutions to these problems and challenges.

2.2 MNN deployment

2.2.1 Memory Usage

Application running memory is always a hot topic for developers, and memory generated by model reasoning accounts for a large proportion of application running memory. Therefore, in order to keep model inference memory as small as possible, it is important for the developer to be aware of the major sources of memory generated by model execution during model deployment. According to our deployment experience, during the deployment of a single model, memory mainly comes from the following four aspects:

Figure 2.2.1 Memory usage for single-model deployment

ModelBuffer: BUFFER for model deserialization, which stores parameters and model information in the model file and is close to the size of the model file.

**FeatureMaps: **FeatureMaps memory, which mainly stores inputs and outputs for each layer of model reasoning.

**ModelParams: ** Memory for model parameters, mainly storing Weights, Bias, Op and other memory required for model reasoning. Weights takes up most of the memory in this section.

**Heap/Stack: ** Stack memory generated during application running.

2.2.2 Memory Optimization

Knowing the memory footprint of the model at run time makes it easy to understand the memory changes of the model at run time. After several model deployment practices, in order to reduce the memory peak of the deployment model, we take the following measures:

  • After model deserialization (createFromFile) and creation of memory (createSession), releaseModel Buffer (release session) to avoid memory accumulation.
  • Processing model input, image memory and inputTensor memory reuse.
  • Model post-processing, model output Tensor and memory reuse of output data.

Figure 2.2.2.1 MNN model deployed memory overcommitment scheme

After memory overcommitment, take the deployment of a 2.4m visual model as an example. During the model’s operation, from loading to releasing, the changes in the memory occupied in each intermediate stage can be expressed by the following curve:

Figure 2.2.2.2 Single Model Application Memory Curve (Android MemoryInfo statistics)

  • Before the model runs, the memory occupied by the model is 0M.
  • After model load (createFromFile) and create memory (createSession), the memory is increased to 5.24m from model deserialization and Featuremaps memory creation.
  • The releaseModel call reduced memory to 3.09 MB due to the release of buffer after deserialization of the model.
  • InputTensor and image memory are multiplexed, the application memory is increased to 4.25M because we have created Tensor memory to store model inputs.
  • RunSession(), application memory increased to 5.76 MB due to increased stack memory during RunSession.
  • After the model is released, the application reverts to the memory value before the model was loaded.

After many times of model deployment practice, the following formula is summarized for estimating peak memory at the end of deployment of a single model:

MemoryPeak: single model runtime MemoryPeak.

StaticMemory: StaticMemory, including model Weights, Bias, and Op.

DynamicMemory: DynamicMemory, including the memory occupied by feature-maps.

ModelSize: model file size. Memory used for model deserialization.

MemoryHS: Runtime stack memory (a rule of thumb is between 0.5m and 2M).

2.2.3 Reasoning principle of the model

This section shares the principles of model reasoning so that developers can quickly locate and solve problems when they encounter them.

Model scheduling before model inference: MNN engine inference maintains a high degree of flexibility. In order to improve the parallelism of heterogeneous systems, we can specify different running paths for the model and specify different back ends for different running paths. This process is mainly a scheduling or task distribution process.

For a branch network, you can specify the current running branch or schedule branches to execute different backends to improve the performance of model deployment. Figure2.2.3.1 shows a multi-branch model, where the two branches output detection results and segmentation results respectively.

Figure 2.2.3.1 Multi-branch network

The following optimizations can be made during deployment:

  • Specify the Path to run the model. When only the detection results are needed, only the detection branch is run, and there is no need to complete the two branches, thus reducing the reasoning time of the model.
  • Detection and segmentation are specified using different backends. For example, detect the specified CPU, split the specified OpenGL, improve the parallelism of the model.

Pre-processing before model reasoning: In this stage, pre-processing will be carried out according to the model scheduling information in the previous step. In essence, Session(holding model reasoning data) is created by using model information and user input configuration information.

Figure 2.2.3.2 Creating sessions according to Schedule

In this stage, operation scheduling is carried out according to the model information of model deserialization and user scheduling configuration information. Used to create running Piplines and corresponding computing backends. See Figure2.2.3.3.

Figure 2.2.3.3 Session creation

Model reasoning: Model reasoning is essentially the process of executing operators sequentially based on the Session created in the previous step. Operations are performed on each layer of the model based on the preprocessing specified model path and specified backend. It is worth mentioning that the operator reverts to the standby backend by default if the specified backend is not supported.

Figure 2.2.3.4 Inference calculation diagram of model

2.2.4 Model deployment time

In this part, the time consuming of each stage of single-model deployment process is counted, which is convenient for developers to know the time consuming of each stage, so as to better design the code architecture. (Different devices have different performance, time consuming data is for reference only)

Figure 2.2.4.1 Inference calculation diagram of model

Model deserialization and Session creation take a long time, and should be performed once as far as possible when reasoning with multiple graphs.

2.2.5 Model error analysis

During model deployment, it is inevitable for developers to encounter deviations between the output of the training model on the deployment side and that on the X86 side (Pytorch, Caffe, Tensorflow). The causes, positioning ideas and solutions of errors are shared below.

The diagram of model Inference is shown in Figure 2.2.5.1:

Figure 2.2.5.1 Model Inference Schematic Diagram

Determination of model error: The most intuitive way to check whether there is a model error is to fix the input values of the deployment side model and the X86 side model and deduce them separately, and then compare the output values of the deployment side model and the X86 side model to confirm whether there is an error.

Location of model error: when model error is confirmed, the model output error caused by model input error should be excluded first. Because the floating-point representation accuracy of X86 and some Arm devices is inconsistent, input errors can accumulate in some models, resulting in large output errors. What method can be used to rule out the problem caused by the input error? One way to do this is to set the model input to 0.46875(because this value is represented identically on X86 and some Arm devices, essentially because a shifted floating-point number of 1 is represented identically on both ends). Then observe whether the output is consistent.

The positioning idea of model error: in the case of excluding model input error leading to model output error (that is, when the model input is consistent, the model output is inconsistent), it is likely to be caused by some operators of the model. How to locate the error caused by which OP of the model? The causes of errors within the model can be located through the following steps:

1) Call back and forth the intermediate calculation results of each OP of the model through runSessionWithCallBack. The goal is to locate the Op from which the model starts to make errors.

2) After locating the layer, the operator generating the error can be located.

3) After locating the operator, the corresponding operator can be located to execute the code through the specified back-end information.

4) After locating the corresponding execution code, debug and locate the code line producing error, so as to locate the root cause of model error.

3. Summary

MNN engine side inference engine is a very good end, as a developer, model on the deployment and performance optimization are paying attention to business logic optimization at the same time, also need to pay attention to the engine calculation process, framework design and model of the thought of accelerating, which in turn can better optimize the business code, make real application of “small but excellent”.

Plan for the future

With the general improvement of equipment performance, the subsequent business will be equipped with higher performance equipment. We will use more abundant computing back-end to accelerate the model, such as OpenCL and OpenGL, so as to accelerate the reasoning of the model.

In the future, the equipment will carry more models to the client to realize the recovery of more types of road element information. We will also use MNN engine to explore a more efficient and real-time code deployment framework to better serve the map collection business.

We are amap data research and development team, there are a lot of HC in the team, welcome to join you who are interested in Java backend, platform architecture, algorithmic engineering (C++) and front-end development, please send your resume to [email protected], email title format: Name – Technical Direction – from Autonavi Technology. We are eager for talents and look forward to your joining us.