1. Introduction
As a national travel service platform with a DAU of more than 100 million yuan, AutoNavi Maps provides users with massive retrieval, positioning and navigation services every day. The realization of these services requires accurate road information, such as electronic eye location, road condition information, and traffic sign location information. Will readers be curious about how AutoNavi perceives road information in the real world and provides these data to users?
In fact, we have many ways to collect and recycle road elements in the real world and update them on the AutoNavi Map App. One of the most important methods is to use computer vision to deploy vision algorithms to the client, and quickly retrieve road information through image detection and recognition.
In order to achieve low-cost and high-real-time recovery of road elements, we use MNN engine (a lightweight deep neural network inference engine) to deploy the convolutional neural network model to the client, and perform end-side model reasoning, thus completing the low computing capability, small memory client task of collecting road features.
The traditional CNN (Convolutional Neural Network) is very computationally intensive, and multiple models need to be deployed in business scenarios. How to deploy multiple models on low-performance devices and ensure that the application is "small and good" without affecting real-time performance is a very big challenge . This article will share the practical experience of deploying deep learning applications on low-performance devices using the MNN engine.
2. Deploy
2.1 Background introduction
As shown in Figure 2.1.1, the business background is to deploy the CNN model related to road element recognition to the client, perform model inference on the end side, and extract information such as the location and vector of the road element.
Due to the needs of this business scenario, 10+ or even more models need to be deployed at the same time on the end to meet the information extraction needs of more different road elements, which is a very big challenge for low-performance devices.
Figure 2.1.1 AutoNavi data collection
In order to achieve the "small and excellent" application, the MNN engine encountered many problems and challenges in the process of deploying the model. Here are some experiences and solutions to these problems and challenges.
2.2 MNN deployment
2.2.1 Memory usage
Application running memory is an inevitable topic for developers, and the memory generated by model inference occupies a large proportion of application running memory. Therefore, in order to make the model inference memory as small as possible, in the process of model deployment, the developer must be aware of the main source of memory generated by the model. According to our deployment experience, in the process of deploying a single model, memory mainly comes from the following four aspects:
Figure 2.2.1 Single-model deployment memory footprint
ModelBuffer : Model deserialization buffer, which mainly stores the parameters and model information in the model file, and its size is close to the size of the model file.
FeatureMaps: Featuremaps memory, which mainly stores the input and output of each layer during the model inference process.
ModelParams: model parameter memory, which mainly stores Weights, Bias, Op and other memory required for model inference. Among them, Weights occupies most of the memory in this part.
Heap/Stack: The stack memory generated during application operation.
2.2.2 Memory optimization
Once you know the memory usage of the model, you can easily understand the memory changes during the model's runtime. After multiple model deployment practices, in order to reduce the peak memory of the deployed model, the measures we took are as follows:
- After the model is deserialized (createFromFile) and memory (createSession) is created, the model Buffer is released (releaseModel) to avoid memory accumulation.
- Process model input, image memory and inputTensor can be reused in memory.
- Model post-processing, model output Tensor and memory reuse of output data.
Figure 2.2.2.1 MNN model deployment memory reuse plan
After memory reuse, take the deployment of a 2.4M vision model as an example. From loading to release when the model is running, the changes in the memory occupied by each stage in the middle can be represented by the following curve:
Figure 2.2.2.2 Single-model application memory curve (Android memoryinfo statistics)
- Before the model runs, the memory occupied by the model is 0M .
- After the model is loaded (createFromFile) and the memory is created (createSession), the memory rises to 5.24M , which comes from model deserialization and Featuremaps memory creation.
- Calling releaseModel reduces the memory to 3.09M because the buffer after deserialization of the model is released.
- InputTensor is multiplexed with image memory, and the application memory is increased to 4.25M because of the creation of Tensor memory to store model inputs.
- RunSession(), the application memory is increased to 5.76M , the reason is to increase the stack memory in the RunSession process.
- After the model is released, the application restores to the memory value before the model was loaded.
After many model deployment practices, the following summarizes the formula for the end-to-end memory peak estimation of the deployment of a single model:
MemoryPeak : The memory peak value when a single model is running.
StaticMemory : Static memory, including the memory occupied by models Weights, Bias, Op.
DynamicMemory : Dynamic memory, including the memory occupied by Feature-maps.
ModelSize : Model file size. The memory occupied by model deserialization.
MemoryHS : runtime stack memory (experienced value between 0.5M-2M).
2.2.3 Model Reasoning Principle
This chapter shares the principle of model reasoning so that developers can quickly locate and solve problems when they encounter related problems.
Model scheduling before reasoning : MNN engine reasoning maintains a high degree of flexibility. That is, you can specify different running paths of the model, or you can specify different back-ends for different running paths to improve the parallelism of heterogeneous systems. This process is mainly a process of scheduling or task distribution.
For branch networks, you can specify the current running branch, or you can schedule the branch to execute different backends to improve the performance of model deployment. Figure 2.2.3.1 shows a multi-branch model, the two branches output detection results and segmentation results respectively.
Figure 2.2.3.1 Multi-branch network
The following optimizations can be made during deployment:
- Specify the path where the model runs. When only the detection result is needed, only the detection branch is run, and there is no need to run two branches, reducing the model inference time.
- Different backends are specified for detection and segmentation. For example, detect designated CPU, split designated OpenGL, and improve model parallelism.
Pre-processing before model inference : This stage will be pre-processed according to the model scheduling information of the previous step, and the essence is to use model information and user input configuration information to create a Session (holding model inference data).
Figure 2.2.3.2 Create Session according to Schedule
In this stage, the operation scheduling is performed according to the model information of the model deserialization and the user scheduling configuration information. Used to create running Piplines and corresponding computing backends. As shown in Figure 2.2.3.3.
Figure 2.2.3.3 Session creation
model of reasoning : model is essentially a process of reasoning based on the step to create the Session, followed by the implementation of the operator. The calculation will perform the calculation of each layer of the model according to the model path specified by the preprocessing and the specified backend. It is worth mentioning that when the operator does not support the specified backend, it will default to the alternate backend to perform calculations.
Figure 2.2.3.4 Model inference calculation diagram
2.2.4 Model deployment time
This section counts the time consumption of each stage of the single-model deployment process, so that developers can understand the time consumption of each stage in order to better design the code architecture. (Different devices have performance differences, time-consuming data is for reference only)
Figure 2.2.4.1 Model inference calculation diagram
Model deserialization and session creation are relatively time-consuming. When inference of multiple images, try to execute it once.
2.2.5 Model error analysis
When deploying a model, developers will inevitably encounter a situation where the output results of the training model on the deployment side and the X86 side (Pytorch, Caffe, Tensorflow) are biased. The reasons for the error, positioning ideas and solutions are shared below.
The schematic diagram of the model Inference is shown in Figure 2.2.5.1:
Figure 2.2.5.1 Schematic diagram of model Inference
model error determination : The most intuitive way to check whether there is a model error is to fix the input values of the deployment model and the X86 model, reason separately, and compare the output values of the deployment model and the X86 model to confirm whether there is an error.
Model Error Location : When it is determined that there is a model error, first eliminate the model output error caused by the model input error. Because the floating-point representation accuracy of the X86 terminal and some Arm devices is inconsistent, input errors will be accumulated in some models, resulting in larger output errors. What method can be used to troubleshoot problems caused by input errors? We provide a way to set the model input to 0.46875 (the reason is that this value is consistent on X86 devices and some Arm devices, and the essence is that the floating point number obtained by shifting is consistent on both ends). Then observe whether the output is consistent.
Model Error Location Idea : In the case of excluding model input errors that lead to model output errors (that is, when the model input is consistent, the model output is inconsistent), it is likely that the error is caused by some operators of the model. How to locate the error caused by which OP of the model? Through the following steps, you can locate the cause of the error inside the model:
1) Call back the intermediate calculation results of each OP of the model through runSessionWithCallBack. The purpose is to locate the model from which Op error occurs.
2) After locating to this layer, you can locate the operator that caused the error.
3) After the operator is located, the corresponding operator execution code can be located through the specified back-end information.
4) After locating the corresponding execution code, debug and locate the code line that caused the error, so as to locate the root cause of the model error.
3. Summary
The MNN engine is a very good end-to-side reasoning engine. As a developer, the end-to-end deployment and performance optimization of the model pay attention to business logic optimization while also paying attention to the idea of engine calculation process, framework design and model acceleration. In turn, the business code can be better optimized, and a truly "small and excellent" application can be made.
4. Future planning
With the general improvement of device performance, subsequent services will be carried to higher-performance devices. We will use richer computing backends to accelerate model, such as OpenCL, OpenGL, etc., to accelerate model inference.
In the future, the equipment will carry more models to the client to realize the recovery of more category road element information. We will also use the MNN engine to explore more efficient and higher real-time code deployment frameworks to better serve Map collection business.
We are the Map Data R&D team . There are a large number of HCs in the team. Welcome friends who are interested in Java back-end, platform architecture, algorithm-side engineering (C++), and front-end development to join, please send your resume to gdtech@alibaba-inc.com, email subject format: name-technical direction-from AutoNavi. We are eager for talents and look forward to your joining.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。