Responding to the challenges of deploying AI frameworks in all scenarios, MindSpore&#39;s "four strokes" make you lie

Abstract: so-called full-scenario AI refers to the ability to quickly apply deep learning technology to hardware devices in different scenarios on the cloud side, including cloud servers, mobile terminals, and IoT devices, which operate efficiently and can effectively collaborate.

This article is shared from the HUAWEI cloud community " AI Framework and MindSpore's Solution ", the original author: HWCloudAI.

The challenge of a unified AI framework for all scenarios

The so-called full-scenario AI refers to the ability to quickly apply deep learning technology to hardware devices in different scenarios on the cloud side, including cloud servers, mobile terminals, and IoT devices, which can operate efficiently and collaborate effectively. For the framework, it involves three major challenges: rapid deployment, efficient operation, and end-cloud collaboration.

Rapid deployment

How to quickly deploy the trained model to cloud servers, mobile terminals and various IoT devices for reasoning and even incremental training?

The reasoning on the cloud server is usually deployed in the form of Service. The trained model is directly pushed to the cloud server through remote interface calls (gRPC/REST), and the user calls the cloud reasoning service interface for reasoning. For mobile terminals and IoT devices, due to hardware resource constraints, the cloud-side model and inference operation framework is too large to be directly deployed. Therefore, the compression of the model and the lightweight operation framework become the key to deployment on mobile terminals and IoT devices.

Facing the challenge of lightweight mobile terminals and IoT devices, it is a better solution to provide an independent lightweight end-side AI framework. At the same time, such a lightweight framework may have more than one form, such as rich terminals and smart phones. The challenges faced by thin terminals like earrings are different. Rich terminals generally have ample storage space and have a certain amount of computing power; the conditions for thin terminals are much more demanding, and the noise floor requirements are controlled at the level of 100K, so you can You can't put a runtime into it, and at the same time consider giving AI developers a general solution.

With a lightweight end-to-side framework and a good model compression conversion technology, can the purpose of rapid deployment be achieved? In fact, there are still problems, because if our end-side architecture is separated from the cloud-side architecture, and the implementation is inconsistent, such as different IR models, different operator definitions, and different inference API interfaces, it is likely to cause the cloud side The trained model cannot be successfully transferred to the end-side for execution, and the inference code on the cloud side cannot be reused on the end-side.

The general framework's process from cloud-side training model to end-side deployment is as follows:

There are currently some problems with this approach:

The first problem: It is difficult to keep the two sets of model definitions consistent. For example, the operators on the cloud side and the end side often have one missing problem, which causes the model conversion to fail.
The second question: The functions required by both the cloud and the end will be repeatedly developed, and there may be inconsistencies. For example, fusion optimization for improving inference performance needs to be done on both sides of the end and the cloud, and the inconsistency of data processing causes accuracy problems.
The third problem: the online training of the model trained on the cloud side requires relatively complex conversion on the end side.

Regarding the inconsistency of the separated end-cloud framework, can a standard such as ONNX be used to solve it? It is difficult. The reason is that the rapid development of the AI industry and the rapid emergence of new operator types, the standard is actually difficult to keep up, so the solution should still fall on the AI framework.

Efficient operation

The efficient operation of the whole scene can be broken down into efficient operators, efficient runtimes and efficient models to achieve the maximum computing power of heterogeneous hardware and improve the operating performance and energy efficiency ratio of AI algorithms.

The performance of the operator needs to be optimized from multiple levels of algorithm and low-level instruction optimization. For example, convolution. Compared with Im2Col+GEMM, the Winograd algorithm has a very good performance improvement on many classic convolutional neural networks.

However, Winograd's algorithm is not better than Im2Col+GEMM in all scenarios. In the figure below, when the shape is 224x224x3x64, Winograd's performance deteriorates. Therefore, choosing the best algorithm under different conditions is critical to performance.

Optimization at the algorithm level is more to improve performance by reducing the number of calculations (multiplication) at runtime, while optimization at the instruction level is to make fuller use of the computing power of the hardware. For CPU hardware, the key factors that affect the execution speed of instructions include the hit rate of L1/L2 cache and the pipeline of instructions. Common optimization methods are:

Choose reasonable data arrangement, such as NHWC, NC4HW4, etc.
The reasonable allocation of registers can divide the registers into feature map registers, weight registers and output registers according to their uses. The reasonable allocation of registers can reduce the number of data loads.
Data pre-access, through prefetch/preload and other instructions, the data can be read into the cache in advance.
Order rearrangement to minimize the pipeline stall of orders.
Vectorized calculation, using SIMD instructions, such as ARM NEON instructions, X86 SSE/AVX instructions, etc.

These optimizations require an in-depth understanding of the hardware architecture.

The performance of the end-side runtime mainly faces the challenges of heterogeneity and asynchronous parallelism. From the perspective of the model, most models appear to be executed serially during inference, but if the operator is opened internally, it becomes a fine-grained kernel. , The overall execution flow is still a dataflow graph. There are many opportunities for asynchronous parallelism. At the same time, there are a large number of heterogeneous devices on the end side. If a model uses multiple types of devices when executing, there are also different pipelines in the middle.

The performance of the model mainly relies on offline optimization and tuning. This sector has also been practiced a lot. The general idea is mainly to combine regularized fusion pass and offline operator tuning.

End-to-cloud collaboration

Device-cloud collaboration mainly involves three parts: cloud-side training-device-side reasoning, cloud-side training-device-side incremental training-device-side reasoning, cloud/device federated learning

Cloud-side training-end-side reasoning focuses on how to generate the most suitable end-side model, including model compression and adaptive model generation. We have already introduced the model compression technology. For neural network automatic search (NAS), it is usually used to generate models that meet certain constraints (for example, extreme memory limitations on the microcontroller). The biggest problem with NAS technology is how to shorten the time to search for models.

Cloud-side training-end-side incremental training focuses on solving the problem of efficient conversion of cloud and end models. This has been introduced in the previous chapter.

Federated learning. At present, there are two main technical genres in the industry. One is horizontal federated learning, where data is aggregated. The typical application scenario is privacy protection on mobile devices. Scenarios such as advertisements need to be established between millions of mobile devices. The federated model avoids uploading user privacy data to the data center. The second is vertical federated learning, which is dimensionally aggregated. It pays more attention to cross-institution and cross-organization big data cooperation scenarios, especially data security and privacy protection issues in banking and financial scenarios.

The architecture of mobile device privacy protection

Cross-institution and cross-organization big data cooperation framework

Federated learning still has many technical challenges, such as cross-device system heterogeneity and communication during algorithm iteration, which will affect the efficiency and accuracy of the final federated aggregation; the model encryption method in the federated learning process, because even through weights It can also infer some private information, as well as client poisoning attacks, adversarial samples, etc.; another challenge is mainly architectural. At present, there is no unified architecture for federated learning, and it supports horizontal federated learning and vertical federated learning at the same time.

MindSpore unified architecture solution for all scenarios

End-cloud unified kernel

MindSpore has carried out a hierarchical design in the design of the framework, decoupling the data structure and modules shared by the end-cloud, while meeting the end-to-side lightweight, maintaining the consistency of the end-cloud architecture, and truly achieving a seamless deployment of training , End-to-cloud training model.

[Unified IR] The unified IR of MindSpore core ensures the consistency of the end cloud model/operator definition, so that the model trained on the cloud side can be seamlessly deployed on the end side. At the same time, for end-side training, the same IR can be used to retrain the model as on the cloud side.

Unified IR defines the logical structure of the model and the attributes of operators, and is decoupled from the persistence of the model. The most widely used methods for persisting data in open source projects are protobuffer and flatbuffer. Compared with the two, protobuffer is more powerful and more flexible to use, but correspondingly, it is more heavyweight. The flatbuffer is lighter and the deserialization speed is faster. MindSpore persists the logical data of the unified IR into different physical forms. The cloud side persists into a protobuffer format, and the end side persists into a flatbuffer, which takes into account the consistency of data and the lightweight of deployment.

[Public pass] In order to improve performance, the trained model needs to do some optimization methods in advance before performing inference. These optimizations include fusion, constant folding, adjustment of data arrangement, and so on. The optimizations for end-cloud sharing are also included in the MindSpore core module, but for cloud-side inference, these optimizations are performed during online inference, and for mobile terminals, these optimizations are done offline before inference is performed.

[Unified interface] MindSpore has designed a unified C++ interface that provides the end cloud. The usage of the unified C++ interface is as consistent as possible with the Python interface, which reduces the cost of learning. Through a unified interface, users can use a set of codes to perform inferences on different hardware.

Lightweight technology

[MindSpore for micro] Compared with mobile terminals, IoT device MCU chip resources are more limited. Therefore, how to deploy deep learning models on IoT devices will be more challenging.

In the above table, the left side shows the size of memory and storage on the cloud, mobile phone, and MCU, and the right side shows the storage and memory occupied by ResNet-50, MobileNet-V2, and int8 quantized MobileNetV2.

For IoT devices, MindSpore designed the MindSpore for micro solution.

The reasoning framework deployed on cloud servers and mobile terminals uses model interpretation to perform reasoning. This method can support multiple models and cross-hardware platforms, but requires additional runtime memory (the most expensive resource in the MCU). Store meta-information (such as model structure parameters). The CodeGen method of MindSpore for micro unloads the operator sequence in the model from runtime to compile time, and only generates the code that executes the model. Not only does it avoid runtime interpretation time, but it also frees up memory usage to allow larger models to run. The size of the binary generated in this way is very light, so it has a high storage efficiency.

The features of MindSpore for micro will be open sourced in version 1.2.

【Quantification】

MindSpore adaptive hybrid low-bit quantization technology: According to the model structure and target compression rate, it automatically searches for the number of quantized bits in the corresponding layer, without the need for deep involvement of quantization experts. The quantization factor can be trained, which can greatly improve training efficiency and reduce quantization loss in low-bit quantization scenarios. In the image classification/target detection model, it is verified that the accuracy is better than the current industry quantization algorithm under the scene of 8-10 times compression.

MindSpore post-training quantization technology: Compared with quantized retraining, post-training quantization has two obvious advantages. One is that it does not require a large number of training data sets, and the other is that it does not need to be retrained, and it is quickly converted offline. MindSpore uses a pipeline combination quantization method. The first stage uses conventional linear quantization methods to quantify the weights and activation values, and the second stage analyzes the quantization errors, and uses statistical methods to correct the quantization model to compensate for the accuracy loss caused by quantization.

Pipeline portfolio quantification

Efficient runtime

[End-cloud unified runtime] In order to provide a unified parallel operation framework for end-cloud training and inference, MindSpore designed a unified end-cloud runtime based on the Actor model.

AI training or inference ultimately executes a DAG calculation graph. Each node in the graph is an op, and each edge is a tensor (or a group of). In the following figure, the left side is a schematic diagram of the actor model, and the right side is a schematic diagram of an AI computing task.

We define an op as an actor, and tensor is passed between actors. In the actor model, messages are stateless and not reused, but in AI computing tasks, in order to improve efficiency, tensors are usually reused. In order to solve this problem, MindRT introduced a tensor manager to manage tensor uniformly, and all ops obtain tensor through tensor manager.

Tensor Manager supports tensor reference counting and memory management. The terminal cloud unified runtime will be open sourced in MindSpore 1.2/1.3 version.

[Software and hardware synergy] MindSpore's native and end-side NPU chips are deeply integrated to maximize the performance advantages of proprietary chips.

[Operator optimization] On the mobile phone CPU, MindSpore supports a variety of convolution algorithms: Sliding Window, Im2Col+GEMM, Strassen, Winograd, Indirect Convolution, FFT, etc. How to choose the optimal convolution algorithm under different conditions, usually there are 3 ways:

Manual setting based on experience
Cost Model through mathematical modeling
Through the machine learning algorithm model, use the existing data set for offline training, and finally get a reliable convolution operator selector

At present, MindSpore supports 2 and 3 methods to select the optimal convolution algorithm.

The choice of algorithm, in addition to considering performance, also needs to consider memory limitations in specific scenarios. For example, for hardware devices in the IoT scenario, if the most common Im2Col+GEMM algorithm is selected, the calculation process needs to level the input and the convolution kernel in the memory, which occupies a large amount of memory. For this scenario, MindSpore chooses the Indirect Convolution algorithm that takes up less memory.

Federated learning

MindSpore's federated learning method supports both cross-device (ToC) and cross-silo (ToB) scenarios, and realizes multi-party joint modeling under the condition that the data is not out of the domain, so as to help enterprise applications improve efficiency and reduce costs. Wisdom upgrades in different industries. In terms of security, MindSpore provides a variety of model encryption methods, which can be applied to large-scale stateless terminal devices, including differential privacy, secret sharing, secure aggregation, etc. The user can customize the security level.

After understanding the major advantages of MindSpore's AI framework, hurry up to [click the link] and [register now], you can learn a classic case on the ModelArts platform to master the deep learning based on MindSpore!

Click to follow, and get to know the fresh technology of

Responding to the challenges of deploying AI frameworks in all scenarios, MindSpore's "four strokes" make you lie

The challenge of a unified AI framework for all scenarios

Rapid deployment

Efficient operation

End-to-cloud collaboration

MindSpore unified architecture solution for all scenarios

End-cloud unified kernel

Lightweight technology

【Quantification】

Efficient runtime

Federated learning

华为云开发者联盟

引用和评论

华为云开发者联盟入选 2023 中国技术品牌影响力企业榜，深耕开发者生态

基于 MCP 的 AI Agent 应用开发实践

基于预生成 QA 对的 RAG 知识库解决方案

祛魅最热门的通用Agent赛道

vLLM 实战教程汇总，从环境配置到大模型部署，中文文档追踪重磅更新

Trae 开发工具与使用技巧

性能远超SAM系模型，苏黎世大学等开发通用3D血管分割基础模型

Responding to the challenges of deploying AI frameworks in all scenarios, MindSpore&#39;s "four strokes" make you lie

The challenge of a unified AI framework for all scenarios

Rapid deployment

Efficient operation

End-to-cloud collaboration

MindSpore unified architecture solution for all scenarios

End-cloud unified kernel

Lightweight technology

【Quantification】

Efficient runtime

Federated learning

华为云开发者联盟

引用和评论

华为云开发者联盟入选 2023 中国技术品牌影响力企业榜，深耕开发者生态

基于 MCP 的 AI Agent 应用开发实践

基于预生成 QA 对的 RAG 知识库解决方案

祛魅最热门的通用Agent赛道

vLLM 实战教程汇总，从环境配置到大模型部署，中文文档追踪重磅更新

Trae 开发工具与使用技巧

性能远超SAM系模型，苏黎世大学等开发通用3D血管分割基础模型

Responding to the challenges of deploying AI frameworks in all scenarios, MindSpore's "four strokes" make you lie