Integrated MNN workbench for inference training

Author: Xiu Yutong (Sound String)

The MNN workbench is a one-stop end-to-end AI R&D platform built by the Alibaba Tao Department intelligent team and open to the outside world for free. It is based on the open source MNN deep learning end-to-end reasoning engine (open source address: https://github.com/ alibaba/MNN ), a series of capabilities such as built-in model tools, stand-alone pre-training templates, out-of-the-box algorithm sets, end-to-end real-machine breakpoint debugging, and an original three-terminal integrated deployment solution. Since its opening to the outside world, MNN workbench has always been committed to solving the core problems of lowering the participation threshold for developers interested in AI and improving the efficiency of collaboration between algorithms and engineering two core issues of end-to-end AI development, which are deeply loved by the majority of Good reviews from end-to-end developers.

But in addition to the praise, the MNN workbench team is still thinking: In terms of making iterations more efficient, deployment verification more coherent, and model production easier , have we really achieved the ultimate? Are we really addressing all of their core needs when targeting different tiers of developers?

Therefore, after communicating with different teams and developers, we focused our attention on two aspects:

The deployment and debugging experience of the MNN workbench is very smooth, but it lacks the training ability of the killer level, which will lead to a split in the overall process of training-deployment;
Algorithms gradually tend to be complex, multi-task, and multi-model collaboration. How to provide a more effective solution to this problem is also a difficulty that needs to be overcome urgently.

Build strong professional training capabilities

The MNN workbench provides an out-of-the-box algorithm model market and a number of built-in pre-training templates, providing developers with an extremely simple model production method, which lays the foundation for the popularization of device-side AI scenarios to a certain extent. However, as more and more professional-level algorithm users join in, the existing training capabilities of the MNN workbench also expose many drawbacks:

Figure 1 - One Stop Workflow of the Past

The model market/pre-training template is aimed at simple and general scenarios, only a very small number of model parameters are exposed to the outside world, and it cannot be personalized and customized according to the demands of professional algorithms;
The previous training process of the MNN workbench was based on the MNN training mechanism running on a single device, which limited the data set and training parameters to a certain extent, resulting in the final model effect unable to meet the high-quality requirements in complex business scenarios.

Based on this, it is imminent to embed professional-level training capabilities in the MNN workbench to become a complete workflow that integrates push and training. Considering that training requires large-scale cluster resources and scheduling management capabilities, we chose to cooperate with the Alibaba Cloud PAI-DLC team to leverage its powerful cloud infrastructure to build training capabilities that fit the end-to-end AI R&D process. During the entire implementation process, the MNN workbench is not simply copying the cloud training concept and API joint debugging, but combining the pain points encountered in the end-to-side deployment verification with its own insights, link design from the user's perspective:

Figure 2 - Integrated workflow of inference training based on PAI-DLC

PAI-DLC is an independent training cluster. The code involved in the training process needs to be submitted through Git, and the data, samples, and models also need to be provided through NAS / OSS. These links need to be manually configured, and the configuration cost is high. Therefore, the MNN workbench is connected to the related systems as a whole. It only requires the user to click a few mouse clicks, and the relevant data sets will automatically synchronize the corresponding positions, and then cooperate with the relevant automation scripts to assist the user to complete the tedious and repetitive work;
There is still a certain distance between model output and deployment to terminal devices. Don't underestimate the last mile. The traditional device-side algorithm deployment process requires the intervention and support of the engineering team, and the adjustment of related parameters requires continuous communication and joint debugging, which consumes the algorithm's patience and energy. Using the three-terminal integration feature of the MNN workbench, algorithm students can download the corresponding model file with one click and write a few lines of code to complete the real machine effect verification, truly achieving a smooth experience of one-stop training and deployment;
In view of the lightweight characteristics of the end-to-end model, we have also incorporated the MNN quantitative sparse algorithm capability into the process, so that the algorithm can efficiently compare the effects of different model parameter calculations.

Figure 3 - Demonstration of an integrated workflow for inference training

Feature Improvements: Multiple Windows, Joint Debugging, and Git Flow

With the gradual improvement of the integrated push-training workflow, a new problem has emerged: cloud training is time-consuming, how to train multiple models at the same time in order to save time?

Before that, the workbench only supported the single-window "exclusive" operation mode, that is, only one project at a time could enjoy the "bonus" of one-stop work. In order to operate other projects, the user could only cut out the current project. This is very similar to the "older machine" interaction before the smart machine, which brings great inconvenience to the user. In addition, some complex terminal computing scenarios also require joint development and debugging of multiple projects, and it is imperative to transform the workbench to support multiple openings.

Therefore, when the self-pushing and training integrated platform was in its infancy, we began to transform the multi-opening capability of the workbench. Finally, we completed the transformation of the bottom-level process model and IPC model of the workbench, and supported one-to-many connection on the end-side DebugSDK. The ability to make the workbench realize the leap from the "older machine" to the "smart machine" era:

Figure 4 - MNN Workbench Multi-Window Architecture

The workbench has entered the era of "smart machine", which brings more possibilities for the development of end-to-end algorithms. Since then, you no longer have to worry about closing the current project to use other functions: now you can open multiple projects at the same time. Develop and submit multiple training codes, debug multiple end computing projects on the same mobile phone, use model tools while developing, etc.:

Figure 5 - MNN Workbench Multitasking Joint Debugging

Figure 6 - Simultaneous access to multiple model tools with multi-window capabilities

The change of the workbench from a single window to a multi-window means that the user needs to manage more projects. In order to avoid confusion, it is best to complete all operations of the whole link in the corresponding workbench window. By sorting out the details, we found that Regardless of end computing publishing or remote training, Git-based code publishing process is indispensable. If the cannot support complete , the entire workflow is still fragmented. It supports checking changes through Diff Editor, editing the Git workspace visually, performing Pull / Push / Stash and other operations through the command menu, and you can easily complete code inspection and release through Git components. push-training integrated platform cooperates with Git Flow to truly realize the end-to-end algorithm engineering full-link work in the same workspace :

Figure 7 - Visual version control via the Git Flow component

Best Practices

Based on the integrated development environment of push-training integrated workflow and workbench, we can quickly realize the construction of the end-side algorithm project from 0 to 1. First, we need to create a PAI training project, write the training code, and perform local training verification:

Figure 8 - Running the training project using the local environment

After the local verification is correct, we can push it to the cloud PAI-DLC container for training with one click. After the training, we can pull the training product back to the local with one click:

Figure 9 - Cloud training and training product download

Cloud training products often need to be processed by model tools such as conversion and quantization before they can be used on the device side. With the multi-window capability of the workbench, you can directly right-click on the model to open any number of model tools for efficient model processing:

Figure 10 - One-click execution of multiple model tools

Finally, we can switch to the three-terminal-integrated end-to-end development and deployment environment for verification. If we encounter problems during the verification process, we can seamlessly switch back to the training project and repeat the previous steps:

Figure 11 - Device side deployment and debugging

The integrated push-training workflow completely solves the problem of the separation of device-side algorithm training and deployment. The training part uses the combination of device and cloud to allow developers to not perceive the difference between the device and the cloud, and the reasoning and verification part relies on three-terminal integration and workbench. Powerful debugging capabilities allow algorithms to be independently debugged and deployed on the device side. At present, these capabilities have been opened. You can download the MNN workbench of version 1.6.0 and later to experience it at https://www.mnn.zone . For the operation method of pushing and training, please refer to the MNN workbench for pushing training All-in-one platform operation manual.

Epilogue

In the past six months, the MNN workbench has always adhered to the concept of "solving the real needs of users", and has continuously communicated with engineering and algorithm students to communicate various problems encountered in the development of end-to-end AI. From debugging, real machine verification to real-time performance evaluation, the launch of a series of functions has demonstrated our determination and ability to solve the full-link problem of end-to-end AI R&D. With the release of the MNN workbench version that integrates professional training capabilities, I believe it will bring you a better experience.

pay attention to [Alibaba Mobile Technology] WeChat public account, 3 mobile technology practices & dry goods per week for you to think about!

Integrated MNN workbench for inference training

Build strong professional training capabilities

Feature Improvements: Multiple Windows, Joint Debugging, and Git Flow

Best Practices

Epilogue

阿里巴巴终端技术

引用和评论

SLS：基于 OTel 的移动端全链路 Trace 建设思考和实践

一文掌握 MCP 上下文协议：从理论到实践

LRU算法，你别跑，我就要吃透你

AI Agent爆火后，MCP协议为什么如此重要！

2025年医疗大模型各医疗场景赋能实践研究报告130+份汇总解读|附PDF下载

AdventureX 2025 正式启动：五天四夜，120小时极限创造！一起在杭州点燃青年创新之火！

MCP 协议为何不如你想象的安全？从技术专家视角解读