The international top conference OSDI included Taobao system papers for the first time, and the device-cloud collaborative intelligence was recommended by the keynote speech of the conference

The Big Taobao technical team's paper was selected into OSDI, the top international academic conference in the field of computer systems. This is the first time that Taobao system papers have been selected for this international top conference. The application's end-cloud collaborative machine learning system "Wall-E" - Walle. David Tennenhouse, specially invited by OSDI, specially recommended the Walle system in the keynote speech of the conference, and praised its advanced technology and application effect. Currently, Walle, as Alibaba's machine learning infrastructure, supports 300+ algorithm tasks on 30+ apps.

OSDI Conference

USENIX OSDI (Operating Systems Design and Implementation) is one of the top international academic conferences in the field of computer systems. It is known as the Oscar in the field of operating systems. It has a high academic status and influence and brings together global academic and industrial systems. Cutting-edge thinking and breakthrough results from domain professionals. This year's OSDI conference specially invited David Tennenhouse to give a keynote speech. He is an IEEE Fellow. He has served as the chief/director of research institutes and DARPA in companies such as Intel, Amazon/ http://A9.com , Microsoft, VMware, etc. He has also taught at MIT. David Tennenhouse specifically recommended the Walle system in his keynote speech, and actively communicated with the authors through email and Slack before and after the Walle talk.

Origin of Walle System Name

Walle comes from the 2008 movie "Robot Story WALL-E". In the movie, the WALL-E robot is responsible for cleaning up the earth's garbage, turning waste into treasure. The architects of Walle also adhere to a similar original intention, hoping that the device-cloud collaborative machine learning system designed and built can effectively utilize the user data on billions of mobile devices like the WALL-E robot, and fully release its neglected value to provide users with better intelligent services.

Walle System Design Philosophy

Figure 1: Walle workflow from the perspective of machine learning task developers In order to break the bottleneck of high latency, high overhead, high server load, and high privacy and security risks of mainstream cloud server-based machine learning frameworks, Walle adopts device-cloud collaborative machine learning. The new paradigm is to give full play to the natural advantages of mobile devices being close to users and data, and to realize the complementary advantages of devices and clouds. Different from the existing work of device-cloud collaborative learning (mainly at the algorithm level, and for specific machine learning inference or training tasks in specific application scenarios), Walle is the first end-to-end, general-purpose, large-scale industrial application of device-cloud collaboration machine learning system. Walle supports machine learning tasks to exchange any necessary information (such as data, features, samples, models, model updates, intermediate results, etc.) Task. Walle follows the end-to-end architecture design and is oriented to machine learning tasks. From the perspective of developers, it covers the development, deployment, and runtime of machine learning tasks, and supports each stage of the device-side and cloud-side runtime. In addition, Walle also follows a general system design, rather than integrating a large number of customized solutions for specific applications and specific platforms. Walle smoothes down the differences between hardware and software of end-cloud devices and ensures the lightweight of mobile APPs, while upwards supports large-scale industrial applications of various types of machine learning tasks.

Walle System Architecture

Figure 2: Walle's overall architecture Walle
It mainly includes the following three core system modules:

Deploy a platform to manage large-scale machine learning tasks and deploy them to billions of devices in a timely manner;
The data pipeline mainly involves the pre-processing stage of machine learning tasks, and provides task input for the device side and the cloud side;
The computing container provides a cross-platform, high-performance machine learning task execution environment, and at the same time meets the actual needs of day-level iteration of machine learning tasks.
Specifically,
The bottom layer of the computing container is the MNN deep learning framework, which includes a high-performance tensor computing engine, standard data processing and model runtime libraries, and a unified external interface through the transformed Python thread-level virtual machine to support the full implementation of various machine learning tasks. Parallelism of link execution and multitasking. The core technological innovation of MNN is the two new mechanisms of geometric calculation and semi-automatic search. The geometric calculation is mainly through the disassembly of deformation operators, which greatly reduces the work of manually optimizing hundreds of operators for more than ten kinds of hardware backends. The semi-automatic search mechanism further realizes the optimal available backend and execution scheme to quickly search the computational graph at runtime. The Python thread-level virtual machine abandons the Global Interpreter Lock (GIL) and supports the parallelism of multi-tasking and multi-threading for the first time. It is further oriented to the actual business needs of mobile APPs, and is ported to the terminal for the first time through tailoring and transformation;
The data pipeline introduces a new end-to-end stream processing framework, following the basic principle of "stateful computing for infinite data streams on a single resource-constrained mobile device", enabling user behavior data to be efficiently processed near the data source At the same time, a task trigger management mechanism based on dictionary tree is designed, which realizes the batch trigger execution of multiple related stream processing tasks on the terminal side. In addition, a real-time transmission channel is built between the device and cloud to support the upload and release of data in 100 milliseconds;
The deployment platform implements fine-grained task management through the git mechanism, and adopts push-pull combination and multi-batch task release to ensure effectiveness and stability, and supports unified and customized multi-granularity task deployment strategies.
System performance in typical real-world applications

Figure 3: End-cloud collaborative highlight recognition process in the e-commerce live broadcast scenario In the Taobao live broadcast scenario, the intelligent highlight task refers to automatically locating the anchor and introducing the product highlight (that is, the product is attractive to buyers) through machine learning methods. information) to improve the user experience. Compared with the previous pure cloud smart watch task link, the new device-cloud collaboration link after the introduction of Walle reduces the average production cloud-side load per watch point by 87%, and increases the number of anchors covered by smart watch points. 123%, and increased the amount of attention per unit of cloud computing power output by 74%. The real machine test shows that the average total time spent on each viewing task on the Huawei P50 Pro is 130.97 ms, while the time on the iPhone 11 is 90.42 ms. The above results highlight the practicability of the device-cloud collaborative learning framework and the high performance of the Walle computing container.

Figure 4: Production process of IPV feature based on Walle data pipeline in e-commerce recommendation scenario Behaviors (such as favorites, adding to shopping carts, purchasing orders, etc.), this feature plays a very important role in the recommendation model. The original IPV feature production link on the cloud side has an average delay of 33.73 seconds to produce a feature, consumes a lot of computing, communication, and storage resources, and has an error rate of 0.7%. In contrast, Walle's new data pipeline can complete the IPV feature production process on the end-side, with an average end-side delay of only 44.16 milliseconds, while reducing the amount of data by more than 90% and ensuring the correctness of features. These results show that compared to mainstream cloud-based data pipelines, Walle's new data pipeline greatly improves the timeliness, efficiency, and correctness of feature production and consumption.

Figure 5: Deployment process of an online randomly selected machine learning task In order to test the timeliness and scale of the Walle deployment platform, an online machine learning task was randomly selected and the entire deployment of it to the target device group was monitored. process. Under the premise of ensuring the stability of the task, it takes 7 minutes for the Walle deployment platform to successfully cover the 7 million mobile devices online, and 22 minutes to cover all the 22 million devices.

Benchmark test results of core modules

Figure 6: MNN vs. TensorFlow (Lite), PyTorch (Mobile)
MNNs were tested against TensorFlow (Lite) and PyTorch (Mobile) on mainstream hardware backends on Android and iOS mobile devices and Linux servers. The test uses 7 models commonly used in the fields of vision, natural language understanding, and recommendation. The results show that MNN outperforms other deep learning frameworks in almost all test examples. In addition to high performance, MNN can also support the running of each model on all mobile hardware backends, while TensorFlow Lite and PyTorch Mobile cannot support some hardware backends or models, so MNNs are more versatile.

Figure 7: MNN vs. TVM
In addition, a comparative test of MNN and TVM was conducted, and the hosts for TVM automatic tuning and compilation were MacBook Pro 2019 and NVIDIA GeForce RTX 2080 Ti. On the one hand, the automatic tuning and compilation of TVM takes about several thousand seconds, while the semi-automatic search of MNN at runtime takes only a few hundred milliseconds. Further combining the differences between MNN and TVM in design and actual deployment (especially TVM's lack of dynamic model deployment capabilities on iOS devices, see PPT and thesis for details), it can be concluded that MNN can support large-scale heterogeneous hardware backend and In industrial scenarios that require frequent and rapid iteration of tasks, TVM is not feasible. On the other hand, MNN is also lower than TVM in terms of inference time per model on each hardware backend, especially on GPU servers, mainly due to manual operator optimization in MNN.

Figure 8: Python thread-level virtual machine vs. CPython (based on statistical analysis of 30 million machine learning task executions online)
Finally, the performance comparison test of Python thread-level virtual machine and CPython is also carried out. The results show that the performance of the Python thread-level virtual machine is greatly improved on three types of tasks involving different amounts of computation, mainly due to the release of the GIL and the support of task-level multi-thread concurrency.

Business landing situation

At present, Walle, as the machine learning infrastructure of Alibaba Group, is called more than 100 billion times a day, and supports more than 300 mobile apps (including mobile Taobao, Ele.me, AliExpress, Cainiao Baobao, etc.) Vision, recommendation and other tasks. In addition, MNN has been open sourced on GitHub, and currently has 6.8k stars and 1.4k forks. At the same time, it has been selected into the 2021 "Technology China" open source innovation list, and has been commercialized in more than 10 other companies.

Author and citation information

Chengfei Lv, Chaoyue Niu, Renjie Gu, Xiaotang Jiang, Zhaode Wang, Bin Liu, Ziqi Wu, Qiulin Yao, Congyu Huang, Panos Huang, Tao Huang, Hui Shu, Jinde Song, Bin zou, Peng Lan, Guohuan Xu, Fei Wu , Shaojie Tang, Fan Wu, and Guihai Chen, Walle: An End-to-End, General-Purpose, and Large-Scale Production System for Device-Cloud Collaborative Machine Learning, in Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI), Pages 249-265, Carlsbad, CA, USA, Jul. 11 - 13, 2022. https://www.usenix.org/conference/osdi22/presentation/lv

Paper related information

OSDI 2022- Walle thesis speech slides, paper full text download: https://files.alicdn.com/tpsservice/d8b31c9ed4b072b9f89f5ec8d5b371ba.zip

The international top conference OSDI included Taobao system papers for the first time, and the device-cloud collaborative intelligence was recommended by the keynote speech of the conference

OSDI Conference

Origin of Walle System Name

Walle System Design Philosophy

Walle System Architecture

System performance in typical real-world applications

Benchmark test results of core modules

Business landing situation

Author and citation information

Paper related information

大淘宝技术

引用和评论

大淘宝技术斩获NTIRE 2023视频质量评价比赛冠军（内含夺冠方案）

从 DeepSeek 看25年前端的一个小趋势

Open WebUI：开源AI交互平台的全面解析

大模型中的Token究竟是什么？从原理到作用深度解析

一文掌握 MCP 上下文协议：从理论到实践

MySQL × 向量数据库：大模型时代的黄金组合实战指南

Mac 安装 DeepSeek-R1 本地化部署