Morphling: How to achieve the ultimate cost reduction for cloud-native deployment of AI?

Author｜Wang Rongping (

With the vigorous development of cloud native technology and the landing of its increasingly mature industries, machine learning on the cloud is rapidly advancing towards large-scale and industrialization.

Recently, Morphling, as an independent sub-project of Alibaba's open source KubeDL, became the Cloud Native Computing Foundation (CNCF) Sandbox project. Aimed at deploying machine learning model inference services for large-scale industries, providing automated deployment configuration tuning, testing and recommendations, and helping enterprises to fully enjoy the cloud in an environment where GPU virtualization and reuse technologies are becoming more mature Native advantages, optimize the performance of online machine learning services, reduce service deployment costs, and efficiently solve the performance and cost challenges of machine learning in actual industrial deployment. In addition, the academic paper "Morphling: Fast, Near-Optimal Auto-Configuration for Cloud-Native Model Serving" related to the Morphling project was accepted by ACM Symposium on Cloud Computing 2021 (ACM SoCC 2021).

Morphling was originally intended to be the hero "Waterman" in the game Dota. He can flexibly change his form according to the environment requirements to optimize combat performance. We hope that through the Morphling project, we can achieve flexible and intelligent deployment configuration changes for machine learning inference operations, optimize service performance, and reduce service deployment costs.

Morphling Github：https://github.com/kubedl-io/morphling
Morphling website: https://kubedl.io/tuning/intro/

background

The workflow of machine learning on the cloud can be divided into two parts: model training and model serving: after the model is trained offline and the tuning test is completed, it will be deployed as an online application in the form of a container for users Provide uninterrupted high-quality reasoning services, such as target item identification in online live videos, online language translation tools, online image classification, etc. For example, Alibaba’s internal Tao system content social platform Machine Vision Application Platform (MVAP), through an online machine learning inference engine, supports Tao system live broadcast product feature recognition, live broadcast cover image de-duplication, and shopping graphics classification. According to Intel's data, the era of large-scale inference ("Inference at Scale") is approaching: By 2020, the ratio of inference to training cycles will exceed 5:1; Amazon's data shows that in 2019, the infrastructure of Amazon AWS on model inference services Cost, which accounts for more than 90% of the total cost of machine learning tasks. Machine learning reasoning has become the key to the landing and "monetization" of artificial intelligence.

Reasoning tasks on the cloud

The reasoning service itself is a special long running microservice form. With the increasing deployment volume of the reasoning service on the cloud, its cost and service performance have become critical optimization indicators. This requires the operation and maintenance team to perform reasonable configuration optimization of the reasoning container before deployment, including hardware resource configuration, service operation parameter configuration, etc. These optimized configurations play a vital role in coordinating service performance (such as response time, throughput rate) and resource utilization efficiency. In practice, our tests have found that different deployment configurations can bring a gap of up to ten times the throughput rate/resource utilization rate.

Relying on Alibaba's extensive experience in AI reasoning services, we first summarized the reasoning business. Compared with the traditional service deployment configuration, we have the following characteristics:

Use expensive graphics card resources, but low memory usage: The development and maturity of GPU virtualization and time-sharing multiplexing technology gives us the opportunity to run multiple inference services on a GPU at the same time, which significantly reduces costs. Different from the training task, the reasoning task is to use a well-trained neural network model to input information from the user and process it through the neural network to obtain the output. The process only involves the forward propagation of the neural network, which requires the use of video memory resources. Lower. In contrast, the training process of the model involves the Backward Propagation of the neural network, which requires a large amount of intermediate results to be stored, and the pressure on the video memory is much greater. Our large number of cluster data shows that allocating an entire graphics card to a single inference task will cause a considerable waste of resources. However, how to choose a suitable GPU resource specification for inference services, especially incompressible video memory resources, has become a key problem.
Performance resource bottlenecks are diverse: in addition to GPU resources, inference tasks also involve complex data pre-processing (processing user input into parameters that conform to model input), and result post-processing (generating a data format that conforms to user perception). These operations are usually performed using the CPU, and model inference is usually performed using the GPU. For different service businesses, GPU, CPU, and other hardware resources may all become the dominant factors affecting service response time, thereby becoming a resource bottleneck.
In addition, the configuration of container operating parameters has also become a dimension that business deployers need to tune: In addition to computing resources, container runtime parameters will also directly affect the performance of services such as RT and QPS, such as the number of concurrent threads running in the container, The batch processing size of the reasoning service, etc.

Optimize inference service deployment configuration

The cloud-native technology with Kubernetes as the mainstream is being widely used for new application loads in a variety of forms. Machine learning tasks (including training and inference) are built on Kubernetes, and stable, efficient, and low-cost deployment is realized. The key and key points for major companies to promote AI projects and cloud services. The industry is still exploring and trying the inference container configuration under the Kubernetes framework.

The most common mode is to manually configure parameters based on human experience, which is simple but inefficient. The actual situation is often: from the perspective of the cluster manager, service deployment personnel tend to configure more resource redundancy in order to ensure service quality, and choose to sacrifice the latter between stability and efficiency, resulting in a lot of waste of resources; or The operating parameters are directly configured with default values, which loses opportunities for performance optimization.
Another alternative is to further refine and optimize resource allocation based on the historical water level profile of resources. However, our observation and practice have found that the daily resource water level cannot reflect the peak flow rate during the service stress test, and the upper limit of the service capacity cannot be evaluated. Secondly, there is generally a lack of reliable historical water level information for reference for the newly launched business; in addition, due to the machine Learning the characteristics of the framework, the historical usage of GPU video memory usually does not accurately reflect the real demand for video memory of the application; finally, for the tuning of the internal program operating parameters of the container, there is insufficient data support from the perspective of historical data.

In general, although the Kubernetes community has some research and products for automatic parameter recommendation in terms of more general hyper-parameter tuning, the industry lacks a cloud-native parameter configuration system directly oriented to machine learning inference services.

Relying on Ali's extensive AI reasoning service experience, we concluded that the pain points of reasoning business configuration optimization are:

Lack of a framework for automated performance testing and parameter tuning: Iterative manual adjustment of configuration-service stress testing brings a huge manual burden to deployment testing, making this direction an impossible option in reality.
Stable and non-intrusive service performance testing process: Direct deployment and testing of online services in a production environment will affect user experience.
Requires efficient parameter combination tuning algorithm: Considering the increase in the number of parameters that need to be configured, joint debugging of the combined optimization configuration of multi-dimensional parameters puts forward higher efficiency requirements for the tuning algorithm.

Morphling

In response to the above-mentioned problems, the Alibaba cloud native cluster management team has developed and open-sourced the Kubernetes-based machine learning inference service configuration framework-Morphling, which automates the entire process of parameter combination tuning, and combines efficient intelligent tuning algorithms to make reasoning The business configuration tuning process can run on Kubernetes efficiently and solve the performance and cost challenges of machine learning in the actual deployment of the industry.

Morphling implements different levels of cloud-native abstraction on the parameter tuning process, provides users with a simple and flexible configuration interface, and encapsulates the underlying container operation, data communication, sampling algorithm, and storage management in the controller. Specifically, Morphling's parameter tuning-performance stress test uses an experiment-trial workflow.

An experiment is the closest level of abstraction to the user. Through interaction, the user specifies the storage location of the machine learning model, the configuration parameters to be tuned, the upper limit of the number of tests, etc., and defines a specific parameter tuning job.
For each parameter tuning operation experiment, Morphling defines another level of abstraction: trial. Trial encapsulates a performance test process for a specific parameter combination, covering the underlying Kubernetes container operation: In each trial, Morphling configures and starts the inference service container according to the test parameter combination, detects the availability and health status of the service, and Perform a stress test on the service to measure the service performance of the container in this configuration, such as response time delay, service throughput, resource usage efficiency, etc. The test results will be stored in the database and fed back to the experiment.
Morphling uses an intelligent hyper-parameter tuning algorithm to select a small number of configuration combinations for performance testing (trial), and each round of test results is used as feedback to efficiently select the next set of parameters to be tested. In order to avoid exhaustive sampling of specification points, we use Bayesian optimization as the internal core driver of the profile sampling algorithm. Through continuous refinement of the fitting function, the pressure measurement overhead at low sampling rate (<20%) is given close to Recommended results for the best container specifications.

Through such iterative sampling-testing, the optimized configuration combination recommendation is finally fed back to the service deployer.

At the same time, Morphling provides a management and control suite: Morphling-UI, which is convenient for the business deployment team to initiate a reasoning business configuration tuning experiment, monitor the tuning process, and compare tuning results through simple and easy-to-use operations.

Morphling's practice in Taoxi content social platform

Alibaba's rich online machine learning reasoning scenarios and a large number of reasoning service instance requirements provide first-hand landing practice and test feedback for Morphling's landing verification. Among them, the Machine Vision Application Platform (MVAP) team of Alibaba Taoist content social platform, through the online machine learning inference engine, supports Taoist live broadcast product feature recognition, live broadcast cover image de-duplication, and shopping graphic classification.

During Double Eleven in 2020, we used Morphling to test and optimize the AI inference container to find the optimal solution between performance and cost. At the same time, the algorithm engineering team further analyzed these resource-consuming inference models, such as the Tao system. Video viewing services, making targeted model quantification, analysis, and optimization from the perspective of AI model design, supporting the peak traffic of Double Eleven with minimal resources, while ensuring that business performance does not decrease, greatly improving GPU Utilization rate and cost reduction.

Academic exploration

In order to improve the efficiency of the reasoning service parameter tuning process, the Alibaba cloud native cluster management team further explored the use of meta-learning and few-shot regression to achieve more efficient results based on the characteristics of the reasoning business. , Low-sampling cost configuration tuning algorithm, to meet the actual industry's tuning requirements of "fast, small sample sampling, low test cost", and a cloud-native and automated tuning framework. The related academic paper "Morphling: Fast, Near-Optimal Auto-Configuration for Cloud-Native Model Serving" was accepted by ACM Symposium on Cloud Computing 2021 (ACM SoCC 2021).

In recent years, topics related to the optimization and deployment of AI inference tasks on the cloud have been active in major cloud computing and system-related academic journals and conferences, and have become a hot spot for academic exploration. The topics explored mainly include dynamic selection of AI models, dynamic scaling of deployment instances, traffic scheduling for user access, and full utilization of GPU resources (such as dynamic model loading, batch size optimization), etc. However, starting from large-scale industry practices, it is the first time to study the issue of optimizing the deployment of container-level reasoning services.

In terms of algorithms, performance tuning is a classic hyper-parameter tuning problem. Traditional hyperparameter tuning methods, such as Bayesian optimization, are difficult to face tuning problems with high dimensionality (multiple configuration items) and large search space. For example, for AI inference tasks, we perform "combination optimization" hyperparameter tuning in the four dimensions (configuration items) of the number of CPU cores, GPU memory size, batch size, and GPU model. Each configuration item has 5 ~ 8 optional parameters. In this way, the parameter search space in the combined case is as high as 700 or more. Based on our accumulated testing experience in production clusters, for an AI reasoning container, each test of a set of parameters will take a few minutes to start services, stress testing, and data reporting. At the same time, there are many types of AI reasoning services. , Frequent update iterations, limited deployment engineers, and limited cost of test clusters. To efficiently test the optimal configuration parameters in such a large search space, this poses a new challenge to the hyperparameter tuning algorithm.

In this paper, our core observation is that for different AI reasoning businesses, various configurations (such as GPU memory, batch size) need to be optimized for the impact of container service performance (such as QPS), "the trend is stable and "Similar" is shown on the visualized "configuration-performance" surface, which is reflected in the similarity of the shape of the "configuration-performance" surface in different AI inference examples, but the degree of impact of configuration on performance and key nodes are different in value. :

The above figure visualizes the impact of three AI inference models, the two-dimensional configuration of <CPU core number, GPU memory size>, on the container service throughput RPS. The paper proposes to use Model-Agnostic Meta-Learning (MAML) to learn these commonalities in advance and train meta-models, so as to test the new AI inference performance, quickly find the key nodes in the surface, start with the meta-model, and make small samples Accurate fit of the lower (5%).

Summarize

Morphling's Kubernetes machine learning reasoning service configuration framework based on, combined with the tuning algorithm of "fast, small sample sampling, low test cost", realizes a cloud-native automated and stable and efficient AI reasoning deployment and tuning process, and faster delivery It can optimize and iterate the deployment process and accelerate the launch of machine learning business applications. The combination of Morphling and KubeDL will also make the AI experience from model training to configuration tuning for inference deployment smoother.

Reference

Morphling Github：
https://github.com/kubedl-io/morphling

Morphling website:
https://kubedl.io/tuning/intro/

KubeDL Github:
https://github.com/kubedl-io/kubedl

KubeDL website:
https://kubedl.io/

Click on the original text and check out the Morphling project github homepage!

Recently Popular

2021·云栖会议定文件# Scan the QR code to sign up and get a free gift!

Scan the [Yunqi Conference Registration QR Code] below to complete the registration and take a screenshot, add the cloud native assistant WeChat account (AlibabaCloud888) and send the screenshot to get the lucky draw opportunity! Come try it~

Yunqi Conference Registration QR Code

Morphling: How to achieve the ultimate cost reduction for cloud-native deployment of AI?

background

Reasoning tasks on the cloud

Optimize inference service deployment configuration

Morphling

Morphling's practice in Taoxi content social platform

Academic exploration

Summarize

Reference

Recently Popular

2021·云栖会议定文件# Scan the QR code to sign up and get a free gift!

阿里云云原生

引用和评论

通义灵码 AI IDE 上线，第一时间测评体验

一文掌握 MCP 上下文协议：从理论到实践

AI Agent爆火后，MCP协议为什么如此重要！

2025年医疗大模型各医疗场景赋能实践研究报告130+份汇总解读|附PDF下载

AdventureX 2025 正式启动：五天四夜，120小时极限创造！一起在杭州点燃青年创新之火！

MCP 协议为何不如你想象的安全？从技术专家视角解读

🔥吐血整理 Bolt.diy 部署与应用攻略