KubeDL joins CNCF Sandbox to accelerate the cloud nativeization of AI industry

Introduction to On June 23, 2021, the Cloud Native Computing Foundation (CNCF) announced the adoption of a global TOC vote to accept KubeDL as a CNCF Sandbox project. KubeDL is Alibaba's open source Kubernetes-based AI workload management framework. It is taken from the abbreviation of "Kubernetes-Deep-Learning". It hopes to rely on Alibaba's scenarios to feed back the experience of large-scale machine learning job scheduling and management to the community.

Author | KubeDL Team

On June 23, 2021, the Cloud Native Computing Foundation (CNCF) announced the acceptance of KubeDL as a CNCF Sandbox project through a global TOC vote. KubeDL is Alibaba’s open-source Kubernetes-based AI workload management framework. It is taken from the "1612703b8740f5 Kube rnetes- D eep- L earning". It hopes to rely on Alibaba's scenario to schedule large-scale machine learning jobs. And the management experience feeds back the community.

Project address: http://kubedl.io

Project Introduction

With the continuous maturity of mainstream AI frameworks such as TensorFlow, PyTorch, and XGBoost, and the emergence of multiple AI heterogeneous computing chips represented by GPU/TPU, artificial intelligence is rapidly entering the stage of "large-scale industrialization". From the algorithm engineer designing the first layer of neural network structure to the final online service for real application scenarios, in addition to the research and development of AI algorithms, a large amount of system support at the infrastructure level is required, including data collection and cleaning, distributed training engines, and resources. Scheduling and orchestration, model management, reasoning service tuning, observability, etc. As shown in the following classic illustration, the collaboration of many system components forms a complete machine learning pipeline.

At the same time, cloud native technologies represented by Kubernetes are booming. Through excellent abstraction and powerful scalability, the application layer is perfectly decoupled from the infrastructure of the IaaS (Infrastructure as a Service) layer: applications can be "The paradigm of using resources on-demand, without paying attention to the complexity of the underlying infrastructure, thereby liberating productivity and focusing on innovation in their own fields.

The emergence of Kubernetes solves the problem of how to efficiently deliver cloud resources. However, it cannot provide good native support for highly complex workloads such as AI. How to integrate the differences of various frameworks and retain their versatility, at the same time The industry is still exploring and trying to build a series of perfect surrounding ecosystems and tools around the runtime of AI workloads. In practice, we found that the AI load operation faces the following challenges in the Kubernetes ecosystem:

Machine learning frameworks flourish, each with different optimization directions and applicable scenarios, but there are many commonalities in the life cycle management of distributed training jobs, and the same requirements for some advanced features (such as network mode, mirror code separation, Metadata persistence, cache acceleration, etc.). Operater is implemented separately for the load of each type of framework, independent processes cannot share state, and the lack of a global perspective makes it difficult to implement global job-level scheduling and queue mechanisms. In addition, it is not conducive to the abstraction and reuse of functions, and there is repeated work at the code level.

Native Kubernetes cannot meet the diverse scheduling requirements of offline tasks. The Pod-oriented scheduling model of Kubernetes is naturally suitable for Long Running workloads such as microservices. However, for the high throughput of offline tasks, Gang Scheduling (All-Or-Nothing), Elastic Capacity and other scheduling requirements, the community has evolved many Kind of scheduling scheme. Take Gang Scheduling, which is very common in the machine learning distributed training job scheduling scenario, as an example. The community currently has YuniKorn, Volcano, Coscheduling and other schedulers to provide different interactive protocols. We need plug-in means to enable the corresponding scheduling. protocol. At the same time, according to the unique attributes of the business, such as PS/worker, there are DAG orchestration requirements for startup dependencies between different roles, which need to be implemented in the controller;

The results of distributed training often use the model as the output and store it in a distributed file system such as (Alibaba Cloud OSS/NAS), but how to manage the model from the perspective of the training job, like a container mirror, becomes the "immutable" of the AI service Infrastructure" and to achieve simple and clear version management and traceability, the industry still lacks best practices. At the same time, the two stages of "training" and "inference" are relatively independent. The "training -> model -> inference" machine learning pipeline from the perspective of algorithm scientists lacks faults, and the "model" as the intermediate product of the two can serve as the " The role of “connecting the past and the future”;

Distributed training can still perform miracles, but the specification configuration of the reasoning service is a delicate task. Variables such as video memory, number of CPU cores, BatchSize, and number of threads may affect the quality of inference services. The capacity estimation based purely on the resource level cannot reflect the real resource requirements of the business, because some engines such as TensorFlow will preempt the video memory. Theoretically, there is an optimal balance between service quality and resource efficiency, but it is like a ghost in the dark, knowing its existence is elusive. With the maturity of GPU virtualization technology, the value of this balance point becomes more and more prominent, and better specifications can significantly provide the deployment density of a single GPU card, saving a lot of costs.

The reasoning service itself is a special long running microservice form. In addition to the basic deployment, it lacks some instance and traffic management strategies for different reasoning scenarios, such as:
1) Algorithm scientists usually deploy two or more model instances of different versions at the same time to perform A/B Test to verify the best service effect, requiring fine-grained flow control based on weights;
2) It can automatically trigger the expansion and contraction of the instance according to the traffic request level and the metrics of the current reasoning service, and minimize the resource cost on the premise of fully guaranteeing the service availability.

KubeDL

In response to the above problems, the Alibaba cloud native, cluster management and PAI teams will deposit their experience of managing large-scale machine learning workloads into a general runtime management framework-KubeDL, covering distributed training, model management, reasoning services and other machine learning pipelines The various stages of Kubernetes enable workloads to run efficiently on Kubernetes.

1. Distributed training

KubeDL supports mainstream machine learning distributed training frameworks (TensorFlow / PyTorch / MPI / XGBoost / Mars, etc.), among which Mars is the open source tensor-based large-scale data calculation framework of the Alibaba computing platform, which can accelerate numpy in a distributed manner. The efficiency of data processing frameworks such as pandas helps Mars operations to be integrated into the cloud native big data ecosystem in a more native way.

We abstract the common parts in the life cycle management of various training jobs to become a layer of general runtime library, which is reused by various distributed training job controllers. At the same time, users can also quickly expand custom workloads on this basis. The controller also reuses existing capabilities. With declarative API and Kubernetes network/storage model, KubeDL can apply/recycle computing resources, service discovery and communication between job roles, fail-over at runtime, etc., algorithm model developers only need to declare this The Job Role that the training relies on, the number of respective copies, the number of computing resources/heterogeneous resources, etc., and then submit the task. In addition, we have made many feature designs to improve the efficiency and experience of training in response to the pain points in the training field:

Different training frameworks often contain different Job Roles, such as PS/Chief/Worker in TensorFlow and Master/Worker in PyTorch. There are often implicit dependencies between Role and Role. For example, Worker depends on the Master to start computing before it can start normally. , A disorderly startup sequence is not only easy to cause resources to idle for a long time, but may even cause the job to fail directly. KubeDL designed a scheduling orchestration control flow based on DAG (Direct Acyclic Graph), which solves the startup dependency sequence between Roles and can be flexibly expanded.

The training time of large models is often restricted by the communication efficiency between computing nodes. The application of high-performance network technologies such as RDMA will greatly increase the data transmission speed. However, these customized networks often require computing nodes to communicate with each other using Hostnetwork. At the same time, sometimes Due to environmental constraints, it is impossible to provide a service discovery mechanism based on the Service mode. This requires that the job management engine can support the service discovery mechanism in the Host network mode, handle the network port allocation of each computing node, and combine with the characteristics of each training framework to handle the network connectivity after node Fail-over. KubeDL supports High-performance distributed training in Host network mode.

Gang Scheduling is a common requirement in distributed training job scheduling scenarios. A cluster of Pods that form a single training job is often required to be scheduled at the same time to avoid livelocks due to resource competition between jobs when the cluster capacity is tight, but Kubernetes is powerful and scalable The nature also enables different schedulers to implement different Gang Scheduling protocols, such as YuniKorn, KubeBatch, etc. In order to avoid the coupling with the specific scheduler and adapt to the differences of different user environments, KubeDL implements the plug-in of the Gang Scheduling protocol, and enables the corresponding plug-in on demand to cooperate with the scheduler to achieve batch scheduling of jobs.

Job is a one-time job, but in actual production applications, we often encounter repeated training/timing training scenarios, such as pulling offline tables in a certain time interval every day and performing data cleaning, model Re-training, KubeDL Provides a separate workload—Cron to process timed training requests, and supports any type of training jobs (such as TFJob, PyTorchJob, etc.). Users can submit cron tab-style timing commands and job templates, and upload them in the Cron resource Track the history of training jobs and the jobs currently in progress in the status.

In response to the requirement that the massive offline job metadata needs to be stored for a long time (the metadata is destroyed from etcd after the Job CRD is deleted), KubeDL also has built-in metadata persistence, real-time monitoring of changes in resource objects such as Job/Pod/Events, and conversion Create the corresponding Databse Schema Object and persist it in the storage backend. The design of the storage backend is also plug-in. Users can implement storage plug-ins according to their online environment and enable them during deployment. In KubeDL, Job/Pod supports the Mysql storage protocol by default and collects Events into the Alibaba Cloud SLS service.

At the same time, we also provide a management and control suite: KubeDL-Dashboard. Users do not need to understand the many APIs of Kubernetes and struggle with various kubectl commands to get started with easy-to-use machine learning tasks in an interface. Persistent metadata can also be directly consumed by Dashboard. Dashboard provides simple job submission, job management, event/log viewing, cluster resource view and other functions to help machine learning users get started with experiments with a very low learning threshold.

2. Reasoning service specification tuning

The development and maturity of GPU virtualization and time-sharing multiplexing technologies have given us the opportunity to run multiple inference services on a GPU at the same time, significantly reducing costs. However, how to choose a suitable GPU resource specification for inference services, especially incompressible video memory resources, has become a key problem. On the one hand, frequent model iterations make algorithm engineers have no time to accurately estimate the resource requirements of each model, and dynamic changes in traffic also make resource evaluation inaccurate. Therefore, they tend to configure more GPU resource redundancy, which is more stable. Choosing to sacrifice the latter between efficiency and efficiency will result in a lot of waste. On the other hand, since machine learning frameworks such as Tensorflow tend to fill up all the idle video memory, from the perspective of the cluster manager, the inference business is estimated based on the historical usage of video memory. Resource requirements are also very inaccurate. In the KubeDL-Morphling component, we have implemented automatic specification tuning of the reasoning service. Through active pressure testing, the service performance profile under different resource configurations is given, and the most suitable container specification recommendation is finally given. The portrait process is highly intelligent: In order to avoid exhaustive sampling of specification points, we use Bayesian optimization as the internal core driver of the portrait sampling algorithm. Through continuous refinement of the fitting function, the sampling rate is low (<20%). Measure the cost, and give the recommended result of the container specification close to the optimal.

3. Model management and reasoning services

The model is the product of training, and the concentrated essence of the combination of calculation and algorithm. Usually, the method of collecting and maintaining the model is hosted on cloud storage, and unified management is achieved by organizing the file system. Such a management method relies on strict process specifications and permission control, and does not realize the immutability of model management from the system level. The birth of container mirroring solves the problems of RootFS construction-distribution-immutability. KubeDL combines the two , To realize the model management based on mirroring. After the training is successfully completed, the ModelVersion specified in the Job Spec will automatically trigger the construction of the model image. Users can agree on basic information such as the storage path of the model, the target mirror registry and other basic information in ModelVersion.Spec, and push the output of each training to the corresponding mirror warehouse.

At the same time, mirroring, as the output of training and the input of inference service, connects the two stages well, which also realizes a complete machine learning pipeline of distributed training -> model construction and management -> inference service deployment. KubeDL provides Inference resource objects to provide inference service deployment and runtime control. A complete Inference service can consist of single or multiple Predictors. Each Predictor corresponds to the model output by pre-training, and the model will be automatically pulled and mounted. To the main container Volume. When multiple versions of Predictor of different models coexist, the traffic can be distributed and controlled according to the assigned weights to achieve the effect of the A/B Test control experiment. In the future, we will make changes for the reasoning service scenario on Batching batch reasoning and AutoScale. Explore more.

KubeDL distributed training practice on public cloud

PAI-DLC

As cloud computing gains popularity and more and more businesses are conducted in a cloud-native way, the machine learning team of Alibaba Cloud Computing Platform PAI has launched DLC (Deep Learning Cloud), a deep learning platform product. DLC adopts a new cloud native architecture, the bottom layer uses Kubernetes as the resource base support, and the training part is fully managed by KubeDL, which is a large-scale practice of KubeDL in deep learning cloud computing scenarios.

DLC has extensively supported many businesses within the Alibaba Group, including the deep learning computing needs of many business departments such as the image and video, natural language, voice, multi-modal understanding, and autonomous driving of the Dharma Academy of the Department of Security of the Taoist Department. In serving the production of cutting-edge business driven by deep learning, the PAI team has accumulated a lot of experience in framework and platform construction, and has accumulated a compatible community (eg, TensorFlow/PyTorch) and a framework that has been practiced in large-scale industrial circles with distinctive characteristics. Platform capabilities, such as the training of the trillion-scale parameter M6 model, the industrial-grade graph neural network system Graph-Learn, the ultimate resource management and reuse capabilities, etc.

Today, the capabilities of PAI-DLC are also fully embracing the public cloud, a cloud-native one-stop deep learning training platform for developers and enterprises, a flexible, stable, easy-to-use and high-performance machine learning training environment, and a comprehensive Supports a variety of communities and PAI in-depth optimization algorithm framework, high-performance and stable operation of ultra-large-scale distributed deep learning tasks, reducing costs and increasing efficiency for developers and enterprises.

As the best practice of Alibaba Group's machine learning platform, the public cloud DLC has absorbed valuable experience in engineering practice in terms of product details, framework optimization, and platform services. In addition, DLC products have fully considered the unique attributes of the public cloud scenario at the beginning of the design, and provided functions such as bidding instances, automatic fail-over, and flexible expansion and contraction, so as to reduce the cost of AI computing power for customers.

Furthermore, DLC is also combined with other public cloud products of PAI, such as DSW for modeling of algorithm engineers, full-process AI for enterprise-level, automated AutoML, online reasoning service EAS, etc., to create full-process AI Benchmarking products.

Copyright Notice: content of this article is contributed spontaneously by Alibaba Cloud real-name registered users. The copyright belongs to the original author. The Alibaba Cloud Developer Community does not own its copyright, and does not assume corresponding legal responsibilities. For specific rules, please refer to the "Alibaba Cloud Developer Community User Service Agreement" and the "Alibaba Cloud Developer Community Intellectual Property Protection Guidelines". If you find suspected plagiarism in this community, fill in the infringement complaint form to report it. Once verified, the community will immediately delete the suspected infringing content.

KubeDL joins CNCF Sandbox to accelerate the cloud nativeization of AI industry

Project Introduction

KubeDL

1. Distributed training

2. Reasoning service specification tuning

3. Model management and reasoning services

KubeDL distributed training practice on public cloud

阿里云开发者

引用和评论

福利来了！计算巢支持在已经购买的 ECS 上搭建幻兽帕鲁服务器，支持图形化管理配置

一文掌握 MCP 上下文协议：从理论到实践

LRU算法，你别跑，我就要吃透你

AI Agent爆火后，MCP协议为什么如此重要！

2025年医疗大模型各医疗场景赋能实践研究报告130+份汇总解读|附PDF下载

记录下安装open-eBackup过程

AdventureX 2025 正式启动：五天四夜，120小时极限创造！一起在杭州点燃青年创新之火！