Cloud-native AI frontier: Kubeflow Training Operator unifies AI training on the cloud

Distributed training and Kubeflow

When developers want to introduce distributed training of deep learning to a Kubernetes cluster, the first thing they think of is the various operators in the Kubeflow community, such as tf-operator and mpi-operator.

The main work that serve various deep learning training (TensorFlow, PyTorch, MXNet, etc.) :

Create a Pod on the Kubernetes cluster to start each training process
Configure information for service discovery (such as TF_CONFIG ) and create related Kubernetes resources (such as Service)
Monitor and update the status of the entire task

In fact, Kubeflow's training Operators have become the actual standard for running distributed training tasks on Kubernetes.

Not only major public cloud vendors have basically included or integrated Kubeflow training operators, but other projects related to deep learning training in the community (such as Katib for automatic machine learning, and Flyte, which provides automatic orchestration functions) are all connected. The operators in Kubeflow are used as a tool for issuing distributed training tasks.

Problems with Kubeflow Operators

In early 2019, the Kubeflow community launched the kubeflow/common project to maintain part of the code that is reused between operators. After more than a year of iteration and reconstruction, the project gradually stabilize in the middle of 2020 and begin to access training operator . Currently, tf-operator, mxnet-operator and xgboost-operator are training operators built on the kubeflow/common project.

However, there are still many challenges in the maintenance of the entire Kubeflow training operator project.

mainly includes :

A large number of developers' energy is spent on functional enhancements and fault repairs for different training frameworks
It is difficult to reuse basic functions and services for testing and release among different operators
Third-party components need to interface with a large number of different operators
The new training framework requires the development of a complete corresponding operator before it can be used, and the development cost is too high
Many operators are too expensive to learn for new developers who are just getting in touch with Kubeflow

The above problems are faced by the developers and maintainers of Kubeflow. In addition, users of these operator is also facing some problems :

Users need to install multiple operator components to support multiple training APIs
The JobSpecs of various Kubeflow Jobs look similar, but they are slightly different and do not provide a unified experience

The main reason for this problem is that corresponds to an operator independently maintained in a repository . This separate maintenance model makes it impossible to integrate well such things as the build environment, test environment, deployment method, and code logic.

Although the number of deep learning frameworks is in the process of convergence, there will still be a steady stream of new frameworks hoping to quickly connect to Kubernetes for distributed training through Kubeflow, and these new increments make the problem more serious.

Proposal：All-in-One

In response to the various issues mentioned above, after many discussions in community meetings, it was decided to try to converge multiple Kubeflow training operator codes into a warehouse through integration.

At the same time, referring to the One-Manager-Multi-Controller mode recommended in controller-runtime, multiple controllers handling different APIs can share a Manager and its cache, and reducing the simultaneous deployment of multiple operators When the redundant APIServer requests :

    mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{...})
    ...
    for _, s := range enabledSchemes {
        setupFunc, supported := controller_v1.SupportedSchemeReconciler[s]
        if !supported {os.Exit(1)}
        if err = setupFunc(mgr, enableGangScheduling); err != nil {
            setupLog.Error(err, "unable to create controller", "controller", s)
            os.Exit(1)
        }
    }

All Controllers (Reconciler) need to SupportedSchemeReconciler in advance:

var SupportedSchemeReconciler = map[string]ReconcilerSetupFunc{
    tensorflowv1.Kind: func(mgr manager.Manager, enableGangScheduling bool) error {
        return tensorflowcontroller.NewReconciler(mgr, enableGangScheduling).SetupWithManager(mgr)
    },
    pytorchv1.Kind: func(mgr manager.Manager, enableGangScheduling bool) error {
        return pytorchcontroller.NewReconciler(mgr, enableGangScheduling).SetupWithManager(mgr)
    },
    ...,
}

The user can specify the APIs that need to be enabled --enable-scheme when starting the operator process. Later, a new Controller will be connected, and the corresponding controllers can be selectively first register and then start

Progress and near-term planning

current fusion of 1613889b2cb43f has been officially merged into the master branch tf-operator. Users can soon experience the integrated tf-operator in the upcoming Kubeflow 1.4 Release: deploying a single operator can support four API support including TFJob, PyTorchJob, MXNetJob and XGBoostJob.

at the code warehouse level is the first step of Kubeflow Training Operator to the next stage . This step more solves the consistency of the project operation level, including environmental reuse and overall code management. Low-code development for developers, including new feature enhancements, bug fixes, and new API access , will be the next goal of our plan.

According to this design, developers only need to modify a very limited number of functions to access the new API.

mainly includes :

// 根据 ctrl.Request 获取对应的自定义 Job
GetJob(ctx context.Context, req ctrl.Request) (client.Object, error)
// 从自定义 Job 中以 map[commonv1.ReplicaType]*commonv1.ReplicaSpec 的格式抽取 ReplicasSpecs
ExtractReplicasSpec(job client.Object) (map[commonv1.ReplicaType]*commonv1.ReplicaSpec, error)
// 从自定义 Job 中抽取 RunPolicy
ExtractRunPolicy(job client.Object) (*commonv1.RunPolicy, error)
// 从自定义 Job 中抽取 JobStatus
ExtractJobStatus(job client.Object) (*commonv1.JobStatus, error)

If developers need to inject some environment variables for service discovery, they can override the method DecoratePod(rtype commonv1.ReplicaType, podTemplate *corev1.PodTemplateSpec, job client.Object) to modify the Pod before the client submits the creation request to the APIServer.

Based on the above embodiment of the code development has been low in pkg/reconciler.v1 form engagement into kubeflow / Common warehouse. Soon, we will also introduce reconciler.v1 package on tf-operator, hoping to verify reconciler.v1 while providing a more convenient way to access Kubernetes for more general practical cases.

If the developer wants to develop the controller in a lower-level API, the pkg/controller.v1 package can meet the needs of this type of developer.

Vision

Although the optimization and transformation for Kubeflow Training Operator is still in progress, we did not stop there. For the future development of Training Operator, we believe that there are the following areas worthy of continuous investment:

First of all, is to further improve the flexibility of Kubeflow Training Operator when adapting to customized jobs. We plan to propose a Job API decoupled from the deep learning training framework to support broader task definitions, and allow users to use controller.v1 and reconciler.v1 in kubeflow/common for customized development, but the cost of learning is very low. Development costs are still too high. Even in the future, junior developers may not modify the operator but only add/modify some webhooks or decorator servers to achieve customized modifications.
second aspect of is to further enhance the convenience of Kubeflow Training Operator and other third-party components. We hope that in the future, developers who use the Kubeflow Training Operator to build an AI platform can easily interface it with other modules to implement functions such as task queues, pipelines, hyperparameter search, etc.
last and most critical , we still hope to further improve the stability of Kubeflow Training Operator.

We welcome more students to try and experience Kubeflow and participate in the Kubeflow project.

Reference

[1]add reconciler.v1: 【https://github.com/kubeflow/common/pull/141】

[2]
reconciler.v1 implementation: 【https://github.com/kubeflow/common/tree/master/pkg/reconciler.v1/common】

[3] All-in-one Kubeflow Training Operator: 【https://docs.google.com/document/d/1x1JPDQfDMIbnoQRftDH1IzGU0qvHGSU4W6Jl4rJLPhI/edit】

about us

For more cases and knowledge about cloud native, please follow the public account of the same name [Tencent Cloud Native]~

Welfare: The official account backstage reply [manual], you can get "Tencent Cloud Native Roadmap Manual" & "Tencent Cloud Native Best Practices"~

[Tencent Cloud Native] Yunshuo new products, Yunyan new technology, Yunyou Xinhuo, Yunxiang information, scan the QR code to follow the public account of the same name, and get more dry goods in time! !

Cloud-native AI frontier: Kubeflow Training Operator unifies AI training on the cloud

Distributed training and Kubeflow

Problems with Kubeflow Operators

Proposal：All-in-One

Progress and near-term planning

Vision

about us

账号已注销

引用和评论

Serverless AI绘画技术沙龙【深圳站】火热报名中

借助腾讯云质检平台的新范式，做工业制造企业质检的“AI慧眼”

腾讯云 TDMQ 产品家族新成员：消息队列 MQTT 版全新发布！

墨天轮2024年度数据库获奖名单

腾讯云主机多账户切换与管理和使用vscode终端ssh登录

猫眼在腾讯云北极星上的最佳实践

基于 RocketMQ 实现 AMQP 协议实践

Cloud-native AI frontier: Kubeflow Training Operator unifies AI training on the cloud

Distributed training and Kubeflow

Problems with Kubeflow Operators

Proposal：All-in-One

Progress and near-term planning

Vision

about us

账号已注销

引用和评论

Serverless AI绘画技术沙龙【深圳站】火热报名中

借助腾讯云质检平台的新范式，做工业制造企业质检的“AI慧眼”

腾讯云 TDMQ 产品家族新成员：消息队列 MQTT 版全新发布！

墨天轮2024年度数据库获奖名单

腾讯云主机多账户切换与管理 和 使用vscode终端ssh登录

猫眼在腾讯云北极星上的最佳实践

基于 RocketMQ 实现 AMQP 协议实践

腾讯云主机多账户切换与管理和使用vscode终端ssh登录