KubeDL HostNetwork: Accelerates Distributed Training Communication Efficiency

Author: Chen Qiukai (request)

foreword

KubeDL is Alibaba's open source Kubernetes-based AI workload management framework, which is taken from the abbreviation of "Kubernetes-Deep-Learning". It is hoped that relying on Alibaba's scenarios, the experience in scheduling and management of large-scale machine learning jobs can be fed back to the community. At present, KubeDL has entered the CNCF Sandbox project incubation, and we will continue to explore the best practices in cloud-native AI scenarios to help algorithm scientists implement innovations easily and efficiently.

KubeDL brings the HostNetwork network mode for distributed training jobs, supports mutual communication between computing nodes through the host network to improve network performance, and adapts to the network environment of new high-performance data center architectures such as RDMA/SCC. In addition, KubeDL targets HostNetwork Problems such as mutual awareness of new ports after FailOver brought by the mode also brought new solutions.

Github address:
https://github.com/kubedl-io/kubedl

website:
https://kubedl.io/model/intro/

Overlay is not a silver bullet

The native container network model of Kubernetes defines a series of "Pod-Pod" communication protocols that do not rely on NAT. The Overlay network based on VxLAN implements this model well (such as the classic Flannel) and solves many large-scale containers. Pain points of network management in orchestration systems:

Sensorless migration of Pods: Overlay network is a virtual Layer 2 network based on physical network. Pod IP is not bound to any node. When a node goes down or other hardware abnormalities occur, the corresponding service Pod can be accessed through the same IP. Restarting on other nodes will not affect the availability of services as long as the underlying physical network connectivity is not interrupted. in large-scale distributed machine learning training. KubeDL is also a computing node failover (FailOver) based on the premise that "Pod may drift, but Service is fixed";
Scale of network nodes: The classic NAT address resolution usually uses the ARP broadcast protocol to automatically learn the mapping between the IP and MAC addresses of adjacent nodes, but when the scale of the node is large, a broadcast can easily cause ARP storms and cause network congestion. The overlay network only needs to know the MAC addresses of a few VTEP nodes to realize packet forwarding, which greatly reduces the pressure on the network;
Tenant network isolation: Kubernetes' powerful network plug-in scalability and VxLAN protocol design make it easy to subdivide virtual networks to achieve network isolation between tenants;

These are the benefits brought by the virtual container network, but the cost of virtualization is the loss of network performance: the Pod and the host network are connected through a pair of Veth virtual bridge devices to achieve mutual isolation of network namespaces. Each "Pod-Pod" The data packets communicated between each other need to go through the process of "packet-routing-ethernet-routing-unpacking" to reach the peer Pod, which slows down the network performance and increases the processing pressure of the host kernel network stack to increase the load.

在这里插入图片描述

With the rise of distributed training modes such as multimodal model training and large-scale dense parameter model training, as well as the explosion of dataset size and feature parameters, network communication has become a "bucket shortcoming" of distributed training efficiency. The most direct way to optimize network performance is to use the host network (HostNetwork) to communicate without the overhead of container network virtualization. At the same time, with the maturity of technologies such as RDMA (RoCE) and Nvidia GPU Direct, these new high-performance network technologies are gradually being applied to large-scale commercial production environments to greatly improve the efficiency of model training. By bypassing the overhead of the kernel network stack and Technologies such as zero-copy direct-reading data make full use of network bandwidth, Efficiency Is Money! These native high-performance network communication library primitives (such as RDMA_CM) also rely on host network implementation and cannot communicate directly based on Pod virtual network.

KubeDL expands the communication model of the host network on the basis of supporting distributed training based on standard container network communication, and solves common problems in distributed training such as port conflicts and mutual awareness of new ports after FailOver, and realizes easy use of high-performance networks. can.

Enable Host High Performance Network

Standard container network topology

In the standard container network communication model, different Workload roles such as Master/Worker/PS realize service discovery through Headless Service, Pods communicate with each other through constant domain names, and CoreDNS realizes the resolution from domain names to Pod IPs. It can be drifted, but the Service and its affiliated domain names are constant. Even if some Pods run abnormally, FailOver can be well implemented, and reconnect with other Pods after the abnormal Pod is pulled up again.

apiVersion: training.kubedl.io/v1alpha1
kind: "TFJob"
metadata:
  name: "mnist"
  namespace: kubedl
spec:
  cleanPodPolicy: None
  tfReplicaSpecs:
    PS:
      replicas: 2
      restartPolicy: Never
      template:
        spec:
          containers:
            - name: tensorflow
              image: kubedl/tf-mnist-with-summaries:1.0
              command:
                - "python"
                - "/var/tf_mnist/mnist_with_summaries.py"
                - "--log_dir=/train/logs"
                - "--learning_rate=0.01"
                - "--batch_size=150"
              volumeMounts:
                - mountPath: "/train"
                  name: "training"
              resources:
                limits:
                  cpu: 2048m
                  memory: 2Gi
                requests:
                  cpu: 1024m
                  memory: 1Gi
          volumes:
            - name: "training"
              hostPath:
                path: /tmp/data
                type: DirectoryOrCreate
    Worker:
      replicas: 3
      restartPolicy: ExitCode
      template:
        spec:
          containers:
            - name: tensorflow
              image: kubedl/tf-mnist-with-summaries:1.0
              command:
                - "python"
                - "/var/tf_mnist/mnist_with_summaries.py"
                - "--log_dir=/train/logs"
                - "--learning_rate=0.01"
                - "--batch_size=150"
              volumeMounts:
                - mountPath: "/train"
                  name: "training"
              resources:
                limits:
                  cpu: 2048m
                  memory: 2Gi
                requests:
                  cpu: 1024m
                  memory: 1Gi
          volumes:
            - name: "training"
              hostPath:
                path: /tmp/data
                type: DirectoryOrCreate

Taking a Tensorflow distributed training job with a classic PS-Worker architecture as an example, the Worker is responsible for calculating the gradient of the parameters, and the PS is responsible for aggregating, updating and broadcasting the parameters, so each PS may establish connections and communicate with all Workers, and vice versa .

In the implementation of the Tensorflow framework, such an inter-job topology is described by a TF Cluster Spec structure. Each Role (PS or Worker) instance contains an Index to identify its own index number, which can be obtained through Role+Index to obtain itself or other Roles The service address of the instance, the connection can be established to start communication. In the standard container network mode, the user submits the following TFJob, and KubeDL will generate a TF Cluster Spec and pass it in as an environment variable and be received by the framework. At the same time, a Headless Service is prepared for each Role instance, and its Endpoint domain name address corresponds to For the service addresses in the TF Cluster Spec, each Pod has an independent Linux Network Namespace, and the port address spaces of the Pods are also isolated from each other, so the same container port can also be used for scheduling to the same Node.

At this point, instances of different Roles can start distributed training and communication in the native way of Tensorflow.

在这里插入图片描述

The benefits of standard container networking are obvious. Simple and intuitive network settings and FailOver-friendly network fault tolerance make this solution suitable for most scenarios. But how should it operate in a scenario that requires a high-performance network? KubeDL provides a solution for host networking.

Host container network topology

Following the above example, the way to enable the host network is very simple, just add an annotation to TFJob, and the rest of the job configuration does not need special modification, as shown below:

apiVersion: training.kubedl.io/v1alpha1
kind: "TFJob"
metadata:
  name: "mnist"
  namespace: kubedl
  annotations:
    kubedl.io/network-mode: host
spec:
  cleanPodPolicy: None
  tfReplicaSpecs:
    PS:
    ...
    Worker:
    ...

When KubeDL finds that the job declares to use the host network, it will complete the network connection settings through the following steps:

Instead of using a fixed port when creating a Pod, a host port is randomly selected within a certain port range, and the corresponding exposed container port number is set, which is passed to the subsequent control flow by context;
Enable HostNetwork for the Pod and set the DNS resolution policy to Host priority;
Instead of creating a Headless Service, a normal traffic forwarding Service is used, the exposed port is the original constant value, and the target port is the real value of the Pod;
In the generated TF Cluster Spec, the corresponding Role+Index shows that the local address port is the real host port, and the address ports of other Role instances are constant. No matter how the other party's Pod drifts, it can be correctly forwarded through the Service;
When FailOver occurs, KubeDL will re-select the port for the rebuilt Pod, and the newly started Pod will get a new local address port through TF_CONFIG. At the same time, KubeDL ensures that the target port of the corresponding Service is correctly updated, and other roles connected to it can also Continue communication after the Service target port is updated;

Such a host network built according to the topology of the training job is ready to be replaced. The difference from the previous one is that all Pods share a Network Namespace with the host, so they also share the host's port number, and the Pods share the same network namespace. The communication has also changed from resolving the domain name to the Pod IP and establishing a connection to realizing the traffic forwarding through the Service. On the other hand, the TF Cluster Spec has changed but the native Tensorflow mode has not been changed. The current Pod directly obtains Local Port monitoring. The other Pod addresses seem to be constant. The domain name and exposed port corresponding to the Service are always constant. Only the target port may change continuously with FailOver. All this is handled by KubeDL and becomes insensitive.

在这里插入图片描述

We take Tensorflow as an example of the host network because its Cluster Spec complexity is more representative, but for KubeDL's built-in workloads (such as PyTorch, XGBoost, etc.) we also implement a network corresponding to the host network mode for the behavior of its framework topology settings.

Summarize

By extending the existing standard container network communication mode of distributed training jobs, KubeDL realizes the communication mode based on the native host network. While obtaining network performance gains in common training scenarios, it also perfectly adapts to high-performance network architectures such as RDMA/SCC. This communication mode has been widely used in Alibaba's internal production clusters. For example, the latest AliceMind super model released by DAMO Academy at the Yunqi Conference is through the KubeDL host network + The product of RDMA training in a high-performance computing cluster. We look forward to more developers participating in the construction of the KubeDL community to optimize the scheduling and runtime efficiency of deep learning workloads!

Poke here to learn about the KubeDL project now!

KubeDL HostNetwork: Accelerates Distributed Training Communication Efficiency

foreword

Overlay is not a silver bullet

Enable Host High Performance Network

Standard container network topology

Host container network topology

Summarize

阿里云云原生

引用和评论

三句话生成 P5.js 粒子特效代码，人人都可以做交互式数字艺术

K8s 小白入门｜从电影配乐谈起，聊聊容器编排和 K8s

面向教育场景的大模型 RAG 检索增强解决方案

掌握Linux top命令：优化系统性能的关键

全网首发 | PAI Model Gallery一键部署阶跃星辰Step-Video-T2V、Step-Audio-Chat模型

开放创新，释放云上数字生产力｜2024华为云开源开发者论坛圆满落幕

阿里云可观测 2024 年 11 月产品动态