Cloud Native Elastic AI Training Series II: Design and Implementation of PyTorch 1.9.0 Elastic Distributed Training

background

Compared with traditional workloads, machine learning workloads have a notable feature that has a strong demand for GPUs . Introduced in the previous article ( https://mp.weixin.qq.com/s/Nasm-cXLtJObjLwLQHALmw and https://mp.weixin.qq.com/s/X4VDynLfKdbe-tyciQ16128Q), currently GPU memory is not enough to keep up with the development of model parameter scale. With the emergence of new model structures such as Transformer, this problem has become more and more significant. Algorithm engineers need more and more resources to train models, and distributed training has also become the standard method for model training in the industry.

Elastic training can dynamically adjust the number of instances participating in training during the training process, greatly improves the utilization of cluster resources . At the same time, in conjunction with resource types such as bid instances on the cloud, model tuning can be performed at a lower cost, further reducing costs and increasing efficiency. In the latest version 1.9.0 of PyTorch, its original distributed training method torch.distributed.launch will soon be abandoned , and users are recommended to use the flexible distributed training interface torch.distributed.run .

Take this opportunity to briefly introduce this new feature, and compare and analyze it with Horovod Elastic. Finally, summarize the issues that need to be paid attention to when using elastic training.

Design before PyTorch 1.9.0

PyTorch deep learning framework is one of the most popular, it is the most praise is ease of use . Whether it is stand-alone training or distributed training, PyTorch provides a concise API.

Before PyTorch 1.9.0, the distributed training method is usually carried out in the following way.

python -m torch.distributed.launch
        --nnodes=NODE_SIZE
        --nproc_per_node=TRAINERS_PER_NODE
        --node_rank=NODE_RANK
        --master_port=HOST_PORT
        --master_addr=HOST_NODE_ADDR
        YOUR_TRAINING_SCRIPT.py (--arg1 ... train script args...)

Where nnodes is the number of nodes participating in the training, nproc_per_node is the number of processes running on each node. node_rank is the identifier of the current node, and master_addr and master_port are the addresses and ports monitored by the master. torch.distributed.launch will set some environment variables, including WORLD_SIZE and MASTER_PORT , MASTER_ADDR and so on.

Then the corresponding process will be created on the current machine for training. The current machine will have TRAINERS_PER_NODE processes, which form a local worker group. A total of NODE_SIZE machines participated in the training, and a total of NODE_SIZE * TRAINERS_PER_NODE processes. If you want to initiate a distributed training task, you need to execute the corresponding commands on all machines.

New design in PyTorch 1.9.0

In the PyTorch 1.9, torch.distributed.launch about to be discarded , it replaced based pytorch / Elastic of torch.distributed.run . This new method has some usage changes compared to the previous one, as shown below.

python -m torch.distributed.run
        --nnodes=MIN_SIZE:MAX_SIZE
        --nproc_per_node=TRAINERS_PER_NODE
        --rdzv_id=JOB_ID
        --rdzv_backend=c10d
        --rdzv_endpoint=HOST_NODE_ADDR
        YOUR_TRAINING_SCRIPT.py (--arg1 ... train script args...)

It provides some new capabilities: first, better fault tolerance, when the worker fails, it will automatically restart to continue training; second, the RANK and WORLD_SIZE fields no longer need to be manually set. Finally and most importantly, it supports flexible training and dynamically increases or decreases the number of workers participating in training. In the above example, nnodes is no longer a fixed value, but an interval. The training task can tolerate changes in the number of workers within this range.

If you want to support flexibility, the training code also needs to be modified.

def main():
     args = parse_args(sys.argv[1:])
     state = load_checkpoint(args.checkpoint_path)
     initialize(state)
     # torch.distributed.run ensure that this will work
     # by exporting all the env vars needed to initialize the process group
     torch.distributed.init_process_group(backend=args.backend)
     for i in range(state.epoch, state.total_num_epochs)
          for batch in iter(state.dataset)
              train(batch, state.model)
          state.epoch += 1
          save_checkpoint(state)

One of the more obvious changes is that users need to manually handle checkpoints. This is because when a worker fails, all workers will restart, so a checkpoint mechanism is needed to ensure that training can continue after restart. This new distributed training method introduces many new concepts, including agent, rendezvous, etc. Next, we will introduce these new designs starting torch.distributed.run

def run(args):
    if args.standalone:
        args.rdzv_backend = "c10d"
        args.rdzv_endpoint = "localhost:29400"
        args.rdzv_id = str(uuid.uuid4())
        log.info(
            f"\n**************************************\n"
            f"Rendezvous info:\n"
            f"--rdzv_backend={args.rdzv_backend} "
            f"--rdzv_endpoint={args.rdzv_endpoint} "
            f"--rdzv_id={args.rdzv_id}\n"
            f"**************************************\n"
        )
    config, cmd, cmd_args = config_from_args(args)
    elastic_launch(
        config=config,
        entrypoint=cmd,
    )(*cmd_args)

Two modes are mainly distinguished, Standalone mode and distributed mode. Standalone mode is a special case of distributed mode. It mainly provides some convenient settings for the stand-alone multi-worker mode. It is no longer necessary to set some redundant parameters such as rdzv_backend and rdzv_endpoint .

Both will finally initiate the real training process through elastic_launch. elastic_launch will manage the life cycle of the worker through the elastic agent, and its return is the output of each worker.

class elastic_launch:
    ...
    def __call__(self, *args):
        return launch_agent(self._config, self._entrypoint, list(args))
def launch_agent(
    config: LaunchConfig,
    entrypoint: Union[Callable, str, None],
    args: List[Any],
) -> Dict[int, Any]:
    ...
    agent = LocalElasticAgent(
        spec=spec, start_method=config.start_method, log_dir=config.log_dir
    )
    ...
    result = agent.run()
    ...
    return result.return_values

Design of Elastic Agent: How to manage multiple worker processes

The elastic agent is an independent process, responsible for managing the workers under it. It plays a role similar to the supervisor of the process management system, ensuring that the settings of each worker are correct when starting. Since the information about WORLD_SIZE and RANK no longer needs to be provided by the user, elastic agent will take care of it.

In addition, the failure of the worker is also captured and processed by the elastic agent. It can be said that elastic agent is the core abstract concept in elastic training.

The figure above shows the working principle

Different elastic agents use rendezvous for mutual discovery among workers and synchronization of member changes. At the same time, by monitoring the worker process, failures in the training process are captured. The core logic is packaged in the function call LocalElasticAgent.run()

    def run(self, role: str = DEFAULT_ROLE) -> RunResult:
        ...
        result = self._invoke_run(role)
        return result
    def _invoke_run(self, role: str = DEFAULT_ROLE) -> RunResult:
        ...
        self._initialize_workers(self._worker_group)
        while True:
            ...
            run_result = self._monitor_workers(self._worker_group)
            state = run_result.state
            ...
            if state == WorkerState.SUCCEEDED:
                ...
                return run_result
            elif state in {WorkerState.UNHEALTHY, WorkerState.FAILED}:
                if self._remaining_restarts > 0:
                    ...
                    self._restart_workers(self._worker_group)
                else:
                    ...
                    return run_result
            elif state == WorkerState.HEALTHY:
                ...
                if num_nodes_waiting > 0:
                    ...
                    self._restart_workers(self._worker_group)
            else:
                raise Exception(f"[{role}] Worker group in {state.name} state")

As you can see, the logic of the _invoke_run in . Among them, _initialize_workers performed most of the initialization work, including allocating RANK for each worker. In the default implementation, the elastic agent and workers processes are on the same machine, so self._monitor_workers(self._worker_group) multiprocessing the running status of workers through 06128beb6266bd. And according to different states, different treatments are carried out.

elastic agent has very good scalability and . In version 1.9.0, there are a total of three Agents, namely ElasticAgent , SimpleElasticAgent and LocalElasticAgent .

Among them, ElasticAgent is an Abstract Class, SimpleElasticAgent implements some of these functions, and LocalElasticAgent implements an elastic agent that manages all worker processes on a single machine.

SimpleElasticAgent is mainly to facilitate the expansion of new agent implementations. For example, if you want to manage all workers on multiple machines through one agent, not just the workers on the local machine, you can implement SimpleElasticAgent

The design of rendezvous: How to determine RANK between different nodes

Next, let's look at another abstraction rendezvous core. In order to achieve flexible training, workers must be able to dynamically change membership. rendezvous is the synchronization component that realizes this feature.

rendezvous the core of the method is:

    @abstractmethod
    def next_rendezvous(
        self,
    ) -> Tuple[Store, int, int]:
        """Main entry-point into the rendezvous barrier.
        Blocks until the rendezvous is complete and the current process is
        included in the formed worker group, or a timeout occurs, or the
        rendezvous was marked closed.
        Returns:
            A tuple of :py:class:`torch.distributed.Store`, ``rank``, and
            ``world size``.
        Raises:
            RendezvousClosedError:
                The rendezvous is closed.
            RendezvousConnectionError:
                The connection to the rendezvous backend has failed.
            RendezvousStateError:
                The rendezvous state is corrupt.
            RendezvousTimeoutError:
                The rendezvous did not complete on time.
        """

As the comments indicate, this function call will be blocked until the number of workers reaches the required number. This function will be called when the worker is initialized or restarted. When the function returns, different workers will use the returned rank as the only identifier. rendezvous a total of four to achieve, namely etcd , etcd-v2 , c10d and static .

class EtcdRendezvousHandler(RendezvousHandler):
    def next_rendezvous(self):
        rdzv_version, rank, world_size = self._rdzv_impl.rendezvous_barrier()
        log.info("Creating EtcdStore as the c10d::Store implementation")
        store = self._rdzv_impl.setup_kv_store(rdzv_version)
        return store, rank, world_size

Among them, etcd related to the previously recommended implementation, and it is no longer recommended after c10d etcd the implementation of 06128beb626875, the status between different workers is stored through the KV interface of etcd.

determine the instance participating in training and the corresponding RANK is shown in the figure below.

First will /rdzv/active_version trying to write a value status: setup . Throughout the process, /rdzv/active_version will be stored as rendezvous procedure intermediate state KV store, and rendezvous exclusive lock during use.

fails to write , it means that there is already a corresponding rendezvous process in progress.

After a successful , updated /rdzv/version_counter original value plus one. Then a directory /rdzv/v_${version_counter} will be created. After these operations are done, will /rdzv/active_version state written as joinable, then entered the join phase.

In the join phase , under the protection of the lock, different agents will sequentially update /rdzv/active_version under paticipants and assign them to an incremental rank. The rank here is not the global rank assigned to each worker process, but the agent's own rank. . The rank of the worker process will be calculated according to the agent rank. This is also a very easy to confuse design, I think there is room for optimization.

    def init_phase(self):
        try:
            active_version = self.try_create_rendezvous()
            state = json.loads(active_version.value)
            log.info("New rendezvous state created: " + str(state))
        except etcd.EtcdAlreadyExist:
            # 已经有了一个新的 rendezvous 过程
            active_version, state = self.get_rdzv_state()
            # Note: it is possible for above query to fail (etcd.EtcdKeyNotFound),
            # but this is ok for us - just means we'll restart from beginning.
            log.info("Observed existing rendezvous state: " + str(state))
        if state["status"] == "closed":
            raise RendezvousClosedError()
        if state["status"] == "joinable":
            return self.join_phase(state["version"])
        if state["status"] == "final":
            self.handle_existing_rendezvous(state["version"])
            raise EtcdRendezvousRetryImmediately()
        self.try_wait_for_state_change(etcd_index=active_version.etcd_index + 1)
        raise EtcdRendezvousRetryableFailure()

When the node participating in training reaches the minimum value passed in the command line parameters of nnodes, it will wait for a certain time. When the waiting time expires or the node participating in training reaches the maximum value set by nnodes, it will enter the frozen phase.

in fronzen stage , each node participating in the training are required by /rdzv/v_${version_counter}/rank_${agent_rank} confirmed way to write down the value. After all nodes are confirmed, it will enter the final stage.

In the final stage of , subsequent agents will be pending. The agent on the node that rendezvous RANK the worker process it manages. RANK 0 will exist as the role of master. Then the corresponding worker process will be created directly. In the default LocalElasticAgent in, we will use python.multiprocessing create multiple processes locally.

    @prof
    def _start_workers(self, worker_group: WorkerGroup) -> Dict[int, Any]:
        spec = worker_group.spec
        store = worker_group.store
        ...
        for worker in worker_group.workers:
            local_rank = worker.local_rank
            worker_env = {
                "LOCAL_RANK": str(local_rank),
                "RANK": str(worker.global_rank),
                ...
            }
            ...
            args[local_rank] = tuple(worker_args)
        ...
        self._pcontext = start_processes(
            name=spec.role,
            entrypoint=spec.entrypoint,
            args=args,
            envs=envs,
            log_dir=attempt_log_dir,
            start_method=self._start_method,
            redirects=spec.redirects,
            tee=spec.tee,
        )
        return self._pcontext.pids()

c10d new design

The previous article introduced the implementation of rendezvous based on etcd, which can ensure the strong consensus of the nodes participating in the training among multiple instances, but this also introduces additional dependencies for PyTorch to run training tasks. Therefore PyTorch also provides a built-in implementation c10d. Compared to the etcd-based implementation, c10d is based on TCP for synchronization.

def create_backend(params: RendezvousParameters) -> Tuple[C10dRendezvousBackend, Store]:
    ...
    if store_type == "file":
        store = _create_file_store(params)
    elif store_type == "tcp":
        store = _create_tcp_store(params)
    ...
    backend = C10dRendezvousBackend(store, params.run_id)
def _create_tcp_store(params: RendezvousParameters) -> TCPStore:
    host, port = parse_rendezvous_endpoint(params.endpoint, default_port=29400)
    ...
    for is_server in [is_host, False]:
        ...
        store = TCPStore(
            host, port, is_master=is_server, timeout=timedelta(seconds=read_timeout)
        )
        ...
        break
    return store

c10d is a client-server architecture. One of the agents will run c10d's TCPServer, which listens on a given port and provides add such as compareAndSet and 06128beb626a70. It can also be understood as a simplified memory database that provides a KV interface, similar to Redis. The synchronization of rendezvous is completed by each agent through the c10d TCPServer on a centralized agent. It can be foreseen that such an implementation has a certain gap in usability compared with etcd, but it wins in ease of use. If users use c10d, they no longer need to operate and maintain an etcd cluster.

PyTorch Elastic on Kubernetes

In order to be able to enjoy the convenience of flexible training, PyTorch also provides support on Kubernetes. Compared with the version before 1.9.0, the new version of distributed training adds some new parameters. Therefore, the PyTorch community has made some modifications to the CRD based on the Kubeflow PyTorch operator. A typical elastic training example is as follows:

apiVersion: elastic.pytorch.org/v1alpha1
kind: ElasticJob
metadata:
  name: imagenet
  namespace: elastic-job
spec:
  # Use "etcd-service:2379" if you already apply etcd.yaml
  rdzvEndpoint: "<your_etcd_endpoint>:<your_etcd_port>"
  minReplicas: 1
  maxReplicas: 2
  replicaSpecs:
    Worker:
      replicas: 2
      restartPolicy: ExitCode
      template:
        apiVersion: v1
        kind: Pod
        spec:
          containers:
            - name: elasticjob-worker
              image: torchelastic/examples:0.2.0
              imagePullPolicy: Always
              args:
                - "--nproc_per_node=1"
                - "/workspace/examples/imagenet/main.py"
                - "--arch=resnet18"
                - "--epochs=20"
                - "--batch-size=32"
                # number of data loader workers (NOT trainers)
                # zero means load the data on the same process as the trainer
                # this is set so that the container does not OOM since
                # pytorch data loaders use shm
                - "--workers=0"
                - "/workspace/data/tiny-imagenet-200"
              resources:
                limits:
                  nvidia.com/gpu: 1

Since the beginning, based on c10d of rendezvous is not yet supported, so you need to define the CRD rdzvEndpoint, pointing to an already deployed etcd cluster. At the same time, the user needs to specify minReplicas and maxReplicas . Others are no different from Kubeflow PyTorchJob.

PyTorch Elastic and Horovod Elastic

At present, the two designs are the same in principle. Compared with Horovod Elastic, PyTorch Elastic provides more flexible scalability , it provides agent , rendezvous and other interfaces, users can expand according to their needs. But from another perspective, Horovod is more than 16128beb626b1d.

PyTorch does not provide built-in support for saving the state. In order to be able to rebuild the training task when the worker process fails, the user needs to implement the logic of saving and loading the checkpoint. Horovod provides a built-in implementation.

Horovod and PyTorch also have relatively large differences in synchronization mechanisms. Horovod Elastic needs the user to provide a script discovery_hosts.sh to help it get the nodes that are participating in training at runtime.

$ horovodrun -np 8 --host-discovery-script discover_hosts.sh python train.py
...
$ ./discover_hosts.sh
host-1:29500
host-2:29500
host-3:29500

This is equivalent to handing over the node discovery logic to the user to implement. On the other hand, PyTorch uses etcd, its own implementation of c10d and other components to solve the mutual discovery problem between nodes, which is more sophisticated.

Summarize

At the end of the article, we summarize the current issues that need attention when implementing elastic training.

First and foremost, flexible training requires a mechanism to solve the mutual discovery problem between nodes/training processes . During the training process, nodes will dynamically join or withdraw. How to make other nodes perceive this change is the main problem faced by this mechanism. In the current design, Horovod delegates this problem to users to solve. Horovod regularly executes user-defined logic to discover current nodes. PyTorch realizes highly available node discovery through third-party distributed consistency middleware etcd. In addition, there are also some exploratory work, using based on Gossip protocol to synchronize, while taking into account high availability without introducing too many components.

Secondly, to achieve elastic training, you also need to capture the training failure . Both Horovod and PyTorch implement this logic through a background process (Driver in Horovod and Local Elastic Agent for each node in PyTorch). When the process crashes or encounters a problem in gradient communication, the background process will catch the failure and re-discover the node, and then restart the training.

Finally, the logic of data segmentation during training and the setting of learning rate/batch size should be modified . Since the training process will dynamically increase or decrease, it may be necessary to reset the learning rate and data distribution logic according to the scale of the new training process to avoid affecting model convergence.

In this article, we first introduced the design and implementation of elastic training in PyTorch 1.9.0. Then analyzed and summarized the way to achieve elastic training and the design differences between different frameworks. From our point of view, elastic training can fit the trend of cloud native well. It is the future trend to reduce costs and increase resource utilization with extreme flexibility. Therefore, we are currently actively participating in community contributions for flexible training in communities such as TensorFlow, PyTorch, and Kubeflow. More related articles will be published in the future, thanks for your attention.

[Tencent Cloud Native] Yunshuo new products, Yunyan new technology, Yunyou Xinhuo, Yunxiang information, scan the QR code to follow the public account of the same name, and get more dry goods in time! !

Cloud Native Elastic AI Training Series II: Design and Implementation of PyTorch 1.9.0 Elastic Distributed Training

background

Design before PyTorch 1.9.0

New design in PyTorch 1.9.0

Design of Elastic Agent: How to manage multiple worker processes

The design of rendezvous: How to determine RANK between different nodes

c10d new design

PyTorch Elastic on Kubernetes

PyTorch Elastic and Horovod Elastic

Summarize

账号已注销

引用和评论

Serverless AI绘画技术沙龙【深圳站】火热报名中

DeepSeek 从热潮到应用，腾讯云携手行业专家共探 AI 下一步

2025免费云服务器盘点

信息安全风云录，AI 时代安全江湖如何见招拆招？

腾讯云TVP AI与安全高峰论坛圆满落幕，共探大模型时代的安全破局之道

腾讯云cos大文件上传服务端实现一篇搞定

具身智能全解读，从实验室到产业化 | TVP技术夜未眠