background
Compared with traditional workloads, machine learning workloads have a notable feature that has a strong demand for GPUs . Introduced in the previous article ( https://mp.weixin.qq.com/s/Nasm-cXLtJObjLwLQHALmw and https://mp.weixin.qq.com/s/X4VDynLfKdbe-tyciQ16128Q), currently GPU memory is not enough to keep up with the development of model parameter scale. With the emergence of new model structures such as Transformer, this problem has become more and more significant. Algorithm engineers need more and more resources to train models, and distributed training has also become the standard method for model training in the industry.
Elastic training can dynamically adjust the number of instances participating in training during the training process, greatly improves the utilization of cluster resources . At the same time, in conjunction with resource types such as bid instances on the cloud, model tuning can be performed at a lower cost, further reducing costs and increasing efficiency. In the latest version 1.9.0 of PyTorch, its original distributed training method torch.distributed.launch will soon be abandoned , and users are recommended to use the flexible distributed training interface torch.distributed.run
.
Take this opportunity to briefly introduce this new feature, and compare and analyze it with Horovod Elastic. Finally, summarize the issues that need to be paid attention to when using elastic training.
Design before PyTorch 1.9.0
PyTorch deep learning framework is one of the most popular, it is the most praise is ease of use . Whether it is stand-alone training or distributed training, PyTorch provides a concise API.
Before PyTorch 1.9.0, the distributed training method is usually carried out in the following way.
python -m torch.distributed.launch
--nnodes=NODE_SIZE
--nproc_per_node=TRAINERS_PER_NODE
--node_rank=NODE_RANK
--master_port=HOST_PORT
--master_addr=HOST_NODE_ADDR
YOUR_TRAINING_SCRIPT.py (--arg1 ... train script args...)
Where nnodes
is the number of nodes participating in the training, nproc_per_node
is the number of processes running on each node. node_rank
is the identifier of the current node, and master_addr
and master_port
are the addresses and ports monitored by the master. torch.distributed.launch
will set some environment variables, including WORLD_SIZE
and MASTER_PORT
, MASTER_ADDR
and so on.
Then the corresponding process will be created on the current machine for training. The current machine will have TRAINERS_PER_NODE
processes, which form a local worker group. A total of NODE_SIZE
machines participated in the training, and a total of NODE_SIZE * TRAINERS_PER_NODE
processes. If you want to initiate a distributed training task, you need to execute the corresponding commands on all machines.
New design in PyTorch 1.9.0
In the PyTorch 1.9, torch.distributed.launch
about to be discarded , it replaced based pytorch / Elastic of torch.distributed.run
. This new method has some usage changes compared to the previous one, as shown below.
python -m torch.distributed.run
--nnodes=MIN_SIZE:MAX_SIZE
--nproc_per_node=TRAINERS_PER_NODE
--rdzv_id=JOB_ID
--rdzv_backend=c10d
--rdzv_endpoint=HOST_NODE_ADDR
YOUR_TRAINING_SCRIPT.py (--arg1 ... train script args...)
It provides some new capabilities: first, better fault tolerance, when the worker fails, it will automatically restart to continue training; second, the RANK and WORLD_SIZE fields no longer need to be manually set. Finally and most importantly, it supports flexible training and dynamically increases or decreases the number of workers participating in training. In the above example, nnodes
is no longer a fixed value, but an interval. The training task can tolerate changes in the number of workers within this range.
If you want to support flexibility, the training code also needs to be modified.
def main():
args = parse_args(sys.argv[1:])
state = load_checkpoint(args.checkpoint_path)
initialize(state)
# torch.distributed.run ensure that this will work
# by exporting all the env vars needed to initialize the process group
torch.distributed.init_process_group(backend=args.backend)
for i in range(state.epoch, state.total_num_epochs)
for batch in iter(state.dataset)
train(batch, state.model)
state.epoch += 1
save_checkpoint(state)
One of the more obvious changes is that users need to manually handle checkpoints. This is because when a worker fails, all workers will restart, so a checkpoint mechanism is needed to ensure that training can continue after restart. This new distributed training method introduces many new concepts, including agent, rendezvous, etc. Next, we will introduce these new designs starting torch.distributed.run
def run(args):
if args.standalone:
args.rdzv_backend = "c10d"
args.rdzv_endpoint = "localhost:29400"
args.rdzv_id = str(uuid.uuid4())
log.info(
f"\n**************************************\n"
f"Rendezvous info:\n"
f"--rdzv_backend={args.rdzv_backend} "
f"--rdzv_endpoint={args.rdzv_endpoint} "
f"--rdzv_id={args.rdzv_id}\n"
f"**************************************\n"
)
config, cmd, cmd_args = config_from_args(args)
elastic_launch(
config=config,
entrypoint=cmd,
)(*cmd_args)
Two modes are mainly distinguished, Standalone mode and distributed mode. Standalone mode is a special case of distributed mode. It mainly provides some convenient settings for the stand-alone multi-worker mode. It is no longer necessary to set some redundant parameters such as rdzv_backend
and rdzv_endpoint
.
Both will finally initiate the real training process through elastic_launch. elastic_launch will manage the life cycle of the worker through the elastic agent, and its return is the output of each worker.
class elastic_launch:
...
def __call__(self, *args):
return launch_agent(self._config, self._entrypoint, list(args))
def launch_agent(
config: LaunchConfig,
entrypoint: Union[Callable, str, None],
args: List[Any],
) -> Dict[int, Any]:
...
agent = LocalElasticAgent(
spec=spec, start_method=config.start_method, log_dir=config.log_dir
)
...
result = agent.run()
...
return result.return_values
Design of Elastic Agent: How to manage multiple worker processes
The elastic agent is an independent process, responsible for managing the workers under it. It plays a role similar to the supervisor of the process management system, ensuring that the settings of each worker are correct when starting. Since the information about WORLD_SIZE and RANK no longer needs to be provided by the user, elastic agent will take care of it.
In addition, the failure of the worker is also captured and processed by the elastic agent. It can be said that elastic agent is the core abstract concept in elastic training.
The figure above shows the working principle
Different elastic agents use rendezvous
for mutual discovery among workers and synchronization of member changes. At the same time, by monitoring the worker process, failures in the training process are captured. The core logic is packaged in the function call LocalElasticAgent.run()
def run(self, role: str = DEFAULT_ROLE) -> RunResult:
...
result = self._invoke_run(role)
return result
def _invoke_run(self, role: str = DEFAULT_ROLE) -> RunResult:
...
self._initialize_workers(self._worker_group)
while True:
...
run_result = self._monitor_workers(self._worker_group)
state = run_result.state
...
if state == WorkerState.SUCCEEDED:
...
return run_result
elif state in {WorkerState.UNHEALTHY, WorkerState.FAILED}:
if self._remaining_restarts > 0:
...
self._restart_workers(self._worker_group)
else:
...
return run_result
elif state == WorkerState.HEALTHY:
...
if num_nodes_waiting > 0:
...
self._restart_workers(self._worker_group)
else:
raise Exception(f"[{role}] Worker group in {state.name} state")
As you can see, the logic of the _invoke_run
in . Among them, _initialize_workers
performed most of the initialization work, including allocating RANK for each worker. In the default implementation, the elastic agent and workers processes are on the same machine, so self._monitor_workers(self._worker_group)
multiprocessing
the running status of workers through 06128beb6266bd. And according to different states, different treatments are carried out.
elastic agent has very good scalability and . In version 1.9.0, there are a total of three Agents, namely ElasticAgent
, SimpleElasticAgent
and LocalElasticAgent
.
Among them, ElasticAgent
is an Abstract Class, SimpleElasticAgent
implements some of these functions, and LocalElasticAgent
implements an elastic agent that manages all worker processes on a single machine.
SimpleElasticAgent
is mainly to facilitate the expansion of new agent implementations. For example, if you want to manage all workers on multiple machines through one agent, not just the workers on the local machine, you can implement SimpleElasticAgent
The design of rendezvous: How to determine RANK between different nodes
Next, let's look at another abstraction rendezvous
core. In order to achieve flexible training, workers must be able to dynamically change membership. rendezvous
is the synchronization component that realizes this feature.
rendezvous
the core of the method is:
@abstractmethod
def next_rendezvous(
self,
) -> Tuple[Store, int, int]:
"""Main entry-point into the rendezvous barrier.
Blocks until the rendezvous is complete and the current process is
included in the formed worker group, or a timeout occurs, or the
rendezvous was marked closed.
Returns:
A tuple of :py:class:`torch.distributed.Store`, ``rank``, and
``world size``.
Raises:
RendezvousClosedError:
The rendezvous is closed.
RendezvousConnectionError:
The connection to the rendezvous backend has failed.
RendezvousStateError:
The rendezvous state is corrupt.
RendezvousTimeoutError:
The rendezvous did not complete on time.
"""
As the comments indicate, this function call will be blocked until the number of workers reaches the required number. This function will be called when the worker is initialized or restarted. When the function returns, different workers will use the returned rank as the only identifier. rendezvous
a total of four to achieve, namely etcd
, etcd-v2
, c10d
and static
.
class EtcdRendezvousHandler(RendezvousHandler):
def next_rendezvous(self):
rdzv_version, rank, world_size = self._rdzv_impl.rendezvous_barrier()
log.info("Creating EtcdStore as the c10d::Store implementation")
store = self._rdzv_impl.setup_kv_store(rdzv_version)
return store, rank, world_size
Among them, etcd
related to the previously recommended implementation, and it is no longer recommended after c10d
etcd
the implementation of 06128beb626875, the status between different workers is stored through the KV interface of etcd.
determine the instance participating in training and the corresponding RANK
is shown in the figure below.
First will /rdzv/active_version
trying to write a value status: setup
. Throughout the process, /rdzv/active_version
will be stored as rendezvous
procedure intermediate state KV store, and rendezvous
exclusive lock during use.
fails to write , it means that there is already a corresponding rendezvous
process in progress.
After a successful , updated /rdzv/version_counter
original value plus one. Then a directory /rdzv/v_${version_counter}
will be created. After these operations are done, will /rdzv/active_version
state written as joinable, then entered the join phase.
In the join phase , under the protection of the lock, different agents will sequentially update /rdzv/active_version
under paticipants
and assign them to an incremental rank. The rank here is not the global rank assigned to each worker process, but the agent's own rank. . The rank of the worker process will be calculated according to the agent rank. This is also a very easy to confuse design, I think there is room for optimization.
def init_phase(self):
try:
active_version = self.try_create_rendezvous()
state = json.loads(active_version.value)
log.info("New rendezvous state created: " + str(state))
except etcd.EtcdAlreadyExist:
# 已经有了一个新的 rendezvous 过程
active_version, state = self.get_rdzv_state()
# Note: it is possible for above query to fail (etcd.EtcdKeyNotFound),
# but this is ok for us - just means we'll restart from beginning.
log.info("Observed existing rendezvous state: " + str(state))
if state["status"] == "closed":
raise RendezvousClosedError()
if state["status"] == "joinable":
return self.join_phase(state["version"])
if state["status"] == "final":
self.handle_existing_rendezvous(state["version"])
raise EtcdRendezvousRetryImmediately()
self.try_wait_for_state_change(etcd_index=active_version.etcd_index + 1)
raise EtcdRendezvousRetryableFailure()
When the node participating in training reaches the minimum value passed in the command line parameters of nnodes, it will wait for a certain time. When the waiting time expires or the node participating in training reaches the maximum value set by nnodes, it will enter the frozen phase.
in fronzen stage , each node participating in the training are required by /rdzv/v_${version_counter}/rank_${agent_rank}
confirmed way to write down the value. After all nodes are confirmed, it will enter the final stage.
In the final stage of , subsequent agents will be pending. The agent on the node that rendezvous
RANK
the worker process it manages. RANK 0
will exist as the role of master. Then the corresponding worker process will be created directly. In the default LocalElasticAgent
in, we will use python.multiprocessing
create multiple processes locally.
@prof
def _start_workers(self, worker_group: WorkerGroup) -> Dict[int, Any]:
spec = worker_group.spec
store = worker_group.store
...
for worker in worker_group.workers:
local_rank = worker.local_rank
worker_env = {
"LOCAL_RANK": str(local_rank),
"RANK": str(worker.global_rank),
...
}
...
args[local_rank] = tuple(worker_args)
...
self._pcontext = start_processes(
name=spec.role,
entrypoint=spec.entrypoint,
args=args,
envs=envs,
log_dir=attempt_log_dir,
start_method=self._start_method,
redirects=spec.redirects,
tee=spec.tee,
)
return self._pcontext.pids()
c10d new design
The previous article introduced the implementation of rendezvous
based on etcd, which can ensure the strong consensus of the nodes participating in the training among multiple instances, but this also introduces additional dependencies for PyTorch to run training tasks. Therefore PyTorch also provides a built-in implementation c10d. Compared to the etcd-based implementation, c10d is based on TCP for synchronization.
def create_backend(params: RendezvousParameters) -> Tuple[C10dRendezvousBackend, Store]:
...
if store_type == "file":
store = _create_file_store(params)
elif store_type == "tcp":
store = _create_tcp_store(params)
...
backend = C10dRendezvousBackend(store, params.run_id)
def _create_tcp_store(params: RendezvousParameters) -> TCPStore:
host, port = parse_rendezvous_endpoint(params.endpoint, default_port=29400)
...
for is_server in [is_host, False]:
...
store = TCPStore(
host, port, is_master=is_server, timeout=timedelta(seconds=read_timeout)
)
...
break
return store
c10d is a client-server architecture. One of the agents will run c10d's TCPServer, which listens on a given port and provides add
such as compareAndSet
and 06128beb626a70. It can also be understood as a simplified memory database that provides a KV interface, similar to Redis. The synchronization of rendezvous
is completed by each agent through the c10d TCPServer on a centralized agent. It can be foreseen that such an implementation has a certain gap in usability compared with etcd, but it wins in ease of use. If users use c10d, they no longer need to operate and maintain an etcd cluster.
PyTorch Elastic on Kubernetes
In order to be able to enjoy the convenience of flexible training, PyTorch also provides support on Kubernetes. Compared with the version before 1.9.0, the new version of distributed training adds some new parameters. Therefore, the PyTorch community has made some modifications to the CRD based on the Kubeflow PyTorch operator. A typical elastic training example is as follows:
apiVersion: elastic.pytorch.org/v1alpha1
kind: ElasticJob
metadata:
name: imagenet
namespace: elastic-job
spec:
# Use "etcd-service:2379" if you already apply etcd.yaml
rdzvEndpoint: "<your_etcd_endpoint>:<your_etcd_port>"
minReplicas: 1
maxReplicas: 2
replicaSpecs:
Worker:
replicas: 2
restartPolicy: ExitCode
template:
apiVersion: v1
kind: Pod
spec:
containers:
- name: elasticjob-worker
image: torchelastic/examples:0.2.0
imagePullPolicy: Always
args:
- "--nproc_per_node=1"
- "/workspace/examples/imagenet/main.py"
- "--arch=resnet18"
- "--epochs=20"
- "--batch-size=32"
# number of data loader workers (NOT trainers)
# zero means load the data on the same process as the trainer
# this is set so that the container does not OOM since
# pytorch data loaders use shm
- "--workers=0"
- "/workspace/data/tiny-imagenet-200"
resources:
limits:
nvidia.com/gpu: 1
Since the beginning, based on c10d
of rendezvous
is not yet supported, so you need to define the CRD rdzvEndpoint, pointing to an already deployed etcd cluster. At the same time, the user needs to specify minReplicas
and maxReplicas
. Others are no different from Kubeflow PyTorchJob.
PyTorch Elastic and Horovod Elastic
At present, the two designs are the same in principle. Compared with Horovod Elastic, PyTorch Elastic provides more flexible scalability , it provides agent
, rendezvous
and other interfaces, users can expand according to their needs. But from another perspective, Horovod is more than 16128beb626b1d.
PyTorch does not provide built-in support for saving the state. In order to be able to rebuild the training task when the worker process fails, the user needs to implement the logic of saving and loading the checkpoint. Horovod provides a built-in implementation.
Horovod and PyTorch also have relatively large differences in synchronization mechanisms. Horovod Elastic needs the user to provide a script discovery_hosts.sh
to help it get the nodes that are participating in training at runtime.
$ horovodrun -np 8 --host-discovery-script discover_hosts.sh python train.py
...
$ ./discover_hosts.sh
host-1:29500
host-2:29500
host-3:29500
This is equivalent to handing over the node discovery logic to the user to implement. On the other hand, PyTorch uses etcd, its own implementation of c10d and other components to solve the mutual discovery problem between nodes, which is more sophisticated.
Summarize
At the end of the article, we summarize the current issues that need attention when implementing elastic training.
First and foremost, flexible training requires a mechanism to solve the mutual discovery problem between nodes/training processes . During the training process, nodes will dynamically join or withdraw. How to make other nodes perceive this change is the main problem faced by this mechanism. In the current design, Horovod delegates this problem to users to solve. Horovod regularly executes user-defined logic to discover current nodes. PyTorch realizes highly available node discovery through third-party distributed consistency middleware etcd. In addition, there are also some exploratory work, using based on Gossip protocol to synchronize, while taking into account high availability without introducing too many components.
Secondly, to achieve elastic training, you also need to capture the training failure . Both Horovod and PyTorch implement this logic through a background process (Driver in Horovod and Local Elastic Agent for each node in PyTorch). When the process crashes or encounters a problem in gradient communication, the background process will catch the failure and re-discover the node, and then restart the training.
Finally, the logic of data segmentation during training and the setting of learning rate/batch size should be modified . Since the training process will dynamically increase or decrease, it may be necessary to reset the learning rate and data distribution logic according to the scale of the new training process to avoid affecting model convergence.
In this article, we first introduced the design and implementation of elastic training in PyTorch 1.9.0. Then analyzed and summarized the way to achieve elastic training and the design differences between different frameworks. From our point of view, elastic training can fit the trend of cloud native well. It is the future trend to reduce costs and increase resource utilization with extreme flexibility. Therefore, we are currently actively participating in community contributions for flexible training in communities such as TensorFlow, PyTorch, and Kubeflow. More related articles will be published in the future, thanks for your attention.
[Tencent Cloud Native] Yunshuo new products, Yunyan new technology, Yunyou Xinhuo, Yunxiang information, scan the QR code to follow the public account of the same name, and get more dry goods in time! !
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。