1. Why Distributed Training Is Needed

With the development of artificial intelligence and deep learning, large-scale and ultra-large-scale models are increasingly being respected by the industry. Take the NLP industry as an example. From the very beginning Bert-base only had about 100 million parameters, to 100 billion GPT-3, and then to the world’s largest pre-training model "Enlightenment 2.0", which was released in June this year. The parameter scale Reaching an astonishing 1.75 trillion yuan, the entire industry is moving from a trend to a larger model. Faced with such a huge model, a huge amount of data is inevitably needed for training. If there is no large computing power blessing for distributed training, an Epoch may have to be trained to a waste of time. Regardless of the scene of the industry's refining super-large models, for an ordinary algorithm engineer in the AI industry, in the face of daily work, distributed training can also greatly accelerate the training of the model, the pace of parameter adjustment, and the iterative update of the version. In the precious moment, I believe that no engineer will resist the benefits of distributed training. Therefore, today we will talk about those things about distributed training in deep learning.

2. Distributed training strategy

Distributed training strategies are different in parallel, and can be simply divided into data parallel and model parallel.

2.1 Data parallel

Data parallelism refers to copying and saving a copy of the model on different GPUs, then assigning different data to different GPUs for calculation, and finally merging the results of all GPU calculations to achieve the purpose of accelerating model training. Since data parallelism involves merging the calculation results of different GPUs and then updating the model, it can be divided into synchronous update and asynchronous update according to the difference with the new method. In data parallelism, each GPU only calculates a part of the data in a batch. Synchronous update refers to waiting for all GPUs to complete the calculation, and then uniformly merge and update the weights of the network, and broadcast to all GPUs. Then proceed to the next round of calculations. Asynchrony is different from new. In asynchronous update, each GPU does not need to wait for other GPUs after the independent calculation is completed. The overall weight can be updated immediately, and then broadcast to other GPUs, and then immediately enter the next round of calculations. It can be seen that synchronous updates need to wait for all GPUs to complete the calculations before they can be updated. If a GPU in the cluster is slow to train, or the communication in the cluster is jittered, it will affect the training speed of the entire network, similar to the barrel effect, the shortest The board determines the maximum capacity. The asynchronous update does not need to wait for other GPU nodes, so the overall training speed will be faster, but there will be a serious gradient failure problem. That is, in the case of asynchronous, after each node completes the training, it will be updated immediately, which will cause the current model parameters of other nodes to be inconsistent with the model parameters used before this round of training, resulting in the gradient at this time being out of date. Therefore, although asynchronous update is fast, the model often falls into a sub-optimal solution due to the gradient failure problem.

2.2 Model Parallel

Different from data parallelism, model parallelism in distributed training refers to disassembling and distributing the entire neural network model to different GPUs, and different GPUs are responsible for calculating different parts of the network model. This is usually used when the network model is very large and the video memory of a single GPU cannot fit the overall network. Since the deep learning model usually contains many layers, the running between layers is trained sequentially. When forward propagation and backward gradient calculation, the previous layer and the following layer will depend on each other as input and output, so this kind of string The logic of the line imposes certain restrictions on acceleration. But in comparison, we can also train a super-large model through model parallelism. Otherwise, for a single GPU, the super-large model cannot work at all.
Therefore, in contrast, model parallelism is that each GPU only loads part of the network structure of the model, and there is a certain dependency, resulting in poor scalability of the scale, and the number of GPUs cannot be increased or decreased arbitrarily. Therefore, the parallel is used in practice. not much. The data parallel method, because each GPU is independent of each other, facilitates the expansion and contraction of the GPU, and has a good acceleration effect, so it is used more in practice, but in some cases, we can also combine data parallel and model parallel at the same time. Way.

3. Distributed training method based on Pytorch

Pytorch provides us with two multi-GPU distributed training solutions: torch.nn.DataParallel (DP) and torch.nn.parallel.Distributed Data Parallel (DDP).

3.1 Data Parallel

The DP mode is very easy to use. You only need to modify one line of the single GPU code to run. Because the DP mode uses the PS architecture, there is a load imbalance problem. The main card often becomes the bottleneck of the training, so the training speed will be lower. Slower than DDP mode. And DP only supports a single machine with multiple cards. Generally, a machine can only install up to 8 cards. When we want to train a particularly large task, 8 cards will be particularly tight, so there will be certain restrictions.

# use DataParallel
if torch.cuda.device_count() > 1:
    print("Let's use", torch.cuda.device_count(), "GPUs!")
    model = torch.nn.DataParallel(model)
model.to(device)

3.2 DistributedDataParallel

Different from the DP mode, the DDP mode itself is designed for multiple machines and multiple cards, of course, it can also be used in the case of a single machine and multiple cards. DDP uses the all-reduce architecture, which basically solves the problem of linear correlation between the communication cost and the number of GPUs in the PS architecture. Although in the case of a single machine with multiple cards, you can use the DP mode, but the use of DDP is usually faster than the DP mode, so the DDP mode is also officially recommended for everyone to use. It is also very convenient to modify the existing code to use DDP, and it can be easily done through the following steps.

# 1. init backend nccl
torch.distributed.init_process_group(backend='nccl')
# 2. config gpu
local_rank = torch.distributed.get_rank()
torch.cuda.set_device(local_rank)
device = torch.device("cuda", local_rank)
# 3. use DistributedSampler
training_loader = DataLoader(training_set, batch_size=TRAIN_BATCH_SIZE, sampler=DistributedSampler(training_set))
# 4. move model to gpu
model.to(device)
# 5. use DistributedDataParallel
if torch.cuda.device_count() > 1:
    print("Let's use", torch.cuda.device_count(), "GPUs!")
    model = DistributedDataParallel(model, device_ids=[local_rank], output_device=local_rank)

3.3 Horovod

In addition to the DP and DDP methods natively provided by Pytorch, there are also many excellent distributed training tools provided by third parties, of which Horovod is the more commonly used one. Horovod is Uber's open source cross-platform distributed training framework (horovod's name comes from a Russian folk dance. Dancers stand in a circle holding hands and dance, which is analogous to the communication mode between GPU devices. If the framework is Chinese or Chinese For development, I guess it might be called "Guozhuang" ^-^). As can be seen from the name, Horovod uses an all-reduce architecture to improve the communication efficiency of distributed devices. At the same time, Horovod not only supports Pytorch, but also supports other deep learning frameworks such as TensorFlow. If you want to use Horovod in training, there are actually relatively few changes to the code, as shown below.

import horovod.torch as hvd
# 1. init horovod
hvd.init()
# 2. Pin GPU to be used to process local rank (one GPU per process)
torch.cuda.set_device(hvd.local_rank())
# 3. Partition dataset among workers using DistributedSampler
train_sampler = DistributedSampler(training_set, num_replicas=hvd.size(), rank=hvd.rank())
training_loader = DataLoader(training_set, batch_size=TRAIN_BATCH_SIZE, sampler=train_sampler)
# 4. Add Horovod Distributed Optimizer
optimizer = hvd.DistributedOptimizer(optimizer, named_parameters=model.named_parameters())
# 5. Horovod: broadcast parameters from rank 0 to all other processes.
hvd.broadcast_parameters(model.state_dict(), root_rank=0)

In addition, ByteDance has also open sourced a high-performance distributed deep learning training framework BytePS (Project github address: https://github.com/bytedance/byteps), which does not use the popular all-reduce, instead The PS architecture is adopted, and the communication performance is improved by using additional CPU resources as Parameter Server and other measures. It is said that the effect can be better than Horovod. A few days ago, Kuaishou and ETH Zurich also announced the open source of Bagua, a distributed training framework. Bagua designed a specific optimization algorithm specifically for distributed scenarios, achieving joint optimization at the algorithm and system level, with better performance. A 60% increase in the same category. Interested students can also pay attention. Project github address: https://github.com/BaguaSys/bagua

4. Experimental comparison

Here we compare the native DP and DDP modes of Pytorch, and also choose the third-party plug-in Horovod for comparison. In the experiment, a pre-trained language model based on bert-base was selected for the task of text classification. The specific experimental parameters are as follows: GPU model: V100, learning_rate: 2e-5, batch_size: 128, max_len: 128, epochs: 1, train_set_size: 48w

Since both DDP and Horovod use the all-reduce architecture, the performance is equivalent. It can be seen that Pytorch's native DDP mode has also done very well. The performance of DP will be worse than other modes. Therefore, in actual work, it is recommended to use DDP or Horovod for distributed training.

Summarize

This article discusses the distributed strategy of model parallelism and data parallelism in deep learning, and introduces the native DP and DDP modes based on the Pytorch framework, as well as the third-party Horovod distributed training framework. It can be seen from the following experimental comparison that DDP or Horovod is more recommended in daily work. Distributed training is a very important part of deep learning. In addition to Horovod, other major manufacturers have also opened up their own distributed training frameworks, such as BytePS, DeepSpeed, Bagua, etc. The open source of these frameworks will further promote this field. The development of the company provides better tools for deep learning.

Author profile
Hongyu OPPO Senior NLP Algorithm Engineer
Mainly engaged in NLP, knowledge graph and related fields

Get more exciting content, scan the code and follow the [OPPO Digital Intelligence Technology] public account


OPPO数智技术
612 声望950 粉丝