Abstract: The and its parameters are becoming more complex. Only one or two cards can no longer meet the requirements of today's training scale, and distributed training has emerged.
This article is shared from the Huawei Cloud Community " Distributed Training Allreduce Algorithm ", the original author: I must win the lottery.
The current model and its parameters are becoming more and more complex. Only one or two cards can no longer meet the requirements of today's training scale. Distributed training has emerged as the times require.
What is distributed training like? Why use the Allreduce algorithm? How does distributed training communicate? This article will take you to understand the distributed training Allreduce algorithm necessary for large model training.
Communication concept
We understand that computer algorithms are all based on a combination of functional operations, so before we explain distributed algorithms, we must first understand the functional operations that constitute the hardware applied to this algorithm-the basic concept of collective communication ,
Broadcast: distributes and broadcasts the data on the root server (Root Rank) to all other servers (Rank)
As shown in the figure, when a server has completed part of its parameter data and wants to send this part of its data to all other servers at the same time in distributed training, then this operation method is called broadcast.
Scatter: the data on the root server into data blocks of the same size, and each other server gets a data block
As shown in the figure, when a server calculates part of its own parameter data, but because sometimes all the parameter data on the server is too large, we want to divide the data on this server into several data blocks of the same size. (buffer), and then send one of the data blocks to other servers according to the sequence (rank index), which is called Scatter.
Gather (gather): directly splices the data blocks on other servers together, and the root server (Root Rank) obtains these data
As shown in the figure, when the servers are scattered, each server obtains a data block from other servers. The operation of stitching together the data blocks obtained by one server is called Gather.
AllGather (full aggregation): All servers do the above Gather operation, so all servers have all the data on the servers
As shown in the figure, all servers splice the data blocks they receive together (all do the aggregation operation), then it is AllGather.
Reduce (Protocol): a reduction operation (such as maximum value, sum) on the data on all servers, and then writes the data to the root server
As shown in the figure, when all servers are broadcasting or scatter, our server as the receiver receives the data sent by each server, and we perform some kind of protocol operation on the received data (commonly such as sum, Seek the maximum value) and then store it in the memory of your own server, then this is called Reduce
AllReduce (full specification): a reduction operation (such as maximum value, sum) on the data on all servers, and then writes the data to the root server
As shown in the figure, each server also completes the above-mentioned protocol operation, then it is a full protocol (Allreduce). This is the most basic framework for distributed training. All data is integrated into each server through protocol operations, and each server obtains completely consistent protocol data that contains the calculation parameters of all the original servers.
ReduceScatter (scattering protocol): The server divides its own data into data blocks of the same size, and each server will perform a protocol operation based on the data obtained by index, that is, do Scatter first and then Reduce.
In the concept, we often encounter the term ReduceScatter. To put it simply, we do Scatter first, divide the data in the server into data blocks of the same size, and then follow the sequence (Rank Index). The parameter data obtained by a server is reduced. This is similar to full aggregation, except that instead of simply splicing the data together, we do a reduction operation (operations such as sum or maximum).
After understanding the basic concepts of various hardware testing, we should also have some understanding of distributed training, that is, distributed training data is divided, so that each server calculates its min-batch data, and then through the above The reduce and other operations are synchronized, so that the parameter data on each server is the same.
Distributed communication algorithm
Parameter Server (PS) algorithm: The root server divides the data into N parts and distributes it to each server (Scatter). Each server is responsible for its own mini-batch training. After obtaining the gradient parameter grad, it returns to the root server for doing Accumulate (Reduce), get the updated weight parameters, and then broadcast to each card (broadcast).
This is the original distributed communication framework, and it is also a commonly used method for small-scale training of a few cards, but it is obvious that serious problems will occur when the scale becomes larger:
- Each round of training iteration requires all cards to synchronize the data and do a Reduce before it is finished. When there are many parallel cards, the barrel effect will be very serious. Once a card is slower, it will slow down the entire cluster. Speed and low computational efficiency.
- The Reducer server has a heavy task and becomes a bottleneck. All nodes need to communicate with the Reducer for data, gradients, and parameters. When the model is large or the data is large, the communication overhead is very large, and the root node receives a huge amount of data. Form a bottleneck.
Halving and doubling (HD) algorithm: two-to-two communication between servers, each step of the server can obtain all the data of the other party, so as to continue, so that all servers have all the data.
This algorithm avoids the bottleneck problem of a single node. At the same time, each node uses its sending and receiving bandwidth. It is currently a common way of large-scale communication, but it also has its problems, that is, in the last step. There will be a lot of data transfer in the number, making the speed slow.
If the servers is not a power of two , there are 13 servers as shown in the figure below. The extra 5 servers will communicate all data in one direction before and after, and the rest of the servers will communicate according to the power of two HD. For details, please refer to Rabenseifner R.'s Optimization of Collective Reduction Operations paper. However, in a practical scenario, the data of the largest block containing all the parameter data after the HD calculation is sent directly to the extra servers, which leads to a huge proportion of the communication time in this step.
Ring algorithm: is connected in a ring, each card has a left-hand card and a right-hand card, one is responsible for receiving, the other is responsible for sending, the loop completes the gradient accumulation, and the loop is used for parameter synchronization. It is divided into two links: Scatter Reduce and All Gather.
More detailed illustration
The Ring algorithm is very advantageous in medium-scale operations, with a small amount of transmitted data, no bottlenecks, and full bandwidth utilization.
The disadvantage is that in large-scale cluster operations, huge data in the server, extremely long Ring ring, this way of dividing data blocks by Ring is no longer dominant.
reference:
- http://research.baidu.com/bringing-hpc-techniques-deep-learning/
- https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/collectives.html
- https://zhuanlan.zhihu.com/p/79030485
- Rabenseifner R. (2004) Optimization of Collective Reduction Operations. In: Bubak M., van Albada G.D., Sloot P.M.A., Dongarra J. (eds) Computational Science - ICCS 2004. ICCS 2004. Lecture Notes in Computer Science, vol 3036. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24685-5_1
Click to follow and learn about Huawei Cloud's fresh technology for the first time~
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。