In recent years, large-scale deep learning models based on transformers trained on large amounts of data have achieved good results in multiple cognitive tasks, and are used behind some new products and functions to further enhance human capabilities. These models have grown in size by orders of magnitude over the past five years. From the millions of parameters of the original transformer model all the way up to the latest 530 billion parameter Megatron-Turing (MT-NLG 530B) model (shown), customer demand for training and fine-tuning large models at unprecedented scale getting stronger.
Big Model and Hardware Capability Panorama
Azure Machine Learning (AzureML) brings a host of state-of-the-art GPUs powered by InfiniBand interconnects for large-scale AI training. We have trained Megatron/Turing and GPT-3 models on Azure. Previously, in order to train these models, users needed to set up and maintain a complex distributed training infrastructure, often involving several manual steps that were prone to error, resulting in a poor experience in terms of usability and performance.
Today, we are proud to announce a breakthrough in our software stack - using DeepSpeed and the 1024 A100 to scale the training of 2T parameter models and deliver a streamlined user experience at 1K+ GPU scale. We'll bring you these software innovations through AzureML, including the fully optimized PyTorch environment, which provides great performance and an easy-to-use interface for training at scale.
As shown in the figure below, Microsoft is adopting a full-stack optimization approach, in which hardware, operating system, VM image, Docker image (including optimized PyTorch, DeepSpeed, ONNX runtime and other Python packages), user-facing Azure ML APIs have all been Optimized, integrated and tested for excellent performance and scalability.
Microsoft's full-stack optimization for scalable distributed training on Azure
This optimized stack enables us to efficiently scale the training of large models using DeepSpeed on Azure. Compared to data published by other cloud vendors, we support 2x larger model sizes (2 trillion vs. 1 trillion parameters), scale up to 2x GPUs (1024 vs. 512), and up to 1.8x compute Throughput/GPU (150 TFLOPs vs. 81 TFLOPs).
If you want to learn more about performance data and how Azure and DeepSpeed make it easy and efficient for you to train trillion-parameter models at scale , please scan the code or click the link below. There are plenty of related resources at the end of the original blog post!
Long press to identify the QR code
Follow Microsoft Developer MSDN for more information
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。