头图

In recent years, large-scale deep learning models based on transformers trained on large amounts of data have achieved good results in multiple cognitive tasks, and are used behind some new products and functions to further enhance human capabilities. These models have grown in size by orders of magnitude over the past five years. From the millions of parameters of the original transformer model all the way up to the latest 530 billion parameter Megatron-Turing (MT-NLG 530B) model (shown), customer demand for training and fine-tuning large models at unprecedented scale getting stronger.
图片
Big Model and Hardware Capability Panorama

Azure Machine Learning (AzureML) brings a host of state-of-the-art GPUs powered by InfiniBand interconnects for large-scale AI training. We have trained Megatron/Turing and GPT-3 models on Azure. Previously, in order to train these models, users needed to set up and maintain a complex distributed training infrastructure, often involving several manual steps that were prone to error, resulting in a poor experience in terms of usability and performance.

Today, we are proud to announce a breakthrough in our software stack - using DeepSpeed and the 1024 A100 to scale the training of 2T parameter models and deliver a streamlined user experience at 1K+ GPU scale. We'll bring you these software innovations through AzureML, including the fully optimized PyTorch environment, which provides great performance and an easy-to-use interface for training at scale.

As shown in the figure below, Microsoft is adopting a full-stack optimization approach, in which hardware, operating system, VM image, Docker image (including optimized PyTorch, DeepSpeed, ONNX runtime and other Python packages), user-facing Azure ML APIs have all been Optimized, integrated and tested for excellent performance and scalability.
图片
Microsoft's full-stack optimization for scalable distributed training on Azure

This optimized stack enables us to efficiently scale the training of large models using DeepSpeed on Azure. Compared to data published by other cloud vendors, we support 2x larger model sizes (2 trillion vs. 1 trillion parameters), scale up to 2x GPUs (1024 vs. 512), and up to 1.8x compute Throughput/GPU (150 TFLOPs vs. 81 TFLOPs).


If you want to learn more about performance data and how Azure and DeepSpeed make it easy and efficient for you to train trillion-parameter models at scale , please scan the code or click the link below. There are plenty of related resources at the end of the original blog post!

image.png
Long press to identify the QR code

Follow Microsoft Developer MSDN for more information

Click to go to the original blog~


微软技术栈
423 声望998 粉丝

微软技术生态官方平台。予力众生,成就不凡!微软致力于用技术改变世界,助力企业实现数字化转型。