For high-performance, ultra-large-scale model training, this combination "debuts"

In recent years, large-scale deep learning models based on transformers trained on large amounts of data have achieved good results in multiple cognitive tasks, and are used behind some new products and functions to further enhance human capabilities. These models have grown in size by orders of magnitude over the past five years. From the millions of parameters of the original transformer model all the way up to the latest 530 billion parameter Megatron-Turing (MT-NLG 530B) model (shown), customer demand for training and fine-tuning large models at unprecedented scale getting stronger.

Big Model and Hardware Capability Panorama

Azure Machine Learning (AzureML) brings a host of state-of-the-art GPUs powered by InfiniBand interconnects for large-scale AI training. We have trained Megatron/Turing and GPT-3 models on Azure. Previously, in order to train these models, users needed to set up and maintain a complex distributed training infrastructure, often involving several manual steps that were prone to error, resulting in a poor experience in terms of usability and performance.

Today, we are proud to announce a breakthrough in our software stack - using DeepSpeed and the 1024 A100 to scale the training of 2T parameter models and deliver a streamlined user experience at 1K+ GPU scale. We'll bring you these software innovations through AzureML, including the fully optimized PyTorch environment, which provides great performance and an easy-to-use interface for training at scale.

As shown in the figure below, Microsoft is adopting a full-stack optimization approach, in which hardware, operating system, VM image, Docker image (including optimized PyTorch, DeepSpeed, ONNX runtime and other Python packages), user-facing Azure ML APIs have all been Optimized, integrated and tested for excellent performance and scalability.

Microsoft's full-stack optimization for scalable distributed training on Azure

This optimized stack enables us to efficiently scale the training of large models using DeepSpeed on Azure. Compared to data published by other cloud vendors, we support 2x larger model sizes (2 trillion vs. 1 trillion parameters), scale up to 2x GPUs (1024 vs. 512), and up to 1.8x compute Throughput/GPU (150 TFLOPs vs. 81 TFLOPs).

If you want to learn more about performance data and how Azure and DeepSpeed make it easy and efficient for you to train trillion-parameter models at scale , please scan the code or click the link below. There are plenty of related resources at the end of the original blog post!

Long press to identify the QR code

Follow Microsoft Developer MSDN for more information

Click to go to the original blog~

For high-performance, ultra-large-scale model training, this combination "debuts"

微软技术栈

引用和评论

对话声网 JCFTP AI Studios：以技术温度叩开商业价值之门

使用 Office Tool Plus 安装并激活 Microsoft 365（原 Office 365）

全球知名快消企业实战：Power BI销售战略沙盘让决策提速50%！

微软全球技术“领航员”空降北京，4月23日带你进入智能体世界

微软创想未来峰会 | 四大行业案例「预习资料」放送中

极客说｜Unsloth 的全微调之路：从 Adapter 到 Full Fine-tuning

微软智能技术开发者挑战赛 |最受开发者欢迎奖投票通道开启！快来为喜爱的团队投票