ZeRO & Fastest BERT: Increasing the scale and speed of deep learning training in DeepSpeed

7 754

17.3

Microsoft Research335 тыс

Следующее

13.04.21 – 78259:08

Discovering hidden connections in art with deep, interpretable visual analogies

Популярные

120 дней – 11442:02

Final intern talk: Distilling Self-Supervised-Learning-Based Speech Quality Assessment into Compact

121 день – 1 5981:23:36

Decoding the Human Brain – A Neurosurgeon’s Experience

Опубликовано 13 апреля 2021, 16:50

The latest trend in AI is that larger natural language models provide better accuracy; however, larger models are difficult to train because of cost, time, and ease of code integration. With the goal of advancing large model training by improving scale, speed, cost, and usability for model developers across the world, Microsoft made the DeepSpeed library open source in February of 2020.

In this webinar, the DeepSpeed team will discuss what DeepSpeed is, how to use it with your existing PyTorch models, and advancements in the ZeRO optimizer that are central to supporting training of 100–200 billion parameter models and higher. In addition, the team will present deep-dive results on how they were able to obtain the world record for fastest BERT training.

DeepSpeed can efficiently train models with 100–200 billion parameters up to 10 times faster than state-of-the-art via the use of a memory optimization system called ZeRO (Zero Redundancy Optimizer). ZeRO is a parallelized optimizer that greatly reduces the resources needed for model and data parallelism while massively increasing the number of parameters that can be trained. Researchers used these breakthroughs to create Turing Natural Language Generation (Turing-NLG), one of the largest publicly known language models at 17 billion parameters.

DeepSpeed recently obtained the fastest BERT training record of 44 minutes on 1024 NVIDIA V100 GPUs. This is a 34% improvement over the best published result, and it does not come at the cost of excessive hardware resources but is a result of improved software efficiency. DeepSpeed can attain a staggering 64 teraflops of single GPU performance on a NVIDIA V100 GPU, which is over 50% of the hardware peak.

Together, you will explore:

■ DeepSpeed features, optimizations for speed and scale, and a roadmap for the future
■ How to use DeepSpeed to train your own model and other popular models like BERT and GPT-2
■ A deep dive into technology behind the ZeRO optimizer and upcoming features
■ How we achieved the world record for BERT training using this technology

DeepSpeed is a group of system researchers and engineers who are enthusiastic about performance optimization of large-scale systems. Presenters in this webinar include: Principal Research Manager Yuxiong He, researcher Samyam Rajbhandari, researcher Jeff Rasley, and researcher Tunji Ruwase.

𝗥𝗲𝘀𝗼𝘂𝗿𝗰𝗲 𝗹𝗶𝘀𝘁:

■ DeepSpeed Website: deepspeed.ai
■ DeepSpeed Library (GitHub): github.com/microsoft/DeepSpeed
■ ZeRO: Memory Optimizations Toward Training Trillion Parameter Models (publication): microsoft.com/en-us/research/p...
■ ZeRO & DeepSpeed: New system optimizations enable training models with over 100 billion parameters (blog): microsoft.com/en-us/research/b...
■ ZeRO-2 & DeepSpeed: Shattering barriers of deep learning speed & scale (blog): microsoft.com/en-us/research/b...
■ DeepSpeed Fastest Bert deep dive (blog): deepspeed.ai/news/2020/05/27/f...
■ Turing-NLG: A 17-billion-parameter language model by Microsoft (blog): microsoft.com/en-us/research/b...
■ AI at Scale (Project Page): microsoft.com/en-us/research/p...
■ ONNX runtime (GitHub): microsoft.github.io/onnxruntim...

*This on-demand webinar features a previously recorded Q&A session and open captioning.

This webinar originally aired on August 06, 2020

Explore more Microsoft Research webinars: aka.ms/msrwebinars

Свежие видео