Scheduling For Efficient Large-Scale Machine Learning Training

1 573

17.5

Microsoft Research336 тыс

Следующее

07.10.19 – 1 43754:54

Discover[i]: Component-based Parameterized Reasoning for Distributed Applications

Популярные

28 дней – 1 4464:06

GASP: Gaussian Avatars with Synthetic Priors

57 дней – 6702:27

Low latency carbon budget 2023

Опубликовано 4 октября 2019, 17:07

Over recent years, machine learning techniques have achieved success in many real-world applications. While researchers and practitioners continue to expand machine learning to new application domains and push the boundary of existing applications, they face critical computational challenges due to growing dataset size, increasing model complexity and capacity. These challenges demand new software systems to train large models efficiently and to enable machine learning researchers to easily experiment with new ideas.

There exist many opportunities to improve training time and support training larger models by leveraging the structural properties of machine learning computation to design efficient training systems. In this talk, I will present two distributed training systems Bösen and Orion that schedules inter-machine network communication and parallel computation to improve training time by reducing data inconsistency in parameter states, without requiring heavy programmer effort. Moreover, by scheduling memory usage in TensorFlow, we reduce GPU memory consumption by 87% and enable training models with 10x more parameters on the same hardware.

See more at microsoft.com/en-us/research/v...

Свежие видео