Joe Chau on Challenges and Opportunities in Large-Scale Training on AMD Instinct MI300X GPUs

25
Published on 30 Jun 2025, 14:00
Joe Chau is Vice President of Engineering at Microsoft, where he leads the Azure HPC and AI infrastructure team. At Advancing AI 2025, he shared valuable insights into optimizing AI training using AMD GPUs and accelerators. This talk covers the practical challenges and solutions encountered during the development of an AI agent for ASHO AI and HPC production usage.

In this session, you'll learn about:
The impressive results achieved with a 17 billion parameter model trained with AMD Instinct™ MI300X accelerators in just 9 days, handling 4.8 trillion tokens.
Key strategies for reducing GPU checkpoint restore times by up to 73%, ensuring efficient use of resources.
The importance of developing custom benchmarks to validate model performance and reliability.
Practical tips for maintaining server cluster health and ensuring reliable training processes.

Whether you're an AI developer or researcher, this talk offers actionable insights and real-world experience to help you optimize your AI training workflows.

Find the resources you need to develop using AMD products: amd.com/en/developer.html

***

© 2025 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, EPYC, ROCm, and AMD Instinct and combinations thereof are trademarks of Advanced Micro Devices, Inc.
autotechmusickids