Compiling the Full Diffusion Pipeline: 4x Faster Image Generation on MI355X

505
14
Published on 14 May 2026, 16:53
Speaker: Chris Lattner, Co-founder & CEO, Modular
Talk Abstract: FLUX.2 is a 32-billion parameter diffusion model split across four stages: DiT backbone, VAE decoder, text encoder, and scheduler. Most serving stacks run these as separate steps with Python overhead between each one. MLIR-based compilation collapses all four into a single fused execution graph, eliminating that overhead and generating hardware-optimized code for AMD Instinct MI355X. Modular's compiled pipeline runs 3.8x faster than torch.compile, generates 1024x1024 images in under 3.5 seconds, and ships in a container under 700MB (90% smaller than a typical inference stack). By compiling the same pipeline for both AMD Instinct MI355X and Blackwell GPUs, we show that AMD silicon delivers equivalent generation performance at a 5.5x total cost advantage.

Find the resources you need to develop using AMD products: amd.com/en/developer.html

Join the Developer Community: devcommunity.amd.com

Join the Developer Discord server: discord.gg/amd-dev

***

© 2026 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, EPYC, ROCm, and AMD Instinct and combinations thereof are trademarks of Advanced Micro Devices, Inc.
autotechmusickids