Building a Real-Time Inference Stack on AMD Instinct GPUs

351
23.4
Published on 14 May 2026, 16:53
Speakers
Gaël Delalleau. Founder and CEO, Kog
Augustin Verneuil, GPU engineer, Kog

Talk Abstract: In this talk, we share our vision for real-time generative AI, and the techniques we developed to achieve the fastest LLM inference on GPU ever, with a generation speed of 2500 tokens/s per request. We first showcase our end-to-end stack optimized for minimal latency on AMD hardware, spanning model re-architecting, a single monokernel implementation, along with topology-aware algorithms. In the second part, we focus on one of the defining challenges of megakernels, intra-GPU grid synchronization barriers and reduce/gather primitives. Using a chiplet-aware approach grounded in deep hardware insight, we are able to decrease the overhead from 1.5µs to 600ns.

Find the resources you need to develop using AMD products: amd.com/en/developer.html

Join the Developer Community: devcommunity.amd.com

Join the Developer Discord server: discord.gg/amd-dev

***

© 2026 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, EPYC, ROCm, and AMD Instinct and combinations thereof are trademarks of Advanced Micro Devices, Inc.
autotechmusickids