Lianmin Zheng on Efficient LLM Inference with SGLang

1 350
11.3
Опубликовано 30 июня 2025, 14:00
Join Lianmin Zheng, Member of Technical Staff at xAI and Leader of SGLang project, as he speaks at Advancing AI for a second year. This talk covers the high-level overview of SGLang, a fast inference engine for large language models and vision-language models, and its application in large-scale production environments using AMD GPUs. Learn about the latest advancements in preview-decode disaggregation and expert parallelism, and how these techniques can significantly enhance inference performance and efficiency.

Discover how SGLang supports major open-weight models like DeepSeek, Llama, and Qwen, and how it integrates with reinforcement learning workflows. Dr. Lang shares real-world insights from xAI's collaboration with AMD, including the implementation of Day 0 support for DeepSeek V3R1 and the first open-source implementation of large-scale expert parallelism.

Key takeaways include:
• Efficient design and implementation of preview-decode disaggregation
• Strategies for large-scale expert parallelism
• Practical insights into deploying SGLang on over 100 GPUs
• Collaboration highlights with AMD for optimized performance

Learn how to deploy DeepSeek-R1 with SGLang: rocm.docs.amd.com/projects/ai-...
Learn how DeepSeek-V3 is optimized on AMD Instinct™ accelerators: amd.com/en/developer/resources...

Find the resources you need to develop using AMD products: amd.com/en/developer.html

***

© 2025 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, EPYC, ROCm, and AMD Instinct and combinations thereof are trademarks of Advanced Micro Devices, Inc.
автотехномузыкадетское