Efficient LLM Inference with SGLang, Lianmin Zheng, xAI

6 515
13.3
Следующее
18.12.24 – 1511:13:03
MPI+ Applications on AMD GPUs
Популярные
Опубликовано 18 декабря 2024, 17:35
In this Advancing AI 2024 Luminary Developer Keynote, Dr. Lianmin Zheng introduces SGLang, a high-performance serving framework optimized for inference with LLMs and vision-language models.

SGLang’s core techniques include RadixAttention for improved KV cache reuse and jump-forward decoding for faster grammar-guided decoding. Additional optimizations, such as low-overhead CPU scheduling and torch native enhancements (e.g., torch.compile and torchao), further enhance efficiency. Benchmark results demonstrate that SGLang achieves superior performance compared to other state-of-the-art inference engines.

As an open-source project with broad adoption, SGLang is also deployed for production serving at xAI.

Speaker: Lianmin Zheng, xAI

Gain access to AMD developer tools and resources.
amd.com/en/developer.html#soft...

The information contained in this video represents the view of AMD or the third-party presenter as of the date presented. AMD and/or the third-party presenters have no obligation to update any forward-looking content in the above presentations. AMD is not responsible for the content of any third-party presentations and does not necessarily endorse the comments made therein. GD-84.


© 2024 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, EPYC, ROCm, and AMD Instinct and combinations thereof are trademarks of Advanced Micro Devices, Inc.
автотехномузыкадетское