Efficient LLM Inference with SGLang, Lianmin Zheng, xAI

289

AMD Developer Central7.19 тыс

Следующее

4 часа – 01:13:03

MPI+ Applications on AMD GPUs

Популярные

12 часов – 1 3370:21

🎁Unboxing my Christmas gift box! 🎄🎅 #christmas #unboxing #DoogeeS200X #christmasgifts #shorts

1 день – 2754:00

Content Summarization Solution | Intel

Опубликовано 18 декабря 2024, 17:35

In this Advancing AI 2024 Luminary Developer Keynote, Dr. Lianmin Zheng introduces SGLang, a high-performance serving framework optimized for inference with LLMs and vision-language models.

SGLang’s core techniques include RadixAttention for improved KV cache reuse and jump-forward decoding for faster grammar-guided decoding. Additional optimizations, such as low-overhead CPU scheduling and torch native enhancements (e.g., torch.compile and torchao), further enhance efficiency. Benchmark results demonstrate that SGLang achieves superior performance compared to other state-of-the-art inference engines.

As an open-source project with broad adoption, SGLang is also deployed for production serving at xAI.

Speaker: Lianmin Zheng, xAI

Gain access to AMD developer tools and resources.
amd.com/en/developer.html#soft...

The information contained in this video represents the view of AMD or the third-party presenter as of the date presented. AMD and/or the third-party presenters have no obligation to update any forward-looking content in the above presentations. AMD is not responsible for the content of any third-party presentations and does not necessarily endorse the comments made therein. GD-84.

© 2024 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, EPYC, ROCm, and AMD Instinct and combinations thereof are trademarks of Advanced Micro Devices, Inc.

Свежие видео