Scalable advanced ML systems with Ray, Google Kubernetes Engine, and ML accelerators

717
9.6
Опубликовано 1 июля 2024, 15:41
As machine learning (ML) systems continue to evolve, the ability to scale complex ML workloads becomes crucial. Scalability can be considered along two dimensions: expansive training of large language models (LLMs) and intricate distribution of reinforcement learning (RL) systems. Each has its own set of challenges, from computational demands of LLMs to complex synchronization in distributed RL.

This session explores the integration of Ray, Google Kubernetes Engine (GKE) and ML accelerators like tensor processing units (TPUs) as a powerful combination to develop advanced ML systems at scale. We discuss Ray and its scalable APIs, its mature integration with GKE and ML accelerators, and demonstrate how it has been used for LLMs and re-implementing the powerful RL algorithm, Muzero.

Speakers:

Watch more:
All sessions from Google Cloud Next → goo.gle/next24

#GoogleCloudNext

OPS213
Event: Google Cloud Next 2024
автотехномузыкадетское