Self host Gemma 4: Deploy LLMs on Cloud Run GPUs

10 491
9
Published on 18 Apr 2026, 15:47
GCP credit → goo.gle/handson-ep7-lab1
Lab → goo.gle/guardians

In this episode, we deploy Google's Gemma 4 model to Cloud Run two completely different ways, each with real trade-offs you need to understand before choosing one for production.

🔨 Ollama — model baked into the container. Instant cold starts. Rebuild to update.
⚡ vLLM — model mounted from Cloud Storage via FUSE. Slower first boot, but swap models without redeploying.

Both use Cloud Run GPUs, scale to zero, and ship through automated CI/CD with Cloud Build.

We build both. You decide which fits. 👇
📦 CI/CD with Cloud Build
🖥️ GPU accelerated serverless inference
🔄 Baked in vs. decoupled model architecture
🚀 Scale to zero
⚖️ Cold start speed vs. production agility

Chapters:
0:00 - Intro
6:08 - Getting started with Agentverse lab
7:57 - Laying the foundations of the citadel
16:07 - Forging the power core: Self hosted LLMs
28:02 - Forging the citadel's central core: Deploy vLLM
43:59 - Summary

More resources:
Cloud Run GPU documentation → goo.gle/4sEbTvG
Ollama documentation → goo.gle/3Qdi64w
vLLM documentation → goo.gle/4cvvxE9
Cloud Storage FUSE → goo.gle/4cQAb0V

Watch more Hands on AI → youtube.com/watch?v=qCBreTfjFH...
🔔 Subscribe to Google Cloud Tech → goo.gle/GoogleCloudTech

#Gemma4 #CloudRun

Speakers: Ayo Adedeji, Annie Wang
Products Mentioned: Agent Development Kit, Gemini API, Cloud Run
autotechmusickids