MedGemma-27B on GCP: Is there a way to deploy "pay-per-token" or serverless (without long spin-up times)?

Jay_Patel1 · February 12, 2026, 8:00am

Hi everyone,

I’m looking for the most cost-effective way to deploy MedGemma-27B on GCP. I have irregular traffic and want to avoid having to pay for always-on GPU nodes, and without incurring massive cold start delays?

I’m looking for a way to scale to zero to save costs, but the time required to load ~20GB+ of weights into VRAM usually makes the first request unusable.

Are there specific architecture patterns—such as model streaming, specialized loaders, or tiered serving—that effectively reduce spin-up times for models of this size?

Pannaga_J · February 16, 2026, 5:56am

Hi @Jay_Patel1
Before I dive into specific architectures, could you share if you have tried any specific serving engines (like vLLM) or specific quantization formats so far? Knowing your current baseline will help me tailor the configuration better.
Also, what does irregular traffic look like for your use case? Are we talking a few requests per day that could come at any time, or more like predictable bursts with quiet periods in between?
Can you also let us know what’s your pain threshold for that first request? Like, is 30 seconds okay? 60 seconds? Or does it need to be near-instant?

Few approaches you can start with are
You can use Cloud Run with NVIDIA L4 GPUs. Try using GCS FUSE to mount the bucket as a volume. When combined with a serving engine like vLLM, the model can leverage memory mapping (mmap). This allows the application to start streaming weights into VRAM rather than waiting for a full disk download first.

Another approach would be to use Tiered Serving which you mentioned. Here you can deploy a tiny model like Med Gemma 1.5B or 4B model that stays always on and use MedGemma-27B on a GPU node. This can buys you time"so the user doesn’t feel the cold start as a system failure.
Let me know if these approaches work for you

Topic		Replies	Views
Gemma 4 e4b latency optimisations Gemma pipelines	3	247	May 26, 2026
Gemma 4 27B-A4B-it (MoE) on Vertex AI: vLLM Dependency Triangles, LoRA, and Vision Blockers Gemma models , gemma , vertexai , pytorch , transformers	1	282	April 27, 2026
URGENT: Cloud Run L4 GPU Quota Blocked for MedGemma 4B Deployment Gemma api , models , medgemma	1	90	February 24, 2026
Requesting a paid tier of gemma models Gemma gemma-3	1	663	September 2, 2025
Any hosted instance of MedGemma available to access using Gemini API key? Gemini API gemini-api , medgemma	2	209	November 13, 2025

MedGemma-27B on GCP: Is there a way to deploy "pay-per-token" or serverless (without long spin-up times)?

Related topics