Hi everyone,
I’m looking for the most cost-effective way to deploy MedGemma-27B on GCP. I have irregular traffic and want to avoid having to pay for always-on GPU nodes, and without incurring massive cold start delays?
I’m looking for a way to scale to zero to save costs, but the time required to load ~20GB+ of weights into VRAM usually makes the first request unusable.
Are there specific architecture patterns—such as model streaming, specialized loaders, or tiered serving—that effectively reduce spin-up times for models of this size?
1 Like
Hi @Jay_Patel1
Before I dive into specific architectures, could you share if you have tried any specific serving engines (like vLLM) or specific quantization formats so far? Knowing your current baseline will help me tailor the configuration better.
Also, what does irregular traffic look like for your use case? Are we talking a few requests per day that could come at any time, or more like predictable bursts with quiet periods in between?
Can you also let us know what’s your pain threshold for that first request? Like, is 30 seconds okay? 60 seconds? Or does it need to be near-instant?
Few approaches you can start with are
You can use Cloud Run with NVIDIA L4 GPUs. Try using GCS FUSE to mount the bucket as a volume. When combined with a serving engine like vLLM, the model can leverage memory mapping (mmap). This allows the application to start streaming weights into VRAM rather than waiting for a full disk download first.
Another approach would be to use Tiered Serving which you mentioned. Here you can deploy a tiny model like Med Gemma 1.5B or 4B model that stays always on and use MedGemma-27B on a GPU node. This can buys you time"so the user doesn’t feel the cold start as a system failure.
Let me know if these approaches work for you
1 Like