Gemma 4 e4b latency optimisations

Working on a banking assistant pipeline using Gemma 4 e4b model for ASR + normalization + intent extraction + entity extraction + QnA in a single flow. Current latency is around ~6s on an NVIDIA L4 and ~1.5s on an H200 for end-to-end inference.

Pipeline includes:

  • ASR

  • text normalization

  • fuzzy/phonetic name correction

  • single-pass intent + entity extraction

  • async FastAPI serving

I’m trying to reduce latency further, maybe get less than 500 ms.

Questions:

  1. What are the best optimizations for Gemma 4 inference in production?

  2. Would vLLM/TensorRT-LLM/Flash Attention significantly help for this workload?

  3. Any recommendations around batching, quantization, KV cache, or async pipeline improvements?

  4. Has anyone optimized small structured-output workloads like this on L4 specifically?

Would love suggestions from people deploying Gemma/Qwen/Llama models in real-time systems.

Hi @Yash_Suryavanshi
Thanks for sharing your detailed setup . I can share a few optimizations worth trying for your workflow.

  1. Switch to vLLM or SGLang for serving - vLLM is the more battle-tested path but SGLang has shown better structured-output latency on small models due to how it overlaps grammar mask generation with GPU inference. I would recoomend you to try both.
    For ref : When to Choose SGLang Over vLLM: Multi-Turn Conversations and KV Cache Reuse | Runpod Blog

  2. Try the new MTP speculative decoding drafters - Google released official MTP drafters for all Gemma 4 sizes . By using a specialized speculative decoding architecture, these drafters deliver up to a 3x speedup without any degradation in output quality or reasoning logic.
    For ref : Multi-token-prediction in Gemma 4

  3. Prefix caching + cap max_new_tokens-Your pipeline has a fixed system prompt on every call Try enabling --enable-prefix-caching in vLLM or if you are using SGLang it t does automatically in RadixAttention.

    Please try these steps and let us know if they help optimize your workflow.

    Thanks

Thank you for the detailed suggestions, really appreciate it.

The SGLang point is especially interesting since most of my workload is structured-output extraction. I’ll benchmark both vLLM and SGLang, and also try the Gemma 4 MTP drafters along with prefix caching and tighter max_new_tokens limits.

Will share the benchmark results once I test them. Thanks again!

Did you test all the 3 options mate? If possible share your experience with all the options you tried. Thank you.