Gemma 4 27B-A4B-it (MoE) on Vertex AI: vLLM Dependency Triangles, LoRA, and Vision Blockers

Hello everyone,

Over the past few weeks, I’ve been working on deploying Google’s Gemma 4 27B-A4B-it (Mixture-of-Experts) model on Vertex AI using vLLM. After extensive testing (30+ container/model versions), I successfully achieved a production-stable, base-model deployment in BF16 on a single NVIDIA A100 80GB GPU.

However, enabling advanced features like LoRA fine-tuning, NF4 quantization, and multimodal/vision capabilities is currently blocked by a few deep framework incompatibilities. I have cataloged 20 distinct failure modes during this process and am reaching out to the community and Google DeepMind/Vertex teams for guidance on the following critical blockers.

The Technical Deployment Environment Configuration

The following configuration achieves a production-stable, base-model deployment in BF16:

  • Model: google/gemma-4-27b-a4b-it (MoE, 27B total, 4B active)

  • Infrastructure: Vertex AI Prediction, 1× NVIDIA A100 80GB (a2-ultragpu-1g)

  • Base Image: pytorch-vllm-serve:gemma4 (Google Custom Build)

  • Internal Dependency Stack: vLLM 0.17.2rc1.dev133, Transformers 5.5.0.dev0, huggingface_hub 1.8.0

  • Active Precision: BF16

  • Context Limit Configured: 8,192 tokens (due to BF16 VRAM budget on 80GB)


Documented Architecture Blockers

1. Unsolvable Dependency Triangle (PyPI vs. Base Image)

Deployment via standard PyPI packages is blocked by a mutually exclusive version constraint matrix.

  • vLLM Constraint: vLLM 0.19.0 (latest PyPI) requires transformers < 5.0 and huggingface_hub < 1.0.

  • Model Constraint: gemma4 model type registration only exists in transformers >= 5.5.0.

  • Transformers Constraint: transformers 5.5.0 requires huggingface_hub >= 1.5.0.

  • Import Failure Path: Installing transformers 5.6.0.dev0 alongside vLLM 0.19.0 results in ImportError: cannot import name 'is_offline_mode' from 'huggingface_hub'. This occurs because is_offline_mode moved to huggingface_hub.utils in 0.36+, breaking vLLM’s internal dependency chain (vllmtransformers_utils/config.pytransformers.hub).

2. LoRA / MoE Mixin Incompatibility

Dynamic LoRA adapter loading via GCSFUSE mount fails at the model class level.

  • Trace: ValueError: Gemma4ForConditionalGeneration does not support LoRA yet.

  • Root Cause: vLLM 0.17.2rc1.dev133 lacks the complete LoRA mixin for Gemma4ForConditionalGeneration. The MoE architecture’s attention layers are not registered for LoRA patching.

3. NF4 Quantization / Expert Mapping Incompatibility

Attempting to quantize the 52GB BF16 weights to NF4 to allocate VRAM for LoRA/KV cache fails during bitsandbytes loading.

  • Trace: AttributeError: MoE Model Gemma4ForConditionalGeneration does not support BitsAndBytes quantization yet. Ensure this model has 'get_expert_mapping' method.

  • Root Cause: vLLM’s bitsandbytes loader requires the get_expert_mapping() method for MoE models to manage expert-specific scaling. The Gemma4ForConditionalGeneration class does not implement this.

4. Jinja2 Sandboxing & Vision Chat Template

The default Gemma 4 chat_template.jinja triggers an inference crash under vLLM’s execution environment.

  • Trace: TypeError: object() takes no arguments (in safe_apply_chat_templatehf.py:483).

  • Root Cause: The native template uses the Jinja2 namespace() object. vLLM 0.17.2rc1.dev133’s sandboxed Jinja2 environment maps namespace to Python’s base object(), which accepts zero arguments.

  • Current Mitigation: Implementation of a simplified, text-only custom chat_template.jinja without namespace(). This results in the loss of vision/multimodal capabilities.

Supplementary Data

Full forensic logs, container Dockerfiles, patching scripts, and dependency matrix proofs are documented in the repository: https://github.com/Manzela/gemma4-vllm-deployment

Any insights, patches, or updates from the Vertex AI team on future base image releases would be incredibly appreciated.

Thank you!

Hi @Manzela
Thank you for sharing these critical architectural blockers and clear root causes with us.
I have escalated all these issues to our engineering team for further review .