Veo 3 API - generate_audio parameter not supported & last_frame limitations

,

We are using the Veo 3 API (veo-3.1-generate-001) via Vertex AI and have encountered two issues:

  1. generate_audio parameter not supported

When setting generate_audio=False in GenerateVideosConfig, we receive the error:

generate_audio parameter is not supported in Gemini API

We need to generate video without audio. Is this parameter currently unsupported for Veo 3, and is there a known workaround?

  1. last_frame with short duration

We want to use last_frame (providing an image for the last frame) with a 4-second video. However, it appears last_frame only works with 8-second videos. Is this an intentional limitation? We would like to generate shorter (4-second) videos with both a first frame and last frame specified.

Your use case is currently not supported. Please refer to Gemini API documentation for current model offering.

Our configuration:

  • Model: veo-3.1-fast-generate-001
  • Resolution: 720p / 1080p
  • Aspect ratio: 9:16 (portrait)
  • SDK: google-genai Python client via Vertex AI

Any clarification on whether these are intended limitations or if there are workarounds would be appreciated. Thank you.

I believe what you are experiencing is a feature. That is, audio generation is built into Veo 3. This is because its a multi-modal engine. There is no off switch, I wish there was. I have had some success with the following. The issue might depend on your prompt for the most part, if a scene implies some sort of sound, your going to get it, and there is no way around it. Fast mode might be the best choice.

Refine Your Prompts

Veo 3 often “hallucinates” background sounds like studio laughter or generic music if the audio environment isn’t defined.

Replicate

  • Use Negative Prompts: Explicitly state what to exclude by adding phrases like “(no background music)”, “(no dialogue)”, or “(no ambient sound)” at the end of your prompt.

  • Define a Minimal Soundscape: Instead of leaving the audio to chance, describe a very quiet environment, such as “near-silent room with only a faint clock ticking” or “soft ambient wind” to prevent the AI from adding louder, unwanted tracks.

  • Avoid Quote Marks: To prevent unintended dialogue generation, avoid using quotation marks in your prompt, as the model often interprets quoted text as speech to be rendered with lip-sync.Replicate +4

2. Adjust Generation Settings

  • Use “Fast” Mode: Switching the generation mode from “Highest Quality” to “Fast” often generates videos without native audio tracks, as full audio generation is typically tied to the high-fidelity experimental modes.

  • Avoid “Text-to-Video” for Silence: Audio and dialogue are most supported in “Text-to-Video” mode. Using other modes like “First Frame to Video” or “Extend/Jump To” significantly reduces the likelihood of complex audio or speech being generated

If you come up with a fool proof solution I’d be very interested in knowing what that is.

Good luck,

Gary