Gemini TTS:(3.1 flash preview) can I reuse voice/director context across many audio chunks?

Hi,

I am generating TTS for an audio segment of about 15 minutes using the Gemini API:


gemini-3.1-flash-tts-preview

streamGenerateContent

responseModalities: ["AUDIO"]

Because the full text is too long to send reliably as a single TTS request, I split it into many smaller chunks and concatenate the generated audio files.

I am following the prompt structure recommended by the GenMedia Voice Director skill example:

https://github.com/GoogleCloudPlatform/vertex-ai-creative-studio/blob/main/experiments/mcp-genmedia/skills/genmedia-voice-director/SKILL.md#example-full-prompt-structure

The structure is:


# AUDIO PROFILE: ...

## "..."

## THE SCENE: ...

...

### DIRECTOR'S NOTES

Style: ...

Pace: ...

Accent: ...

#### TRANSCRIPT

[performance tags]

... transcript chunk ...

In my use case, every TTS chunk needs the same global voice/director instructions:

  • AUDIO PROFILE

  • THE SCENE

  • DIRECTOR'S NOTES

Only the TRANSCRIPT changes from chunk to chunk.

Currently, I repeat the full prompt structure for every chunk, for example:


# AUDIO PROFILE: ...

## "..."

## THE SCENE: ...

...

### DIRECTOR'S NOTES

Style: ...

Pace: ...

Accent: ...

#### TRANSCRIPT

... chunk-specific transcript ...

This works, but it wastes tokens because the same voice/director context is sent again and again for every chunk.

My question:

Is there a supported way to reuse a server-side context for multiple Gemini TTS streamGenerateContent requests, so I can send the voice/director instructions only once and then send only the next TRANSCRIPT chunk for each following request?

More specifically:

  1. Does explicit context caching / cached content work with gemini-3.1-flash-tts-preview and responseModalities: ["AUDIO"]?

  2. Is there a recommended pattern for long-form TTS where many chunks share the same voice/director prompt?

  3. If I use a chat/session style API, will it actually reduce billed/re-sent prompt tokens, or does the full conversation history still count for each TTS request?

I want to keep the prompt structure aligned with the Voice Director skill, but avoid repeating the same global instructions for every small TTS chunk if there is a better supported approach.

1 Like