Gemini TTS:(3.1 flash preview) can I reuse voice/director context across many audio chunks?

Soro · May 14, 2026, 10:33am

Hi,

I am generating TTS for an audio segment of about 15 minutes using the Gemini API:


gemini-3.1-flash-tts-preview

streamGenerateContent

responseModalities: ["AUDIO"]

Because the full text is too long to send reliably as a single TTS request, I split it into many smaller chunks and concatenate the generated audio files.

I am following the prompt structure recommended by the GenMedia Voice Director skill example:

https://github.com/GoogleCloudPlatform/vertex-ai-creative-studio/blob/main/experiments/mcp-genmedia/skills/genmedia-voice-director/SKILL.md#example-full-prompt-structure

The structure is:


# AUDIO PROFILE: ...

## "..."

## THE SCENE: ...

...

### DIRECTOR'S NOTES

Style: ...

Pace: ...

Accent: ...

#### TRANSCRIPT

[performance tags]

... transcript chunk ...

In my use case, every TTS chunk needs the same global voice/director instructions:

AUDIO PROFILE
THE SCENE
DIRECTOR'S NOTES

Only the TRANSCRIPT changes from chunk to chunk.

Currently, I repeat the full prompt structure for every chunk, for example:


# AUDIO PROFILE: ...

## "..."

## THE SCENE: ...

...

### DIRECTOR'S NOTES

Style: ...

Pace: ...

Accent: ...

#### TRANSCRIPT

... chunk-specific transcript ...

This works, but it wastes tokens because the same voice/director context is sent again and again for every chunk.

My question:

Is there a supported way to reuse a server-side context for multiple Gemini TTS streamGenerateContent requests, so I can send the voice/director instructions only once and then send only the next TRANSCRIPT chunk for each following request?

More specifically:

Does explicit context caching / cached content work with gemini-3.1-flash-tts-preview and responseModalities: ["AUDIO"]?
Is there a recommended pattern for long-form TTS where many chunks share the same voice/director prompt?
If I use a chat/session style API, will it actually reduce billed/re-sent prompt tokens, or does the full conversation history still count for each TTS request?

I want to keep the prompt structure aligned with the Voice Director skill, but avoid repeating the same global instructions for every small TTS chunk if there is a better supported approach.

Topic		Replies	Views
How to Start a Chat with Gemini Without Resending the File Gemini API api , github	3	314	February 26, 2025
Gemini 2.5 Flash TTS streaming? Gemini API api , audio	12	1422	February 25, 2026
Gemini Live Caching Gemini API audio , context_caching	6	291	March 24, 2026
Gemini 3.1 Flash TTS SSE sometimes returns exactly 20s / 1,280,000 base64 chars and truncated audio Gemini API api , gemini-api , gemini , gemini-flash	0	111	May 14, 2026
Repeat video understanding context - best way to context cache? Gemini API api	2	67	February 2, 2026

Gemini TTS:(3.1 flash preview) can I reuse voice/director context across many audio chunks?

Related topics