Hi,
I am generating TTS for an audio segment of about 15 minutes using the Gemini API:
gemini-3.1-flash-tts-preview
streamGenerateContent
responseModalities: ["AUDIO"]
Because the full text is too long to send reliably as a single TTS request, I split it into many smaller chunks and concatenate the generated audio files.
I am following the prompt structure recommended by the GenMedia Voice Director skill example:
The structure is:
# AUDIO PROFILE: ...
## "..."
## THE SCENE: ...
...
### DIRECTOR'S NOTES
Style: ...
Pace: ...
Accent: ...
#### TRANSCRIPT
[performance tags]
... transcript chunk ...
In my use case, every TTS chunk needs the same global voice/director instructions:
-
AUDIO PROFILE -
THE SCENE -
DIRECTOR'S NOTES
Only the TRANSCRIPT changes from chunk to chunk.
Currently, I repeat the full prompt structure for every chunk, for example:
# AUDIO PROFILE: ...
## "..."
## THE SCENE: ...
...
### DIRECTOR'S NOTES
Style: ...
Pace: ...
Accent: ...
#### TRANSCRIPT
... chunk-specific transcript ...
This works, but it wastes tokens because the same voice/director context is sent again and again for every chunk.
My question:
Is there a supported way to reuse a server-side context for multiple Gemini TTS streamGenerateContent requests, so I can send the voice/director instructions only once and then send only the next TRANSCRIPT chunk for each following request?
More specifically:
-
Does explicit context caching / cached content work with
gemini-3.1-flash-tts-previewandresponseModalities: ["AUDIO"]? -
Is there a recommended pattern for long-form TTS where many chunks share the same voice/director prompt?
-
If I use a chat/session style API, will it actually reduce billed/re-sent prompt tokens, or does the full conversation history still count for each TTS request?
I want to keep the prompt structure aligned with the Voice Director skill, but avoid repeating the same global instructions for every small TTS chunk if there is a better supported approach.