Best way to insert external SFX into Gemini TTS output without many separate TTS requests?

Soro · May 7, 2026, 10:16am

Hi everyone,

I’m using Gemini TTS with the google-genai Python SDK to generate narrated audio from longer text.

I want to insert short external sound effects at specific points inside the narration.

Example:

“The character opened the window. [[SFX: wind_soft]] Then the room became quiet again.”

My current workaround is:

1. Split the text at each SFX marker.

2. Send each part as a separate Gemini TTS request.

3. Load the external SFX file in Python.

4. Combine the generated audio chunks and SFX using pydub/ffmpeg.

This works, but if the input text is long and contains many SFX markers, it increases the number of TTS requests, quota usage, latency, and may reduce narration continuity.

Is there a better recommended architecture for this use case?

Possible solutions I’m wondering about:

- Can Gemini TTS accept or generate markers that help place external SFX?

- Can Gemini TTS return timing/alignment metadata for the generated audio?

- Is there any supported way to use SSML-style marks or event markers?

- Is there a recommended way to mix external audio assets with Gemini TTS output?

- Should I generate one full narration and then use another tool for alignment?

- Or is splitting the text at each SFX marker currently the best approach?

Simplified Python call:

```python

response = client.models.generate_content(

model="gemini-3.1-flash-tts-preview",

contents=prompt,

config=types.GenerateContentConfig(

    response_modalities=\["AUDIO"\],

    speech_config=types.SpeechConfig(

        voice_config=types.VoiceConfig(

            prebuilt_voice_config=types.PrebuiltVoiceConfig(

                voice_name="Kore"

            )

        )

    ),

),

)

Topic		Replies	Views
Gemini TTS:(3.1 flash preview) can I reuse voice/director context across many audio chunks? Gemini API api , models , gemini-api , context_caching	0	67	May 14, 2026
Gemini 2.5 Flash TTS streaming? Gemini API api , audio	12	1443	February 25, 2026
How to get text output from gemini-2.5-flash-preview-native-audio-dialog Gemini API showcase	4	1294	November 3, 2025
Gemini Flash TTS speed? hows your experience? Gemini API gemini-api	1	948	June 11, 2025
Gemini 3.1 Flash TTS SSE sometimes returns exactly 20s / 1,280,000 base64 chars and truncated audio Gemini API api , gemini-api , gemini , gemini-flash	0	117	May 14, 2026

Best way to insert external SFX into Gemini TTS output without many separate TTS requests?

Related topics