Best way to insert external SFX into Gemini TTS output without many separate TTS requests?

Hi everyone,

I’m using Gemini TTS with the google-genai Python SDK to generate narrated audio from longer text.

I want to insert short external sound effects at specific points inside the narration.

Example:

“The character opened the window. [[SFX: wind_soft]] Then the room became quiet again.”

My current workaround is:

1. Split the text at each SFX marker.

2. Send each part as a separate Gemini TTS request.

3. Load the external SFX file in Python.

4. Combine the generated audio chunks and SFX using pydub/ffmpeg.

This works, but if the input text is long and contains many SFX markers, it increases the number of TTS requests, quota usage, latency, and may reduce narration continuity.

Is there a better recommended architecture for this use case?

Possible solutions I’m wondering about:

- Can Gemini TTS accept or generate markers that help place external SFX?

- Can Gemini TTS return timing/alignment metadata for the generated audio?

- Is there any supported way to use SSML-style marks or event markers?

- Is there a recommended way to mix external audio assets with Gemini TTS output?

- Should I generate one full narration and then use another tool for alignment?

- Or is splitting the text at each SFX marker currently the best approach?

Simplified Python call:

```python

response = client.models.generate_content(

model="gemini-3.1-flash-tts-preview",

contents=prompt,

config=types.GenerateContentConfig(

    response_modalities=\["AUDIO"\],

    speech_config=types.SpeechConfig(

        voice_config=types.VoiceConfig(

            prebuilt_voice_config=types.PrebuiltVoiceConfig(

                voice_name="Kore"

            )

        )

    ),

),

)