[BUG] Gemini 2.5 Flash Native Audio outputs control characters (`<ctrl##>`) instead of audio, causing silent responses

Summary

When using the gemini-2.5-flash-native-audio-preview-12-2025 model via the Live API, the model occasionally outputs control characters (e.g., <ctrl46><ctrl46>) to the transcript stream instead of generating audio. During these episodes, no audio is produced, leaving users in complete silence for 10-15+ seconds.

This is a critical production issue affecting voice agent deployments where users have no indication that the system is working, leading to poor user experience and abandoned sessions.


Affected Model(s)

  • gemini-2.5-flash-native-audio-preview-12-2025 (confirmed)

  • Potentially other native audio preview models in the same family


Environment

| Component | Value |

|-----------|-------|

| API | Gemini Live API via WebSocket |

| Model | gemini-2.5-flash-native-audio-preview-12-2025 |

| Response Modality | AUDIO only |

| Language | Romanian (but likely affects all languages) |

| Use Case | Production voice agent for appointment confirmations |

| SDK | google-genai Python SDK |


Expected Behavior

  1. User speaks to the voice agent

  2. Model processes the input

  3. Model generates audio response

  4. User hears the response immediately


Actual Behavior

  1. User speaks to the voice agent

  2. Model processes the input

  3. Model outputs control characters (<ctrl46><ctrl46>) to the transcript stream

  4. No audio is generated - the audio stream contains silence

  5. User hears nothing for 10-15+ seconds

  6. User asks “Can you hear me?” multiple times (confirming they heard nothing)

  7. Eventually, model may recover and produce normal audio


Evidence

Session Timeline (Anonymized)


Production Session - January 9, 2026

Model: gemini-2.5-flash-native-audio-preview-12-2025

32.7s User: "Nu, nu pot veni. As vrea sa reprogramez."

(Translation: "No, I can't come. I'd like to reschedule.")

33.7s Agent transcript output: "<ctrl46><ctrl46>"

Audio output: NONE (verified via audio recording)

[Internal tool call triggered - searching for appointment slots]

[~13 SECONDS OF COMPLETE SILENCE - NO AUDIO GENERATED]

45.1s User: "M-ati auzit?"

(Translation: "Can you hear me?" - confirms no audio was heard)

47.3s Agent: "Va rog sa asteptati putin..."

(Translation: "Please wait a moment..." - audio resumes normally)

Key Observations

  1. Transcript explicitly contains <ctrl46><ctrl46> - These characters appear in the output_transcription stream where normal text should be

  2. Audio recording confirms complete silence - The OGG recording of the session contains zero audio during this 13-second period

  3. User confirmation of silence - The user’s “Can you hear me?” at 45.1s proves they received no audio

  4. Model eventually recovers - After the silence, normal audio generation resumes


Control Character Details

The control characters observed follow the pattern <ctrl##> where ## is a number. Examples seen:

  • <ctrl46> (most common)

  • Multiple consecutive occurrences: <ctrl46><ctrl46>

Questions:

  1. What do these control characters represent internally?

  2. Why are they leaking into the transcript output instead of being processed?

  3. Why does their presence correlate with audio generation failure?


Reproduction

Trigger Conditions (Observed)

This issue appears to occur:

  • After processing user speech that requires a substantive response

  • More frequently when tool/function calls are involved (but not exclusively)

  • Inconsistently - the same input may work sometimes and fail other times

Steps to Reproduce

  1. Set up a Gemini Live API session with gemini-2.5-flash-native-audio-preview-12-2025

  2. Configure for AUDIO-only response modality

  3. Engage in multi-turn conversation

  4. At some point (unpredictable), the model will output <ctrl##> instead of audio

  5. Observe silence in the audio stream

  6. Check transcript to see control characters

Note: Due to the inconsistent nature of the bug, reproduction may require multiple attempts.


Impact

User Experience

  • Users hear nothing for 10-15+ seconds

  • Users assume the system is broken or didn’t hear them

  • Users repeatedly ask “Can you hear me?”

  • Sessions are abandoned due to perceived failure

Business Impact

  • Voice agents appear unreliable in production

  • Customer frustration and support burden

  • Cannot deploy native audio models with confidence

Workaround Attempts

  • No reliable workaround has been found

  • The issue occurs at the model level before any application-layer processing


Related GitHub Issues

This appears related to other reported audio generation issues with native audio models:

  1. google-gemini/live-api-web-console#117 - Audio cutoff mid-speech

  2. googleapis/python-genai#1725 - Audio generation inconsistency (Closed - Not Planned)

  3. google-gemini/cookbook#977 - LiveAPI stops talking in a second

  4. googleapis/js-genai#707 - Responses cut off with turnComplete

The control character output may be a specific manifestation of the broader audio generation failure pattern described in these issues.


Technical Analysis

Hypothesis

Based on the evidence, it appears that:

  1. The model’s internal audio generation pipeline sometimes fails

  2. When audio generation fails, the model outputs control characters as a fallback or error state

  3. These control characters leak into the transcript stream

  4. The audio stream remains empty/silent

  5. Eventually, the model’s internal state recovers and normal operation resumes

Why This Differs from Audio Cutoff

This is distinct from the “audio cutoff mid-sentence” issue:

  • Audio cutoff: Audio starts playing, then stops early

  • This issue: No audio is ever generated - complete silence from the start


Requests

1. Priority Escalation

This issue is currently tracked as P2 (Moderately-important). Given the production impact on voice agent deployments, we respectfully request escalation to P1.

Justification:

  • Affects production systems with real users

  • No workaround available

  • Causes significant user experience degradation

  • Undermines confidence in native audio models

2. Timeline and Acknowledgment

We request:

  • Confirmation that this issue is being tracked

  • An estimated timeline for a fix or patch

  • Any interim workarounds while a fix is developed

3. Technical Clarification

We would appreciate understanding:

  • What the <ctrl##> characters represent

  • Why audio generation fails when these characters appear

  • Whether there’s a way to detect and handle this state


Additional Information

Session Configuration


# Model configuration used

model = "gemini-2.5-flash-native-audio-preview-12-2025"

# Response modality

response_modalities = [genai_types.Modality.AUDIO]

# VAD configuration

automatic_activity_detection = genai_types.AutomaticActivityDetection(

start_of_speech_sensitivity=genai_types.StartSensitivity.START_SENSITIVITY_HIGH,

end_of_speech_sensitivity=genai_types.EndSensitivity.END_SENSITIVITY_LOW,

prefix_padding_ms=20,

silence_duration_ms=300,

)

We Can Provide

If helpful for debugging:

  • Full anonymized session logs

  • Audio recordings showing the silence

  • Transcript files with control characters

  • Additional test sessions


Summary

The gemini-2.5-flash-native-audio-preview-12-2025 model occasionally outputs <ctrl##> control characters instead of generating audio, causing complete silence for 10-15+ seconds. This is a critical issue for production voice agents with no available workaround.

We request P1 prioritization, acknowledgment of the issue, and an estimated fix timeline.


Report prepared: January 12, 2026

Observed in production: January 9, 2026

1 Like

Hi @bunny1, welcome to the community!

I tried multiple times to reproduce the issue, but every time, the transcript and audio generation was fine. But this is expected, as you have mentioned the issue happens occasionally.
Can you please send some additional logs and any specific, relevant configuration that we can use to try and reproduce this issue with the model?

Thanks!

Occasionally, I experience the same issue with the Danish language.