Unable to enforce strictly JSON-parsable output from gemini-2.5-flash-native-audio-preview in Live API session

Hi everyone, I’m integrating the Gemini Live API using the gemini-2.5-flash-native-audio-preview-12-2025 model for a real-time audio interaction use case that absolutely requires every model response to be strictly JSON parsable. I must continue using the Live API because of its real-time streaming performance and continuous audio interaction flow.

Even with very explicit instructions in the system prompt (e.g., “Output must be strictly JSON, first character {, last }”), the model continues to emit responses without JSON only or with extra natural language surrounding the expected structured data, making automatic parsing impossible.

To explain the need for strict JSON output in a concrete but simple way:

Suppose the assistant is helping multiple UI components decide what to show next based on the latest model response.
If the model output were exactly

{ "nextAction": "showHint", "message": "Choose a drink" }

then the client can simply do:

const { nextAction, message } = JSON.parse(responseText);
handle(nextAction, message);

But if the response contains any extra text (even a single line) outside of the JSON object, JSON.parse() fails and the pipeline breaks.

The Gemini API docs describe Structured Outputs and JSON Schema support in the REST API, but I haven’t found a way to enforce this deterministically within a Live API (WebSocket) session with native audio models. I’m also aware that Structured Outputs or Function Calling can help on the normal REST path, but Live API still returns mixed or unstructured text in its stream.

Has anyone successfully forced strictly JSON-parsable output from gemini-2.5-flash-native-audio-preview in a Live API session, without requiring post-processing to extract the JSON? If so, what configuration or prompting pattern works in practice?

Thanks!