I’m using the Live API (gemini-3.1-flash-live-preview) for a speech-to-speech audio session and I’m trying to understand how ContextWindowCompressionConfig / SlidingWindow actually behaves.
Setup — two sessions, identical scripted ~25-turn audio conversation, same audio input, with exactly one parameter changed between them:
-
trigger_tokens = 25000 (held constant)
-
Session A: SlidingWindow(target_tokens=512)
-
Session B: SlidingWindow(target_tokens=8000)
So target_tokens differs 16x between the two runs.
Observed (reading usage_metadata.prompt_token_count per turn):
-
Compression clearly fires in both runs — there are visible post-trigger drops in the prompt token count.
-
But the post-compression “landing” sizes differ by only ~10-20% between A and B — nowhere near 16x.
-
In both sessions the post-compression size keeps escalating turn over turn, and the window ends at ~71,000 prompt tokens by turn 25.
-
Cumulative billed input tokens were nearly the same — in fact the target_tokens=512 run was ~9% HIGHER, not lower.
So a 16x reduction in target_tokens produced essentially no reduction in realized window size or cost (slightly the opposite).
Questions:
1. For audio / S2S sessions, is target_tokens expected to influence the post-compression window size at all, or is it effectively a soft hint?
2. Is there a documented incompressible floor — e.g. system instruction + tools + the most recent un-discardable user turn(s)? If audio turns are large, is a small target_tokens simply unreachable?
3. Is SlidingWindow discard quantized to whole turns (and aligned to a user-turn boundary)? That would explain why a small target cannot be reached.
I’ve verified the config is constructed and sent correctly (the SlidingWindow.target_tokens field is populated). I’m trying to understand the intended behavior — not just whether this is a bug — so I can decide whether target_tokens is a usable knob for controlling cost in long audio sessions.
Pointers to docs or clarification from the team would be very helpful.
Thanks!