Live API: does ContextWindowCompression `target_tokens` affect the post-compression window for audio (S2S) sessions?

I’m using the Live API (gemini-3.1-flash-live-preview) for a speech-to-speech audio session and I’m trying to understand how ContextWindowCompressionConfig / SlidingWindow actually behaves.

Setup — two sessions, identical scripted ~25-turn audio conversation, same audio input, with exactly one parameter changed between them:

  • trigger_tokens = 25000 (held constant)

  • Session A: SlidingWindow(target_tokens=512)

  • Session B: SlidingWindow(target_tokens=8000)

So target_tokens differs 16x between the two runs.

Observed (reading usage_metadata.prompt_token_count per turn):

  • Compression clearly fires in both runs — there are visible post-trigger drops in the prompt token count.

  • But the post-compression “landing” sizes differ by only ~10-20% between A and B — nowhere near 16x.

  • In both sessions the post-compression size keeps escalating turn over turn, and the window ends at ~71,000 prompt tokens by turn 25.

  • Cumulative billed input tokens were nearly the same — in fact the target_tokens=512 run was ~9% HIGHER, not lower.

So a 16x reduction in target_tokens produced essentially no reduction in realized window size or cost (slightly the opposite).

Questions:

1. For audio / S2S sessions, is target_tokens expected to influence the post-compression window size at all, or is it effectively a soft hint?

2. Is there a documented incompressible floor — e.g. system instruction + tools + the most recent un-discardable user turn(s)? If audio turns are large, is a small target_tokens simply unreachable?

3. Is SlidingWindow discard quantized to whole turns (and aligned to a user-turn boundary)? That would explain why a small target cannot be reached.

I’ve verified the config is constructed and sent correctly (the SlidingWindow.target_tokens field is populated). I’m trying to understand the intended behavior — not just whether this is a bug — so I can decide whether target_tokens is a usable knob for controlling cost in long audio sessions.

Pointers to docs or clarification from the team would be very helpful.

Thanks!