Hi, I’m running an INT8 quantized encoder-decoder seq2seq model (fused single graph) through LiteRT with a hardware accelerator delegate. The model loads and runs without errors but generates repeated tokens regardless of input.
The same model in FP32 works correctly. Disabling the delegate and falling back to CPU gives the same broken output, so the issue is not delegate-specific but in how the quantized graph is executed by the LiteRT runtime.
The decoder appears to stop using encoder context entirely from the first inference step onward.
Has anyone seen this behavior with encoder-decoder models on LiteRT? Any runtime-side configuration (op resolver, delegate options, quantization parameters) that could affect cross-attention execution in INT8 would be very helpful.
Reproducible notebook available on request.