INT8 encoder-decoder seq2seq model inference on edge - Repeated tokens output regardless of input when running via LiteRT

AminLO · April 27, 2026, 12:54pm

Hi, I’m running an INT8 quantized encoder-decoder seq2seq model (fused single graph) through LiteRT with a hardware accelerator delegate. The model loads and runs without errors but generates repeated tokens regardless of input.

The same model in FP32 works correctly. Disabling the delegate and falling back to CPU gives the same broken output, so the issue is not delegate-specific but in how the quantized graph is executed by the LiteRT runtime.

The decoder appears to stop using encoder context entirely from the first inference step onward.

Has anyone seen this behavior with encoder-decoder models on LiteRT? Any runtime-side configuration (op resolver, delegate options, quantization parameters) that could affect cross-attention execution in INT8 would be very helpful.

Reproducible notebook available on request.

Topic		Replies	Views
Quantization spec for 16x8 quantization TensorFlow tflite-support , litert	1	130	December 31, 2025
Help converting tflite models with mediapipe Google AI Edge tflite-support , mediapipe	2	533	January 22, 2025
AI edge torch api converted Gemma 2b inference via Mediapipe on Android Gemma api , gemma	0	200	June 26, 2024
How tflite calculates exp under int8 quantization? General Discussion tflite	1	154	February 27, 2025
Issues about quantized MobileBert model TensorFlow models , tflite	2	127	September 9, 2025

INT8 encoder-decoder seq2seq model inference on edge - Repeated tokens output regardless of input when running via LiteRT

Related topics