I’m using the TI Linux SDK AM62A with 4 ARM Cortex-A53 to run YOLO models (e.g. v9 tiny) with tflite interpreter. For float32 models XNNPACK and Neon SIMD instructions are enabled and the inference is running multi threaded on all cores. When using the int8 quantized model (quantized with Ultralytics export mode) tflite interpreter uses GEMM (confirmed with perf) instead of XNNPACK while still working multi threaded. Using GEMM and NEON SIMD with int8 results in almost twice as much inference time, while I would normally expect quantized models to be faster than float32 models. Is there any idea to improve the performance of int8 models using this setup?
Hi @Franks, Welcome to the TensorFlow Forum!
Could you please try this with the latest stable version of TensorFlow, and please provide a sample code snippet , how you’re quantizing the model to better understand the issue. Thank you!
Hi @Divya_Sree_Kayyuri, thanks for your response! The latest TFLite Runtime version I can use is 2.12.0, when using the newest provided Linux-Image by TI for that SDK. The quantization is done as follows:
from ultralytics import YOLO
model = YOLO(“yolov9t.pt”)
model.export(format=“tflite”, int8=True, data=“coco8.yaml”)
This code snippet dumps 5 tflite models of which I used yolov9_full_integer_quant.tflite.