Hi everyone,
We are working on effective deployment of AI models on various devices such as smartphones and smart TVs.
We observed an interesting phenomenon:
- a significant speed-up when moving from float32 models to int8 models on Cortex_A55 (25ms → 5ms inference time)
- a much lower speed-up when moving from float32 models to int8 models on Cortex_A73 (23ms → 18ms inference time)
Do you have any idea what may be the reason of this? It seems that XNNPACK (with quantized models support) does not activate correctly on A73 CPU?
Some technical details:
- we use tflite with XNNpack (latest TFLite version), we tested both uint8 and int8 models
- we use per-tensor quantization
- our convolutional architectures are based on FSMN (https://arxiv.org/pdf/1512.08301) and DeepFSMN (https://arxiv.org/pdf/1803.05030)
- we use the benchmark tool for our comparisons (Performance measurement | TensorFlow Lite) and also our own compiled binaries (C++ environment), both benchmarks show similar results
Best regards,
Michal