Recently I have been benchmarking the inference speed between the quantized and non-quantized tflite models. They have been converted from the same pre-trained Tensorflow model (a *.pb file).
The thing is that when I compared the inference speed between these two on Android phones, the (int) quantized version is always slower than the non-quantized one. One thing I’ve learned is that I might need to perform the benchmark on ARM processors due to the optimizations based on NEON (INT TFLITE very much slower than FLOAT TFLITE · Issue #21698 · tensorflow/tensorflow · GitHub). In my case, I don’t think this is the problem, but still the issue is there. Does anyone have any comments? Thanks!
PS:
Sorry to bother you in a short time.
Just now, I figured out something that could be some initial input for my questions. I just set one command line option --use_xnnpack=false for doing inference both on quantized and non-quantized tflite models. This time, it looks normal that the quantized version uses less time than the non-quantized version. As far as I know, XNNPACK could help to speed up inferences in the floating point model, but does it support the int model as well? According to this post support for quantized tflite models · Issue #999 · google/XNNPACK · GitHub, it is disabled by default. Do we have the latest information about the support for int?
Any feedback is appreciated!