Why does inference with XNNPACK INT8 on TensorFlow Lite run slower with 4 threads than with 2 threads on a Raspberry Pi 5?