TensorFlow Serving parallelism

TensorFlow serving provides two parameters to utilize the CPU (tensorflow_intra_op_parallelism and tensorflow_inter_op_parallelism). Tuning these parameters can have great impact on the model server performance (throughput, latency). I couldn’t find a good documentation for them in TensorFlow Serving website. My main question is:

  • How these thread pools relate to rest_api_num_threads . Are the thread pools shared between ops of all the requests on the model server?

Hello @Mehran_S

Thank you for using TensorFlow

The tensorflow_intra_op_parallelism and tensorflow_inter_op_parallelism are parameters used by TensorFlow during model execution, rest_api_num_threads affects the number of threads available for handling incoming API requests in Serving. The thread pools for these are not directly shared.
Tuning tensorflow_intra_op_parallelism and tensorflow_inter_op_parallelism will directly may improve model inference performance, while rest_api_num_threads handles multiple concurrent requests from requests.