Optimize prediction thoughtput of tensorflow keras

vanduong0504 · December 27, 2023, 1:34pm

Recently, I built a model that runs on a CPU—a vanilla neural network. Following that, I implemented serving using gRPC and Kubernetes for deployment. Despite configuring the number of CPUs, memory, and service threads, the RPS is bottlenecked around 10 during prediction. I load model during service initialization and run a prediction for each incoming request, as shown below:

pythonCopy code

def GetPredict(self, request, context):
    predict = model.predict()
    return service_pb2.Predict(predict=predict)

I have verified that the prediction process, as indicated by [=============] verbosity, takes around 100ms. However, even with increased CPU or memory, the RPS not increase. Something the Kubernetes pods are crashing when calling the service so much. I think some memory leak in the prediction. Additionally, I want this to be a single prediction, not a batch prediction. Any way to optimize directly without using some tool like ONNX? I use Tensorflow 2.13.

Tim_Wolfe · January 27, 2024, 3:30am

Optimizing the throughput of a TensorFlow Keras model on CPU with gRPC and Kubernetes involves several strategies:

Model Optimization: Simplify your model, consider model quantization, or use TensorFlow Lite for more efficient CPU inference.
Service Code Optimization: Implement asynchronous processing to handle multiple requests simultaneously and consider small-scale batching for predictions.
Kubernetes Tuning: Fine-tune your Kubernetes deployment for optimal resource utilization, efficient load balancing, and effective autoscaling.
gRPC Optimization: Adjust gRPC parameters for better performance and manage connections efficiently.
Memory Leak Investigation: Use profiling tools to identify and fix any memory leaks, and consider manual garbage collection.
Monitoring: Implement robust monitoring and logging to identify performance bottlenecks and errors.
Advanced TensorFlow Features: Explore TensorFlow Serving as an optimized alternative for deploying TensorFlow models.
Hardware Review: Although you’re using CPUs, evaluate if other hardware options like GPUs might offer better performance for your workload.

By applying these strategies, you can potentially increase the request-per-second (RPS) throughput and address issues like service crashes and memory leaks, even without using tools like ONNX. Remember to test changes thoroughly to ensure they positively impact performance.

Topic		Replies	Views
How to deploy tf-serving for maximum throughput for inference on metal and kubernetes? General Discussion tf-serving	1	1228	September 19, 2024
Tensorflow serving GRPC mode General Discussion models , serving , help_request	0	1635	August 26, 2022
Tensorflow serving in Kubernetes deployment fails to predict based on input json (text based messages) - Output exceeds the size limit error General Discussion models , serving , keras , help_request	3	1922	November 7, 2022
Tensorflow multiprossesing model predication General Discussion models , keras	1	529	July 5, 2024
What is `worker` and `use_multiprocessing` in `model.predict`? General Discussion models , keras , help_request	2	3884	July 11, 2022

Optimize prediction thoughtput of tensorflow keras

Related topics