Hi, I’m deploying a simple CPU model on tensorflow serving server (model is just a basic adder of 2 tensors, nothing heavy in particular). I created a test script in python in which I prepare and send a dozen of requests to the server - the goal is to achieve the highest possible throughput. As the serialisation of requests takes a while I split the work onto separate python processes spawned inside my script. So, each process prepares a subset of requests and sends each of them immediately to the server by calling stub.Predict.future and then collects the futures’ results. Thanks to this strategy I managed to reach some top throughput.
I then extended my set of models, so now the server does serve 2 identical models with just 2 different names. I start the server with numactl pinning it’s process to some unoccupied cores (from 16 to 128 of them), so it has plenty of resources to put computations on. This time I start 2 clients in parallel each of them sending requests to a different model (once again I pin their processes to separate sets of cores). Surprisingly it does not give me a doubled throughput but the throughput reported by each client is half the top one achieved when only 1 client exist. However, if I start two separate server instances, working on 2 different ports and start my clients in parallel communicating with the dedicated server, the throughput I get is (almost) 2x.
So it looks like the server gets saturated.
After trying all the possible tf serving settings I started looking at the internals and got to the conclusion that the possible cause for such situation is the configuration of grpc server that the serving uses for client - server communication. The tf serving’s grpc server does work in sync mode only and as far as I can see there is no ready-to-go option of switching it to async mode (which I asume may be the solution for me).
So, I did fork tf serving and switched grpc to async mode (all based on grpc tutorial). Inside tf serving I created an array of CompletionQueues (finally sth like 40 of them) and also spawned 40 threads - one for a queue. I do not know much about the grpc, so it was more driven by intuition and experiments then knowledge, but somehow I managed to make it all work. Server does respond to my requests and I get correct results.
However the problem still exists - I did not manage to get the same throughput as when there are 2 server instances.
So, now to the questions:
- Why is there only grpc sync mode in tf serving?
- Is my approach to the solution a complete nonsense or am I going in the right direction?
- I did rework my server, but basically did not change my client (still use stub.Predict.future as before and surprisingly it works) - should I extend the python stub module with AsyncPred and CompletionQueue to prepare & send requests in the same manner it is shown in this: Asynchronous-API tutorial | C++ | gRPC tutorial?
Looking forward to your opinions!