I’m trying to quickly load a model to make predictions in a REST API. The tf.keras.models.load_model method takes ~1s to load so it’s too slow for what I’m trying to do.
What is the fastest way to load a model for inference only?
I know there is a TFX serving server to just do this efficiently but I’ve already have a REST API for doing other things. Setting up a specialised server just for predictions feels like an overkill. How is the TFX server handling this?
Yes there might be no other way. However, I’m not sure if the TFX server loads a model from disk in every request?
What I’m trying to achieve is to either find a very quick way to load a model from disk or keep the model in memory somehow so it doesn’t need to be loaded in every request.
I also tried caching but pickle deserialisation is very expensive and adds ~1.2s. I suspect the built-in load model does some sort of serialisation too, which seems to be the killer.