I’m running a localhost API in python. The /predict endpoint takes one base64 encoded image which is then processed and predicted by the previously imported saved_model.I’m predicting with my GPU using CUDA. Now the problem is that the model only predicts one image at a time due to the GIL I think, however I’d like to predict multiple images concurrently in case there are multiple requests incoming at the same time. At the moment my GPU is only being utilized to 1-2%. Is there any way to accomplish this in python similar to Tensorflow Serving API which sadly doesn’t properly work with GPU on windows.
This is my code:
from fastapi import FastAPI
import uvicorn
from pydantic import BaseModel
import tensorflow as tf
import numpy as np
import io
from PIL import Image
import base64
import numpy as np
import time
model_path = './saved_model'
model = tf.saved_model.load(model_path)
class PredictionRequest(BaseModel):
image: str
app = FastAPI()
@app.post("/predict")
async def predict_objects(request: PredictionRequest):
image_data = base64.b64decode(request.image)
image = Image.open(io.BytesIO(image_data))
image_np = np.array(image)
input_tensor = tf.convert_to_tensor(image_np)
input_tensor = input_tensor[tf.newaxis, ...]
ts = time.perf_counter()
detections = model(input_tensor)
ts2 = time.perf_counter()
print(int(ts2 * 1000 - ts * 1000))
detections = detections['detection_boxes'][0].numpy()
data = []
# processing return data
return data
if __name__ == "__main__":
uvicorn.run(app, host="127.0.0.1", port=8000)