How to make TFLite handle multiple api calls at simultaneously?

Hi all,

I am performing Cloud Cost Optimization on AWS and am currently exploring deployment of TFLite models on AWS t4g instances which use AWS custom ARM chip - Graviton2. They pack ~2x performance for 1/2 cost when compared to t3 instances. I have created a basic deployment container for InceptionNet-v3 pretained on Image-Net (no fine-tuning) using FastApi.

InceptionNet-v3 tflite model is quantized to float16. I have shared the deployment code.

import tflite_runtime.interpreter as tflite
import numpy as np
from PIL import Image
import copy

def load_model(model_path):
    '''
    Used to load the model as a global variable
    '''
    model_interpreter = tflite.Interpreter(model_path)   
    return model_interpreter

def infer(image, model_interpreter):
    '''
    Runs inference on the input image
    '''
    model_interpreter.allocate_tensors()
    input_details = model_interpreter.get_input_details()[0]
    output_details = model_interpreter.get_output_details()[0]
    input = model_interpreter.tensor(input_details["index"])
    output = model_interpreter.tensor(output_details["index"])
 
    image = image.resize(input_details["shape"][1:-1])
    image = np.asarray(image, dtype=np.float32)
    image = np.expand_dims(image, 0)
    image = image/255
   
    input()[:] = image
    model_interpreter.invoke()
    results = copy.deepcopy(output())
    
    return results

When I run my load testing program (written using Locust - a python load testing package), then I don’t see any error till the user count breaches ‘3’. After that I start seeing Runtime error given below:

RuntimeError: There is at least 1 reference to internal data
      in the interpreter in the form of a numpy array or slice. Be sure to
      only hold the function returned from tensor() if you are using raw
      data access.

I don’'t see the error for every request. It is random in nature.

When I placed the model loading codelines inside infer function. I stopped seeing this error. But, this will load a new model everytime. This is fine while the model size is less. But, after sometime, this will contribute a good amount towards the inference time.

Is there a way to solve this random error while having the TFLite model as a global variable?

Thanks.

EDIT:
I tried out signature runners api. I didn’t get the error but, for some reason my docker container kept on exiting without any error. I think, it was because of lack of memory.

Does signature runner create multiple interpreter instances in different threads when the current intrepreter is being run thus using up all the available RAM?

Hi, @Prikmm

I apologize for the delayed response, if you still need help on this issue could you please run the performance benchmark on the model with the --enable_op_profiling=true? It will give much more details.

Please refer official documentation for performance best practices and to confirm, did you try workaround suggested in this stackoverflow thread, if not please give it try once and see is it resolving your issue or not ?

If issue still persists after trying with latest version please let us know with more information and with more information to investigate this issue further from our end ?

Thank you for your cooperation and patience.