Decoding of tflite custom object detector output from model trained with mediapipe (MobileNetV2)

Hi all, I trained a custom object detector model with Mediapipe, when I exported it to tflite (ok) and tried to make predictions I obtain two dictionaries as output, one supposed for bboxes and one for scores, the outputs are obtained as follows for one image:

output_details = interpreter.get_output_details()

which gives:

[{'name': 'StatefulPartitionedCall:0',
  'index': 425,
  'shape': array([    1, 12276,     4], dtype=int32),
  'shape_signature': array([    1, 12276,     4], dtype=int32),
  'dtype': numpy.float32,
  'quantization': (0.0, 0),
  'quantization_parameters': {'scales': array([], dtype=float32),
   'zero_points': array([], dtype=int32),
   'quantized_dimension': 0},
  'sparsity_parameters': {}},
 {'name': 'StatefulPartitionedCall:1',
  'index': 423,
  'shape': array([    1, 12276,     4], dtype=int32),
  'shape_signature': array([    1, 12276,     4], dtype=int32),
  'dtype': numpy.float32,
  'quantization': (0.0, 0),
  'quantization_parameters': {'scales': array([], dtype=float32),
   'zero_points': array([], dtype=int32),
   'quantized_dimension': 0},
  'sparsity_parameters': {}}]

code for prediciton:

interpreter.set_tensor(input_details[0]['index'], input_data)
interpreter.invoke()
boxes  = interpreter.get_tensor(output_details[0]['index'])
scores = interpreter.get_tensor(output_details[1]['index'])

boxes:

[[[ 0.01087701 -0.27369365 -0.53198564 -0.8404835 ]
  [ 0.05485853  0.02915781 -1.390534   -1.670182  ]
  [-0.12034623  0.00819616 -0.9961058  -0.395994  ]
  ...
  [-0.3435838  -0.35941318 -1.0712042  -0.43489447]
  [-0.4016505  -0.03572614 -0.67902136 -0.7194235 ]
  [-0.47916242  0.01016152  0.13207799 -0.7979872 ]]]

scores:

[[[0.005811   0.00431303 0.00324296 0.01789892]
  [0.00658012 0.01305784 0.00548336 0.01855727]
  [0.01610166 0.00838473 0.01678689 0.01819396]
  ...
  [0.00505611 0.02350343 0.01970816 0.00919266]
  [0.00427777 0.01386124 0.00888682 0.01396356]
  [0.00742702 0.00696907 0.00702236 0.00696763]]]

my question is how do you decode the output in a format that can be used for inference and visualization, I’m not understanding the structure of the values of boxes (there are negative values), the same with scores and relate them to the classes, in my case I’m trying to predict objects which belongs to one of 3 classes (+ 1 of background). I’ve seen in some examples that tflite object detectors also provide the labels and number of detections in other two dictioneries, but here I only get 2. I trained the detector based on the MobileNetV2 model with an example from mediapipe (by recommendations of google developers since tflite model maker seems to have facing issues and it will moves to mediapipe in the future).

Does anyone has used this model before and docoded the outputs to make predictions with the exported tflite models? is there a way to add more information when exporting the model to tflite and then get the other two mentioned outputs and in a decoded format directly?

Thanks,
any help will be very useful,
best regards
Carlos

1 Like

I am looking for a way to make sense of this as well. Would really be nice if someone could drop some info on this.
In my case I tried to import the model into an android application to run inference with Vision Tasks of TFLite. However, it expects 4 output tensors and is finding 2 as mentioned in the posts. tflite-model-maker has been out of the scene for quite a while now due to version conflicts and mediapipe-model-maker seems to be eating up output tensors.