I am working on an app to teach kids animal names.
The way it works is that they have a camera stream that looks at what they are pointing at, and says the animal to them(via saying the name, as well as the sound of the animal). For example, if a kid were to point the phone at a dog, the app would say “dog, woof”.
The model that I am using has a latency of around 2 seconds per prediction. I can’t show the code, but here is how it (kind of) works:
process_image: resizing, normalization
prediction
speak prediction
if the prediction came quickly, wait 3 - latency of prediction seconds.
Get the next available frame, not the second frame in the sequence.
Repeat.
A little problem that I found is that I point the camera at a dog, it says the prediction correctly, then I move the camera. The problem arises when it says dog again, before saying the correct prediction.
But when it says dog again, did the model predicted dog on a random image or is it a delay on your pipeline?
are you using TFLite?
it would be better to put more information on what you’re using.
For image classification 2 seconds prediction is usually too much, which model are you using? if custom, how did you customize it?
Wait, sorry. 2 seconds isn’t the prediction time, it’s between 200-500 ms but there is extra time needed to say the prediction.
I’m using a tflite model which uses the yolov5s architecture. There are around 20 animals that I am detecting and found that yolo worked a lot better than ssd mobilenet or simple image classification with efficientnet.
Not sure what you mean by random image, but no, its the next available frame.
Here’s a more detailed pipeline(pseudocode)
predLoaded = true
camera.onNextFrame = predict
async predict(image) {
delay = 3000ms
if not predLoaded then return;
Image = await preprocess(image)
Prediction = await model.predict(image)
Prediction = await nms(prediction)
Speak(prediction)
Combined time = time(preprocess) + time(prediction) + time(nms)
If Combined time < delay {
Await Wait delay - Combined time
}
predLoaded = true
}
Actually I just checked, Combined prediction time with saying animal name is around 5 seconds but on extremely fast devices it can get below the 3s threshold