I’m asking Gemini 1.5 to tell me if there’s a dog in an image, and it recognizes it perfectly. But when I ask it to send me the coordinates in pixels where it’s located, it does so incorrectly. Is there any trick or model that does it correctly?
This happens to me with any image and with any object that I ask for coordinates so that I can then take that data and crop it from the sent image.
A general observation with LLM’s is, they tend to perform better if you show them what is expected a couple of times first. In your example, that means you would set up a sequence (demo image1, solution for image1, demo image2, solution for image2, the image you really want to have an answer for, placeholder for the model to fill in the answer), and prefix the entire thing with a description of what is expected, either in the prompt or the system instruction. This doesn’t mean you send multiple generateContent
requests, it’s all in one request with multiple Parts.
Note, that’s just a general rule, I am not claiming that it will work perfectly for the image-cropping test case (although I have seen published research reports that claim this can be handled relatively well by LLMs).
In my testing, I have found that the Gemini are often confused about right, left, up and down. I mean, they get more easily confused about direction than other popular LLMs. Don’t know if that will affect the quality of the results you get.
I think trying out the few-shot approach I outlined above will help you. It would help everybody if you report back whether it did or didn’t
1 Like
One more tip: In the usual example prompts to perform object localization, there is an explicit “What is the image size of the input image?” question in the prompt or better yet the prompt specifies “The image size is (width, height) = (998,786)”. That helps the model with orientation. The results will still be approximate, the model will guess where the bounding box of the targeted object really is.
This is probably not something the model will be able to do with any reasonable amount of reliability or accuracy.
We have no idea if or by how much the image is scaled prior to going into the model, and even if there is no scaling, this isn’t the type of thing likely to be in the training data.
If you want to get this type of information, I would recommend using an external utility to overlay a labeled grid of points on the image, then asking for reference points.
Don’t fight the model, meet it where it is.
2 Likes
I certainly agree that what I described as a guess is going to be inaccurate, but it’s not useless. Intrigued by the idea, I tried out the prompt style that usually works for GPT-4 on a cartoon. Gemini 1.5 pro in AI Studio came up with this answer:
The bounding box would encompass the speech bubble on the right side of the image, with the following approximate coordinates:
Top Left: (190, 400)
Bottom Right: (510, 735)
The angel says:
"Everything! The bad people go to hell, the good people go to heaven and the rich people go to the Bahamas!"
It’s not perfect, but it’s kind of Ok, and the model explained that it considered the speech bubble as part of the “angel” object in the cartoon. For some applications, it could be good enough.
Google has a number of other Vision products and models available. One of them may better serve your needs.
1 Like
I have tried with VISION AI and it works perfectly. I will have to use both models (Gemini + Vision) to get the result I want. I hope the billing isn’t too high now
2 Likes