Welcome to the community @BrianHung
One of the advantages of Gemini models is the really large context window. This means that you can send a bunch of images and ask the model to return a jsonlist using structured outputs where every element is an object containing the details that you want about the respective image.
I made a sample on the AI Studio but sharing isn’t working so here’s the json schema you can use in your own AI studio instance after enabling the JSON mode:
{
"type": "object",
"properties": {
"response": {
"type": "array",
"items": {
"type": "object",
"properties": {
"caption": {
"type": "string",
"description": "Image caption"
},
"hasHumans": {
"type": "boolean",
"description": "True if the image has humans in it, otherwise false"
}
},
"required": [
"caption",
"hasHumans"
]
}
}
},
"required": [
"response"
]
}
Set the system prompt to:
You reply with a description of the images in the respective order in the specified JSON format.
Set the model to “Gemini 1.5 Flash” or Pro depending your use-case.
Then simply add the images. No need to supply any text, just make the generate content request.
I tried with four of the sample images and got the following response (prettified):
{
"response": [
{
"caption": "A wooden chair with a light brown finish.",
"hasHumans": false
},
{
"caption": "A tabby kitten looking at the camera.",
"hasHumans": false
},
{
"caption": "A bowl of salad with tomatoes, cucumbers, red onions, feta cheese, and lettuce.",
"hasHumans": false
},
{
"caption": "A man riding a bicycle on a road.",
"hasHumans": true
}
]
}