Bulk Processing Images Without Batching

BrianHung · October 24, 2024, 5:48pm

I have a couple hundred photos that I would like to generate individual captions for, using a prompt. I would like to process them concurrently, without running into rate limits errors and without using batching (would like to show captions as individual images finish).

What’s the recommended way of doing this with Gemini API + TypeScript / Swift?

sps · October 25, 2024, 6:19am

Welcome to the community @BrianHung

One of the advantages of Gemini models is the really large context window. This means that you can send a bunch of images and ask the model to return a jsonlist using structured outputs where every element is an object containing the details that you want about the respective image.

I made a sample on the AI Studio but sharing isn’t working so here’s the json schema you can use in your own AI studio instance after enabling the JSON mode:

{
  "type": "object",
  "properties": {
    "response": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "caption": {
            "type": "string",
            "description": "Image caption"
          },
          "hasHumans": {
            "type": "boolean",
            "description": "True if the image has humans in it, otherwise false"
          }
        },
        "required": [
          "caption",
          "hasHumans"
        ]
      }
    }
  },
  "required": [
    "response"
  ]
}

Set the system prompt to:

You reply with a description of the images in the respective order in the specified JSON format.

Set the model to “Gemini 1.5 Flash” or Pro depending your use-case.

Then simply add the images. No need to supply any text, just make the generate content request.

I tried with four of the sample images and got the following response (prettified):

{
  "response": [
    {
      "caption": "A wooden chair with a light brown finish.",
      "hasHumans": false
    },
    {
      "caption": "A tabby kitten looking at the camera.",
      "hasHumans": false
    },
    {
      "caption": "A bowl of salad with tomatoes, cucumbers, red onions, feta cheese, and lettuce.",
      "hasHumans": false
    },
    {
      "caption": "A man riding a bicycle on a road.",
      "hasHumans": true
    }
  ]
}

BrianHung · October 25, 2024, 7:55am

Thanks sps for the welcome; do you know if in this approach, these token generations are uncorrelated from each other as if they were sent individually?

I would want a caption to only be dependent on the image, not on preceding images or captions.

sps · October 25, 2024, 11:45am

The outputs are inevitably going to be influenced when they are being sent this way. However, to see how much they were influenced I experimented with the model Gemini 1.5 Flash 002 and constant temperature = 0 for the batch and then the individual images.

Here are the results.

Batch

{
  "response": [
    {
      "caption": "A light brown wooden chair with a simple design is shown in a studio setting against a white background.",
      "hasHumans": false
    },
    {
      "caption": "Close-up view of an adorable tabby kitten lying down on a light-colored surface. The kitten's fur is predominantly brown and black striped, and its eyes are a striking green.",
      "hasHumans": false
    },
    {
      "caption": "A fresh and vibrant salad in a bowl. The salad consists of chopped tomatoes, cucumbers, red onions, feta cheese, and fresh herbs like parsley and cilantro. The ingredients are arranged in a visually appealing manner, and the overall presentation suggests a healthy and delicious meal.",
      "hasHumans": false
    },
    {
      "caption": "A man wearing a helmet and a backpack is riding a bicycle on a city road. The view is from behind, showing the man's back and the bicycle. The background includes a cityscape with buildings and a bridge.",
      "hasHumans": true
    }
  ]
}

Image 1

{
  "response": [
    {
      "caption": "A light brown wooden chair with a curved back and seat sits on a white background.",
      "hasHumans": false
    }
  ]
}

Image 2

{
  "response": [
    {
      "caption": "Close-up of a tabby kitten lying down on a light-colored surface. The kitten is looking directly at the camera.",
      "hasHumans": false
    }
  ]
}

Image 3

{
  "response": [
    {
      "caption": "A bowl of fresh salad with feta cheese, tomatoes, cucumbers, red onion, and lettuce.",
      "hasHumans": false
    }
  ]
}

Image 4

{
  "response": [
    {
      "caption": "A man wearing a helmet and a backpack is riding a bicycle on a paved road next to a bridge.",
      "hasHumans": true
    }
  ]
}

Topic		Replies	Views
How to do batch Inference on Prompt Image pairs with Gemini API without getting errors Gemini API gemini-15 , bug , api	1	319	May 28, 2024
Bulk image processing using the gemini API, Concrete way to reference the images Gemini API generative-ai , imagevision	1	50	April 7, 2025
How to get multi-part responses? Gemini API gemini-15 , api , gemini-api	8	568	November 27, 2024
Processing multiple text excerpts with Gemini API Gemini API api , text	2	52	May 20, 2025
Significant Difference in Response Quality between Google AI Studio and Gemini 2.5 Pro API (gemini-2.5-pro-03-25) Gemini API feedback , api , gemini-25 , gemini-2-5	7	518	June 4, 2025