Bulk Processing Images Without Batching

I have a couple hundred photos that I would like to generate individual captions for, using a prompt. I would like to process them concurrently, without running into rate limits errors and without using batching (would like to show captions as individual images finish).

What’s the recommended way of doing this with Gemini API + TypeScript / Swift?

Welcome to the community @BrianHung

One of the advantages of Gemini models is the really large context window. This means that you can send a bunch of images and ask the model to return a jsonlist using structured outputs where every element is an object containing the details that you want about the respective image.

I made a sample on the AI Studio but sharing isn’t working so here’s the json schema you can use in your own AI studio instance after enabling the JSON mode:

{
  "type": "object",
  "properties": {
    "response": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "caption": {
            "type": "string",
            "description": "Image caption"
          },
          "hasHumans": {
            "type": "boolean",
            "description": "True if the image has humans in it, otherwise false"
          }
        },
        "required": [
          "caption",
          "hasHumans"
        ]
      }
    }
  },
  "required": [
    "response"
  ]
}

Set the system prompt to:

You reply with a description of the images in the respective order in the specified JSON format.

Set the model to “Gemini 1.5 Flash” or Pro depending your use-case.

Then simply add the images. No need to supply any text, just make the generate content request.

I tried with four of the sample images and got the following response (prettified):

{
  "response": [
    {
      "caption": "A wooden chair with a light brown finish.",
      "hasHumans": false
    },
    {
      "caption": "A tabby kitten looking at the camera.",
      "hasHumans": false
    },
    {
      "caption": "A bowl of salad with tomatoes, cucumbers, red onions, feta cheese, and lettuce.",
      "hasHumans": false
    },
    {
      "caption": "A man riding a bicycle on a road.",
      "hasHumans": true
    }
  ]
}

Thanks sps for the welcome; do you know if in this approach, these token generations are uncorrelated from each other as if they were sent individually?

I would want a caption to only be dependent on the image, not on preceding images or captions.

The outputs are inevitably going to be influenced when they are being sent this way. However, to see how much they were influenced I experimented with the model Gemini 1.5 Flash 002 and constant temperature = 0 for the batch and then the individual images.

Here are the results.

Batch

{
  "response": [
    {
      "caption": "A light brown wooden chair with a simple design is shown in a studio setting against a white background.",
      "hasHumans": false
    },
    {
      "caption": "Close-up view of an adorable tabby kitten lying down on a light-colored surface. The kitten's fur is predominantly brown and black striped, and its eyes are a striking green.",
      "hasHumans": false
    },
    {
      "caption": "A fresh and vibrant salad in a bowl. The salad consists of chopped tomatoes, cucumbers, red onions, feta cheese, and fresh herbs like parsley and cilantro. The ingredients are arranged in a visually appealing manner, and the overall presentation suggests a healthy and delicious meal.",
      "hasHumans": false
    },
    {
      "caption": "A man wearing a helmet and a backpack is riding a bicycle on a city road. The view is from behind, showing the man's back and the bicycle. The background includes a cityscape with buildings and a bridge.",
      "hasHumans": true
    }
  ]
}

Image 1

{
  "response": [
    {
      "caption": "A light brown wooden chair with a curved back and seat sits on a white background.",
      "hasHumans": false
    }
  ]
}

Image 2

{
  "response": [
    {
      "caption": "Close-up of a tabby kitten lying down on a light-colored surface. The kitten is looking directly at the camera.",
      "hasHumans": false
    }
  ]
}

Image 3

{
  "response": [
    {
      "caption": "A bowl of fresh salad with feta cheese, tomatoes, cucumbers, red onion, and lettuce.",
      "hasHumans": false
    }
  ]
}

Image 4

{
  "response": [
    {
      "caption": "A man wearing a helmet and a backpack is riding a bicycle on a paved road next to a bridge.",
      "hasHumans": true
    }
  ]
}