How to tie images to the text parts of a long context?

Eskel · May 26, 2024, 10:42pm

Let’s say I have a long article with multiple images within it, which I need to include in the prompt, for reference. Imagine photos of various places or people, with descriptions.

In the API, I see that the input is broken down into separate parts (text or image). How piece it back together, to give the model the information on which text parts relate to which image parts?

I think I need either:
A. Somehow keep those images within the original, logical flow of the text
B. Give the images some additional labels, so that the connection can be inferred

Can it be achieved? Or am I missing some other way?
Thanks.

OrangiaNebula · May 27, 2024, 12:02am

To accurately replicate the flow between text and images (as we humans would see it), the API uses the Content type: https://ai.google.dev/api/rest/v1beta/Content
It consists of a list of Parts. Each Part can be text or an image, and there are two ways to provide image Parts: (a) as inlineData, which is good enough if you have a small amount of images, say under 10, as you would expect to find in something like a magazine article, or (b) as fileData by including a uri to the image previously uploaded to the Google cloud.

This cookbook example illustrates the concept using method (a): cookbook/examples/Guess_the_shape.ipynb at main · google-gemini/cookbook · GitHub

Hope this helps

afirstenberg · May 27, 2024, 12:28pm

Welcome to the forums!

You’ve outlined the two approaches correctly. Tho you don’t need to “piece it back together”, since the model will take care of that for you. As it looks at the list of parts, it turns them into a single stream of tokens.

@OrangiaNebula has outlined how to break it into these two types of parts.

In both methods, it makes sense to have chunks of text in between the image pargs (either inline or file image data). With your method A, the flow typically works well without specific annotations. With method B, it is common to annotate each image with a text part before it that says something like “This is image 1:” and so forth, and then reference the images later in your full text prompt.

One issue is that once the text and image are turned into tokens, that this is an image itself is lost. So you can’t refer to “the image below”, because the model doesn’t see an image anymore. This is why specifically labeling each image using a text part is important.

Eskel · May 27, 2024, 4:30pm

@OrangiaNebula @afirstenberg I see, thanks. So for example in javascript it would go like this:

model.generateContentStream([
  'Article text...',
  { inlineData: { data: smallBase64image, mimeType: 'image/jpeg' } }
  'Person A description',
  { inlineData: { data: smallBase64image, mimeType: 'image/jpeg' } },
  'Person B description',
  { fileData: { data: largeFileBlobImage } },
  'Person C description',
  'Rest of the article...'
])

Correct?
Does the FileData version (with URI) have to on the Google Cloud? I already have the images hosted on my CDN, so it would be much easier to link to it, than to reupload it elsewhere.

OrangiaNebula · May 27, 2024, 4:57pm

Yes, you have clearly understood the concept. There are libraries available for JavaScript that might make it a bit easier to handle mixed-modality content. I’m not strong on js and used the Python examples earlier.

The uri — I’m pretty sure it has to be on Google cloud, the way I know has worked is to use the file API (which is part of the overall API). The upload returns a (storage) File object: REST Resource: files | Google AI for Developers | Google for Developers which contains the uri that you can later use when defining the fileData part.

The file API has some footnotes attached when on the free tier, the files get auto removed two days later and overall storage is capped at 20 GB. It’s clearly intended to support development, period.

afirstenberg · May 27, 2024, 11:52pm

You have the gist of it, correct. (Although the fileData part requires a fileUri field, not a data field.)

The fileData parts must use either a File API URI (if you’re using the AI Studio Gemini API)or a Google Cloud Storage “gc” URI (if you’re using Vertex AI Gemini API). You cannot use an external URL.

Topic		Replies	Views
Am I using the FileData / Uri part wrong? Gemini API api	4	498	May 10, 2024
How to process uploaded image into a multimodal image content without using PIL on python? Gemini API api , python	3	53	May 21, 2025
Unable to upload files to Gemini 2.0 : File not exists in Gemini API Gemini API gemini-20	6	437	May 11, 2025
Sending Files With Prompt: Gemini AI API Gemini API api	11	1642	July 17, 2024
500 error when including a file Gemini API api , model	12	252	September 17, 2024

How to tie images to the text parts of a long context?

Related topics