Let’s say I have a long article with multiple images within it, which I need to include in the prompt, for reference. Imagine photos of various places or people, with descriptions.
In the API, I see that the input is broken down into separate parts (text or image). How piece it back together, to give the model the information on which text parts relate to which image parts?
I think I need either:
A. Somehow keep those images within the original, logical flow of the text
B. Give the images some additional labels, so that the connection can be inferred
Can it be achieved? Or am I missing some other way?
Thanks.
To accurately replicate the flow between text and images (as we humans would see it), the API uses the Content type: https://ai.google.dev/api/rest/v1beta/Content
It consists of a list of Parts. Each Part can be text or an image, and there are two ways to provide image Parts: (a) as inlineData, which is good enough if you have a small amount of images, say under 10, as you would expect to find in something like a magazine article, or (b) as fileData by including a uri to the image previously uploaded to the Google cloud.
You’ve outlined the two approaches correctly. Tho you don’t need to “piece it back together”, since the model will take care of that for you. As it looks at the list of parts, it turns them into a single stream of tokens.
@OrangiaNebula has outlined how to break it into these two types of parts.
In both methods, it makes sense to have chunks of text in between the image pargs (either inline or file image data). With your method A, the flow typically works well without specific annotations. With method B, it is common to annotate each image with a text part before it that says something like “This is image 1:” and so forth, and then reference the images later in your full text prompt.
One issue is that once the text and image are turned into tokens, that this is an image itself is lost. So you can’t refer to “the image below”, because the model doesn’t see an image anymore. This is why specifically labeling each image using a text part is important.
model.generateContentStream([
'Article text...',
{ inlineData: { data: smallBase64image, mimeType: 'image/jpeg' } }
'Person A description',
{ inlineData: { data: smallBase64image, mimeType: 'image/jpeg' } },
'Person B description',
{ fileData: { data: largeFileBlobImage } },
'Person C description',
'Rest of the article...'
])
Correct?
Does the FileData version (with URI) have to on the Google Cloud? I already have the images hosted on my CDN, so it would be much easier to link to it, than to reupload it elsewhere.
Yes, you have clearly understood the concept. There are libraries available for JavaScript that might make it a bit easier to handle mixed-modality content. I’m not strong on js and used the Python examples earlier.
The uri — I’m pretty sure it has to be on Google cloud, the way I know has worked is to use the file API (which is part of the overall API). The upload returns a (storage) File object: REST Resource: files | Google AI for Developers | Google for Developers which contains the uri that you can later use when defining the fileData part.
The file API has some footnotes attached when on the free tier, the files get auto removed two days later and overall storage is capped at 20 GB. It’s clearly intended to support development, period.
You have the gist of it, correct. (Although the fileData part requires a fileUri field, not a data field.)
The fileData parts must use either a File API URI (if you’re using the AI Studio Gemini API)or a Google Cloud Storage “gc” URI (if you’re using Vertex AI Gemini API). You cannot use an external URL.