It’s possible this is already well-known here, but I just stumbled into the discovery that when I use the API to send a pdf file to the model (and I think the same is true for image files), it’s not just the original pdf image which is input into the model, but also 4 cropped versions of the image (along with the OCR raw text). Firstly, this is reflected if I call count_tokens
on the Part object with the pdf file - there are 5x as many tokens as one image should generate. Additionally, if I prompt the model after the file with how many total images did i just show you?
, the model will respond with You showed me a total of 5 images: one original image and four different crops of that image.
After some prompting, the model will also reveal the text accompanying the images when they were fed into the model as the following:
Here is the original image:
and here are the different crops of this image to help you see better, use these only as hints:
This is all fine, but the problem is that when I play this game in the AI Studio, it turns out that the model there is being fed 9 crops of the image (I would imagine a 3x3 grid perhaps) in addition to the original, which resulted - obviously - in a discrepancy between the responses in the Studio and when I called the API directly.
So the question is: is this configurable? Can I decide the level of crop-granularity that I want, or possibly disable this cropping behavior altogether? Or am I stuck here not knowing what will occur under the hood when I upload a file?