Issue of multimodal fewshot prompt with gemini-flash

Context:
I m trying to create a fewshot prompt for extracting text from receipts and convert the output in json, during testing, I noticed that gemini will not return results for the target image but rather return result from the examples.

Here’s my prompt looks like:

System instruction: You are a expert extract medical receipt info. …

** BEGIN OF EXAMPLES**


Receipt Image:

[image]

Receipt Analysis JSON:

{“encounters”:[ xxxxx]}


Receipt Image:

[image]

Receipt Analysis JSON:

{“xxxxx”}


Receipt Image:

[image]

Receipt Analysis JSON:

{xxxx}


Receipt Image:

[image]

Receipt Analysis JSON:

{“encounters”:[xxxx]}


Receipt Image:

[image]

Receipt Analysis JSON:

{“encounters”:[xxxx]}


** END OF EXAMPLES **

Now, extract the information from the following receipt:

Receipt Image:

[image]

Receipt Analysis JSON:

Note: this is the prompt in vertex ai studio’s freeform playground, the [image] represent an actual image i uploaded.

I have tried enforce at prompt level, i.e. do not return anything from examples, etc but it doesn’t work, any suggestions would be appreciated, thanks in advance!

I can confirm I had the exact same observation. Interestingly, if you change the output from json to simple text mode (bullet list), the model behaves as expected (the “example” images are processed, and only items in the actual question are returned in the bulleted list). It’s when using json output that the model switches to output everything, examples and the last image you are really asking about.

Interesting, so essentially to make it work directly, I need to remove the JSON schema and enforcement at prompt level. I m curious if you have tried:
OCR => gemini ? (basically input now becomes text) and compared performance?

No, my sample images used collections of objects, not OCR.