Issue of multimodal fewshot prompt with gemini-flash

Xiao_Cui · August 27, 2024, 6:05pm

Context:
I m trying to create a fewshot prompt for extracting text from receipts and convert the output in json, during testing, I noticed that gemini will not return results for the target image but rather return result from the examples.

Here’s my prompt looks like:

System instruction: You are a expert extract medical receipt info. …

** BEGIN OF EXAMPLES**

Receipt Image:

[image]

Receipt Analysis JSON:

{“encounters”:[ xxxxx]}

Receipt Image:

[image]

Receipt Analysis JSON:

{“xxxxx”}

Receipt Image:

[image]

Receipt Analysis JSON:

{xxxx}

Receipt Image:

[image]

Receipt Analysis JSON:

{“encounters”:[xxxx]}

Receipt Image:

[image]

Receipt Analysis JSON:

{“encounters”:[xxxx]}

** END OF EXAMPLES **

Now, extract the information from the following receipt:

Receipt Image:

[image]

Receipt Analysis JSON:

Note: this is the prompt in vertex ai studio’s freeform playground, the [image] represent an actual image i uploaded.

I have tried enforce at prompt level, i.e. do not return anything from examples, etc but it doesn’t work, any suggestions would be appreciated, thanks in advance!

OrangiaNebula · August 27, 2024, 6:36pm

I can confirm I had the exact same observation. Interestingly, if you change the output from json to simple text mode (bullet list), the model behaves as expected (the “example” images are processed, and only items in the actual question are returned in the bulleted list). It’s when using json output that the model switches to output everything, examples and the last image you are really asking about.

Xiao_Cui · August 27, 2024, 6:56pm

Interesting, so essentially to make it work directly, I need to remove the JSON schema and enforcement at prompt level. I m curious if you have tried:
OCR => gemini ? (basically input now becomes text) and compared performance?

OrangiaNebula · August 27, 2024, 7:36pm

No, my sample images used collections of objects, not OCR.

Topic		Replies	Views
Processing multiple text excerpts with Gemini API Gemini API api , text	2	38	May 20, 2025
Invalid Responses with gemini-1.5-flash-002 in Document Classification Gemini API gemini-15 , api	6	113	November 22, 2024
Gemini 2.0: use a list of Pydantic objects at response schema Gemini API gemini-flash	4	1184	January 14, 2025
How to do batch Inference on Prompt Image pairs with Gemini API without getting errors Gemini API gemini-15 , bug , api	1	294	May 28, 2024
Bulk Processing Images Without Batching Gemini API api , gemini-api	3	238	October 25, 2024

Issue of multimodal fewshot prompt with gemini-flash

Related topics