The results from the Gemini 2.0 Flash website don't match the API results for same thing

The model and parameters are identical to those on the official website.

Hi @yang_fa, Welcome to the forum!!

Are you using Google AI Studio?? Can you please provide any example screenshot to understand the situation better.

Thanks

yes, I’m using Gemini 2.0 flash.
An example is here:


prompt:

Instruction:
Given the coordinates of text boxes in an image, determine whether adjacent boxes belong to the same paragraph by considering both visual and semantic information. Follow these steps:
Analyze Visual Position:
Note that vertical stacking alone does not imply the same paragraph. The visual layout must "look like a paragraph" based on typical paragraph formatting conventions.
The ID of each box is located on the left side of the box.

Evaluate Text Semantic Content:
Check if the text in adjacent boxes is logically coherent (e.g., continuation of the same sentence, same theme, or related context).
Look for punctuation marks (e.g., periods, commas) that might indicate sentence boundaries.
Ensure that the text forms a fluent paragraph with appropriate connecting words.
Remember that content relevance does not necessarily mean they belong to the same paragraph in semantics.
The text should not be treated as individual information units but as parts of a coherent paragraph.
[important!]The text should be treated as parts of a coherent paragraph, not as individual information units.
[important!]When assessing semantic coherence, try concatenating the text from adjacent boxes to see if it forms a fluent sentence or paragraph.
Combine Information:
Use both visual and semantic cues to decide if adjacent boxes belong to the same paragraph.
Boxes with aligned right edges and coherent text content are likely part of the same paragraph.
[important!]Even if boxes are vertically stacked and content-related, they may not belong to the same paragraph unless they visually "look like a paragraph" and form a coherent text.
Output Format:
Return the results exclusively in the format [[0], [1, 2], [3, 4, 5], ...], where each sublist contains the IDs of boxes belonging to the same paragraph.
Do not include any additional text, explanations, or content.
Ensure that all text boxes in the image are accounted for in the output.
Note:
Even if boxes are visually stacked and contain similar content, they should not be considered part of the same paragraph if there is no logical connection between them.
A paragraph should have a fluent flow of sentences and connecting words, not just individual pieces of information.
You should not overly focus on textual coherence and neglect visual formatting cues.
Consider all text boxes in the image and ensure none are omitted from the output.
[important!]Special attention should be given to author names and affiliations: even if they are visually close, they should be treated as separate paragraphs due to their typical similar formatting (font, size, etc.) and semantic independence.
[important!]Special attention should be given to author names: even if they are visually close, they should be treated as separate paragraphs due to their typical similar formatting (font, size, etc.) and semantic independence.
[important!]Do not rely overly on thematic relevance; instead, consider semantic coherence and visual continuity.
Please group title texts together. Titles are typically characterized by larger font sizes, bold formatting, and text that summarizes the main content of a document or section.
Example:
Example 1:
Input: Two adjacent boxes with text "John Doe" and "ABC Corporation".
Output: [[0], [1]]
Explanation: Even though the boxes are adjacent and contain related content (a name and an institution), they do not form a coherent paragraph and should be treated as separate.
Example 2:
Input: Two adjacent boxes with text "Yuan Liu1,2āˆ— Cheng Lin2āˆ— Zijiao Zeng2 Xiaoxiao Long1† Lingjie Liu3" and "Taku Komura1 Wenping Wang4".
Output: [[0], [1]]
Explanation: Even though the boxes are adjacent and contain related content (name), they do not form a coherent paragraph and should be treated as separate.

The output of the Google AI studio is completely correct. But the output from API is: [[0], [1, 2, 3, 4, 5, 6], [7, 8, 9, 10, 11], [12, 13, 14], [15, 16, 17, 18], [19, 20, 21, 22, 23], [24, 25, 26], [27]]

1 Like

Hey @yang_fa, Let me get back to you on this.

2 Likes

Hello, any news on this? I’m also getting very different results on API vs Google Studio for the same exact instructions, tools, and parameters.