Why I observed gemini2.5flash (setting thinking_budget=0) is slower than gemini2.0flash?

Tiffany · April 24, 2025, 8:49pm

I am using the example from official and only do de text generation.
Using same prompt.

gemini2.0flash
— 7.378899812698364 seconds —
Prompt tokens: 56292
Output tokens: 2009
Total tokens: 58301

gemini2.5flash:
– 15.149184942245483 seconds —
Prompt tokens: 56393
Thoughts tokens: None
Output tokens: 3311
Total tokens: 59704

Pannaga_J · May 7, 2025, 10:55am

Hi @Tiffany . Welcome to the community.
Thanks for sharing these detailed metrics! It’s helpful to see a direct comparison of the text generation performance between Gemini 2.0 Flash and Gemini 2.5 Flash using the same prompt,
A few initial observations based on the data you’ve provided:

Latency:This difference in latency could be due to various factors, including the model architecture, the complexity of the generation task, and the current server load.
Token Usage: Gemini 2.5 Flash utilized slightly more prompt tokens (56,393 vs. 56,292) and significantly more output tokens (3,311 vs. 2,009) than Gemini 2.0 Flash. This suggests that Gemini 2.5 Flash might be generating longer and potentially more detailed responses for the same input.
To understand this further could you share a bit more context about your use case? For example:

What is the nature of the prompt you are using? (e.g., creative writing, factual question answering, code generation, etc.)
Are you experiencing similar latency differences consistently?
Have you noticed any significant differences in the quality or relevance of the generated text between the two models?

Tiffany · May 9, 2025, 8:44pm

@Pannaga_J Thank you so much for the follow-up. My test case involves using a long article and asking the model to generate 50 question and answer pairs. Even if I set thinking_budget=0, is there any possibility that Gemini 2.5 flash is still engaging in thinking? I’ve found some discussions about this. In this case, I’m using exactly the same prompt, but the ‘Prompt tokens’ show 56,292 and 56,393. Does this difference of 101 tokens represent thinking tokens? I appreciate your help.

Pannaga_J · May 12, 2025, 6:05am

Hey @Tiffany .Thanks for the follow-up! Even with thinking_budget=0, Gemini 2.5 Flash still needs to process the long article and your request to generate Q&A pairs. This inherent processing is different from the explicit ‘thinking’ that thinking_budget controls.

The 101-token difference in Prompt tokens likely comes from slight variations in how the very long article is processed by each model version, not from thinking tokens. It’s a minor difference in how the input is being read.

Hope this clarifies things!

Tiffany · May 13, 2025, 12:21am

Thank you for the answer. So in your opinion, if the article is long, is Flash 2.0 faster than Flash 2.5? If the prompt is short, will Flash 2.5 be faster? If this is true, do you have an estimated token count threshold where Flash 2.5 provides faster inference speed than Flash 2.0? In my use case, we want to try flash 2.5 if the inference speed is faster than flash 2.0

Topic		Replies	Views
Gemini 2.5 Flash Thinking Tokens using OpenAI API Gemini API help_request	15	855	May 2, 2025
How To disable Thinking using Gemini 2.5 Flash? thinkingBudget: 0 not working Gemini API help_request , gemini-flash	1	553	April 23, 2025
Gemini-2.5-flash-preview-04-17 not honoring thinking_budget=0 Gemini API help_request	5	744	April 22, 2025
Gemini 2.5 Pro, Thinking and Non-thinking Google AI Studio models , gemini-20	5	1735	April 7, 2025
2.5 Flash down recently due to thinking tokens Gemini API help_request , gemini-flash	3	117	May 7, 2025

Why I observed gemini2.5flash (setting thinking_budget=0) is slower than gemini2.0flash?

Related topics