Hi,
Is it possible to enable/disable thinking or set thinking_tokens for Gemini 2.5 Flash using the OpenAI compatible API?
Thanks
Hi,
Is it possible to enable/disable thinking or set thinking_tokens for Gemini 2.5 Flash using the OpenAI compatible API?
Thanks
Iām also interested in this question regarding controlling thinking with the OpenAI compatible API.
waitting for updates
It would be preferable for the model name to have two endpoints: one with Thinking and one without Thinking, which would make integration easier. A bit like what OpenRouter currently offers.
Same here. I would like to see OpenAI compatibility with Thinking Tokens and Reasoning Trace.
Same as well, I want to use and disable thinking mode.
Official documentation tell to set thinking budget to 0 if you want to disable thinking
For Gemini API, yes, but not OpenAI
Got one better:
find a way to make the AI agent think less and generate more
i am using a checkpoint approach, and currently building a large scope project without even typing a single line of code
so far we are having a slow but stable process regarding the whole app, but one of the reasons is I AM USING A PHONE TO CODE AND HOST THE PROJECT
my last test run using my method resulted in :
130+k token on load
thought for 13s
generated for 200+s
and the generated content was very precise
i need some help making AITHER using this method
and i have plans for enhancing the current method
if you need proof, please tell me how to share the data, this is my first login here
There is an update from Logan. Itās not yet possible, but the team is working on it.
āModified by moderatorā
@Stefan_Streichsbier
Is there a time estimation for when itās going to be supported?
thank you!
OpenAI compatibility for Gemini Flash is now available. You just have to set reasoning_effort to none, low, medium, high.
This is great news! Flash 2.5 is a super promising model for us. However, about 25% of the time we see very long latencies for simple requests (e.g. 5s-7s) that are inconsistent with Flash 2.0 and other Flash 2.5 calls (which take roughly 1s). I wonder if the reasoning_effort is sometimes ignored as described in this bug? āModified by moderatorā
Artificialanalysis.ai shows similarly high latencies (Gemini 2.5 Flash: API Provider Performance Benchmarking & Price Analysis | Artificial Analysis), and I imagine this is not the expected latency for the model. This makes not usable for agentic work.
Update: Hereās a minimal example that reproduces the behavior. Itās a little tricky to get it to consistently reason, but this looks to produce reasoning tokens about 30% of the time.
I can actually get it to produce the behavior without structured output; itās just rarer (maybe 5% of requests).
url = "https://generativelanguage.googleapis.com/v1beta/openai/chat/completions"
api_key = 'XXXX'
msgs = [
"If you remove every other letter from 'SUBSTANTIATION,' what's the resulting word?",
"If yesterday's tomorrow is Friday, what day is two days before today's yesterday?",
"How many unique ways can you arrange the letters in 'MISSISSIPPI'?",
"Explain briefly why mirrors reverse left-to-right but not up-to-down.",
"If all roses are flowers and some flowers fade quickly, must some roses fade quickly?",
"Is it logically possible for an omnipotent being to create a rock it can't lift?",
"Which weighs more: a pound of feathers on Earth or a pound of iron on the Moon?",
"If all cats chase some mice and all mice fear all dogs, do all cats fear some dogs?",
"Does the set of all sets that don't contain themselves contain itself?",
"If two people each flip a fair coin five times, what's the probability their results match exactly?",
"Explain in one sentence why multiplying two negative numbers yields a positive result.",
"If there are three apples and you take two, how many apples do you have?",
"Can a statement be both completely true and completely false simultaneously?",
"If you always lie and you say 'I always lie,' are you lying or telling the truth?",
"If today is Wednesday, what is the day 1000 days from now?",
"Which is larger: 2^30 or 3^20?",
"A triangle has angles in a 1:2:3 ratio; what are the three angle measurements?",
"Can the average height of a population increase even if every individual's height decreases?",
"Does adding salt to water increase or decrease its freezing point?",
"Which has a greater perimeter: a square with an area of 16 or a rectangle with an area of 16 and dimensions 1 by 16?"
]
for msg in msgs:
headers = {
'Content-Type': 'application/json',
'Authorization': 'Bearer ' + api_key,
}
body = {
"model": 'gemini-2.5-flash-preview-04-17',
"reasoning_effort": "none",
"messages": [
{
"role": "system",
"content": 'You are a helpful assistant. '
'Answer the question using reasoning, with careful step-by-step reasoning before producing an answer.'
},
{
"role": "user",
"content": msg
}
],
"response_format": {
"type": 'json_schema',
"json_schema": {
'name': 'result',
'schema': {
'type': 'object',
'properties': {
'explanation': {
'type': 'string'
},
'answer': {
'type': 'string'
}
},
'required': ['explanation', 'answer'],
}
}
}
}
r = requests.post(url, headers=headers, json=body)
js = r.json()
reasoning_tokens = (js['usage']['total_tokens'] -
js['usage']['prompt_tokens'] -
js['usage']['completion_tokens'])
print(reasoning_tokens)