Gemini 2.5 Flash Thinking Tokens using OpenAI API

Hi,

Is it possible to enable/disable thinking or set thinking_tokens for Gemini 2.5 Flash using the OpenAI compatible API?

Thanks

15 Likes

I’m also interested in this question regarding controlling thinking with the OpenAI compatible API.

2 Likes

waitting for updates

3 Likes

It would be preferable for the model name to have two endpoints: one with Thinking and one without Thinking, which would make integration easier. A bit like what OpenRouter currently offers.

3 Likes

or like claude, add extra info, more control

6 Likes

Maybe even like the ones in Requesty? Where you can set one of four thinking efforts:

1 Like

Same here. I would like to see OpenAI compatibility with Thinking Tokens and Reasoning Trace.

1 Like

Same as well, I want to use and disable thinking mode.

2 Likes

Official documentation tell to set thinking budget to 0 if you want to disable thinking

2 Likes

For Gemini API, yes, but not OpenAI

1 Like

Got one better:

find a way to make the AI agent think less and generate more

i am using a checkpoint approach, and currently building a large scope project without even typing a single line of code

so far we are having a slow but stable process regarding the whole app, but one of the reasons is I AM USING A PHONE TO CODE AND HOST THE PROJECT

my last test run using my method resulted in :

130+k token on load

thought for 13s

generated for 200+s

and the generated content was very precise
i need some help making AITHER using this method
and i have plans for enhancing the current method

if you need proof, please tell me how to share the data, this is my first login here :pleading_face:

1 Like

There is an update from Logan. It’s not yet possible, but the team is working on it.
ā€œModified by moderatorā€

1 Like

@Stefan_Streichsbier
Is there a time estimation for when it’s going to be supported?

thank you!

1 Like

OpenAI compatibility for Gemini Flash is now available. You just have to set reasoning_effort to none, low, medium, high.

This is great news! Flash 2.5 is a super promising model for us. However, about 25% of the time we see very long latencies for simple requests (e.g. 5s-7s) that are inconsistent with Flash 2.0 and other Flash 2.5 calls (which take roughly 1s). I wonder if the reasoning_effort is sometimes ignored as described in this bug? ā€œModified by moderatorā€

Artificialanalysis.ai shows similarly high latencies (Gemini 2.5 Flash: API Provider Performance Benchmarking & Price Analysis | Artificial Analysis), and I imagine this is not the expected latency for the model. This makes not usable for agentic work.

Update: Here’s a minimal example that reproduces the behavior. It’s a little tricky to get it to consistently reason, but this looks to produce reasoning tokens about 30% of the time.

I can actually get it to produce the behavior without structured output; it’s just rarer (maybe 5% of requests).


url = "https://generativelanguage.googleapis.com/v1beta/openai/chat/completions"
api_key = 'XXXX'
msgs = [
    "If you remove every other letter from 'SUBSTANTIATION,' what's the resulting word?",
    "If yesterday's tomorrow is Friday, what day is two days before today's yesterday?",
    "How many unique ways can you arrange the letters in 'MISSISSIPPI'?",
    "Explain briefly why mirrors reverse left-to-right but not up-to-down.",
    "If all roses are flowers and some flowers fade quickly, must some roses fade quickly?",
    "Is it logically possible for an omnipotent being to create a rock it can't lift?",
    "Which weighs more: a pound of feathers on Earth or a pound of iron on the Moon?",
    "If all cats chase some mice and all mice fear all dogs, do all cats fear some dogs?",
    "Does the set of all sets that don't contain themselves contain itself?",
    "If two people each flip a fair coin five times, what's the probability their results match exactly?",
    "Explain in one sentence why multiplying two negative numbers yields a positive result.",
    "If there are three apples and you take two, how many apples do you have?",
    "Can a statement be both completely true and completely false simultaneously?",
    "If you always lie and you say 'I always lie,' are you lying or telling the truth?",
    "If today is Wednesday, what is the day 1000 days from now?",
    "Which is larger: 2^30 or 3^20?",
    "A triangle has angles in a 1:2:3 ratio; what are the three angle measurements?",
    "Can the average height of a population increase even if every individual's height decreases?",
    "Does adding salt to water increase or decrease its freezing point?",
    "Which has a greater perimeter: a square with an area of 16 or a rectangle with an area of 16 and dimensions 1 by 16?"
]

for msg in msgs:
    headers = {
        'Content-Type': 'application/json',
        'Authorization': 'Bearer ' + api_key,
    }
    body = {
        "model": 'gemini-2.5-flash-preview-04-17',
        "reasoning_effort": "none",
        "messages": [
            {
                "role": "system",
                "content": 'You are a helpful assistant. '
                           'Answer the question using reasoning, with careful step-by-step reasoning before producing an answer.'
            },
            {
                "role": "user",
                "content": msg
            }
        ],
        "response_format": {
            "type": 'json_schema',
            "json_schema": {
                'name': 'result',
                'schema': {
                    'type': 'object',
                    'properties': {
                        'explanation': {
                            'type': 'string'
                        },
                        'answer': {
                            'type': 'string'
                        }
                    },
                    'required': ['explanation', 'answer'],
                }
            }
        }
    }
    r = requests.post(url, headers=headers, json=body)
    js = r.json()
    reasoning_tokens = (js['usage']['total_tokens'] -
                        js['usage']['prompt_tokens'] -
                        js['usage']['completion_tokens'])
    print(reasoning_tokens)