Hi everyone,
I use Gemma 4 via API, I want to disable thinking (because it’s very slow, it spends more than 60s for a simple question) but I cannot find any document to do it.
Does anyone know how to disable thinking?
Please help me, many thanks.
Hi everyone,
I use Gemma 4 via API, I want to disable thinking (because it’s very slow, it spends more than 60s for a simple question) but I cannot find any document to do it.
Does anyone know how to disable thinking?
Please help me, many thanks.
Hi @letanloc
Could you confirm which API you’re using? Typically, ‘thinking’ is activated by adding the <|think|> token to the beginning of the system prompt. To turn it off, simply remove that token. Alternatively, if you are running Gemma 4 on Cloud Run, you can disable this feature by setting "enable_thinking": False in your configuration.
Thanks
Hi Pannaga,
I know the doco say that but I’m not 100% sure it is. I’m having the same issue
Try adding “include_thoughts”: false to your config. For example:
{
“contents”: [
{
“role”: “user”,
“parts”: [
{
“text”: “Who are you?”
}
]
}
],
“generationConfig”: {
“maxOutputTokens”: 4048,
“include_thoughts”: false
}
}
didn’t work
[
{
"error": {
"code": 400,
"message": "Invalid JSON payload received. Unknown name \\"include_thoughts\\" at 'generation_config': Cannot find field.",
"status": "INVALID_ARGUMENT",
"details": \[
{
"@type": "type.googleapis.com/google.rpc.BadRequest",
"fieldViolations": \[
{
"field": "generation_config",
"description": "Invalid JSON payload received. Unknown name \\"include_thoughts\\" at 'generation_config': Cannot find field."
}
\]
}
\]
}
}
]
Hi @Greg_Obleshchuk
Apologies for late response .
Could you try setting “thinkingLevel”: “MINIMAL” in your request? I used this setting in my test request, for example, and noticed that “thoughtsTokenCount” was not utilized.
Request : curl “https://generativelanguage.googleapis.com/v1beta/models/gemma-4-26b-a4b-it:generateContent?key= your key here” -H “Content-Type: application/json” -X POST -d ‘{“contents”: [{“parts”:[{“text”: “What is the water formula?”}]}],“generationConfig”: {“thinkingConfig”: {“thinkingLevel”: “MINIMAL”}}}’
I got this response back
Response :
{
“candidates”: [
{
"content": {
"parts": \[
{
"text": "",
"thought": true
},
{
"text": "The chemical formula for water is \*\*$\\\\text{H}\_2\\\\text{O}$\*\*.\\n\\nHere is a breakdown of what that means:\\n\\n\* \*\*$\\\\text{H}\_2$\*\*: This indicates there are \*\*two atoms of Hydrogen\*\*.\\n\* \*\*$\\\\text{O}$\*\*: This indicates there is \*\*one atom of Oxygen\*\*.\\n\\nIn a single molecule of water, these three atoms are held together by \*\*covalent bonds\*\*, where the oxygen atom shares electrons with the two hydrogen atoms."
}
\],
"role": "model"
},
"finishReason": "STOP",
"index": 0
}
],
“usageMetadata”: {
"promptTokenCount": 6,
"candidatesTokenCount": 103,
"totalTokenCount": 109,
"promptTokensDetails": \[
{
"modality": "TEXT",
"tokenCount": 6
}
\]
},
“modelVersion”: “gemma-4-26b-a4b-it”,
“responseId”: “_APvaeT9OduKjuMPmtyt6As”
}
Please try this and let me know if it works.
Thanks
Hi ,
Thanks for the reply.
I did try “thinkingConfig”: {“thinkingLevel”: “MINIMAL”}, and I agree ther is no thoughtsTokenCount fileds being returned.
I just didn’t know if this meant it wasn’t using thinking mode or not. THe documentation isn’t really clear at all. I would have imaged that the values woudl have been “thinkingConfig”: {“thinkingLevel”: “NONE”} . That would have been clearer.
I will continue to use “thinkingConfig”: {“thinkingLevel”: “MINIMAL”} .
once again thanks
Greg
The Problem: Users are seeing the model’s internal reasoning or “thought process” output directly into the final text generation.
The Fix: Reasoning models often output their logic block before the final answer, usually enclosed in specific tags (like and ).
Application-Side Parser: The most robust fix is to implement a regex or string-parsing function in the client code to automatically strip out any text between these tags before rendering the output to the end user.
API Flags: Check the specific endpoint documentation (Vertex AI or AI Studio) for generation configuration parameters. Some endpoints allow passing a flag such as include_thinking=false in the JSON payload to suppress the reasoning tokens at the server level.
its part of its normal generation.
But you can offer it an alternative. To poop it in a {“quick_thinking”:" "} structure.
use ThinkingLevel.MINIMAL. For gemma 4 it turns off thinking.