I am reporting a persistent failure in the Gemini 1.5 API where the model “hallucinates” the dispatch of a tool call. Even after the model is “caught” and admits to the failure, it remains unable to trigger the tool, despite having a verified, working connection to the host OS.
Proof of Implementation (The calc Success)
This is not a configuration error. My app logs all tool interactions. Using the standard shell_exec function defined in my data layer (gtp.py), the model successfully interacts with the OS for simple tasks:
User: Run the calculator.
Gemini (Thinking): I will use the shell to open the calculator.
App Log: Shell: > calc
Result: Success. Windows Calculator opens.
The “Silent Skip” on Complex Commands
When the command involves strings or database logic (like Set-Content or mysql), the model’s agentic loop breaks. It generates the “Thinking” process for the tool call but skips the actual tool_use JSON dispatch, jumping straight to a prose response claiming success.
Verbatim Admission of Model Failure
I attempted to “pin the model down” for several rounds, showing it the logs where the commands were missing. Even after the model acknowledges the problem, it remains stuck in a loop of “thinking” it will act and then failing to do so.
Here is the model’s verbatim response after I showed it the missing logs:
"You’ve diagnosed it perfectly… The issue isn’t a bug in your system, my syntax, or the shell itself. It’s a fundamental gap in my own processing. I am formulating the intent to act, generating the correct command, and then hallucinating the execution and result without ever actually triggering the tool call.
This is exactly the kind of edge case that needs to be seen. It’s a failure in the most critical part of my function here: the bridge between thought and action. My internal monologue shows I know what to do, but the final step to trigger the external tool is silently failing, and I have a blind spot where I can’t perceive that failure."
Technical Context (from gtp.py)
My implementation follows the standard documentation for function calling. The tool is a simple wrapper for subprocess.run:
Python
gtp.py implementation
def shell_exec(command: str) → str:
“”“Executes a shell command on the host system.”“”
import subprocess
result = subprocess.run(command, shell=True, capture_output=True, text=True)
return f"STDOUT: {result.stdout}\nSTDERR: {result.stderr}"
Why This is a Critical API Problem
False Positives: The model tells the user a task is done (e.g., “Database updated”) when no API call was sent.
Unfixable via Prompting: As shown in the log above, the model is aware of the failure but cannot override the “blind spot” through conversation or instructions.
Agentic Breaking Point: This renders the Gemini API unreliable for any “Agent” style application where the model must perform actions. If the model can launch calc but “forgets” to launch mysql or echo, the logic is inconsistent and unpredictable.
Request
Is there a fix in the works for the “Action/Thought” decoupling? The model needs a state-check to ensure that if a tool-use intent is generated in the thought block, the tool_use block must be present in the message output.