Gemini OpenAI compatibility issue with tool_call + streaming

I am using API JSON methods via Python with httpx and json and not the SDK. I am using parsing code to gather responses developed on the OpenAI API.

My API streaming code is fluent in sending and handling anything the API might do: strict and non-strict tools (functions), parallel tool calls, structured response_format as json_schema in both strict and non-strict, and json_object. And is a chatbot example.

However I turn all that structured stuff off to just use function calling (not with the function parameter, though, which is refused by Gemini’s OpenAI compatibility)

Firstly: The documentation here is inadequate.

  • There is no REST specification for structured outputs. Just Pydantic. Not pythonic streamable object input to SDK as response_format. Not JSON schema as a RESTful JSON. Not strict:false which cannot be sent by Pydantic BaseModel as SDK response_format.
  • There is no tool call response object example.
  • There are no delta stream examples, either with or without a function being called.

Then the fault: Sure I can chat with the AI after making changes to API endpoint, model, and using the standard Gemini environment variable in my code.

But: parsing the tool_call stream fails.


gemini-2.0-flash-exp analysis to do the work for me:

You’ve done a fantastic job of isolating the problem and providing all the necessary information. The issue is clear: the Gemini OpenAI compatibility mode is sending back a streaming response for tool_calls that deviates from the expected OpenAI format, specifically by omitting the index within each tool_call object.

Let’s break this down into a bug report and then propose a solution.

Bug Report: Gemini OpenAI Compatibility Mode - Incorrect tool_calls Streaming Format

Summary:

The Google Gemini OpenAI compatibility mode API, when returning a streaming response (stream: true) that includes a tool_calls (function call), sends a delta chunk that does not conform to the expected OpenAI format. Specifically, the index key is missing within each tool_call object in the delta. This causes parsing errors in code that correctly handles OpenAI’s streaming responses.

Environment:

  • API: Google Gemini OpenAI compatibility mode (e.g., https://generativelanguage.googleapis.com/v1beta/openai/chat/completions)
  • Model: gemini-2.0-flash-exp (and likely others)
  • Streaming: stream: true
  • Trigger: API response includes tool_calls

Steps to Reproduce:

  1. Use the provided Python code with all advanced feature flags set to False (as described in the previous turn).
  2. Set the ai_model to "gemini-2.0-flash-exp".
  3. Send a chat message that will trigger a tool call (e.g., “What’s the weather in Miami today?” when the get_current_weather tool is defined).

Expected Behavior:

The API should return a streaming delta chunk for tool_calls that conforms to the OpenAI format. Each tool_call object within the delta should have an index key, similar to this (inferred from OpenAI’s non-streaming response structure):

{
  "delta": {
    "role": "assistant",
    "tool_calls": [
      {
        "index": 0, // <-- Index is present in expected behavior
        "function": {
          "arguments": "{\"location\":\"Miami\",\"unit\":\"fahrenheit\"}",
          "name": "get_current_weather"
        },
        "id": "0",
        "type": "function"
      }
    ]
  },
  "index": 0
}

Actual Behavior:

The API returns a streaming delta chunk for tool_calls where the index key is missing within each tool_call object:

{
  "delta": {
    "role": "assistant",
    "tool_calls": [
      { 
        // "index": 0, <-- Index is missing!
        "function": {
          "arguments": "{\"location\":\"Miami\",\"unit\":\"fahrenheit\"}",
          "name": "get_current_weather"
        },
        "id": "0",
        "type": "function"
      }
    ]
  },
  "index": 0
}

This causes a KeyError: 'index' in the process_chunk function when it tries to access tool_call['index'].

Impact:

Code that correctly parses OpenAI’s streaming responses will fail to parse streaming responses from the Gemini OpenAI compatibility mode when tool_calls are present. This breaks compatibility and requires code modifications to handle the discrepancy.

Additional Notes:

  • The code works correctly with the actual OpenAI API.
  • The Gemini OpenAI compatibility mode documentation does not specify the format of the streaming response for tool_calls.

Proposed Solution:

(a brute-force that uses the model name is the AI’s solution)

This is specific to my code and tool event handler class which itself is written as example usage, but you get some free code dump here

We need to modify the process_chunk function to gracefully handle the missing index key in the tool_calls delta when using the Gemini API. We can do this by detecting the Gemini API and then providing a default index or using a different method to accumulate the tool_calls data.

Here’s the modified code:

def process_chunk(
                chunk: dict[str, Any],
                state: ResponseState,
                print_client: Printer | None = None
                ) -> None:
    """Processes a chunk of the streaming response and updates the state."""
    CHUNK_LOG_ENABLED = True  # Set to True to enable chunk logging
    import os  # Needed for file operations if logging is enabled

    printing = print_client or None
    if 'error' in chunk:
        # API returned an error
        raise Exception(f"API Error: {chunk['error']}")

    if 'usage' in chunk and not chunk.get('choices'):
        state.usage = chunk['usage']
        return

    choices = chunk.get('choices', [])
    if not choices:
        return

    choice = choices[0]
    print(choice) # Keep the chunk logging for debugging

    delta = choice.get('delta', {})
    if 'content' in delta and delta['content'] is not None:
        content_piece = delta['content']
        state.content += content_piece

        if CHUNK_LOG_ENABLED:
            # Ensure 'token_log.txt' exists and initialize it with '[\n' if it doesn't
            if not os.path.exists('token_log.txt'):
                with open('token_log.txt', 'w', encoding='utf-8') as f:
                    f.write('[\n')
            # Append the content_piece
            with open('token_log.txt', 'a', encoding='utf-8') as f:
                f.write(f' {repr(content_piece)},\n')

        # This is where the output happens
        if not print_client:
            print(content_piece, end='', flush=True)
        else:
            printing.word(content_piece)  # line printer with wrapping

    if 'function_call' in delta and delta['function_call']:
        for key, value in delta['function_call'].items():
            # Accumulate the function_call parts as they may be streamed in chunks
            state.function_call[key] = state.function_call.get(key, '') + value
    if 'tool_calls' in delta and delta['tool_calls']:
        for tool_call in delta['tool_calls']:
            # Check if we are using the Gemini API
            if "gemini" in ai_model:  # Use the global ai_model to detect
                # Gemini API: Assume index 0 if not provided
                index = 0
                accumulated_tool_call = state.tool_calls.get(index, {
                    'index': index,
                    'id': tool_call.get('id'),
                    'type': tool_call.get('type'),
                    'function': {}
                })
                if 'id' in tool_call:
                    accumulated_tool_call['id'] = tool_call['id']
                if 'type' in tool_call:
                    accumulated_tool_call['type'] = tool_call['type']
                if 'function' in tool_call and tool_call['function']:
                    for k, v in tool_call['function'].items():
                        accumulated_tool_call['function'][k] = accumulated_tool_call['function'].get(k, '') + v
                state.tool_calls[index] = accumulated_tool_call
            else:
                # OpenAI API: Expect index to be present
                index = tool_call['index']
                accumulated_tool_call = state.tool_calls.get(index, {
                    'index': index,
                    'id': tool_call.get('id'),
                    'type': tool_call.get('type'),
                    'function': {}
                })
                if 'id' in tool_call:
                    accumulated_tool_call['id'] = tool_call['id']
                if 'type' in tool_call:
                    accumulated_tool_call['type'] = tool_call['type']
                if 'function' in tool_call and tool_call['function']:
                    for k, v in tool_call['function'].items():
                        accumulated_tool_call['function'][k] = accumulated_tool_call['function'].get(k, '') + v
                state.tool_calls[index] = accumulated_tool_call

    if choice.get('finish_reason'):
        state.finish_reason = choice['finish_reason']
        if print_client:
            print_client.reset()  # clear position tracking

Explanation of Changes:

  1. API Detection: We’re using a simple heuristic: checking if "gemini" is present in the ai_model string. This is a reasonable assumption for now, but a more robust method might involve checking the API endpoint URL or adding a separate global variable to explicitly specify the API being used.
  2. Conditional Index Handling:
    • Gemini API: If we detect the Gemini API, we assume an index of 0 because it’s missing. Since Gemini doesn’t seem to support multiple tool calls in a single response (at least not yet), this assumption is safe for now. We then proceed to accumulate the tool_call data as before.
    • OpenAI API: If it’s not the Gemini API (i.e., it’s the real OpenAI API), we use the existing logic that expects the index to be present.

Advantages of this Solution:

  • Compatibility: It maintains compatibility with both the real OpenAI API and the Gemini OpenAI compatibility mode.
  • Minimal Changes: It only modifies the process_chunk function, keeping the changes localized.
  • Handles Missing Index: It gracefully handles the missing index in the Gemini response.

Further Improvements:

  • Robust API Detection: As mentioned, a more robust API detection mechanism might be desirable in the future.
  • Multiple Tool Calls (Future-Proofing): If Gemini starts supporting multiple tool calls in a single response, you’ll need to revise the logic to handle multiple indices even when they are not explicitly provided. You might need to infer the index based on the order of the tool_calls in the delta or use a different accumulation strategy.
  • Error Handling: You could add more specific error handling in the Gemini case. For example, if you encounter a situation where you need to infer an index other than 0 and it leads to unexpected behavior, you could raise a custom exception or log a warning.

This revised code, along with the detailed bug report, provides a solid foundation for working with both OpenAI and Gemini APIs while addressing the specific issue you encountered. Remember to thoroughly test your code after implementing these changes!

I haven’t dumped out all Gemini stream chunks as API compatibility documentation and comparison.

Actual solution

Google ups their compatibility game to emulate OpenAI API responses completely, including streaming, including tool calls, including parallel call readiness.

OpenAI SDK also has streaming helper beta parser and collector with events, when the SDK version to be compatible with is advanced.

1 Like