It looks like it isn’t possible for function responses to contain an image part, only a generic JSON object. Is this planned?
Welcome to the forum!
Exactly what are you trying to do with function calls that involve images? Are you looking for Gemini to return an image by making a function call to some other API?
This isn’t quite how function calls work.
In a way, Function Calls are badly defined. What they are meant for is for you to define the functions that your program supports (ie - a function name, parameters, types for the parameters, and descriptions for all of that) and then Gemini will identify which function can best handle the user’s request and return with a JSON structure indicating the name and the parameters based on what the user has asked for.
Your code is then expected to handle that request and, optionally, send the return value to Gemini again for it to generate a plain-text response for the user’s question. Sometimes this may take multiple rounds.
This is one tool to help provide grounding - since LLMs are poor at providing real information, but good at “understanding” what the user has asked, it provides way to turn that fuzzy human request into a machine representation that you can use to get a “real answer”.
So while you can provide a function or functions that will generate images based on user input, Gemini will just turn that into the parameters that might form the request. It is still up to you, as a developer, to do something that generates the image (for example, calling an image generation model, or searching online stock photos).
Function Calling is a powerful tool - but may not be the tool you’re looking for. What are you looking for, exactly?
Sorry, should have been more specific. As I understand it, the FunctionResponse message part, which is used to define the output of a function call, only supports a JSON object. I would like to return an image that the model can process. This can be done in user/system prompts using ImagePart, but not functions.
An example usecase here is a function that returns a screenshot of the user’s screen.
OpenAI has similar limitations: Returning image as result of function call to gpt-4-turbo - Bugs - OpenAI Developer Forum
Your understanding is correct. The Gemini API doesn’t have specifications for the return type of a tool function, when you take a closer look. You can specify types for arguments. Those specifications are largely cosmetic as well: the model will pick which type of argument it will use way more based on examples of arguments provided in the description of the function than based on the code you write and provide.
So, once you accept that, which you will be able to verify by extensive testing yourself, you can actually return fairly complex types in the function return when they are structured name / value combinations. Structured text, not images.
You can possibly get the effect you are looking for (the model gets an image in the context as a result of a function call) by having the function call provide the parameters for the image, having the application code prepare the image and return a dummy return value, true and false are popular choices in sample code shown in the cookbooks, receiving the next response from the model, which will be text again, and then invoking generate_content()
with text Parts and an image Part, containing the image your application generated in response to the specs from the model.
Hope that helps!
Ah! Yes. Good question, and apologies for my misunderstanding.
This is a specific case of a more general issue: What if the response isn’t structured. There are a few example cases, such as
- The one you’ve given - where an image is the “return value”
- Where Gemini indicates there are values missing which requires the user to answer, so it becomes another conversational content.
- Where the “Function” actually indicates a structured error, and a re-prompt may be necessary with different parameters.
You don’t need to return a FunctionResponse
at any point. In each of these cases, you can just return normal text and image parts which become incorporated into the pattern of the prompt.
In some cases you may want to rewrite the “model” response so that, instead of the structured response, you’re sending a generic text message with Gemini asking for more information.
The Gemini API doesn’t have specifications for the return type of a tool function, when you take a closer look
Yeah, The function output itself can be any valid JSON, but you need to use a specific FunctionPart
type to tell the model that the message is in response to a specific function call. Even if I pass an image in the function response, the model won’t be able to actually understand it, because images needs to be tokenized differently than the regular text/JSON input. So I think we’re on the same page there.
The workaround I’m using now seems similar to what you and @afirstenberg mentioned. The image function returns { success: true }
, and then the image itself is sent in the next message, with the user role. I’ve tested it with gpt-4o
and it works well. Unless Gemini has a restriction on the order of messages, it should work there too.
I haven’t built the scenario I described myself (*), but I have done enough testing to be pretty sure it will work. By using generate_content()
the model gets the image you want, but without history. If you want the image in the context and you want the history, then the same can be done from the chat object, with the possible added tweak of manipulating the previous model response like Allen suggests.
(*) part of it I have built: Model comes up with a graph specification for Graphviz in DOT language. I could draw it on the client and send the image back, but I didn’t need to.
Gemini has a restriction on the order of messages.
- Messages must start with a “user” message.
- Messages must end with either a “user” or “function” role.
- Messages must alternate between (“user” or “function”) roles and “model” roles.
As I said - you may need to rewrite the “model” response when you send it back as part of the history.