I have an image and I want to use Gemini 2 Flash to detect the bounding boxes of the different objects in the image. I’d like the output to be a list of objects (see the Object class in the code bellow).
The problem I have is that I cannot figure out how to use Pydantic along side response_schema to make it work. Here is the code:
import os
from pydantic import BaseModel, RootModel
from typing import List
from google import genai
from google.genai import types
from PIL import Image
class Object(BaseModel):
label: str
box_2d: list[int]
class ObjectsList(RootModel):
root: List[Object]
image = Image.open(SOME_IMAGE_PATH)
prompt = "Detect objects box 2d."
client = genai.Client(api_key=os.environ["API_KEY"])
response = client.models.generate_content(
contents=[image, prompt],
model="gemini-2.0-flash-exp",
config=types.GenerateContentConfig(
response_mime_type= 'application/json',
response_schema=ObjectsList,
),
)
This results in a Extra inputs are not permitted [type=extra_forbidden, input_value={'Object': {'properties':...2d'], 'type': 'OBJECT'}}, input_type=dict] error.
Instead of passing ObjectsList to response_schema, I first tried passing list[Object] or List[Object] with no luck.
Is it possible to use Pydantic to have a list of objects as output?
Pydantic actually works with Gemini 2.0 (response_schema). Please check the code below which works for me. The error message is not coming from Gemini 2.0.
import os
from google import genai
from google.genai.types import GenerateContentConfig
from PIL import Image
from pydantic import BaseModel
os.environ["API_KEY"] = <YOUR_API_KEY>
client = genai.Client(api_key=os.environ["API_KEY"])
image = Image.open(SOME_IMAGE_PATH)
prompt = "Detect objects box 2d."
class Object(BaseModel):
label: str
box_2d: list[int]
response = client.models.generate_content(
model="gemini-2.0-flash-exp",
contents=[image, prompt],
config=GenerateContentConfig(
response_mime_type="application/json",
response_schema=Object,
),
)
print(response.text)
Your solution does not work. Maybe I was not clear in my initial message / question.
To be more precise, your solution does something (with no error) which is different from what I’ve asked. What I’ve asked was:
Is it possible to use Pydantic to have a list of objects as output?
The Object has 2 fields: label (which is a string) and box_2d (which is a list of integers). By using response_schema=Object in the GenerateContentConfig, you’re asking Gemini to return you only one Object. Here is the output of your solution:
It appears this is not supported out-of-the-box in the SDK. Please refer to this GitHub issue which has a workaround. Alternatively, you can refer to this code below:
import os
import json
from pydantic import BaseModel, RootModel, TypeAdapter
from typing import List
from google import genai
from google.genai import types
from PIL import Image
class Object(BaseModel):
label: str
box_2d: list[int]
class ObjectsList(RootModel):
root: List[Object]
def get_schema(cls: BaseModel):
"""
Converts a Pydantic model to a JSON schema dictionary.
"""
schema = cls.model_json_schema()
if "$defs" not in schema:
return schema
defs = schema.pop("$defs")
def _resolve(schema):
if "$ref" in schema:
ref = schema.pop("$ref")
schema.update(defs[ref.split("/")[-1]])
if "properties" in schema:
for prop in schema["properties"].values():
_resolve(prop)
if "items" in schema:
_resolve(schema["items"])
schema.pop("title",None)
_resolve(schema)
return schema
SOME_IMAGE_PATH = "Cajun_instruments.jpg"
image = Image.open(SOME_IMAGE_PATH)
prompt = "Detect objects box 2d."
response = client.models.generate_content(
contents=[image, prompt],
model="gemini-2.0-flash-exp",
config=types.GenerateContentConfig(
response_mime_type='application/json',
response_schema=get_schema(ObjectsList), # Use the get_schema function here
),
)
# Load JSON string and validate with Pydantic
obj = TypeAdapter(ObjectsList).validate_python(json.loads(response.text))
print(obj)
I see that you are using a JSON schema here (which is the solution I’m using too). I understand that Pydantic cannot be used for this simple use case. Thanks for your answer.