OK, I found it out myself.
Just use the API like this:
image_path = "https://upload.wikimedia.org/wikipedia/commons/thumb/8/87/Palace_of_Westminster_from_the_dome_on_Methodist_Central_Hall.jpg/2560px-Palace_of_Westminster_from_the_dome_on_Methodist_Central_Hall.jpg"
image = httpx.get(image_path)
prompt = "Caption this image."
chat = models[0].start_chat(history=[])
response = chat.send_message([
{'mime_type':'image/jpeg',
'data': base64.b64encode(image.content).decode('utf-8')},
prompt
])
print(response.text)