How to get multi-part responses?

parakeet · November 26, 2024, 10:49pm

I upload a video with the intention of carrying out two tasks:

Identify speakers by caption.
Transcribe the video.

This works well in AI Studio, which gives me code for https://generativelanguage.googleapis.com/upload/v1beta/files?key=${API_KEY}

For a corresponding API call, note that the code shows both user prompt and model response for the two instructions. Obviously, I do not include the model parts.

Can I achieve the two results using one API call? (I think it would save having to poll the big video twice?).

When I tried feeding two user parts in the request body, the API only returned one part response, the speaker names; it did not bother including the transcript. (Is it limited on parts response count just as it is limited on candidateCount?).

[
    {
      "role": "user",
      "parts": [
        {
          "fileData": {
            "fileUri": "${FILE_URI_0}",
            "mimeType": "video/mp4"
          }
        }
      ]
    },
    {
      "role": "user",
      "parts": [
        {
          "text": "List the speakers in this interview. Their names, companies and job titles are found in chyrons at the bottom of the video. Return the results as a json object with name, company and job title."
        }
      ]
    },
    {
      "role": "model",
      "parts": [
        {
          "text": "JSON GENERATED HERE OF THE SPEAKERS"
        }
      ]
    },
    {
      "role": "user",
      "parts": [
        {
          "text": "Transcribe the video."
        }
      ]
    },
    {
      "role": "model",
      "parts": [
        {
          "text": "VIDEO TRANSCRIPT GENERATED HERE BY LOVELY GEMINI API"
        }
      ]
    }
  ],
  "generationConfig": {
    "temperature": 1,
    "topK": 40,
    "topP": 0.95,
    "maxOutputTokens": 8192,
    "responseMimeType": "text/plain"
  }
}'

jkirstaetter · November 27, 2024, 6:08am

Hi, you need to combine the two user parts into one message with two parts - text and fileData.
Furthermore, user and model messages have to be alternating.

Susarla_Sai_Manoj · November 27, 2024, 6:25am

Hi @parakeet

In addition to the points mentioned in the comment, you can also incorporate few-shot prompting if possible.

Thanks.

parakeet · November 27, 2024, 7:33am

@jkirstaetter When I did a call with two parts of role user, the response only answered the first parts and there was no second part`.

Ohh, I see what you’re saying… put fileData, text 1 and text 2 all within a single parts array, so that there is only one node with a parts array.

I’ve just tried the following. While the response returns answers to both prompts, it still comes back with only 1x item in parts.

[
    {
      "role": "user",
      "parts": [
        {
          "fileData": {
            "fileUri": "${FILE_URI_0}",
            "mimeType": "video/mp4"
          }
        }
      ]
    },
    {
      "role": "user",
      "parts": [
        {
          "text": "List the speakers in this interview. Their names, companies and job titles are found in chyrons at the bottom of the video. Return the results as a json object with name, company and job title."
        }
      ]
    },
    {
      "role": "user",
      "parts": [
        {
          "text": "Transcribe the video."
        }
      ]
    }
  ],
  "generationConfig": {
    "temperature": 1,
    "topK": 40,
    "topP": 0.95,
    "maxOutputTokens": 8192,
    "responseMimeType": "text/plain"
  }
}'

parakeet · November 27, 2024, 7:34am

So, ask it to 1. Return speaker names and 2. Transcribe the video in the same prompt?
I could probably ask it to return distinct JSONs or JSON nodes for each, and then just parse them post-operation?

MOHD_RAMLAN_BIN_M_RO · November 27, 2024, 8:37am

Sure! Here are a few options:

Thanks for the clarification! I’ll give that a try.
Got it, thanks! Combining the user parts into one message makes sense.
Appreciate the tip about few-shot prompting, I’ll look into that.

jkirstaetter · November 27, 2024, 9:33am

Hi,
What you’re trying to achieve sounds more like a ChatSession inclusive history than a single request. See here Generowanie tekstu za pomocą interfejsu Gemini API | Google AI for Developers

Anyway, here is the doc about text + video processing Gemini API로 비전 기능 살펴보기 | Google AI for Developers
Or have a look at the JS version of the content: สำรวจความสามารถในการมองเห็นด้วย Gemini API | Google AI for Developers

The value of responseMimeType is not compatible. Compatible mimetypes: application/json And your prompt asking for JSON is contradictatory to the expected response mime type.

Right now, I would either split this into two generateContent requests or use a chatSession with history.

Susarla_Sai_Manoj · November 27, 2024, 9:59am

Hi @parakeet

You can show a simple example of how you want the output to look. Provide a basic JSON example for both “Identify speakers by caption” and “Transcribe the video.” You can also check this link.

Thanks

parakeet · November 27, 2024, 11:17am

My starting point was taking the code using the “Get code” option in a multi-message AI Studio chat, but the code it gives does not refer to chat, chatSession or startChat.

Of note - The example code for generateContent after upload (“Now generate content using that file”) includes only one parts item and, within it, only one text. I’m not sure if that either a) suggests I should definitely use only a single one of each, or b) if the code example is just not extensive.

I’m starting to think I should / can only generate a single response from a single API query and, therefore, to minimise usage, I will combine.

Currently, I am having it generate a “people” JSON array for identified speakers, but just outputting a plain-text transcript, which I’m happy with. But I guess I may also have to find a way to output both “people” and transcript as a single JSON object.

ChatGPT suggests the API is indeed designed to return only a single response. Therefore, as you say, @jkirstaetter, the options must be either: 1) sequential calls or 2) a chat session.

Thanks, @jkirstaetter, @Susarla_Sai_Manoj for your input.

Topic		Replies	Views
Will it be possible to receive text and audio data in the multimodal API? Gemini API models , gemini-api	13	785	July 22, 2025
API periodically ignoring multiple documents Gemini API gemini-15 , api , gemini-api	9	236	October 1, 2024
Getting Youtube video summary via Gemini AI API Gemini API api	3	806	March 31, 2025
Unable to upload files to Gemini 2.0 : File not exists in Gemini API Gemini API gemini-20	6	435	May 11, 2025
Error using image and a prompt Google AI Studio gemini-15 , api , models	13	1079	December 8, 2024

How to get multi-part responses?

Related topics