How to get multi-part responses?

I upload a video with the intention of carrying out two tasks:

  1. Identify speakers by caption.
  2. Transcribe the video.

This works well in AI Studio, which gives me code for https://generativelanguage.googleapis.com/upload/v1beta/files?key=${API_KEY}

For a corresponding API call, note that the code shows both user prompt and model response for the two instructions. Obviously, I do not include the model parts.

Can I achieve the two results using one API call? (I think it would save having to poll the big video twice?).

When I tried feeding two user parts in the request body, the API only returned one part response, the speaker names; it did not bother including the transcript. (Is it limited on parts response count just as it is limited on candidateCount?).

[
    {
      "role": "user",
      "parts": [
        {
          "fileData": {
            "fileUri": "${FILE_URI_0}",
            "mimeType": "video/mp4"
          }
        }
      ]
    },
    {
      "role": "user",
      "parts": [
        {
          "text": "List the speakers in this interview. Their names, companies and job titles are found in chyrons at the bottom of the video. Return the results as a json object with name, company and job title."
        }
      ]
    },
    {
      "role": "model",
      "parts": [
        {
          "text": "JSON GENERATED HERE OF THE SPEAKERS"
        }
      ]
    },
    {
      "role": "user",
      "parts": [
        {
          "text": "Transcribe the video."
        }
      ]
    },
    {
      "role": "model",
      "parts": [
        {
          "text": "VIDEO TRANSCRIPT GENERATED HERE BY LOVELY GEMINI API"
        }
      ]
    }
  ],
  "generationConfig": {
    "temperature": 1,
    "topK": 40,
    "topP": 0.95,
    "maxOutputTokens": 8192,
    "responseMimeType": "text/plain"
  }
}'

Hi, you need to combine the two user parts into one message with two parts - text and fileData.
Furthermore, user and model messages have to be alternating.

Hi @parakeet

In addition to the points mentioned in the comment, you can also incorporate few-shot prompting if possible.

Thanks.

@jkirstaetter When I did a call with two parts of role user, the response only answered the first parts and there was no second part`.

Ohh, I see what you’re saying… put fileData, text 1 and text 2 all within a single parts array, so that there is only one node with a parts array.

I’ve just tried the following. While the response returns answers to both prompts, it still comes back with only 1x item in parts.

[
    {
      "role": "user",
      "parts": [
        {
          "fileData": {
            "fileUri": "${FILE_URI_0}",
            "mimeType": "video/mp4"
          }
        }
      ]
    },
    {
      "role": "user",
      "parts": [
        {
          "text": "List the speakers in this interview. Their names, companies and job titles are found in chyrons at the bottom of the video. Return the results as a json object with name, company and job title."
        }
      ]
    },
    {
      "role": "user",
      "parts": [
        {
          "text": "Transcribe the video."
        }
      ]
    }
  ],
  "generationConfig": {
    "temperature": 1,
    "topK": 40,
    "topP": 0.95,
    "maxOutputTokens": 8192,
    "responseMimeType": "text/plain"
  }
}'

So, ask it to 1. Return speaker names and 2. Transcribe the video in the same prompt?
I could probably ask it to return distinct JSONs or JSON nodes for each, and then just parse them post-operation?

Sure! Here are a few options:

  • Thanks for the clarification! I’ll give that a try.
  • Got it, thanks! Combining the user parts into one message makes sense.
  • Appreciate the tip about few-shot prompting, I’ll look into that.

Hi,
What you’re trying to achieve sounds more like a ChatSession inclusive history than a single request. See here Generowanie tekstu za pomocą interfejsu Gemini API  |  Google AI for Developers

Anyway, here is the doc about text + video processing Gemini API로 비전 기능 살펴보기  |  Google AI for Developers
Or have a look at the JS version of the content: สำรวจความสามารถในการมองเห็นด้วย Gemini API  |  Google AI for Developers

The value of responseMimeType is not compatible. Compatible mimetypes: application/json And your prompt asking for JSON is contradictatory to the expected response mime type.

Right now, I would either split this into two generateContent requests or use a chatSession with history.

Hi @parakeet

You can show a simple example of how you want the output to look. Provide a basic JSON example for both “Identify speakers by caption” and “Transcribe the video.” You can also check this link.

Thanks

My starting point was taking the code using the “Get code” option in a multi-message AI Studio chat, but the code it gives does not refer to chat, chatSession or startChat.

Of note - The example code for generateContent after upload (“Now generate content using that file”) includes only one parts item and, within it, only one text. I’m not sure if that either a) suggests I should definitely use only a single one of each, or b) if the code example is just not extensive.


I’m starting to think I should / can only generate a single response from a single API query and, therefore, to minimise usage, I will combine.

Currently, I am having it generate a “people” JSON array for identified speakers, but just outputting a plain-text transcript, which I’m happy with. But I guess I may also have to find a way to output both “people” and transcript as a single JSON object.


ChatGPT suggests the API is indeed designed to return only a single response. Therefore, as you say, @jkirstaetter, the options must be either: 1) sequential calls or 2) a chat session.

Thanks, @jkirstaetter, @Susarla_Sai_Manoj for your input.

1 Like