How to get multi-part responses?

I upload a video with the intention of carrying out two tasks:

  1. Identify speakers by caption.
  2. Transcribe the video.

This works well in AI Studio, which gives me code for https://generativelanguage.googleapis.com/upload/v1beta/files?key=${API_KEY}

For a corresponding API call, note that the code shows both user prompt and model response for the two instructions. Obviously, I do not include the model parts.

Can I achieve the two results using one API call? (I think it would save having to poll the big video twice?).

When I tried feeding two user parts in the request body, the API only returned one part response, the speaker names; it did not bother including the transcript. (Is it limited on parts response count just as it is limited on candidateCount?).

[
    {
      "role": "user",
      "parts": [
        {
          "fileData": {
            "fileUri": "${FILE_URI_0}",
            "mimeType": "video/mp4"
          }
        }
      ]
    },
    {
      "role": "user",
      "parts": [
        {
          "text": "List the speakers in this interview. Their names, companies and job titles are found in chyrons at the bottom of the video. Return the results as a json object with name, company and job title."
        }
      ]
    },
    {
      "role": "model",
      "parts": [
        {
          "text": "JSON GENERATED HERE OF THE SPEAKERS"
        }
      ]
    },
    {
      "role": "user",
      "parts": [
        {
          "text": "Transcribe the video."
        }
      ]
    },
    {
      "role": "model",
      "parts": [
        {
          "text": "VIDEO TRANSCRIPT GENERATED HERE BY LOVELY GEMINI API"
        }
      ]
    }
  ],
  "generationConfig": {
    "temperature": 1,
    "topK": 40,
    "topP": 0.95,
    "maxOutputTokens": 8192,
    "responseMimeType": "text/plain"
  }
}'

Hi, you need to combine the two user parts into one message with two parts - text and fileData.
Furthermore, user and model messages have to be alternating.

Hi @parakeet

In addition to the points mentioned in the comment, you can also incorporate few-shot prompting if possible.

Thanks.