I upload a video with the intention of carrying out two tasks:
- Identify speakers by caption.
- Transcribe the video.
This works well in AI Studio, which gives me code for https://generativelanguage.googleapis.com/upload/v1beta/files?key=${API_KEY}
For a corresponding API call, note that the code shows both user
prompt and model
response for the two instructions. Obviously, I do not include the model
parts.
Can I achieve the two results using one API call? (I think it would save having to poll the big video twice?).
When I tried feeding two user
parts in the request body, the API only returned one part response, the speaker names; it did not bother including the transcript. (Is it limited on parts
response count just as it is limited on candidateCount
?).
[
{
"role": "user",
"parts": [
{
"fileData": {
"fileUri": "${FILE_URI_0}",
"mimeType": "video/mp4"
}
}
]
},
{
"role": "user",
"parts": [
{
"text": "List the speakers in this interview. Their names, companies and job titles are found in chyrons at the bottom of the video. Return the results as a json object with name, company and job title."
}
]
},
{
"role": "model",
"parts": [
{
"text": "JSON GENERATED HERE OF THE SPEAKERS"
}
]
},
{
"role": "user",
"parts": [
{
"text": "Transcribe the video."
}
]
},
{
"role": "model",
"parts": [
{
"text": "VIDEO TRANSCRIPT GENERATED HERE BY LOVELY GEMINI API"
}
]
}
],
"generationConfig": {
"temperature": 1,
"topK": 40,
"topP": 0.95,
"maxOutputTokens": 8192,
"responseMimeType": "text/plain"
}
}'