Hi Team,
We are trying to give example audios to gemma 3n 4b model as a few shot example for low resource languages.
Uvicorn is taking 16 seconds to give us the output even on GPU. The major issue is that its not accepting the example audio through the prompt for training. Audios is ignored and we end up running pure text generation.
Keyword argument audios is not a valid argument for this processor and will be ignored.
Is there a way we can work with uvicorn to get this output?
2 Likes
Hi @Dibyajyoti_Mishra ,
Gemma 3n is a multimodal model that supports audio input, but the way we pass that audio data for few-shot prompting needs to follow a specific multimodal chat format, and it’s typically done through the model’s dedicated processor/pipeline, not a generic keyword like audios.
To provide audio as part of your prompt (for few-shot examples or direct transcription/translation), you must format the input according to the Gemma 3n multimodal chat template.
An example structure (following the Hugging Face docs for Gemma 3n) is:
messages = [
{
"role": "user",
"content": [
{"type": "audio", "audio": "your_audio_data_or_path"},
{"type": "text", "text": "Please transcribe this audio into English."},
],
},
# Add your few-shot examples here using the same structure
{
"role": "assistant",
"content": "Transcription result for the user's audio.",
}
]
input_ids = processor.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_tensors="pt"
)
For few-shot examples, you would place multiple alternating user (with audio and instructions) and assistant (with the correct text output) messages within this messages list before the final query. This is how the model learns the pattern from the examples.
Thanks.
1 Like