'Audio Understanding' lacks emotional intelligence and does NOT actually understand anything

O_P · March 30, 2025, 1:16am

We need real audio understanding-like Vision has–not the current so-called ‘audio understanding’ which are just vague audio text transcriptions by another AudioToText model as of writing this. Music is emotional and hard to capture in text, and I understand that training an emotionless AI to understand it is a huge challenge, let alone generating tons of such samples.

Still, I have a few ideas for training samples that might help approximate the emotions/feelings of sounds for such audio-capable model which could capture/approximate the features of music a little bit better than the current detached descriptions of the sounds (which lack emotion):

Type 1 (Story With vs. Without Soundtrack): Novel generation training sample, version with soundtrack & version without a soundtrack. The two samples should have a contrasting narration in storytelling and possibly explicit mentions of parts of the audio or repeated lyrics if any. Since the two novel stories are about the same exact thing, with only difference being the track—the ‘emotional’ features are thus successfully captured if trained on tons of contrasting variations like that. (It’d teach the model to recognize how the same words can be colored emotionally by background audio; such as sad, hopeful, chaotic, eerie, etc.)
Type 2 (Multi-Persona Reaction): Different systemInstructions of different characters reacting to the same soundtrack or a small sudden sound. (Theory Of Mind, or basically how different personas would react to the same stimuli.).
Type 3 (Sound as Instruction): Soundtrack provided as a “mood” for a character inside the systemInstructions with little to no textual instructions needed. Just pure sound and character just embodying it as if it were ‘instructions.’
Type 4 (Optional - Music Collaboration): Gemini “helping” a music creator compose a track.

Creating this at scale is tough, but worth it. True Audio Understanding could let Gemini give meaningful/professional feedback to music creators, not just guess at “synths and drums” or ‘advice’ me to add the track’s name in morse code cause “it’d be cool”…

I truly love this model and hope to see it truly capable of that feat one day, not the current proclaimed audio understanding.

Big thanks to the Gemini / Vertex AI team.

Akhilesh_Kambhampati · May 19, 2025, 2:18am

@O_P,

Welcome to the community,
thank you very much for your suggestions, we value your inputs.
we will make sure to pass it along to gemini team.

Topic		Replies	Views
Gemini 1.5 refuses to process audio files Gemini API gemini-15 , api , web-ml	8	533	September 19, 2024
Call to update documentation for Audio Understanding (Refer to timestamps) Gemini API audio , gemini-20 , documentation	1	93	May 31, 2025
Gemini 2.5 Flash doesn't have audio processing capability, but why? Gemini API ui , gemini-flash-2-5	3	317	June 4, 2025
Feedback & Feature Request: Enhancing Gemini for Deep Conceptual Synthesis and Long-Context Collaborative Reasoning in AI Studio. Google AI Studio ai-studio , feedback , feature-request , gemini-2	0	119	April 9, 2025
Yes, Gemini is very, very awful and subpar! Google, you should really HIRE Claude engineer! Gemini API api , models , gemini-api , gemini	3	241	June 16, 2025

'Audio Understanding' lacks emotional intelligence and does NOT actually understand anything

Related topics