We need real audio understanding-like Vision has–not the current so-called ‘audio understanding’ which are just vague audio text transcriptions by another AudioToText model as of writing this. Music is emotional and hard to capture in text, and I understand that training an emotionless AI to understand it is a huge challenge, let alone generating tons of such samples.
Still, I have a few ideas for training samples that might help approximate the emotions/feelings of sounds for such audio-capable model which could capture/approximate the features of music a little bit better than the current detached descriptions of the sounds (which lack emotion):
-
Type 1 (Story With vs. Without Soundtrack): Novel generation training sample, version with soundtrack & version without a soundtrack. The two samples should have a contrasting narration in storytelling and possibly explicit mentions of parts of the audio or repeated lyrics if any. Since the two novel stories are about the same exact thing, with only difference being the track—the ‘emotional’ features are thus successfully captured if trained on tons of contrasting variations like that. (It’d teach the model to recognize how the same words can be colored emotionally by background audio; such as sad, hopeful, chaotic, eerie, etc.)
-
Type 2 (Multi-Persona Reaction): Different systemInstructions of different characters reacting to the same soundtrack or a small sudden sound. (Theory Of Mind, or basically how different personas would react to the same stimuli.).
-
Type 3 (Sound as Instruction): Soundtrack provided as a “mood” for a character inside the systemInstructions with little to no textual instructions needed. Just pure sound and character just embodying it as if it were ‘instructions.’
-
Type 4 (Optional - Music Collaboration): Gemini “helping” a music creator compose a track.
Creating this at scale is tough, but worth it. True Audio Understanding could let Gemini give meaningful/professional feedback to music creators, not just guess at “synths and drums” or ‘advice’ me to add the track’s name in morse code cause “it’d be cool”…
I truly love this model and hope to see it truly capable of that feat one day, not the current proclaimed audio understanding.
Big thanks to the Gemini / Vertex AI team.