Enigma Sound : Multi-Modal Emotion-to-Music Pipeline: Fusing Face, Voice, and Text (IEEE Research)

“Hello Community! I’m sharing my work on Enigma Sound- AI driven Music Generation with Emotion Intelligence, a triple-modal affective computing project. My research focuses on creating a seamless ‘emotional flow’ by fusing three distinct data streams into real-time generative music.”

The Architecture:

To ensure emotional and temporal consistency, I implemented a custom CNN-LSTM data-fusion pipeline:

  • Visual: Real-time facial expression recognition (FER) capturing micro-expressions.

  • Auditory: Feature extraction using Librosa for audio prosody (pitch, tempo, and energy).

  • Textual: Natural Language Processing for sentiment intensity in user prompts.

  • Generative Engine: Logic-based MIDI synthesis using Music21 and FluidSynth, mapped to a unified Valence-Arousal coordinate.

  • Ecosystem Integration: A hybrid model that also redirects to Spotify playlists tailored to the detected emotion.

Research & Edge Efficiency:

This work was published in IEEE and focused on high-performance execution on Edge Devices. Using Flutter, I optimized the pipeline to run on mid-range mobile hardware , ensuring low-latency emotional mapping.

The Road Ahead :

The project has already seen over 100 clones on GitHub. I am now exploring how to scale this architecture by migrating the custom LSTM layers to Vertex AI or leveraging Gemini’s multimodal embeddings for more complex orchestration.

I’d love to connect with anyone working on ‘emotional latency’ in real-time generative apps or those who have moved custom LSTM pipelines to Google’s managed services.

Links: