Hey all,
I’m building a real-time meeting assistant where the goal is to track the progress of predefined agenda items during ongoing conversations.
Setup:
- The meeting audio is captured as a raw studio stream (PCM 16kHz).
- There can be 50–100 participants, though only a few speak at a time—like in any typical active discussion.
- Right now, we send the stream to Deepgram for live transcription, and every 10 seconds we pass the transcribed text to GPT-4.1 Nano (or other LLMs) to:
- Detect which agenda item is being discussed
- Determine its status:
Not Started
,In Progress
, orCompleted
What We Want to Achieve:
We’re aiming for real-time agenda tracking—ideally sub-1-second updates instead of waiting 10s.
To reduce this lag, I’m exploring whether we can completely skip Deepgram and instead use the Gemini Live API for:
- Both transcription and
- Natural language understanding (i.e. tracking agenda progress in real-time)
My Questions:
- Can Gemini Live API handle raw PCM 16kHz audio directly?
- If not, what preprocessing is needed to make the stream consumable?
- Can it transcribe and understand intent simultaneously, so we don’t need a separate transcription layer?
- Is there a way to stream live context (like the current agenda list and previous discussion state) to Gemini Live API continuously?
- Has anyone tried this kind of “one-hop” LLM streaming architecture before?
Any pointers, success stories, or even architecture sketches would be incredibly helpful. We’re happy to consider hybrid or fallback options too if complete replacement isn’t practical yet.
Thanks in advance!