Since the ChatGPT desktop is announced with gpt-4o, I have been dying to build an app that can see and hear. But I can’t even get my head around on how to design this at a high level.
As we know that no Multimodal LLMs can accept streaming live video as an input, how do we work around this to emulate a similar experience that your AI assistant is hearing what you hear, seeing what you see?
Do we constantly taking one second video clip and feed to the LLMs? I know 1 or 2 millions token limit sound a lot, but it will be filled out pretty quickly. Even we could solve this data ingesting issue, what about the lag it causes, constantly uploading videos to the LLM? What about the LLM response delay?
We don’t need to LLM to respond to every second of clip, obviously. But that also means all the past videos, i.e., need to be kept in the history as a context. So when you ask a question to the model, it knows what you have been hearing or seeing in the last 5 mins. Where we gonna keep that? In text only application, there is always a conversational variable that keeps array of information, the next user prompt and system response are always appended to this history variable.
What strategy do we use for the Multimodal use case? If we keep all the input formats, text, audio, images and video, this variable will grow larger quickly too, and there is no merit. Given a 5 minute video of watching sunset contains little information but the video file could very large. In this case, all multimodal information should be converted into text embeddings and only the delta should be kept to keep the conversational history variable lean. But I have no idea how to do that, is that even a thing?
I like to hear all you innovative minds there how you build an AI assistant that could hear and see what’s going on your desktop, so you could watch a movie or play game together. What strategy do you use to reduce the delay and mimic a live experience.
ChatGPT desktop app can do that, so as Microsoft Copilot Recall. I think there must be a technique to achieve this, also minimise the delay and keep the conversational experience.
Welcome to the forums! You certainly decided to jump in with a challenging question. Let’s break some of down bit by bit.
You’re clearly not the only one! This is very much bleeding edge. Google demonstrated their future version of it as part of Project Astra - Google DeepMind.
That’s correct. The usual practice has been to take a frame of the video. Gemini’s built-in tools do this one a second, while GPT-4o suggests 2-4/second, but doesn’t provide a way to do this. Neither, as you note, are streaming the video.
So we know they’re breaking it up into chunks and feeding them to the model.
The response only really comes into play when you ask it a question based on the accumulated history. So that involves detecting when a question itself is being asked and making sure it gets processed, as opposed to just accumulating information.
There are many possible approaches to that. From hotword trigger detection (“Hey, Gemini”) to proxiity detection for audio input to filtering out background audio.
You maintain the same conversational history variable with multiple parts for the input part representing the stream of input that you’re getting.
In Gemini, this is represented as an array of Content objects with the role field specifying the message was either from the “user” (representing the accumulated information since the last question, plus the new question) or the “model” (representing the response from Gemini).
The Content includes an array of Part objects. Each part can be text or data (data being a file of some sort - the frame image, a bit of audio, etc).
But it is important to remember that Gemini doesn’t see this, really. All it sees is a series of tokens.
In Gemini, this history will contain the parts - text and URI references to the other media. We’re not generally storing the video - just frames of it. But Gemini is also managing that aspect of it.
Well… some thoughts here:
Why convert them into text embeddings? Why not just convert them into embeddings? There are multimodal embedding models that will embed media and text into the same embedding space.
Embeddings aren’t necessarily the best approach here. They’re good for vector searching, but we’re not necessarily trying to search here.
The powerful feature of these natively multimodal models (such as Gemini 1.5) is that they were trained on multimodal inputs directly. So they have tokens representing the media that are the equivalent tokens to those that come from text.
So, just like text, the input to the model are tokens.
There is a lot of talk about “Prompt Caching”, but I tend to think that this could (and perhaps should) be easily used as a Prompt State Management. So based on the current state, we add tokens to advance the state, and the model adds tokens in its reply (and sends the reply to us as well). This leaves things in a new state, so we’re only ever adding new things to this state to update it.
Prompt Caching isn’t public yet, so I’m not sure if this is actually available. Or what the time and memory constraints of it might be.
You’re asking all the right questions. But I think everyone is still experimenting and trying to find the answers. And despite what OpenAI seems to think - I think we’re a long way away from it, at least using the current technologies.
Quite honestly, I’m not the least bit interested in my desktop. And I think Microsoft’s focus on it is reflecting their historic bias. Most of what I do here is already captured in various other documents and activity. And if it isn’t - it probably shouldn’t be captured at all. I don’t need something taking pictures of me writing this reply - I have the reply itself. And all of that can be done with embeddings. But embeddings may lose the context of things - they’re not perfect solutions.
My world is bigger than the desktop.
What I am interested in is seeing (literally) how these tools help me understand and work with the world around me. The one I live in. And I think that’s what Project Astra was showing.
But there are definite challenges here. Privacy being the biggest. And not just my privacy, but the privacy of other people that I may encounter. But that’s difficult to keep in mind when you have (and want) a device that collects “everything” because you may need to ask about “anything”.
I want a tool that is there when I need it, but out of my way when I don’t.
Thank you @afirstenberg for the clear explanation. I love how you broke apart my questions and addressed them individually. My head is much clearer now.
I also learned a lot from your reply, such as the multimodal embeddings.
Thank you for confirming that I am asking the right questions, and I agree with all the points you highlighted.
On the topic of capturing the high frequency video clips, is there any recommended method on Mac? For most computers, there might be a performance issue achieving this. It needs to start a process, record, saving the clip to disc and restart at the following second. Do we need to consider multi-thread processes capturing at the same time, with different timing and a bit of overlaps? Does LLM have a preferred video format?
I’m right there with you. The idea I’ve had and have been chewing on since 1.5 Preview was granted was trying to get “close” to real-time with either a webcam or an old deprecated Dropcam (great cameras shame a company disabled them I won’t name but you know who you are). The closest I was getting though was to chop it up into finding something like a sweet spot every 5 seconds or so and making sure that it was a high FPS so it produced slightly better responses.
It was still slow, and considering the overhead costs, latency, inference or etc … v/s just taking a 30-minute video and processing that was showing me, either a skill issue (and there most likely is plenty of that on my end) but also that we don’t have (it appears) granular control over how the API processes the video (ie every 5 frames, or every frame, and what it does with those frames, and what or how audio is also processed and associated to the frame(s).
I’ve been looking and I’d like to see if there have been people who have experience in not only format, FPS, resolution, and all that good stuff, and what nets better results and why.
@nhtkid great question and @afirstenberg great responses. One of those responses had me thinking about privacy. While yeah I get it, it had me thinking Telsas footage are often shown in mainstream media during “breaking news” segments, however these are in public and where I live at least there are rules on what is a reasonable expectation of privacy (born from photographers back in the day) but that said. If they are allowed to record all day, all over, everywhere and that footage is accessible (for media, law, etc) then one could argue it would be reasonable to assume footage gathered for a “world view” project would be allowed. I know that’s getting into a much larger, less than fun topic, just a thought.
That said, an app that can hear and see has been done many times over, the concept isn’t new, it’s at least a decade or two old, from the surveillance perspective in retail to home, etc. To detect faces, dogs, porch pirates, etc. All those solutions are using what appears to be the less capable solution, not even a model, machine learning and computer vision, labeling, etc (then we go off into the AI v/s ML/CS debate) but nonetheless, I wonder if there is a way to reverse engineer how a Nest Cam or Doorbell cam does it.
Or just beg and beg for preview access to Project Astra (and the hardware, and ops tooling) to run it.
Following to see what anyone comes up with. I’m very interested
One thing to keep in mind is that you don’t need to record a video - you need to take a snapshot.
Gemini (and GPT 4o, for that matter) take videos and turn them into frames to process each frame.
So it might be just as easy to just start with the frames that you capture (how often you do so - that’s another story). And then how you capture and handle the audio (but that is slightly lighter weight).
We know that Astra today runs with Gemini 1.5.
From my brief experiments with it at I/O, it is not terribly good at quick motions, and does tend to have a small latency when you interrupt it. I wouldn’t be surprised if it was just getting image grabs to process it as a stream.
@Andrew I hear you my friend. And yes the idea of AI is not new, it has been happening in our life in the past one or two decades. It has been implemented in the shape of machine learning and computer visions, in our factories, air port and big data analysis. But the whole recipe has a missing ingredient, which we have now, the LLM.
I am very excited about the future, which is that we can apply LLM layer to all the existing AI infrastructure we already built, to feed data and perform better analysis and reasoning (if it could stop hallucinating).
And thanks for mention the elephant in the room, Tesla AI.
Companies who are pioneering self driving have to ingest the live videos. Are they processing at high frequency frames too?
@afirstenberg I will start with basic, just a screenshot every couple of seconds, trying to play chess online.
So far, I am finding neither gpt-4o or Gemini 1.5 are able to read the chess board correctly. I might try other open source models on Hugging Face too.
But what am I doing it wrong?
I first thought there might be issue with my code.
So I just used Google’s Cookbook in Colab, replace the image with my chess board screenshot, it just can’t read each chess piece positions correctly.
Is there a way to configure how image was processed when feeding to the LLM? I suspect it might be fed at a reduced resolution which could cause the LLM not able to read it clearly?
Most companies working on autonomous driving aren’t using camera images. They’re using LIDAR primarily and using images to supplement that, if it all.
But yes, they’re processing at a high frame rate. But also using a custom tuned image model. Not a general purpose LLM.
This doesn’t surprise me, to be honest. It is a general model so an overhead shot of a chess board doesn’t give a lot of clues about what each chess piece looks like, and a side shot does not clearly identify depth locations.
You might try, as part of your prompt, include images (at the same angle and depth) of the different kind of chess pieces to see if that helps. Otherwise, I would definitely consider evaluating other models.
No, there isn’t. But that also isn’t the issue. Consider that Gemini and GPT 4o are token based models. So they’re taking an image and turning it into a series of tokens in the same way that it takes a paragraph and turns it into a series of tokens. If you think of it similar to embeddings, the embeddings doesn’t necessarily represent where certain text is in the paragraph - it represents the tendencies of the paragraph as a whole.
Autonomous driving really needs lidar. There’s another technology domain that has been solving the computer vision questions beside autonomous driving: robotics. And, universities have been running competitions and publishing their results for over 20 years now. Which in turn means, there’s tons of opensource code for you to look at, on how the problem is tackled.
The solution everyone comes back to are vision pipelines.
imagine a main loop
inputs,
guiding prompt
text:
audio:
video
other
reasoniong
here you setup a framework how to interpet inputs
(different for dog, different for bird, differnet for human)
result of reasoning from inputs:
plan RunAway!!
actions:
funcionCall runAway:
the faster that loop can go te more aware and alife your ROBOT is