Hi everyone,
I’m building an AI pipeline using the Gemini API (specifically gemini-3.1-pro via the Files API and Context Caching) to audit security camera footage from a barbershop. My goal is to count the number of haircuts/services performed and extract basic details about each client (clothing, age, gender, action performed).
I am feeding the model 1-hour long video chunks (around 1080p, automatically downsampled by Gemini to 1 FPS). While the model is incredible at understanding the scene, I’m noticing some unreliability when processing these long, static videos:
- Hallucinations over time: It sometimes invents events that didn’t happen or loses track of strict JSON schemas I’ve provided.
- Missing events: It will occasionally skip over a 20-minute haircut entirely.
My current setup:
- Uploading via the Files API.
- Using Context Caching to cache the 1-hour video + system instructions.
- Sending a lightweight generation query to extract the JSON array.
My questions for the community:
- Is analyzing a continuous 1-hour video natively a bad approach? Are people having better success chunking videos into 5-minute segments using FFmpeg before sending them to Gemini?
- Prompting tips for temporal tracking: What are your best prompt engineering tips to force Gemini to scan every single frame of a long video rather than taking “shortcuts” and summarizing?
Any tips, SDK tricks, or architectural advice would be hugely appreciated!