Analysong long video (>1 hour)

Hi everyone,

I’m building an AI pipeline using the Gemini API (specifically gemini-3.1-pro via the Files API and Context Caching) to audit security camera footage from a barbershop. My goal is to count the number of haircuts/services performed and extract basic details about each client (clothing, age, gender, action performed).

I am feeding the model 1-hour long video chunks (around 1080p, automatically downsampled by Gemini to 1 FPS). While the model is incredible at understanding the scene, I’m noticing some unreliability when processing these long, static videos:

  1. Hallucinations over time: It sometimes invents events that didn’t happen or loses track of strict JSON schemas I’ve provided.
  2. Missing events: It will occasionally skip over a 20-minute haircut entirely.

My current setup:

  • Uploading via the Files API.
  • Using Context Caching to cache the 1-hour video + system instructions.
  • Sending a lightweight generation query to extract the JSON array.

My questions for the community:

  1. Is analyzing a continuous 1-hour video natively a bad approach? Are people having better success chunking videos into 5-minute segments using FFmpeg before sending them to Gemini?
  2. Prompting tips for temporal tracking: What are your best prompt engineering tips to force Gemini to scan every single frame of a long video rather than taking “shortcuts” and summarizing?

Any tips, SDK tricks, or architectural advice would be hugely appreciated!