Gemini API responses slower than Gemini on web when files are in chat

Hi there,

I’m experimenting with an app that uses Gemini for processing files (PNGs, PDFs, etc). I am assigning the files a unique ID before uploading them via the Google Files API. I then pass the API file ID to Gemini. Each subsequent request sends the chat history to Gemini and references to the Google File ID. This works and is correctly not re-uploading to Google Files each time.

However, I’ve noticed that when I have, say, 5 PDFs (each ~5MB) in the chat, subsequent queries in the same conversation (even if the query isn’t related to the PDF at all) will become extremely slow. For example, saying “Hello” when no PDFs are in the chat will normally take ~2 seconds to get the first token. However, when PDFs are in the chat, it can take up to 15 seconds.

Trying this in the Gemini site itself does not suffer from this problem. I can upload 5 PDFs, ask a question about them, then ask an unrelated question, and it will take about 3 seconds to the first token.

I am aware that caching is not yet available in Gemini Flash 2.0, but at the same time, I don’t think the amount of PDFs I have would exceed the minimum threshold for being able to use caching. Therefore, it seems that this is a problem unrelated to caching.

(post deleted by author)

Hi @Thomas_Gandy,

Yes, this is a known issue especially when referencing multiple files or maintaining long chat histories.

Below are the few particular reasons:

  1. File Referencing Overhead: Even if your prompt doesn’t directly use the files, Gemini still processes the full chat context—including file references—on every request. This adds latency, especially with large or multiple files.

  2. No Caching in Gemini Flash 2.0: As you noted, caching is not yet available in Gemini Flash 2.0. This means every request reprocesses the entire context from scratch.

  3. Web UI Optimisations: The Gemini web interface likely uses internal optimisations (e.g. caching, context pruning, or lazy file loading) that are not exposed in the public API.

Here are some strategies to improve performance:

1.Minimise File References: Only include file IDs in the prompt when they’re actually needed.

  1. Separate Threads: For unrelated queries, start a new chat session without file references.

3.Switch to Gemini Pro: If latency is critical, Gemini Pro models may offer better performance for large contexts.

4.Batch File Content: Instead of referencing multiple files, consider summarising them into a single file or embedding only the relevant parts.