Gemini API responses slower than Gemini on web when files are in chat

Thomas_Gandy · March 26, 2025, 11:25am

Hi there,

I’m experimenting with an app that uses Gemini for processing files (PNGs, PDFs, etc). I am assigning the files a unique ID before uploading them via the Google Files API. I then pass the API file ID to Gemini. Each subsequent request sends the chat history to Gemini and references to the Google File ID. This works and is correctly not re-uploading to Google Files each time.

However, I’ve noticed that when I have, say, 5 PDFs (each ~5MB) in the chat, subsequent queries in the same conversation (even if the query isn’t related to the PDF at all) will become extremely slow. For example, saying “Hello” when no PDFs are in the chat will normally take ~2 seconds to get the first token. However, when PDFs are in the chat, it can take up to 15 seconds.

Trying this in the Gemini site itself does not suffer from this problem. I can upload 5 PDFs, ask a question about them, then ask an unrelated question, and it will take about 3 seconds to the first token.

I am aware that caching is not yet available in Gemini Flash 2.0, but at the same time, I don’t think the amount of PDFs I have would exceed the minimum threshold for being able to use caching. Therefore, it seems that this is a problem unrelated to caching.

Devayani_S · June 13, 2025, 12:54pm

Hi @Thomas_Gandy,

Yes, this is a known issue especially when referencing multiple files or maintaining long chat histories.

Below are the few particular reasons:

File Referencing Overhead: Even if your prompt doesn’t directly use the files, Gemini still processes the full chat context—including file references—on every request. This adds latency, especially with large or multiple files.
No Caching in Gemini Flash 2.0: As you noted, caching is not yet available in Gemini Flash 2.0. This means every request reprocesses the entire context from scratch.
Web UI Optimisations: The Gemini web interface likely uses internal optimisations (e.g. caching, context pruning, or lazy file loading) that are not exposed in the public API.

Here are some strategies to improve performance:

1.Minimise File References: Only include file IDs in the prompt when they’re actually needed.

Separate Threads: For unrelated queries, start a new chat session without file references.

3.Switch to Gemini Pro: If latency is critical, Gemini Pro models may offer better performance for large contexts.

4.Batch File Content: Instead of referencing multiple files, consider summarising them into a single file or embedding only the relevant parts.

Topic		Replies	Views
Handling Multiple PDF Files with Gemini API and Token Limit Issues Gemini API ai-studio , api , models	3	808	January 9, 2025
How to Start a Chat with Gemini Without Resending the File Gemini API api , github	3	211	February 26, 2025
Slow Context Cache Creation with Gemini 2.5 Pro: Looking for Optimization Methods Gemini API api , models , context_caching	3	160	June 27, 2025
Chat issue with embedded file Gemini API gemini-15 , ai-studio	3	121	March 21, 2024
Urgent: Significant Regression in File Status Transition to ACTIVE Gemini API bug , gemini	17	331	May 21, 2025

Gemini API responses slower than Gemini on web when files are in chat

Related topics