PDFs vs. Raw Text Efficiency - What's best?

clarru · March 14, 2025, 5:44am

Hello everyone,

I’m working on a project similar to ChatGPT, but specifically for law enforcement officers.

To make this useful, I need to add around 200 laws to its knowledge base.

When I add 50 of them as PDFs, the model estimates about 100,000 tokens, but in reality, if I extract the raw text, it amounts to 750,000 tokens.

In total, with 200 laws, the raw text would be around 2.5 million tokens, but as PDFs, I’m not sure yet, maybe around 300,000 tokens?

My questions are:

Is it better to use PDFs? Does the model work faster or slower with them?
Has anyone dealt with a similar issue?

If I have to use raw text, I will need to create 4-5 models. This is fine since each model would be for a specific law enforcement structure. However, having a single model would be more convenient, making PDFs an attractive option, but not at the cost of quality or response speed.

Right now, like many others have reported, Gemini’s responses are really slow. We’re experiencing 5-8 minutes per answer with 50 PDFs (~100k tokens). So, I am concerned about what will happen if I add 200 PDFs.

Topic		Replies	Views
Token Estimation for Processing PDFs Gemini API help-request , text	1	20	April 14, 2025
New Gemini API PDF Support Gemini API	3	199	July 30, 2024
Understad token count Gemini API api , prompt	4	113	February 27, 2025
Handling Multiple PDF Files with Gemini API and Token Limit Issues Gemini API ai-studio , api , models	3	370	January 9, 2025
Gemini API large PDF file upload limited tokens? Gemini API api , prompt	1	123	March 7, 2025

PDFs vs. Raw Text Efficiency - What's best?

My questions are:

Related topics