PDFs vs. Raw Text Efficiency - What's best?

Hello everyone,

I’m working on a project similar to ChatGPT, but specifically for law enforcement officers.

To make this useful, I need to add around 200 laws to its knowledge base.

When I add 50 of them as PDFs, the model estimates about 100,000 tokens, but in reality, if I extract the raw text, it amounts to 750,000 tokens.

In total, with 200 laws, the raw text would be around 2.5 million tokens, but as PDFs, I’m not sure yet, maybe around 300,000 tokens?

My questions are:

  • Is it better to use PDFs? Does the model work faster or slower with them?
  • Has anyone dealt with a similar issue?

If I have to use raw text, I will need to create 4-5 models. This is fine since each model would be for a specific law enforcement structure. However, having a single model would be more convenient, making PDFs an attractive option, but not at the cost of quality or response speed.

Right now, like many others have reported, Gemini’s responses are really slow. We’re experiencing 5-8 minutes per answer with 50 PDFs (~100k tokens). So, I am concerned about what will happen if I add 200 PDFs.