Best Practice for Data Extraction: Upload Raw HTML or Convert to PDF First?

Hello everyone,

I’m building a pipeline to parse and analyze large SEC 10-K filings, and I’m trying to determine the most robust and high-quality data pipeline for the Gemini File API.

My workflow has two main stages:

  1. Extraction: Reliably extract a specific section (eg “Item 8, Financial Statements”) from the full filing.

  2. Analysis: Send that extracted section to another Gemini prompt for deep analysis (eg “Summarize the debt covenants,” “What is the total revenue?”).

I’m debating the best file format to use for this pipeline. The source files from EDGAR are large, “noisy” .htm files.

Part 1: The Extraction Dilemma (HTML vs. PDF)

I’m weighing two options for the extraction step:

Option 1: Upload Raw HTML

  • Workflow: Upload the raw .htm file (mime_type='text/html') and prompt Gemini to find and return the “Financial Statements” section based on text markers.

  • My Concern: Will the model struggle to reliably parse the noisy HTML, which is full of navigation links, scripts, and complex tables?

Option 2: Convert to PDF First

  • Workflow: Convert the .htm to a PDF first, then upload it (mime_type='application/pdf').

  • My Theory: Is this more reliable? Does Gemini’s layout-aware processing for PDFs make it easier for the model to “see” and isolate a section based on headings, similar to how a human would?

Question 1: For the task of extracting a data section, is it better to feed Gemini the noisy, raw HTML, or is the extra step of converting to a PDF worth it for improved reliability?

Part 2: The Analysis Dilemma (HTML vs. PDF Bytes)

This leads to my follow-up question. Let’s assume we have successfully extracted the “Financial Statements” section using one of the methods above. Now we need to analyze it.

  • If we used the HTML path, we’d be sending a chunk of clean HTML markup to the analysis prompt.

  • If we used the PDF path, we’d be sending the cropped PDF bytes.

Question 2: For analytical tasks, which input format leads to higher accuracy? Is it more accurate at understanding tables and text from the original HTML markup, or from the text/layout of a PDF?

So does the model get more accurate semantic meaning from the structured (but messy) HTML, or from the visually “clean” (but unstructured) text of a PDF?

I’m trying to find the optimal format to use from start to finish. Thanks for any insights!

Hi @swamooo , Thanks for reaching out to us!

PDFs are processed with layout awareness, while other formats like HTML, TXT, or XML are treated as plain text, so visual structure or formatting may not be interpreted.

Given this constraint, for extracting and analyzing your large SEC 10-K filings, PDF format will be helpful for both stages:

  1. Extraction: Converting to PDF enables the model’s document vision to reliably locate specific section like Item 8 based on visual layout and heading.

  2. Analysis: The PDF format will be helpful for analysis because it allows the model to interpret content, including text, images, diagrams, charts, and tables, even in long documents.

Please refer to this documentation for more details: https://ai.google.dev/gemini-api/docs/document-processing

Awesome, thanks for your help @Sonali_Kumari1

1 Like