Best Practice for Data Extraction: Upload Raw HTML or Convert to PDF First?

swamooo · October 26, 2025, 12:37am

Hello everyone,

I’m building a pipeline to parse and analyze large SEC 10-K filings, and I’m trying to determine the most robust and high-quality data pipeline for the Gemini File API.

My workflow has two main stages:

Extraction: Reliably extract a specific section (eg “Item 8, Financial Statements”) from the full filing.
Analysis: Send that extracted section to another Gemini prompt for deep analysis (eg “Summarize the debt covenants,” “What is the total revenue?”).

I’m debating the best file format to use for this pipeline. The source files from EDGAR are large, “noisy” .htm files.

Part 1: The Extraction Dilemma (HTML vs. PDF)

I’m weighing two options for the extraction step:

Option 1: Upload Raw HTML

Workflow: Upload the raw .htm file (mime_type='text/html') and prompt Gemini to find and return the “Financial Statements” section based on text markers.
My Concern: Will the model struggle to reliably parse the noisy HTML, which is full of navigation links, scripts, and complex tables?

Option 2: Convert to PDF First

Workflow: Convert the .htm to a PDF first, then upload it (mime_type='application/pdf').
My Theory: Is this more reliable? Does Gemini’s layout-aware processing for PDFs make it easier for the model to “see” and isolate a section based on headings, similar to how a human would?

Question 1: For the task of extracting a data section, is it better to feed Gemini the noisy, raw HTML, or is the extra step of converting to a PDF worth it for improved reliability?

Part 2: The Analysis Dilemma (HTML vs. PDF Bytes)

This leads to my follow-up question. Let’s assume we have successfully extracted the “Financial Statements” section using one of the methods above. Now we need to analyze it.

If we used the HTML path, we’d be sending a chunk of clean HTML markup to the analysis prompt.
If we used the PDF path, we’d be sending the cropped PDF bytes.

Question 2: For analytical tasks, which input format leads to higher accuracy? Is it more accurate at understanding tables and text from the original HTML markup, or from the text/layout of a PDF?

So does the model get more accurate semantic meaning from the structured (but messy) HTML, or from the visually “clean” (but unstructured) text of a PDF?

I’m trying to find the optimal format to use from start to finish. Thanks for any insights!

Sonali_Kumari1 · November 20, 2025, 9:36am

Hi @swamooo , Thanks for reaching out to us!

PDFs are processed with layout awareness, while other formats like HTML, TXT, or XML are treated as plain text, so visual structure or formatting may not be interpreted.

Given this constraint, for extracting and analyzing your large SEC 10-K filings, PDF format will be helpful for both stages:

Extraction: Converting to PDF enables the model’s document vision to reliably locate specific section like Item 8 based on visual layout and heading.
Analysis: The PDF format will be helpful for analysis because it allows the model to interpret content, including text, images, diagrams, charts, and tables, even in long documents.

Please refer to this documentation for more details: https://ai.google.dev/gemini-api/docs/document-processing

swamooo · November 20, 2025, 11:06am

Awesome, thanks for your help @Sonali_Kumari1

Topic		Replies	Views
PDFs vs. Raw Text Efficiency - What's best? Google AI Studio gemini-api , model	1	236	June 10, 2025
Gemini API with PDF file Gemini API api , help_request	1	111	May 29, 2025
Upload PDF to Gemini File API Gemini API gemini-15 , gemini-api	11	1733	February 6, 2025
Clarifying Gemini 2.5 Flash API Document Limits, Supported File Types, and Size Constraints Gemini API gemini-flash-2-5	1	260	December 31, 2025
How to use the Gemini API with a PDF? Gemini API	6	612	July 4, 2024

Best Practice for Data Extraction: Upload Raw HTML or Convert to PDF First?

Related topics