Best Practice for Data Extraction: Upload Raw HTML or Convert to PDF First?

Hello everyone,

I’m building a pipeline to parse and analyze large SEC 10-K filings, and I’m trying to determine the most robust and high-quality data pipeline for the Gemini File API.

My workflow has two main stages:

  1. Extraction: Reliably extract a specific section (eg “Item 8, Financial Statements”) from the full filing.

  2. Analysis: Send that extracted section to another Gemini prompt for deep analysis (eg “Summarize the debt covenants,” “What is the total revenue?”).

I’m debating the best file format to use for this pipeline. The source files from EDGAR are large, “noisy” .htm files.

Part 1: The Extraction Dilemma (HTML vs. PDF)

I’m weighing two options for the extraction step:

Option 1: Upload Raw HTML

  • Workflow: Upload the raw .htm file (mime_type='text/html') and prompt Gemini to find and return the “Financial Statements” section based on text markers.

  • My Concern: Will the model struggle to reliably parse the noisy HTML, which is full of navigation links, scripts, and complex tables?

Option 2: Convert to PDF First

  • Workflow: Convert the .htm to a PDF first, then upload it (mime_type='application/pdf').

  • My Theory: Is this more reliable? Does Gemini’s layout-aware processing for PDFs make it easier for the model to “see” and isolate a section based on headings, similar to how a human would?

Question 1: For the task of extracting a data section, is it better to feed Gemini the noisy, raw HTML, or is the extra step of converting to a PDF worth it for improved reliability?

Part 2: The Analysis Dilemma (HTML vs. PDF Bytes)

This leads to my follow-up question. Let’s assume we have successfully extracted the “Financial Statements” section using one of the methods above. Now we need to analyze it.

  • If we used the HTML path, we’d be sending a chunk of clean HTML markup to the analysis prompt.

  • If we used the PDF path, we’d be sending the cropped PDF bytes.

Question 2: For analytical tasks, which input format leads to higher accuracy? Is it more accurate at understanding tables and text from the original HTML markup, or from the text/layout of a PDF?

So does the model get more accurate semantic meaning from the structured (but messy) HTML, or from the visually “clean” (but unstructured) text of a PDF?

I’m trying to find the optimal format to use from start to finish. Thanks for any insights!