File Search Tool fails to process medical PDF documents with complex table structures and embedded charts

Problematic File Characteristics:

I have identified a specific PDF type that consistently fails to process in the File Search Tool - medical cardiopulmonary stress test reports from hospital systems. Due to data privacy regulations, I cannot share the actual file, but I can provide detailed characteristics that likely contribute to the failure:

  1. Complex Mixed Content Structure:

    • Dense multi-column tables with hundreds of rows of numerical data

    • Multiple embedded charts/graphs on the same pages as tabular data

    • Overlapping content layers (text, tables, and vector graphics)

  2. Special Characters and Encoding:

    • Medical measurement units with special symbols (°C, L/min, mmHg, mL/min/kg)

    • German umlauts (ä, ö, ü) in patient data

    • Mathematical symbols (ø, ², %, )

  3. High Data Density:

    • 5 pages containing over 300 rows of time-series measurement data

    • 18+ columns per table with mixed data types (text, integers, floats)

    • Multiple nested header rows

  4. Embedded Visualizations:

    • 15+ line charts showing physiological parameters over time

    • Flow-volume loops

    • Charts with gridlines, legends, and multiple data series

Expected Behavior: The File Search Tool should successfully parse, chunk, and index the PDF file, extracting both the tabular data and recognizing the presence of charts.

Actual Behavior: The upload process fails, causing the entire batch upload operation to stall and eventually fail.

Impact: This issue is critical for enterprise use cases where:

  • Users need to upload large document sets containing medical reports, technical specifications, or scientific papers

  • A single problematic file blocks processing of all subsequent files

  • No clear error message indicates which aspect of the file caused the failure

Suspected Root Cause: The File Search Tool’s chunking algorithm likely encounters issues when processing PDFs that contain:

  • Overlapping bounding boxes from tables and embedded graphics

  • Complex text extraction where column boundaries are ambiguous

  • Special Unicode characters that may not be handled correctly during the embedding generation phase

  • Documents where content is rendered as a combination of text objects and vector graphics

Request: If needed, I can provide a sample file that reproduces this issue through a secure channel, but due to the sensitive nature of medical data, I cannot attach it to this public bug report.

1 Like