File Search Tool fails to process medical PDF documents with complex table structures and embedded charts

Problematic File Characteristics:

I have identified a specific PDF type that consistently fails to process in the File Search Tool - medical cardiopulmonary stress test reports from hospital systems. Due to data privacy regulations, I cannot share the actual file, but I can provide detailed characteristics that likely contribute to the failure:

  1. Complex Mixed Content Structure:

    • Dense multi-column tables with hundreds of rows of numerical data

    • Multiple embedded charts/graphs on the same pages as tabular data

    • Overlapping content layers (text, tables, and vector graphics)

  2. Special Characters and Encoding:

    • Medical measurement units with special symbols (°C, L/min, mmHg, mL/min/kg)

    • German umlauts (ä, ö, ü) in patient data

    • Mathematical symbols (ø, ², %, )

  3. High Data Density:

    • 5 pages containing over 300 rows of time-series measurement data

    • 18+ columns per table with mixed data types (text, integers, floats)

    • Multiple nested header rows

  4. Embedded Visualizations:

    • 15+ line charts showing physiological parameters over time

    • Flow-volume loops

    • Charts with gridlines, legends, and multiple data series

Expected Behavior: The File Search Tool should successfully parse, chunk, and index the PDF file, extracting both the tabular data and recognizing the presence of charts.

Actual Behavior: The upload process fails, causing the entire batch upload operation to stall and eventually fail.

Impact: This issue is critical for enterprise use cases where:

  • Users need to upload large document sets containing medical reports, technical specifications, or scientific papers

  • A single problematic file blocks processing of all subsequent files

  • No clear error message indicates which aspect of the file caused the failure

Suspected Root Cause: The File Search Tool’s chunking algorithm likely encounters issues when processing PDFs that contain:

  • Overlapping bounding boxes from tables and embedded graphics

  • Complex text extraction where column boundaries are ambiguous

  • Special Unicode characters that may not be handled correctly during the embedding generation phase

  • Documents where content is rendered as a combination of text objects and vector graphics

Request: If needed, I can provide a sample file that reproduces this issue through a secure channel, but due to the sensitive nature of medical data, I cannot attach it to this public bug report.

2 Likes

Hello @John_Mavic

I tried to reproduce this on my end using a sample chemical document that includes Dense Multi-Column Tables, Pharmacokinetic Time-Series Charts, Special Character Encoding, and Layered Vector Graphics.

It is working with the File Search API as expected for me.

Could you share the Sample code snippet with prompts and a sample document DM me so I can check it on my end?

2 Likes