Consistent PDF Upload/Processing Failure with Gemini 3.x Models in Google AI Studio (Affects Python SDK too)

Bug Report: Consistent PDF Upload/Processing Failure with Gemini 3.x Models in Google AI Studio (Affects Python SDK too)

Platform: Google AI Studio + Python SDK (google-generative-ai)
Duration: Ongoing for approximately 2 months


Summary

Gemini 3.x generation models (Gemini 3.1 Pro, Gemini 3.5 Flash, Gemini 3.1 Flash-Lite) consistently fail to process a specific subset of standard, well-formed PDF documents. The model ingests the file without raising any error, but responds as if no document was provided. The same PDFs and prompts work flawlessly with Gemini 2.5 Flash. Through technical analysis of four failing files (vs. two working references), I have identified the root cause with a reproducible and statistically consistent pattern.


Affected Models

  • Gemini 3.1 Pro
  • Gemini 3.5 Flash
  • Gemini 3.1 Flash-Lite

Unaffected Models (Working Correctly)

  • :white_check_mark: Gemini 2.5 Flash — reliably processes the same PDFs with no issues

Steps to Reproduce

In Google AI Studio:

  1. Open Google AI Studio and select a Gemini 3.x model (e.g., Gemini 3.5 Flash).
  2. Attach a PDF with the technical characteristics described in the Root Cause section below (Type1 fonts with custom /Differences encoding and no /ToUnicode map).
  3. Submit a prompt such as: “Extract all data from this PDF.”
  4. Observe the response: the model behaves as if no PDF was received and asks for the content to be pasted as plain text.

Via Python SDK:

Using Part.from_file() to pass PDF data to a Gemini 3.x model via generate_content() produces the same behavior — the model does not process the file content.


Observed Behavior

  • No error is raised during file upload or ingestion.
  • The model responds as if the PDF was never attached, asking the user to paste the document content manually.
  • The model’s internal reasoning (visible via the thinking field) explicitly states: “the actual text of the financial document has not been provided in the prompt.”

Expected Behavior

Gemini 3.x models should reliably parse and extract information from uploaded PDF documents, at parity with — or better than — Gemini 2.5 Flash, including applying a fallback to the /Differences encoding array when a /ToUnicode map is absent (see Technical Recommendation below).


Root Cause Analysis

I analysed four failing files and two working reference files to isolate the cause. The pattern is clear and fully reproducible.

File matrix

File Producer /ToUnicode missing on Ingestion result
LU0089290844_KID.pdf Neevia docCreator v4.5 All Type1 fonts (R10, R12, R14) :cross_mark: Fails — empty
LU2533812058_KID.pdf Neevia docCreator v4.5 All Type1 fonts (R10, R12, R14, R35) :cross_mark: Fails — empty
LU2314312922_KID.pdf Neevia docCreator v4.5 All Type1 fonts (R10, R12, R14, R39) :cross_mark: Fails — empty
LU2526007799_KID.pdf Neevia docCreator v5.0 /R39 on page 2 only :warning: Partial — page 2 text corrupted
PRIIP_KID_F0GBR04BQM_299.pdf Neevia docCreator v5.0 None :white_check_mark: OK

Primary cause — Missing /ToUnicode maps on Type1 fonts

All failing files share the same structural defect: their Type1 fonts use a custom /Encoding with a /Differences array (a non-standard glyph mapping where character codes do not correspond to Unicode code points) but do not include a /ToUnicode map.

Example font table from LU0089290844_KID.pdf (page 1):

Font Subtype Encoding /ToUnicode
/R10 Type1 Custom (WinAnsiEncoding + /Differences) ABSENT
/R12 Type1 Custom (WinAnsiEncoding + /Differences) ABSENT
/R14 Type1 Custom (WinAnsiEncoding + /Differences) ABSENT
/R7 TrueType -– Present

Without a /ToUnicode map, the text renders visually correctly but cannot be extracted as Unicode text. A conforming PDF text extractor (per ISO 32000) obtains unmappable glyphs and effectively sees an empty document — which is precisely the behavior observed with Gemini 3.x.

The partial-failure case strengthens the diagnosis

LU2526007799_KID.pdf (produced by the newer v5.0) is particularly revealing: only font /R39 on page 2 lacks /ToUnicode. The rest of the document is extracted correctly, but page 2 produces corrupted output (unresolved placeholders such as |num07070oneoffcostsportfolioentrycost| and disordered rows in the cost section). This confirms that the failure is per-font and proportional to the missing maps — not a binary all-or-nothing parser failure.

Correlation with the PDF generator

All files produced by Neevia docCreator v4.5 systematically omit /ToUnicode on Type1 fonts with custom encoding. Files produced by v5.0 include it. This is a defect in the upstream generation tool, but the issue reported here remains valid on Google’s side: the Gemini 3.x ingestion pipeline does not apply the ISO 32000-compliant fallback when /ToUnicode is absent, whereas Gemini 2.5 Flash does.

Secondary factor — Compressed cross-reference stream

The v4.5 files use a compressed xref stream instead of a plain xref table:

/Type /XRef, /Filter /FlateDecode, /DecodeParms << /Columns 5 /Predictor 12 >>, /W [1 3 1]

This alone does not cause the failure, but may reduce tolerance in less robust parsers.

Note on structural tagging

None of the files are tagged PDFs (/MarkInfo, /StructTreeRoot, and /Lang are absent in all of them). This is therefore not a discriminating factor, but it rules out a structural tree as an alternative text extraction fallback.


Technical Recommendation

When a font has /Encoding with a /Differences array but no /ToUnicode map, the extractor should reconstruct the character-to-Unicode mapping from the glyph names listed in /Differences (e.g., /T, /h, /period, /space) via the standard Adobe Glyph List. This is the fallback implemented by libraries such as pypdf, and it is precisely what allows those files to be extracted correctly outside of Gemini 3.x. Implementing this fallback would resolve all the failure cases described above.


Impact

This is a production blocker for data extraction pipelines currently being migrated from Gemini 2.x to Gemini 3.x. The affected PDFs are legitimate, publicly distributed financial documents (Key Information Documents / KIDs) compliant with EU regulatory standards (PRIIPs Regulation).


Happy to share the PDF files or SDK code snippets directly if that helps with reproduction. Thank you for looking into this.

Hi @Marco_Menon

Can you share the code snippet

i am working on reproducing this