Consistent PDF Upload/Processing Failure with Gemini 3.x Models in Google AI Studio (Affects Python SDK too)

Bug Report: Consistent PDF Upload/Processing Failure with Gemini 3.x Models in Google AI Studio (Affects Python SDK too)

Platform: Google AI Studio + Python SDK (google-generative-ai)
Duration: Ongoing for approximately 2 months


Summary

Gemini 3.x generation models (Gemini 3.1 Pro, Gemini 3.5 Flash, Gemini 3.1 Flash-Lite) consistently fail to process a specific subset of standard, well-formed PDF documents. The model ingests the file without raising any error, but responds as if no document was provided. The same PDFs and prompts work flawlessly with Gemini 2.5 Flash. Through technical analysis of four failing files (vs. two working references), I have identified the root cause with a reproducible and statistically consistent pattern.


Affected Models

  • Gemini 3.1 Pro
  • Gemini 3.5 Flash
  • Gemini 3.1 Flash-Lite

Unaffected Models (Working Correctly)

  • :white_check_mark: Gemini 2.5 Flash β€” reliably processes the same PDFs with no issues

Steps to Reproduce

In Google AI Studio:

  1. Open Google AI Studio and select a Gemini 3.x model (e.g., Gemini 3.5 Flash).
  2. Attach a PDF with the technical characteristics described in the Root Cause section below (Type1 fonts with custom /Differences encoding and no /ToUnicode map).
  3. Submit a prompt such as: β€œExtract all data from this PDF.”
  4. Observe the response: the model behaves as if no PDF was received and asks for the content to be pasted as plain text.

Via Python SDK:

Using Part.from_file() to pass PDF data to a Gemini 3.x model via generate_content() produces the same behavior β€” the model does not process the file content.


Observed Behavior

  • No error is raised during file upload or ingestion.
  • The model responds as if the PDF was never attached, asking the user to paste the document content manually.
  • The model’s internal reasoning (visible via the thinking field) explicitly states: β€œthe actual text of the financial document has not been provided in the prompt.”

Expected Behavior

Gemini 3.x models should reliably parse and extract information from uploaded PDF documents, at parity with β€” or better than β€” Gemini 2.5 Flash, including applying a fallback to the /Differences encoding array when a /ToUnicode map is absent (see Technical Recommendation below).


Root Cause Analysis

I analysed four failing files and two working reference files to isolate the cause. The pattern is clear and fully reproducible.

File matrix

File Producer /ToUnicode missing on Ingestion result
LU0089290844_KID.pdf Neevia docCreator v4.5 All Type1 fonts (R10, R12, R14) :cross_mark: Fails β€” empty
LU2533812058_KID.pdf Neevia docCreator v4.5 All Type1 fonts (R10, R12, R14, R35) :cross_mark: Fails β€” empty
LU2314312922_KID.pdf Neevia docCreator v4.5 All Type1 fonts (R10, R12, R14, R39) :cross_mark: Fails β€” empty
LU2526007799_KID.pdf Neevia docCreator v5.0 /R39 on page 2 only :warning: Partial β€” page 2 text corrupted
PRIIP_KID_F0GBR04BQM_299.pdf Neevia docCreator v5.0 None :white_check_mark: OK

Primary cause β€” Missing /ToUnicode maps on Type1 fonts

All failing files share the same structural defect: their Type1 fonts use a custom /Encoding with a /Differences array (a non-standard glyph mapping where character codes do not correspond to Unicode code points) but do not include a /ToUnicode map.

Example font table from LU0089290844_KID.pdf (page 1):

Font Subtype Encoding /ToUnicode
/R10 Type1 Custom (WinAnsiEncoding + /Differences) ABSENT
/R12 Type1 Custom (WinAnsiEncoding + /Differences) ABSENT
/R14 Type1 Custom (WinAnsiEncoding + /Differences) ABSENT
/R7 TrueType -– Present

Without a /ToUnicode map, the text renders visually correctly but cannot be extracted as Unicode text. A conforming PDF text extractor (per ISO 32000) obtains unmappable glyphs and effectively sees an empty document β€” which is precisely the behavior observed with Gemini 3.x.

The partial-failure case strengthens the diagnosis

LU2526007799_KID.pdf (produced by the newer v5.0) is particularly revealing: only font /R39 on page 2 lacks /ToUnicode. The rest of the document is extracted correctly, but page 2 produces corrupted output (unresolved placeholders such as |num07070oneoffcostsportfolioentrycost| and disordered rows in the cost section). This confirms that the failure is per-font and proportional to the missing maps β€” not a binary all-or-nothing parser failure.

Correlation with the PDF generator

All files produced by Neevia docCreator v4.5 systematically omit /ToUnicode on Type1 fonts with custom encoding. Files produced by v5.0 include it. This is a defect in the upstream generation tool, but the issue reported here remains valid on Google’s side: the Gemini 3.x ingestion pipeline does not apply the ISO 32000-compliant fallback when /ToUnicode is absent, whereas Gemini 2.5 Flash does.

Secondary factor β€” Compressed cross-reference stream

The v4.5 files use a compressed xref stream instead of a plain xref table:

/Type /XRef, /Filter /FlateDecode, /DecodeParms << /Columns 5 /Predictor 12 >>, /W [1 3 1]

This alone does not cause the failure, but may reduce tolerance in less robust parsers.

Note on structural tagging

None of the files are tagged PDFs (/MarkInfo, /StructTreeRoot, and /Lang are absent in all of them). This is therefore not a discriminating factor, but it rules out a structural tree as an alternative text extraction fallback.


Technical Recommendation

When a font has /Encoding with a /Differences array but no /ToUnicode map, the extractor should reconstruct the character-to-Unicode mapping from the glyph names listed in /Differences (e.g., /T, /h, /period, /space) via the standard Adobe Glyph List. This is the fallback implemented by libraries such as pypdf, and it is precisely what allows those files to be extracted correctly outside of Gemini 3.x. Implementing this fallback would resolve all the failure cases described above.


Impact

This is a production blocker for data extraction pipelines currently being migrated from Gemini 2.x to Gemini 3.x. The affected PDFs are legitimate, publicly distributed financial documents (Key Information Documents / KIDs) compliant with EU regulatory standards (PRIIPs Regulation).


Happy to share the PDF files or SDK code snippets directly if that helps with reproduction. Thank you for looking into this.

Hi @Marco_Menon

Can you share the code snippet

i am working on reproducing this

Hi @Mustan_lokhand , please find below the reproducer script and the full log output.

One important nuance surfaced during testing that slightly refines the affected model list:

Model prompt_tokens Result
gemini-3.5-flash 10 :cross_mark: Fails β€” PDF not seen
gemini-3.1-flash-lite 10 :cross_mark: Fails β€” PDF not seen
gemini-3.1-pro-preview 1606 :white_check_mark: Works β€” correct answer
gemini-2.5-flash 784 :white_check_mark: Works β€” correct answer
gemini-2.5-pro 784 :white_check_mark: Works β€” correct answer

The prompt_tokens value is the key signal: when the model receives 10 tokens, it has clearly not ingested the PDF (only the text prompt was counted). When it receives 700–1600 tokens, the PDF content was processed. This makes the failure programmatically detectable via usage_metadata.prompt_token_count as a workaround, even though no exception is raised.

Surprisingly, gemini-3.1-pro-preview works correctly, which means the bug is specific to the Flash and Lite variants of the 3.x family, not the entire generation.

Updated affected models:

  • :cross_mark: gemini-3.5-flash
  • :cross_mark: gemini-3.1-flash-lite
  • :white_check_mark: gemini-3.1-pro-preview (works)

Reproducer script

"""
Minimal script to reproduce the bug:
  Gemini 3.x Flash/Lite models fail to read certain PDFs (e.g. PRIIP KID documents
  produced by Neevia docCreator v4.5) while Gemini 2.5 Flash/Pro and
  Gemini 3.1 Pro process them correctly.

Usage:
    python reproduce_gemini3_pdf_bug.py <path_to_pdf>

Requires GOOGLE_CREDENTIALS_JSON (service account JSON) or
GOOGLE_APPLICATION_CREDENTIALS + GOOGLE_CLOUD_PROJECT already set.
"""

import os
import sys
import json
import tempfile
from pathlib import Path

from google import genai
from google.genai.types import Part, GenerateContentConfig

LOCATION = "global"

MODELS_TO_TEST = [
    "gemini-3.5-flash",           # ❌ fails
    "gemini-3.1-flash-lite",      # ❌ fails
    "gemini-3.1-pro-preview",     # βœ… works (surprisingly)
    "gemini-2.5-flash",           # βœ… works
    "gemini-2.5-pro",             # βœ… works
]

PROMPT = "Extract the Manufacturer of the product from this PDF."


def _bootstrap_credentials() -> str:
    """Initialise credentials and return project_id."""
    if os.getenv("GOOGLE_APPLICATION_CREDENTIALS") and os.getenv("GOOGLE_CLOUD_PROJECT"):
        return os.environ["GOOGLE_CLOUD_PROJECT"]

    creds_json = os.getenv("GOOGLE_CREDENTIALS_JSON")
    if not creds_json:
        raise RuntimeError(
            "Set GOOGLE_CREDENTIALS_JSON (service account JSON) or "
            "GOOGLE_APPLICATION_CREDENTIALS + GOOGLE_CLOUD_PROJECT."
        )

    creds = json.loads(creds_json)
    tmp = tempfile.NamedTemporaryFile(mode="w", suffix=".json", delete=False)
    json.dump(creds, tmp)
    tmp.close()
    os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = tmp.name
    return creds["project_id"]


def run(pdf_path: str) -> None:
    pdf_path = Path(pdf_path).expanduser().resolve()
    if not pdf_path.is_file():
        raise FileNotFoundError(pdf_path)

    project_id = _bootstrap_credentials()
    client = genai.Client(vertexai=True, project=project_id, location=LOCATION)

    pdf_bytes = pdf_path.read_bytes()
    pdf_part = Part.from_bytes(data=pdf_bytes, mime_type="application/pdf")

    config = GenerateContentConfig(
        temperature=0,
        max_output_tokens=2048,
    )

    print("=" * 78)
    print("Gemini PDF ingestion reproducer")
    print("=" * 78)
    print(f"PDF path       : {pdf_path}")
    print(f"Models to test : {MODELS_TO_TEST}")
    print("=" * 78)

    for model in MODELS_TO_TEST:
        print("\n" + "-" * 78)
        print(f">>> MODEL: {model}")
        print("-" * 78)
        try:
            response = client.models.generate_content(
                model=model,
                contents=[PROMPT, pdf_part],
                config=config,
            )

            usage = getattr(response, "usage_metadata", None)
            if usage is not None:
                prompt_tokens = getattr(usage, "prompt_token_count", None)
                print(f"[usage] prompt_tokens     = {prompt_tokens}")
                print(f"[usage] candidates_tokens = {getattr(usage, 'candidates_token_count', None)}")
                print(f"[usage] total_tokens      = {getattr(usage, 'total_token_count', None)}")

                # Heuristic to detect silent PDF ingestion failure:
                # if prompt_tokens is suspiciously low (close to text-only token count),
                # the PDF was likely not processed.
                if prompt_tokens is not None and prompt_tokens < 50:
                    print(f"[WARNING] prompt_token_count={prompt_tokens} suggests the PDF "
                          "was NOT ingested (text-only token count). "
                          "This call should have raised an error.")

            candidate = response.candidates[0]
            print(f"[finish_reason] {getattr(candidate, 'finish_reason', None)}")

            parts = candidate.content.parts or []
            print(f"[n_parts] {len(parts)}")
            for i, part in enumerate(parts):
                text = getattr(part, "text", None)
                thought = getattr(part, "thought", None)
                print(f"--- part[{i}] thought={thought} ---")
                print(text if text is not None else f"<non-text part: {part}>")

            try:
                print("\n[response.text]")
                print(response.text)
            except Exception:
                pass

        except Exception as e:
            print(f"[ERROR calling {model}] {type(e).__name__}: {e}")

    print("\n" + "=" * 78)
    print("DONE")
    print("=" * 78)


if __name__ == "__main__":
    if len(sys.argv) < 2:
        print("Usage: python reproduce_gemini3_pdf_bug.py <path_to_pdf>")
        sys.exit(1)
    run(sys.argv[1])

Full log output

==============================================================================
Gemini PDF ingestion reproducer
==============================================================================
PDF path        : C:\Users\menonm\Downloads\LU2533812058_KID.pdf
Models to test  : ['gemini-3.5-flash', 'gemini-3.1-flash-lite', 'gemini-3.1-pro-preview', 'gemini-2.5-flash', 'gemini-2.5-pro']
==============================================================================

------------------------------------------------------------------------------
>>> MODEL: gemini-3.5-flash
------------------------------------------------------------------------------
[usage] prompt_tokens     = 10
[usage] candidates_tokens = 46
[usage] total_tokens      = 175
[finish_reason] FinishReason.STOP
[n_parts] 1
--- part[0] thought=None ---
It looks like you forgot to attach or upload the PDF!

[response.text]
It looks like you forgot to attach or upload the PDF!

------------------------------------------------------------------------------
>>> MODEL: gemini-3.1-flash-lite
------------------------------------------------------------------------------
[usage] prompt_tokens     = 10
[usage] candidates_tokens = 34
[usage] total_tokens      = 44
[finish_reason] FinishReason.STOP
[n_parts] 1
--- part[0] thought=None ---
Please upload the PDF file to this chat.

[response.text]
Please upload the PDF file to this chat.

------------------------------------------------------------------------------
>>> MODEL: gemini-3.1-pro-preview
------------------------------------------------------------------------------
[usage] prompt_tokens     = 1606
[usage] candidates_tokens = 30
[usage] total_tokens      = 1800
[finish_reason] FinishReason.STOP
[n_parts] 1
--- part[0] thought=None ---
Based on the provided document, the Manufacturer of the product is:
**BNP PARIBAS ASSET MANAGEMENT Luxembourg** ("BNPP AM Luxembourg")

[response.text]
Based on the provided document, the Manufacturer of the product is:
**BNP PARIBAS ASSET MANAGEMENT Luxembourg** ("BNPP AM Luxembourg")

------------------------------------------------------------------------------
>>> MODEL: gemini-2.5-flash
------------------------------------------------------------------------------
[usage] prompt_tokens     = 784
[usage] candidates_tokens = 15
[usage] total_tokens      = 911
[finish_reason] FinishReason.STOP
[n_parts] 1
--- part[0] thought=None ---
BNP PARIBAS ASSET MANAGEMENT Luxembourg ("BNPP AM Luxembourg")

------------------------------------------------------------------------------
>>> MODEL: gemini-2.5-pro
------------------------------------------------------------------------------
[usage] prompt_tokens     = 784
[usage] candidates_tokens = 25
[usage] total_tokens      = 992
[finish_reason] FinishReason.STOP
[n_parts] 1
--- part[0] thought=None ---
Based on the provided PDF, the Manufacturer of the product is:
**BNP PARIBAS ASSET MANAGEMENT Luxembourg**

==============================================================================
DONE
==============================================================================

Unfortunately I am not able to attach the affected PDF files directly here, but I am happy to share them through any channel you prefer β€” feel free to drop a contact or a preferred transfer method and I will send them over promptly.

hi @Mustan_lokhand Here is the sample PDF to reproduce the issue: docfinder. bnpparibas-am. com/api/files/ 183c4a9a-e238-4dd4-9bc5-7609579d2067/4608

As mentioned, this file lacks the /ToUnicode map on Type1 fonts with custom encoding. You can use it with the code snippet provided above to see how Gemini 3.x models fail to extract the text, while older versions process it without issues.

Let me know if you need anything else!