I am extracting structured data from 50 scanned historical PDFs (German corporate directories, 1930s) using the Gemini API via the google-genai Python SDK with generate_content_stream. 45 files extract correctly, but 5 specific files (36-43 pages, 6-12 MB each) consistently return empty responses—no error is raised, the stream completes normally, but no text chunks are yielded. The same files with the same prompt work correctly in AI Studio web interface and return the expected extracted data. We have tried: reducing concurrent workers from 50 to 5 to 1, retrying multiple times, increasing timeouts, and re-splitting the PDFs to ensure page counts are under 45. None of these attempts resolved the issue, the same 5 files fail every time via API while working in AI Studio.
Actual Code
from google import genai
from google.genai import types
client = genai.Client(api_key=API_KEY)
# Read PDF as bytes (no file upload API used)
with open(pdf_path, "rb") as f:
pdf_data = f.read()
contents = [
types.Content(
role="user",
parts=[
types.Part.from_bytes(
mime_type="application/pdf",
data=pdf_data,
),
types.Part.from_text(text="Extract all companies from this PDF following the schema."),
],
),
]
generate_config = types.GenerateContentConfig(
temperature=0.0,
system_instruction=[types.Part.from_text(text=system_prompt)],
)
# Stream response
response = client.models.generate_content_stream(
model="gemini-3-pro-preview",
contents=contents,
config=generate_config,
)
extracted_text = ""
for chunk in response:
if chunk.text:
extracted_text += chunk.text
# Result: extracted_text is empty string for 5 specific files
# No exception raised, stream completes normally