Does PDF fine-tuning focus solely on text extraction, or does it also perform visual inference?

Dhruv_Shah · March 4, 2025, 12:03pm

Hello, so I am planning to fine tune Gemini 1.5 flash based on PDFs. My question is does it only make inferences on the OCR text it extracts or is it able to identify the structure of the pdf, images and logo present within?

Devayani_S · June 13, 2025, 3:03pm

Hi @Dhruv_Shah,

Gemini 1.5 Flash and Pro can process PDFs using both textual and visual signals.
The model is capable of structured output extraction, meaning it can infer layout, tables, and even visual hierarchy from the PDF—not just plain OCR text
This includes understanding images, logos, and formatting cues like headings, bullet points, and tables.

So yes, Gemini does more than just extract text—it performs multimodal inference on the document structure and visual elements.

However, when it comes to fine-tuning, there are some limitations:

Fine-tuning currently supports input-output pairs in text format only
You cannot fine-tune Gemini with multi-turn chat or raw PDF files directly.
To fine-tune on PDFs, you’ll need to preprocess them into structured text examples (e.g., question-context-answer triples or instruction-response pairs).

If your goal is to improve Gemini’s performance on PDF-based tasks, you can:

Use Gemini’s native PDF understanding for inference.
Fine-tune it using textual representations of those PDFs (e.g., summaries, extracted Q&A, or structured prompts).

Topic		Replies	Views
Gemini 2.0 and PDF OCR Fine-tuning Google AI Studio ai-studio , fine-tuning , gemini-flash	1	265	June 12, 2025
Bounding Box for pdf using Flash 2.0 Gemini API gemini-flash	3	92	April 4, 2025
Is Gemini 1.5 Pro OCR better than 2.0 Flash OCR? Gemini API api , gemini-flash	4	196	May 10, 2025
Text generation or vision understanding to describe an image? Gemini API api , models , vision	1	34	April 2, 2025
How to improve gemini-1.5-flash output accuracy on images Gemini API gemini-15 , model	3	115	September 12, 2024

Does PDF fine-tuning focus solely on text extraction, or does it also perform visual inference?

Related topics