Does PDF fine-tuning focus solely on text extraction, or does it also perform visual inference?

Hello, so I am planning to fine tune Gemini 1.5 flash based on PDFs. My question is does it only make inferences on the OCR text it extracts or is it able to identify the structure of the pdf, images and logo present within?

Hi @Dhruv_Shah,

  • Gemini 1.5 Flash and Pro can process PDFs using both textual and visual signals.

  • The model is capable of structured output extraction, meaning it can infer layout, tables, and even visual hierarchy from the PDF—not just plain OCR text

  • This includes understanding images, logos, and formatting cues like headings, bullet points, and tables.

So yes, Gemini does more than just extract text—it performs multimodal inference on the document structure and visual elements.

However, when it comes to fine-tuning, there are some limitations:

  • Fine-tuning currently supports input-output pairs in text format only
  • You cannot fine-tune Gemini with multi-turn chat or raw PDF files directly.
  • To fine-tune on PDFs, you’ll need to preprocess them into structured text examples (e.g., question-context-answer triples or instruction-response pairs).

If your goal is to improve Gemini’s performance on PDF-based tasks, you can:

  • Use Gemini’s native PDF understanding for inference.
  • Fine-tune it using textual representations of those PDFs (e.g., summaries, extracted Q&A, or structured prompts).