Preparing PDF files for fine-tuning Gemini with appropriate JSON format

xiphiass · November 22, 2024, 3:22pm

Hi everyone, I’m currently working on fine-tuning Gemini using a set of PDF files, which are mostly astrology-related books. My goal is to prepare these PDFs in an appropriate JSON format for fine-tuning. How should I proceed for the fine-tuning format? I currently only have the PDFs, and I was considering using an open-source LLM to prepare the dataset for fine-tuning.

Do you have any other suggestions or approaches I should follow?

I’m trying to figure out the best workflow, including how to generate suitable outputs for training data. Any ideas or resources would be greatly appreciated!

tocsa · November 24, 2024, 5:38pm

My take:

Chunk the dataset like you would prepare for RAG storage.
As you chunk you ask an LLM (of your choice) to generate questions.
Collect the QnA and supply it in the LLM provider’s fine tuning format. Usually a JSON.

Gotchas:

How do you chunk your PDFs? Are they multi modal with figures, tables, images?
I’d go for as many questions as possible, but that will mean a ton of QnA requests, which can cost $20 or so if you will be in a range if 10k questions.
You need to process the results: there can be duplicate and undesirable QnA. Such as when the LLM doesn’t know the answer.

For start you can rely on LangChain’s PDF chunker, if that’s sufficient. I’m not sure how semantic is it or is there other better semantic chunker for PDF.

If it was markdown documents look at this and you can catch prompts in it: GitHub - CsabaConsulting/question_extractor: Generate question/answer training pairs out of a set of Markdown documents.

Emanuele_De_Candia · January 27, 2025, 3:37am

Hi, to fine-tuning with ‘gemini-1.5-flash-001’ model I recommend you read the documentation. JSON mode is not supported. See here

Topic		Replies	Views
How can I upload my own file for trainning my AI? Google AI Studio ai-studio	2	412	May 8, 2024
Fine-tuning Gemini works via AI Studio, but not via REST API Gemini API	4	296	May 9, 2024
Questions about Gemini Finetuning Dataset Gemini API gemini , datasets	1	85	October 28, 2024
Gemini 2.0 and PDF OCR Fine-tuning Google AI Studio ai-studio , fine-tuning , gemini-flash	1	333	June 12, 2025
Document learning? Gemini API	4	258	May 3, 2024

Preparing PDF files for fine-tuning Gemini with appropriate JSON format

Related topics