Preparing PDF files for fine-tuning Gemini with appropriate JSON format

Hi everyone, I’m currently working on fine-tuning Gemini using a set of PDF files, which are mostly astrology-related books. My goal is to prepare these PDFs in an appropriate JSON format for fine-tuning. How should I proceed for the fine-tuning format? I currently only have the PDFs, and I was considering using an open-source LLM to prepare the dataset for fine-tuning.

Do you have any other suggestions or approaches I should follow?

I’m trying to figure out the best workflow, including how to generate suitable outputs for training data. Any ideas or resources would be greatly appreciated!

My take:

  1. Chunk the dataset like you would prepare for RAG storage.
  2. As you chunk you ask an LLM (of your choice) to generate questions.
  3. Collect the QnA and supply it in the LLM provider’s fine tuning format. Usually a JSON.

Gotchas:

  • How do you chunk your PDFs? Are they multi modal with figures, tables, images?
  • I’d go for as many questions as possible, but that will mean a ton of QnA requests, which can cost $20 or so if you will be in a range if 10k questions.
  • You need to process the results: there can be duplicate and undesirable QnA. Such as when the LLM doesn’t know the answer.

For start you can rely on LangChain’s PDF chunker, if that’s sufficient. I’m not sure how semantic is it or is there other better semantic chunker for PDF.

If it was markdown documents look at this and you can catch prompts in it: GitHub - CsabaConsulting/question_extractor: Generate question/answer training pairs out of a set of Markdown documents.

1 Like