File API - Large files processed in batches but then content is cut off

I want to send a pdf file in batches and extract questions from it.
It is a course book.
Problem is now that if I split it into batches of 10 pages or so, that I may split a question in half. Which then leads to a wrongly extracted question or for that matter two questions that are not complete.

Any idea how to resolve it such that the questions that are in two batches are not split?

Hi @robo,

Sorry for the delay in response. To solve this, one effective approach is to overlap batches by a few pages so that questions spanning boundaries are captured in their entirety in at least one batch followed by deduplication of results.

Additionally, scanning for logical question boundaries before splitting, implementing post-processing to identify and combine incomplete questions and processing the entire document as a single batch when possible can further improve extraction accuracy and reliability.

Thank you!