Extracting Structured Text from Multi-Page Scanned Documents

Hello Community,

I have been assigned a task to extract structured text from scanned documents. The challenge I’m facing is that some sections span multiple pages rather than being confined to a single page.

I attempted to extract text by processing each scanned page as an image and passing it to the gemini-2.0-flash model. However, I couldn’t get continuous text across multiple pages—it gets fragmented.

How can I extract complete text from sections that span multiple pages while maintaining their structure? Any guidance or suggestions would be greatly appreciated!

Thanks in advance!

my code:

class Randbeschreibung(BaseModel):
    von:int
    nach:int
    pos:str


class RandbeschreibungResponse(BaseModel):
    randbeschreibung: list[Randbeschreibung]

def get_randbescheibung_from_images(img):

    prompt = (
    'This is a scanned technical document written in German. It contains sections titled "Randbeschreibung", '
    'which have various subsections such as "Rand der Position x: Pos x" and special cases like "Aussparung".\n\n'
    'Within these sections, the document includes data in the format:\n'
    '- "Gerade von 123 bis 456", which translates to "connected from node 123 to node 456".\n\n'
    '### Task:\n'
    'Extract all occurrences of this "Gerade von X bis Y" pattern from every subsection under "Randbeschreibung", '
    'including "Rand der Position x: Pos x" and "Aussparung", while strictly ignoring any data found under the section '
    '"Fixpunkte, -sterne, -geraden".\n\n'
    '### Expected Output:\n'
    '- Identify the "von" node (start node).\n'
    '- Identify the "bis" node (end node).\n'
    '- Identify the "Pos " Position number (this is like a sub heading. it helps to group the nodes. connecting nodes under pos will create a closed shape).\n'
    '- For an unknown Pos you can enter NA as the position number.\n'
    '- Return a structured list of pairs where each pair represents a connection between two nodes and the Pos number.\n\n'
    '#### Example:\n'
    '**Input:** "Gerade von 123 bis 456"\n\n'
    'Make sure to extract all valid node connections from the relevant sections while strictly adhering to the given constraints.')
    
    response = client.models.generate_content(
        model='gemini-2.0-flash',
        contents=[prompt,img],
        config={
            'response_mime_type': 'application/json',
            'response_schema': RandbeschreibungResponse,
        },
    )
    return(response.text)

Here is my two images (note: they are two seperate images, I joined them using paint for this question):

Hi @Kuladeep_K,

Welcome to forum, One way I can suggest is convert all required images into pdf, and you can send the whole pdf to Gemini API, in that scenario everything is parsed together. In this way all text is extracted from pdf in continuous way and you can extract structured output.. Hope this helps.

Cheers!