Hello Community,
I have been assigned a task to extract structured text from scanned documents. The challenge I’m facing is that some sections span multiple pages rather than being confined to a single page.
I attempted to extract text by processing each scanned page as an image and passing it to the gemini-2.0-flash model. However, I couldn’t get continuous text across multiple pages—it gets fragmented.
How can I extract complete text from sections that span multiple pages while maintaining their structure? Any guidance or suggestions would be greatly appreciated!
Thanks in advance!
my code:
class Randbeschreibung(BaseModel):
von:int
nach:int
pos:str
class RandbeschreibungResponse(BaseModel):
randbeschreibung: list[Randbeschreibung]
def get_randbescheibung_from_images(img):
prompt = (
'This is a scanned technical document written in German. It contains sections titled "Randbeschreibung", '
'which have various subsections such as "Rand der Position x: Pos x" and special cases like "Aussparung".\n\n'
'Within these sections, the document includes data in the format:\n'
'- "Gerade von 123 bis 456", which translates to "connected from node 123 to node 456".\n\n'
'### Task:\n'
'Extract all occurrences of this "Gerade von X bis Y" pattern from every subsection under "Randbeschreibung", '
'including "Rand der Position x: Pos x" and "Aussparung", while strictly ignoring any data found under the section '
'"Fixpunkte, -sterne, -geraden".\n\n'
'### Expected Output:\n'
'- Identify the "von" node (start node).\n'
'- Identify the "bis" node (end node).\n'
'- Identify the "Pos " Position number (this is like a sub heading. it helps to group the nodes. connecting nodes under pos will create a closed shape).\n'
'- For an unknown Pos you can enter NA as the position number.\n'
'- Return a structured list of pairs where each pair represents a connection between two nodes and the Pos number.\n\n'
'#### Example:\n'
'**Input:** "Gerade von 123 bis 456"\n\n'
'Make sure to extract all valid node connections from the relevant sections while strictly adhering to the given constraints.')
response = client.models.generate_content(
model='gemini-2.0-flash',
contents=[prompt,img],
config={
'response_mime_type': 'application/json',
'response_schema': RandbeschreibungResponse,
},
)
return(response.text)
Here is my two images (note: they are two seperate images, I joined them using paint for this question):