Hi there!
I am looking for an approach that takes in booking documents (booking date, sender address, receiver address, price, …) and returns structured text (e.g. a json). The documents are fairly structured but do differ significantly between different companies.
The OCR part is no problem, I can accurately get the raw text (and coordinates on the page) from the input file. The problem I face now is actually getting the raw text into a structured form. For instance, I need to be able to detect what the sender address is or the delivery date.
I tried to do some manual work based on the word position in the document; e.g. “sender name” is followed by the name of the sender. However, suffice to say that this approach is not general at all.
I was wondering if there were models out there, for instance BERT, that could classify individual words after some training.
Any suggestion would be nice, thanks!