While working with Google Document AI for text extraction (OCR), i encountered an issues:
Sometimes the OCR returns characters that look correct, but actually belong to a diferent alphabet.
For example

Or in other cases:
ORIGINAL: ELIZABETH
After converting to lowercase: elizaβετη
Do you have any tips to homologate these cases?
Hi @Isabel_Sanchez_Benit
Thank you for bringing this to our attention.
Could you please share the full payload details along with some sample of the code which you are using?
code:
GoogleCredentials credentials = getGoogleCredentials();
DocumentProcessorServiceSettings settings =
DocumentProcessorServiceSettings.newBuilder()
.setEndpoint(endpoint).
setCredentialsProvider(FixedCredentialsProvider.create(credentials))
.build();
try (DocumentProcessorServiceClient client = DocumentProcessorServiceClient.create(settings)) {
String name = String.format("projects/%s/locations/%s/processors/%s", projectId, locationId, invoiceId);
ByteString content = ByteString.copyFrom(image);
RawDocument document =
RawDocument.newBuilder().setContent(content).setMimeType(extension).build();
ProcessRequest request =
ProcessRequest.newBuilder().setName(name).setRawDocument(document).build();
log.info("Send request google cloud");
ProcessResponse result = client.processDocument(request);
Document documentResponse = result.getDocument();
String value = null;
for (Document.Entity entity : documentResponse.getEntitiesList()) {
if(entity.getMentionText().isEmpty()){
value = entity.getNormalizedValue().getText();
} else {
value = entity.getMentionText();
}
info.put(entity.getType(),value);
}
resource:
