Why can't I get Gemini to recognize "strikethrough" text in an image

why? :thinking:

why not just send them in parallel? itā€™ll be faster, too.

Iā€™m not sure I understand. Letā€™s say I have a 30 page PDF. I extract each page as an image, so I now have 30 jpegs. I want the model to read the 30 pages and extract the text, in the same page order, without the stikethroughs. Iā€™m thinking that I will send the 30 images in one model call. Are you saying to make 30 separate calls?

OK, in case anyone else runs into this, I was able to resolve the issue using sonnet 3.5 instead of gemini. I was stuck on how to send multiple images in one prompt. I finally got code working that:

  1. Converts a PDF to images (I used PyMuPdf)
  2. Saves the images in a local directory.
  3. Uploads the directory to Google Drive (or, in my case, AWS S3)
  4. Creates prompt that uses images as content.
  5. Sends prompt to claude sonnet 3.5 and returns response.

Here is a code snippet of just the prompt code:

def encode_image(url, media_type):
    return {
        "type": "image",
        "source": {
            "type": "base64",
            "media_type": media_type,
            "data": base64.b64encode(httpx.get(url).content).decode("utf-8"),
        },
    }

def process_images_with_anthropic(s3_bucket, s3_output_key, folder_name, project_id, max_tokens=4096):
    LOCATION = "europe-west1"  # or "us-east5"
    client = AnthropicVertex(region=LOCATION, project_id=project_id)

    prompt = """
    You are a very professional image to text document extractor.
    Please extract the text from these images, treating them as pages of a PDF document. 
    A strikethrough is a horizontal line drawn through text, used to indicate the deletion of an error or the removal of text.  Ensure that all strikethrough text is excluded from the output. 
    Try to format any tables found in the images. 
    Do not include page numbers, page headers, or page footers.
    Please double-check to make sure that any words in all capitalized letters with strikethrough letters are excluded.
    Return only the extracted text.  No commentary.
    **Exclude Strikethrough:** Do not include any strikethrough words in the output. Even if the strikethrough words are in a title.
    **Include Tables:** Tables should be preserved in the extracted text.
    **Exclude Page Headers, Page Footers, and Page Numbers:** Eliminate these elements which are typically not part of the main content.
    """

    s3_client = boto3.client('s3')
    s3_output_key = s3_output_key.rstrip('/')
    response = s3_client.list_objects_v2(Bucket=s3_bucket, Prefix=f"{s3_output_key}/{folder_name}/")

    content = []
    for obj in response.get('Contents', []):
        if obj['Key'].endswith('.jpg'):  # Ensure we're only processing jpg files
            url = f"https://s3.us-west-2.amazonaws.com/{s3_bucket}/{obj['Key']}"
            content.append(encode_image(url, "image/jpeg"))
    
    content.append({
        "type": "text",
        "text": prompt
    })

    message = client.messages.create(
        max_tokens=max_tokens,
        messages=[
            {
                "role": "user",
                "content": content,
            }
        ],
        model="claude-3-5-sonnet@20240620",
    )

    return message.content[0].text

This was the code sample (from the Google Cloud documentation on AnthropicVertex) that served as my starting point.

But thanks go to @Diet for suggesting claude sonnet. That was the key to solving this puzzle.

1 Like

Just when I thought I had solved this issue:

Output folder /mnt/temp/Local_52_Studio_Mechanics_SDPA_MOA uploaded to s3://docs.scbbs.com/docs/test/Local_52_Studio_Mechanics_SDPA_MOA
Traceback (most recent call last):
  File "/mnt/temp/claudeUploadImg05.py", line 132, in <module>
    response, usage = process_images_with_anthropic(s3_bucket, full_s3_output_key, project_id)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/temp/claudeUploadImg05.py", line 94, in process_images_with_anthropic
    message = client.messages.create(
              ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/anthropic/_utils/_utils.py", line 277, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/anthropic/resources/messages.py", line 902, in create
    return self._post(
           ^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/anthropic/_base_client.py", line 1266, in post
    return cast(ResponseT, self.request(cast_to, opts, stream=stream, stream_cls=stream_cls))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/anthropic/_base_client.py", line 942, in request
    return self._request(
           ^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/anthropic/_base_client.py", line 1046, in _request
    raise self._make_status_error_from_response(err.response) from None
anthropic.BadRequestError: Error code: 400 - {'error': {'code': 400, 'message': 'The request size (33301382 bytes) exceeds 30.000MB limit.', 'status': 'FAILED_PRECONDITION'}}

The original PDF is ONLY 26 pages, but I had to create 26 images which apparently exceed this 30MB limit. Where is this limit set? Can I change it? Where the heck can I post AnthropicVertext issues?

Since the content has been subdivided into individual pages, you could try a workaround - segmented processing with small overlap. By that I mean, submit pages 1 through 20 in one request, then 20 to the last in a second request. There will be some duplicate ā€œjunkā€ when you try to splice the partial results together, but I suspect it will be easy to detect and remove from the final result. Just an idea in case nothing better shows up.

1 Like

Thanks. Iā€™m going to try something similar, but using gpt-4o:

Submit the request and read the response ā€œfinish reasonā€. If it is ā€œlengthā€ (exceeds tokens), then add the assistant output to the message and re-submit. Repeat this until full request is processed.

At the moment, gpt-4o appears to accept the multiple images without issue. Currently, I donā€™t know if the AnthropicVertex error is a Google issue or an Anthropic issue, and there doesnā€™t appear to be anyone or anyplace I can ask.

1 Like

OK, so the final, final solution to this was actually to use gpt-4o. What I had to do was create a script that does this:

1. convert local pdf to jpg pages

2. upload jpg images to AWS s3 bucket

3. submit jpg images with prompt to OpenAI model in batches

4. continue processing if max tokens exceeded.

5. write output to local txt file

Since I could not use AnthropicVertex Claude due to the file MB limit (which, actually, my new methodology fixes), I tried using Claude through the AWS Bedrock SDK. The problem there was that version of Claude does not reliably recognize the strikethrough text the same way the AnthropicVertex version does. Go figure.

So, gpt-4o becomes the default go-to model for handling strikethrough text.

Finally, by uploading the images in batches (as little as one image per call), I may extend the time it takes to process a large document, but the token difference is negligible while the efficiency of image processing increases dramatically.

1 Like

Hey @Ron_Parker, how many tokens it consumes per 10 pages? Is it cost efficient?

Here are some updated notes from the issue discussion on OpenAI forum: Gpt-4o-mini fails with multiple images in same code that works with gpt-4o - #8 by SomebodySysop - API - OpenAI Developer Forum

You can see how many tokens gpt-4o cost to process a 26 page pdf. This is because OpenAI overcharges in tokens for image processing. There is another thread that discusses that: Super-high token usage with gpt-4o-mini and image - API - OpenAI Developer Forum

The Gemini and Anthropic models do not overcharge, but they either do not capture all overstrikes or run out of output tokens in the process.

The most economical method for doing this is to export the PDF to word, have word remove the strikethrough text, then convert back to PDF.

1 Like