Why can't I get Gemini to recognize "strikethrough" text in an image

Diet · July 19, 2024, 6:56am

why?

why not just send them in parallel? it’ll be faster, too.

Ron_Parker · July 22, 2024, 8:05pm

I’m not sure I understand. Let’s say I have a 30 page PDF. I extract each page as an image, so I now have 30 jpegs. I want the model to read the 30 pages and extract the text, in the same page order, without the stikethroughs. I’m thinking that I will send the 30 images in one model call. Are you saying to make 30 separate calls?

Ron_Parker · July 24, 2024, 6:45am

OK, in case anyone else runs into this, I was able to resolve the issue using sonnet 3.5 instead of gemini. I was stuck on how to send multiple images in one prompt. I finally got code working that:

Converts a PDF to images (I used PyMuPdf)
Saves the images in a local directory.
Uploads the directory to Google Drive (or, in my case, AWS S3)
Creates prompt that uses images as content.
Sends prompt to claude sonnet 3.5 and returns response.

Here is a code snippet of just the prompt code:

def encode_image(url, media_type):
    return {
        "type": "image",
        "source": {
            "type": "base64",
            "media_type": media_type,
            "data": base64.b64encode(httpx.get(url).content).decode("utf-8"),
        },
    }

def process_images_with_anthropic(s3_bucket, s3_output_key, folder_name, project_id, max_tokens=4096):
    LOCATION = "europe-west1"  # or "us-east5"
    client = AnthropicVertex(region=LOCATION, project_id=project_id)

    prompt = """
    You are a very professional image to text document extractor.
    Please extract the text from these images, treating them as pages of a PDF document. 
    A strikethrough is a horizontal line drawn through text, used to indicate the deletion of an error or the removal of text.  Ensure that all strikethrough text is excluded from the output. 
    Try to format any tables found in the images. 
    Do not include page numbers, page headers, or page footers.
    Please double-check to make sure that any words in all capitalized letters with strikethrough letters are excluded.
    Return only the extracted text.  No commentary.
    **Exclude Strikethrough:** Do not include any strikethrough words in the output. Even if the strikethrough words are in a title.
    **Include Tables:** Tables should be preserved in the extracted text.
    **Exclude Page Headers, Page Footers, and Page Numbers:** Eliminate these elements which are typically not part of the main content.
    """

    s3_client = boto3.client('s3')
    s3_output_key = s3_output_key.rstrip('/')
    response = s3_client.list_objects_v2(Bucket=s3_bucket, Prefix=f"{s3_output_key}/{folder_name}/")

    content = []
    for obj in response.get('Contents', []):
        if obj['Key'].endswith('.jpg'):  # Ensure we're only processing jpg files
            url = f"https://s3.us-west-2.amazonaws.com/{s3_bucket}/{obj['Key']}"
            content.append(encode_image(url, "image/jpeg"))
    
    content.append({
        "type": "text",
        "text": prompt
    })

    message = client.messages.create(
        max_tokens=max_tokens,
        messages=[
            {
                "role": "user",
                "content": content,
            }
        ],
        model="claude-3-5-sonnet@20240620",
    )

    return message.content[0].text

This was the code sample (from the Google Cloud documentation on AnthropicVertex) that served as my starting point.

But thanks go to @Diet for suggesting claude sonnet. That was the key to solving this puzzle.

Ron_Parker · August 2, 2024, 3:01am

Just when I thought I had solved this issue:

Output folder /mnt/temp/Local_52_Studio_Mechanics_SDPA_MOA uploaded to s3://docs.scbbs.com/docs/test/Local_52_Studio_Mechanics_SDPA_MOA
Traceback (most recent call last):
  File "/mnt/temp/claudeUploadImg05.py", line 132, in <module>
    response, usage = process_images_with_anthropic(s3_bucket, full_s3_output_key, project_id)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/temp/claudeUploadImg05.py", line 94, in process_images_with_anthropic
    message = client.messages.create(
              ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/anthropic/_utils/_utils.py", line 277, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/anthropic/resources/messages.py", line 902, in create
    return self._post(
           ^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/anthropic/_base_client.py", line 1266, in post
    return cast(ResponseT, self.request(cast_to, opts, stream=stream, stream_cls=stream_cls))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/anthropic/_base_client.py", line 942, in request
    return self._request(
           ^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/anthropic/_base_client.py", line 1046, in _request
    raise self._make_status_error_from_response(err.response) from None
anthropic.BadRequestError: Error code: 400 - {'error': {'code': 400, 'message': 'The request size (33301382 bytes) exceeds 30.000MB limit.', 'status': 'FAILED_PRECONDITION'}}

The original PDF is ONLY 26 pages, but I had to create 26 images which apparently exceed this 30MB limit. Where is this limit set? Can I change it? Where the heck can I post AnthropicVertext issues?

OrangiaNebula · August 2, 2024, 7:24am

Since the content has been subdivided into individual pages, you could try a workaround - segmented processing with small overlap. By that I mean, submit pages 1 through 20 in one request, then 20 to the last in a second request. There will be some duplicate “junk” when you try to splice the partial results together, but I suspect it will be easy to detect and remove from the final result. Just an idea in case nothing better shows up.

Ron_Parker · August 3, 2024, 6:54am

Thanks. I’m going to try something similar, but using gpt-4o:

Submit the request and read the response “finish reason”. If it is “length” (exceeds tokens), then add the assistant output to the message and re-submit. Repeat this until full request is processed.

At the moment, gpt-4o appears to accept the multiple images without issue. Currently, I don’t know if the AnthropicVertex error is a Google issue or an Anthropic issue, and there doesn’t appear to be anyone or anyplace I can ask.

Ron_Parker · August 14, 2024, 3:08am

OK, so the final, final solution to this was actually to use gpt-4o. What I had to do was create a script that does this:

1. convert local pdf to jpg pages

2. upload jpg images to AWS s3 bucket

3. submit jpg images with prompt to OpenAI model in batches

4. continue processing if max tokens exceeded.

5. write output to local txt file

Since I could not use AnthropicVertex Claude due to the file MB limit (which, actually, my new methodology fixes), I tried using Claude through the AWS Bedrock SDK. The problem there was that version of Claude does not reliably recognize the strikethrough text the same way the AnthropicVertex version does. Go figure.

So, gpt-4o becomes the default go-to model for handling strikethrough text.

Finally, by uploading the images in batches (as little as one image per call), I may extend the time it takes to process a large document, but the token difference is negligible while the efficiency of image processing increases dramatically.

Rodolfo_Uber · November 25, 2024, 6:21pm

Hey @Ron_Parker, how many tokens it consumes per 10 pages? Is it cost efficient?

Ron_Parker · November 25, 2024, 10:56pm

Here are some updated notes from the issue discussion on OpenAI forum: Gpt-4o-mini fails with multiple images in same code that works with gpt-4o - #8 by SomebodySysop - API - OpenAI Developer Forum

You can see how many tokens gpt-4o cost to process a 26 page pdf. This is because OpenAI overcharges in tokens for image processing. There is another thread that discusses that: Super-high token usage with gpt-4o-mini and image - API - OpenAI Developer Forum

The Gemini and Anthropic models do not overcharge, but they either do not capture all overstrikes or run out of output tokens in the process.

The most economical method for doing this is to export the PDF to word, have word remove the strikethrough text, then convert back to PDF.

CreatingIris · November 26, 2024, 2:35pm

Why not just have chatgpt write you a python program that does the work for you? Theres python libraries for extracting text from pdfs. And idk about embedding, but if whatever you have embedded can call an api then it shouldnt take long to set things up so you can call the program with an api. Then you dont need to worry about spending tokens at all. Right?

Ron_Parker · November 26, 2024, 8:27pm

I have spent weeks going down various rabbit holes with ChatGPT and Claude trying to resolve this single issue. Here is but one example:

I’ve tried pdf to text extractors like pdftotext, pymupdf, pdfplumber, Marker and solr. Also apis like Textract, LlamaParse, and the Gemini and gpt-4o LLM models.

The problem is that ChatGPT will happily provide you with a python program – that doesn’t work. These bots are designed to give you an answer whether it’s correct or not. In my case, they don’t realize that none of the above extractors and most of the models cannot recognize strikethrough text. So, none of their “solutions” will ever be successful.

ChatGPT and Claude have definitely helped me resolve the problem, but serious developers who rely on these chatbots to “do the work” for them are only asking (no, begging) for trouble down the line.

Topic		Replies	Views
Gemini Model Unable to Extract Text from Uploaded Image, Requests Direct Text Input Instead Gemini API api	1	119	October 23, 2024
Gemini Pro unable to transcribe text in images Community feedback	12	445	May 9, 2024
Gemini API - Still No Text Completion? Gemini API gemini-15 , ai-studio , api , models	1	155	August 12, 2024
Gemini 1 building in html, PYTHON and other types language Gemini API gemini-15 , tfjs , datasets	1	33	September 4, 2024
Working with gemini AI and nextjs Gemini API	0	48	July 5, 2024