Why can't I get Gemini to recognize "strikethrough" text in an image

This is the image:

This is my prompt:

    # Start the chat session
    chat_session = model.start_chat(
        history=[
            {
                "role": "user",
                "parts": [
                    files[0],
                ],
            },
        ]
    )

    # Send a message and print the response
    response = chat_session.send_message("Extract the text from this file.  Exclude strikethrough words. A strikethrough is a horizontal line drawn through text, used to indicate the deletion of an error or the removal of text.  Do not extract any word formatted with a horizontal line through its center.")
    print(response.text)

Response:

  1. Sick Leave

Modify Article 9 of the Local #161 Motion Picture Theatrical and TV Series Production Agreement (and make conforming changes to Article 41 of the Local #161 Supplemental Digital Agreement) as follows:

"ARTICLE 9. WAIVER OF NEW YORK CITY EARNED SICK TIME ACT AND SIMILAR LAWS SICK LEAVE

"(a) Paid Sick Leave in the State of New York: The following is applicable only to employees working under this Agreement in the State of New York:

"(1) Commencing [insert the date that is the first Sunday after 30 days following the AMPTP’s receipt of notice of ratification], employees shall accrue one (1)

No matter what I try, Flash or Pro model. If it’s multi-modal and looking at this image visually, why can’t it recognize and act on the strikethrough text?

To be clear, this:

"ARTICLE 9. WAIVER OF NEW YORK CITY EARNED SICK TIME ACT AND SIMILAR LAWS SICK LEAVE

Should be returned as this:

"ARTICLE 9. SICK LEAVE

How do I fix this?

1 Like

According to me, you cannot fix it. Google has to improve its model.

1 Like

If possible, you could change the color of the lines to distinguish them from the text. That might be helpful.

1 Like

Thanks for the suggestion. Not really possible as we are not the source for these PDFs. If we have to go through and manually edit them, sort of defeats the purpose of an automated solution.

I’ve said this before: PDFs are the lifeblood of modern day-to-day business activity. What is the business use case of recognizing drawings of cats and dogs and ducks when these multi-modal models can’t even describe what’s clearly present in a business document?

Really frustrating.

:rofl::rofl: 100% true fact.

But I wanted to know, have you uploaded pdf or image file?
According to me, there is a separate engine to extract text from pdf. Maybe extraction don’t use AI. I have seen that if I send a image to it, it will first check the image then, even if i instruct something else and say to re analyse the image it won’t. It will stand on the same first result.

Yes. I have been testing with this image:

And the API absolutely refuses to recognize the strikethrough text. The AI Studio does a better job, but I need this to work in an embedding pipeline, so I’ve got to get the API working.

It looks like strikethrough in OCR is still a non-trivial issue in 2024.

Anthropic’s models are in the google model garden, if that’s a solution. Google Cloud console (you need to request access though)

It’s not perfect, but sonnet 3.5 gave me this:

[
  {
    "style": "bold",
    "content": "10. Sick Leave"
  },
  {
    "style": "normal",
    "content": "\n\nModify Article 9 of the Local #161 Motion Picture Theatrical and TV Series Production\nAgreement (and make conforming changes to Article 41 of the Local #161 Supplemental\nDigital Agreement) as follows:\n\n"
  },
  {
    "style": "bold",
    "content": "\"ARTICLE 9. "
  },
  {
    "style": "bold-strikethrough",
    "content": "WAIVER OF NEW YORK CITY EARNED SICK TIME ACT\nAND SIMILAR LAWS"
  },
  {
    "style": "bold-underlined",
    "content": " SICK LEAVE"
  },
  {
    "style": "bold",
    "content": "\""
  },
  {
    "style": "normal",
    "content": "\n\n\"(a) Paid Sick Leave in the State of New York: The following is applicable\nonly to employees working under this Agreement in the State of New York:\n\n\"(1) Commencing "
  },
  {
    "style": "italic",
    "content": "[insert the date that is the first Sunday after 30 days\nfollowing the AMPTP's receipt of notice of ratification]"
  },
  {
    "style": "normal",
    "content": ", employees shall accrue one (1)"
  }
]
proomt
Please transcribe the text above, with text blocks filling the following schema:

{
    style: "normal" | "bold" | "underlined" | "strikethrough" | "italic", // you can combine styles with a dash ("style_a-style_b")
    content: string // the actual content
}[]

Only reply with valid JSON. Begin your response with [
.filter(e => !e.style.match("strikethrough"))
.map(e => e.content).join("")
  1. Sick Leave

Modify Article 9 of the Local #161 Motion Picture Theatrical and TV Series Production
Agreement (and make conforming changes to Article 41 of the Local #161 Supplemental
Digital Agreement) as follows:

“ARTICLE 9. SICK LEAVE”

"(a) Paid Sick Leave in the State of New York: The following is applicable
only to employees working under this Agreement in the State of New York:

"(1) Commencing [insert the date that is the first Sunday after 30 days
following the AMPTP’s receipt of notice of ratification], employees shall accrue one (1)

1 Like

Sounds like we’re getting closer. How did you generate the above? Did you manually input it, or was it generated by some code? This is the crucial point for me because I’ll be dealing with hundreds of documents which will have strikethrough text, so I really need to be able to automate the process.

Thank you for the response! I have enabled Claude Sonnet 3.5 (which I didn’t know you could do in Vertex!).

proomt is in the proomt box, just expand it :slight_smile:

I see the prompt. But what I’m not following (my bad, I don’t know the language) is:

a. What is the input text (text above) that the prompt is using? Is this the PDF or image file?
b. How is the following determined?

In other words, I don’t see how the script is determining what is and what isn’t “bold-strikethrough”. I see that the script will match against “strikethrough”, and thus elimanate it from the output – which is what I want.

Unless you are saying that this prompt alone:

Will return this output:

In which case, I do understand the concept.

Yeah you got it. that was javascript. you’d send each screen with that prompt, and then post process the data.

eg in py

1 Like

Beautiful. Haven’t used javascript in a long minute, but I get the gist of it now. I actually have access to sonnet 3 through AWS, but it’s not offering 3.5. I’ll enable it through Vertex and give it a try. Will let you know.

A glimmer of hope. Many, many thanks.

1 Like

It isn’t really “requesting”. You need to fill out some information, and you get access once you fill it out. (This is mostly information so Anthropic has an idea how many people are using it.)

1 Like

Ah. I’ve been reluctant to fill it out because google’s rejected me in the past because my website wasn’t “up to snuff” or something.

нет проблем при загрузке в виде изображение

Yes, but you are going it in the playground/chatbox. I need to do it in the API. Totally different responses.

нет через апи вот проверти Google AI Gemini API UZB или через тг бот Google AI Gemini API UZB

Thanks. Hopefully this will provide the solution: Why can't I get Gemini to recognize "strikethrough" text in an image - #12 by Diet

Thanks @Diet !!

I finally got something working. Yes, using Claude through Anthropic[Vertex] appears to recognize the strikeout.

I am using code similar to this:

import base64
import httpx
from anthropic import AnthropicVertex

LOCATION="europe-west1" # or "us-east5"

client = AnthropicVertex(region=LOCATION, project_id="PROJECT_ID")

image1_url = "https://upload.wikimedia.org/wikipedia/commons/a/a7/Camponotus_flavomarginatus_ant.jpg"
image1_media_type = "image/jpeg"
image1_data = base64.b64encode(httpx.get(image1_url).content).decode("utf-8")

message = client.messages.create(
  max_tokens=1024,
  messages=[
    {
      "role": "user",
      "content": [
        {
          "type": "image",
          "source": {
            "type": "base64",
            "media_type": image1_media_type,
            "data": image1_data,
          },
        },
        {
          "type": "text",
          "text": "Describe this image."
        }
      ],
    }
  ],
  model="claude-3-5-sonnet@20240620",
)
print(message.model_dump_json(indent=2))

Which can be found in this Google Cloud documentation: Google Cloud console

Last question: I need to extract text from multi-page PDFs. So, task 1 is to convert each page to an image. However, I need to know how to send multiple images to Claude in one API call.

Suggestions?

1 Like