why?
why not just send them in parallel? itāll be faster, too.
why?
why not just send them in parallel? itāll be faster, too.
Iām not sure I understand. Letās say I have a 30 page PDF. I extract each page as an image, so I now have 30 jpegs. I want the model to read the 30 pages and extract the text, in the same page order, without the stikethroughs. Iām thinking that I will send the 30 images in one model call. Are you saying to make 30 separate calls?
OK, in case anyone else runs into this, I was able to resolve the issue using sonnet 3.5 instead of gemini. I was stuck on how to send multiple images in one prompt. I finally got code working that:
Here is a code snippet of just the prompt code:
def encode_image(url, media_type):
return {
"type": "image",
"source": {
"type": "base64",
"media_type": media_type,
"data": base64.b64encode(httpx.get(url).content).decode("utf-8"),
},
}
def process_images_with_anthropic(s3_bucket, s3_output_key, folder_name, project_id, max_tokens=4096):
LOCATION = "europe-west1" # or "us-east5"
client = AnthropicVertex(region=LOCATION, project_id=project_id)
prompt = """
You are a very professional image to text document extractor.
Please extract the text from these images, treating them as pages of a PDF document.
A strikethrough is a horizontal line drawn through text, used to indicate the deletion of an error or the removal of text. Ensure that all strikethrough text is excluded from the output.
Try to format any tables found in the images.
Do not include page numbers, page headers, or page footers.
Please double-check to make sure that any words in all capitalized letters with strikethrough letters are excluded.
Return only the extracted text. No commentary.
**Exclude Strikethrough:** Do not include any strikethrough words in the output. Even if the strikethrough words are in a title.
**Include Tables:** Tables should be preserved in the extracted text.
**Exclude Page Headers, Page Footers, and Page Numbers:** Eliminate these elements which are typically not part of the main content.
"""
s3_client = boto3.client('s3')
s3_output_key = s3_output_key.rstrip('/')
response = s3_client.list_objects_v2(Bucket=s3_bucket, Prefix=f"{s3_output_key}/{folder_name}/")
content = []
for obj in response.get('Contents', []):
if obj['Key'].endswith('.jpg'): # Ensure we're only processing jpg files
url = f"https://s3.us-west-2.amazonaws.com/{s3_bucket}/{obj['Key']}"
content.append(encode_image(url, "image/jpeg"))
content.append({
"type": "text",
"text": prompt
})
message = client.messages.create(
max_tokens=max_tokens,
messages=[
{
"role": "user",
"content": content,
}
],
model="claude-3-5-sonnet@20240620",
)
return message.content[0].text
This was the code sample (from the Google Cloud documentation on AnthropicVertex) that served as my starting point.
But thanks go to @Diet for suggesting claude sonnet. That was the key to solving this puzzle.
Just when I thought I had solved this issue:
Output folder /mnt/temp/Local_52_Studio_Mechanics_SDPA_MOA uploaded to s3://docs.scbbs.com/docs/test/Local_52_Studio_Mechanics_SDPA_MOA
Traceback (most recent call last):
File "/mnt/temp/claudeUploadImg05.py", line 132, in <module>
response, usage = process_images_with_anthropic(s3_bucket, full_s3_output_key, project_id)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/temp/claudeUploadImg05.py", line 94, in process_images_with_anthropic
message = client.messages.create(
^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/anthropic/_utils/_utils.py", line 277, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/anthropic/resources/messages.py", line 902, in create
return self._post(
^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/anthropic/_base_client.py", line 1266, in post
return cast(ResponseT, self.request(cast_to, opts, stream=stream, stream_cls=stream_cls))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/anthropic/_base_client.py", line 942, in request
return self._request(
^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/anthropic/_base_client.py", line 1046, in _request
raise self._make_status_error_from_response(err.response) from None
anthropic.BadRequestError: Error code: 400 - {'error': {'code': 400, 'message': 'The request size (33301382 bytes) exceeds 30.000MB limit.', 'status': 'FAILED_PRECONDITION'}}
The original PDF is ONLY 26 pages, but I had to create 26 images which apparently exceed this 30MB limit. Where is this limit set? Can I change it? Where the heck can I post AnthropicVertext issues?
Since the content has been subdivided into individual pages, you could try a workaround - segmented processing with small overlap. By that I mean, submit pages 1 through 20 in one request, then 20 to the last in a second request. There will be some duplicate ājunkā when you try to splice the partial results together, but I suspect it will be easy to detect and remove from the final result. Just an idea in case nothing better shows up.
Thanks. Iām going to try something similar, but using gpt-4o:
Submit the request and read the response āfinish reasonā. If it is ālengthā (exceeds tokens), then add the assistant output to the message and re-submit. Repeat this until full request is processed.
At the moment, gpt-4o appears to accept the multiple images without issue. Currently, I donāt know if the AnthropicVertex error is a Google issue or an Anthropic issue, and there doesnāt appear to be anyone or anyplace I can ask.
OK, so the final, final solution to this was actually to use gpt-4o. What I had to do was create a script that does this:
Since I could not use AnthropicVertex Claude due to the file MB limit (which, actually, my new methodology fixes), I tried using Claude through the AWS Bedrock SDK. The problem there was that version of Claude does not reliably recognize the strikethrough text the same way the AnthropicVertex version does. Go figure.
So, gpt-4o becomes the default go-to model for handling strikethrough text.
Finally, by uploading the images in batches (as little as one image per call), I may extend the time it takes to process a large document, but the token difference is negligible while the efficiency of image processing increases dramatically.
Hey @Ron_Parker, how many tokens it consumes per 10 pages? Is it cost efficient?
Here are some updated notes from the issue discussion on OpenAI forum: Gpt-4o-mini fails with multiple images in same code that works with gpt-4o - #8 by SomebodySysop - API - OpenAI Developer Forum
You can see how many tokens gpt-4o cost to process a 26 page pdf. This is because OpenAI overcharges in tokens for image processing. There is another thread that discusses that: Super-high token usage with gpt-4o-mini and image - API - OpenAI Developer Forum
The Gemini and Anthropic models do not overcharge, but they either do not capture all overstrikes or run out of output tokens in the process.
The most economical method for doing this is to export the PDF to word, have word remove the strikethrough text, then convert back to PDF.