Gemini Pro unable to transcribe text in images

I Gave Gemini Pro a screenshot with an extract from a book. Since I wanted to ask a question about, I first asked to transcribe it (I find this to be an effective technique). After attempting to generate some images, it said it was unable to perform the task.

Gemini 1.5 Pro instead transcribed only part of the text, omitting the formulas (this has to do with the inability to display content when using the double dollar sign environment) and the stopped halfway. When I asked why it stopped, it said that it had transcribed the entire text, even though it did not.

This is a pretty simple task ChatGPT can perform rather well. Has anyone succeeded with a request of this kind?

Interesting…hmmm…

And it does not appear to have triggered any flags.

Did you by chance try it in AI Studio a second time to see if it’s unable to produce a result on every attempt?

I tried replicating your issue using some random image on the internet here:

And it successfully printed a complete prompt:

The only thing that comes to mind might be the formatting. It almost looks like it’s trying to produce some of the information in LaTeX, but AI Studio itself is not set up for that kind of format yet. I don’t know if that’s why it didn’t finish, but it does not seem that AI Studio visualizes it.

This is what I get when I asked for that:

Yes Gemini 1.5 Pro consistently breaks down when using the double dollar sign from Latex. This is an issue I have already reported and I intent to open a longer thread on Gemini’s limitations when dealing with Mathematics.

But I never saw the behaviour where it was convinced that it processed the entire file even though it did not.

1 Like

Honestly, I think this might be an issue with Google AI Studio and not the model itself. It might have returned the entire response, but something in the app might have cut it off. Keep in mind Google Ai Studio is brand new, so kinks are still being worked out.

1 Like

Yes absolutely, I also think it’s a problem of Google AI Studio and I am reporting it exactly for this reason. Hopefully they fix all these problems little by little. #developersdevelopersdevelopers

2 Likes

From the short time I’ve been here, thankfully it does seem like google has been very good at both listening to feedback here and deploying fixes pretty quickly. It’s a refreshing change of pace lol. Once the weekend wraps up I don’t doubt it’ll be addressed.

#developersdevelopersdevelopers :smiley:

2 Likes

Yes, I think Google has a big opportunity and only needs to listen to the feedback instead of ignoring it as the competition.

Actually, Gemini 1.5 failed to transcribe the sample you provided. Look closely. It did fine up to and including the section “Properties”. In the last section, it transcribed 4 items in a bulleted list. The source section has 9 items in a numbered list. Of the 4 it did, only one (limes of ln(x) when approaching zero from the positive side) is in the source document, Gemini made up the other three, and they are technically correct. My hypothesis is, it mistook the heading to be a discussion about limits to zero, and then discarded everything that didn’t match what it expected.

Anyway, the counter-example, look, Gemini works, isn’t a valid counter example. Gemini makes mistakes, and so far other LLMs make mistakes too.

1 Like

Gemini has a BIG problem handing maths.

1 Like

Completely agree!

In this case, it was the formatting I was paying attention to, not its content. I was also posting that as I was doing it lol. At least that’s what I thought :sweat_smile:.

I think of these more like explorations than anything else.

You know, I wonder if Google has some fun plans up its sleeve with AlphaGeometry? :thinking:

It’s like, they have the language parts, and they have the maths parts, they just need to figure out how to smush them together.

Graphs seem to confuse Gemini a bit more than other LLMs at this time. This example illustrates it. The prompt is “Alice, Bob, Charlie and Doug like to gossip, and each knows a scandal (a for Alice, b for Bob, c for Charlie and d for Doug) which is not known to any of the others. They communicate by mail, and whenever a person writes a letter, they pass on to the recipient as many scandals as they know. How many letters are needed, and in what sequence, before everyone knows about all the scandals?”

Gemini 1.5 response:
Here’s how everyone can learn all the scandals with just 4 letters:
Sequence:
1. Alice to Bob: Alice shares her secret (a) with Bob. Now Bob knows an and b.
2. Bob to Charlie: Bob shares both secrets (an and b) with Charlie. Now Charlie knows a, b, and c.
3. Charlie to Doug: Charlie shares all three secrets (a, b, and c) with Doug. Now Doug knows a, b, c, and d.
4. Doug to Alice: Doug shares all four secrets (a, b, c, and d) with Alice. Now Alice knows all the secrets as well.
With this sequence, everyone learns all the scandals with just 4 letters.

The procedure is insufficient, it takes 6 letters to distribute the scandals. In Gemini’s procedure, Bob never finds out secret (d), for example.

GPT-4 does marginally better. It comes up with 8 letters, and at least solves the everyone must hear all scandals requirement.

In this particular case, Claude 3 actually solved the message passing test. 6 letters, and properly ordered. Claude 3 isn’t always better at math, I have examples where it does terribly there.

1 Like

Geometry is an area where Gemini 1.5 needs help. Spatial relationships in general are difficult for LLMs. Try this simple geometry problem (it’s a disguised Pythagoras):


Prompt: "We have a right triangle (ABC) in a plane. The base (BC) has length x, the hypotenuse (AC) has length z, and the remaining side (AB) has length y. Inscribe a square inside the triangle such that one side of the square is on the base of the triangle (BC) and another side of the square is on the other leg of the triangle, (AB). The hypotenuse and the square share only one point, label it E. Let D be the corner of the square on (AB). The sides of the square have length s. Show the following equation is true: (x + y) (x + y - 2s) = z^2.”

Gemini 1.5 got the prompt and the image. The response is awful.
It applies Pythagoras to 3 collinear points, it indulges in writing out formulas and rearranging them, factoring, simplify, botches the algebra too, and in the end proudly announces “Therefore, the equation is proved.”

And that is a definite improvement over Gemini Advanced:

Absolutely, based on the image you sent, you can identify three sets of similar triangles:

  1. Triangle ABC and Triangle BFE:
  • Both triangles share a right angle (∠B).
  • Since ∠E and ∠C are both vertical angles, they are congruent (∠E ≅ ∠C = 90°).
  • The triangles share a common vertex (B).
  1. Triangle ABC and Triangle ADE:
  • Both triangles share a right angle (∠B).
  • Since ∠A and ∠D are both right angles, they are congruent (∠A ≅ ∠D = 90°).
  • The triangles share a common vertex (A).
  1. Triangle BFE and Triangle ADE:
  • Both triangles share the same angle (∠E).
  • The triangles are equiangular to each other since their corresponding angles are congruent due to reasons mentioned in points 1 and 2 above.

Gemini 1.0 adds a second right angle at C right at the first step, when the triangle ABC is given to have a right angle opposite the hypotenuse. Two right angles in a triangle makes the triangle impossible.

1 Like

Definitely add to here please. This is very relevant and more suited to the thread of maths.