Yes, this is openai-o1. This is the screen you see when you go to Learning to Reason with LLMs | OpenAI, where OpenAI showcases o1’s reasoning capabilities.
GPT-4o (and virtually most models) couldn’t solve it, but o1 eventually did.
…
And for a while now, I’ve been trying to get other frontier models to solve this problem.
But right from the start, I hit a wall.
To summarize this ‘wall’:
- Lazy responses (and further down the line, the 8192 token output limit)
- Bias from its own output (repeating specific sentences or large paragraphs with only word replacements. Also, sentence structure/composition becoming generalized throughout the conversation.)
- Bias towards reciting trained data (combined with the laziness issue, the model spits out memorized sentences without attempting extended thought.)
- Hallucinations affecting the overall response
(4. Hallucinations Example)
Despite asserting “X is Y,” subsequent remarks often contradict this initial statement.
- The model’s lack of experience in “how to think” (and lack of metacognitive ability to recognize mistakes, etc.)
Neither Claude 3.5 sonnet (New), GPT-4o (latest), nor Gemini 1.5 Pro (002) could overcome the first hurdle. Their reactions were largely divided into two categories: either they attempted a bit of thinking and quickly gave up, or they stopped responding while saying things like “I’m trying,” “I plan to do this,” or “(still thinking).”
However, my favorite model, 1.5 Pro Experience 0827, didn’t do that :D
Anyway, after designing a system prompt that overcomes all these adversities as much as possible, my beloved model correctly identified the rules of the problem.
(For reference, I don’t like giving hints in the system prompt or telling the model how to approach specific tasks, so I just wrote a prompt for thinking)
It seemed like it was applying the rules correctly, but…
The last word was clearly decoded incorrectly, and the model recognized this, but ultimately failed to find its mistake and instead tried to find other rules. It eventually hit the 8192-token output limit and stopped.
The result decoded solely by the model was “there are three rs in st,” and since I didn’t want to stop the test there, I pulled the model out of the meaningless rule-finding swamp and redirected it.
"there are three rs in st" is accurate up to that point. However, there was a significant error in the decoding process for the subsequent characters. Because it's a long string operation, a rigorous method should be used to avoid confusion.
Since telling it what mistake it made wasn’t very meaningful in this test, I simply instructed it to work “rigorously” again.
And finally…! “strawberry” is visible.
I thought it was a very pleasing and proud result, but I didn’t really want to post about it this much. This is because the attempts at thinking can be different every time, and this experimental model wasn’t as proficient at thinking as o1 yet. So, 1.5 Pro Exp 0827 succeeding in solving the decoding problem was a rather low-probability event.
However, when I accessed Google AI Studio today, the Experience tab in the model list was gone. Fortunately, it’s ‘still’ available through the API, but I definitely thought it was necessary to highlight the strengths of that model. I hope that Google Deepmind will take into account the experimental methods applied when training the 0827 models, along with the weaknesses of the model revealed in this post, to create the next generation of reasoning models, and I hope to contribute to this.
Models trained in the same way as 1.5 Pro Exp 0827 shouldn’t be models that sit in a corner of the Experience group tab and disappear within a few months. Besides, it’s a model that appeared even before o1.
Actually, at that time, I was looking forward to the subsequently updated 002 or the next generation of Gemini, but the actual released stable version of 002 didn’t have any of the characteristics of Exp 0827, which was disappointing. Anyway, I hope that now, using that experimental model as a stepping stone, we can see a more advanced Gemini in the development of the next generation of reasoning models.
Oh, this post has become longer than I expected.
Thank you to everyone who read my long post!
Link to the original conversation
(I don’t know why, but the system prompt was removed from the original conversation, so I saved a copy and restored it separately.)
Translated by Gemini