Microsoft Research published the paper “Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models” in May 2024 ([2404.03622] Mind's Eye of LLMs: Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models). Their results are easy to reproduce, they had not tested the Google models, so I did. The results I obtained were quite disappointing, so I didn’t publish them at that point in time.
The natural language navigation task using a 3x3 grid has a baseline success rate one in nine, which is 11%. A blindfolded chicken randomly pecking at results would achieve an expected success rate of 11%. Gemini 1.5 flash achieved 24%. The results reported in the Microsoft paper for GPT-4 had success rates at well over 50%. Clearly the 1.5 models were not doing that well with this task. The difficulty with spatial direction was obvious with other tests as well (visual navigation tasks, showing an image with upside down numbers, and in responses from the model about the relative position of objects in an image). The Gemini 1.5 were directionally challenged (both flash and Pro).
I repeated the natural language navigation task with experimental 0801. The preliminary results are encouraging: the success rate worked out to 78%, but with large error bars, which translates into something between success 3 out of 4 times or 4 out of 5. That is a big step difference from success 1 in 4, or about only double the success rate of the blindfolded chicken that the non-experimental models achieve. Just to make sure, I re-ran the test on gemini-1.5-flash-001 and it shows the same disappointing success rate as the one I had measured in May.
Model gemini-1.5-pro-exp-0801 also shows significant improvement when dealing with images with upside down numbers etc. It seems to have overcome the directional challenges of the previous model variants.
There are other differences in performance as well, a few math problems that Gemini 1.5 Pro had difficulty with the 801 model solved, but that works both ways: a few math problems that Gemini 1.5 Pro solves, the 801 model has difficulty with.
Conclusion: the most dramatic improvement in performance between Gemini 1.5 Pro and model 801 that I have observed in testing so far is in spatial navigation and direction. The 801 model did not experience performance degradation due to excessive alignment training, which I am convinced is a malady that afflicted the 1.5 models and that nobody in Google will ever fess up to (since who in their right mind will want to admit to such a potentially career limiting mistake).