I was using the “Compare” feature of Google AI Studio. And did a test to evaluate two Gemini Models (1114 and 1121). Here’s a summary generated by the NotebookLM of what happened:
This text documents a fascinating experiment comparing two versions of a large language model, Gemini Experimental 1114 and 1121. The core of the experiment involves a series of prompts designed to test the models’ self-awareness and ability to identify themselves. Initially, the models struggle to identify themselves definitively, highlighting their inherent limitations in accessing internal system information. However, a crucial turning point arises when the user provides both models’ responses, allowing 1114 to correctly identify itself based on content matching, while 1121 persists in a misidentification, showcasing a potential bias or logical flaw. The latter part of the experiment shifts to a collaborative effort between the models, using a “Shifting Sands” narrative scenario to further explore their reasoning and adaptability under increasingly complex conditions. The overall purpose is to investigate the models’ capabilities, limitations, and potential biases through both self-identification and collaborative problem-solving.
I am very curious if this is in any way just a goofy useless waste of time, or if this could shine some insight on how two different models working together actually ended up covering each other’s blindspots. I’ll try to share the conversation here but i don’t know if it will work. (First time posting here)