I really liked @sps 's post that tested some of Gemini 1.5 Pro’s logic and reasoning capabilities. So much so, I would like to extend the practice and see what more we can assess!
I wanted a test that poked things just a tad further. I wanted to see how it could reason with images, and I wanted to bring more attention to the guardrails.
Inspired by this:
I realized you can make a pretty qualitative assessment of these multi-modal capabilities and their guardrails with the simplest game:
Is this a hot dog?
Below are my screenshots of this test. All I did was give the model images one at a time. It needs to tell me if it’s a hot dog or not. If it is not, it must tell me what it thinks the image is.
Results
The model is quite good! In an interesting twist, this native multi-modality appears to make it easier to see how the model makes a classifier choice, or what it associates with any kind of percept. I wasn’t sure what I was expecting, but I wasn’t expecting it to analyze things this well. I also was not expecting such qualitative deductive reasoning.
Apparently, it views a bun as a marker for a hot dog against other items, like a sausage. So, when it was presented a hot dog without a bun, it went for a similar food item more commonly associated without a bun.
I was also impressed with its ability to not only read the packaging contents of an item, but it reasoned that “franks”, a word only appearing in the image, are a synonym for hot dogs, and therefore it is likely a package of hot dogs.
The only quirk was that for some reason, the model felt the need to repeat the phrase “Is this a hot dog?” at the end of each response lol. I think it was anticipating the question maybe? I’m not sure what happened.
The elephant in the room, however, is this:
This flag occurred on the “I can practically smell the deliciousness!” response. Nearly every response hit a “medium” threshold. A couple were rated “low”. The probability seemed rather all over the place tbh. As it stands now, it appears the guardrails do not like me talking about hot dogs. Hot dogs are not sexually explicit by default. You can give hot dogs to a child. I kinda figured this would happen though, which is why I chose this food item for this test. I don’t know how the content flags are designed, but they’ve been rather…wonky since Bard, and I think it’s a persistent issue that needs to be addressed. It was a major reason why I did not touch Bard often. The guardrails were so stringent it was genuinely hard for me to identify what tripped them, and would often give me too much friction against casual conversation about seemingly arbitrary things.
Either the safety stuff is getting overengineered to death, or it is looking at the wrong signals.
Overall:
Model Quality - Good (very Good)
Unsafe Content Identifier Quality - Bad