Hello everyone,
I’m currently working with AI Studio as part of my Google fellowship on a data science team in cloud, and I’ve encountered an interesting issue during code snippet augmentation and evaluation.
My workflow involves using the model to:
- Generate prompts for code augmentation.
- Evaluate the augmented code snippets based on custom metrics (developed with the model’s assistance).
- Compare up to 10 different augmented outputs to identify pathologies and repetitiveness.
The problem arises during the comparison phase. Occasionally, the model seems to get confused and starts comparing an output to itself (e.g., “augmentation 6 is identical to augmentation 6”). While I can correct it and the model usually resumes the workflow, it then incorporates these invalid self-comparisons into its final evaluation summary.
It’s manageable, but I’m curious if anyone has experienced similar issues or has insights into:
- Why the model might be getting confused during the comparison process.
- Strategies for organizing the experiment to minimize this confusion.
- If there are any known limitations with the model when conducting these types of sequential evaluations.
- I’m particularly interested in hearing from others who have worked with AI Studio for code augmentation, evaluation, or similar multi-step workflows.
Any suggestions or experiences you can share would be greatly appreciated!
Thanks!
BTW - I used Gemini 2.0 Flash to create this post and title and tags!