I have a question about Gemini Vision. According to documentation sources, each uploaded photo consumes approximately 258 tokens (though this may vary by pixel size). Currently, it can be used with gemini-1.0-pro-vision, gemini 1.5-flash, gemini-1.5-pro, and gemini-2.0-flash.
In this case, are these 258 tokens calculated according to each model’s API pricing? If so, what is the difference between these models in terms of vision understanding? Does gemini-2.0-flash have better vision understanding than gemini-1.5-pro?
gemini-2.0-flash-lite has also been added to the API. Is it possible to use vision with this model as well? I think Google doesn’t pay enough attention to details regarding vision capabilities.
Actually, I’m going to use it as OCR for mathematical problems. I want to decide which one would be best for me.
I couldn’t find any benchmarks or resources about this.
Thank you for your support.