Hi, I’m building a project using Gemini’s generateContent API. The API receives a webpage’s content—including text, images, and audio. The webpage provides instructions that need to be solved.
I want to use an API that allows my LLM to also generate charts, interpret audio, and solve problems based on the provided content. Currently, I’m using tool-calling, but it only supports very limited actions.
What API or approach should I use to enable richer capabilities like chart generation and audio understanding?
Hi @Tarang_Jhaveri ,
Thank you for reaching out!
To handle audio understanding, chart generation, and rich multimodal reasoning, use a multimodal LLM API instead of basic tool-calling — e.g., Gemini’s multimodal models which accept text, images, audio, and more in one request.
Recommended Gemini models:
- Gemini 3 Pro– most capable multimodal reasoning and complex tasks across text, images, and audio.
- Gemini 3 Flash – faster and cheaper alternative with strong multimodal skills and lower latency.
These models can process media directly and return structured data you can use to generate charts or interpret audio.