Extending LLM Functionality for Charts, Audio Interpretation, and Webpage Analysis

Hi, I’m building a project using Gemini’s generateContent API. The API receives a webpage’s content—including text, images, and audio. The webpage provides instructions that need to be solved.

I want to use an API that allows my LLM to also generate charts, interpret audio, and solve problems based on the provided content. Currently, I’m using tool-calling, but it only supports very limited actions.

What API or approach should I use to enable richer capabilities like chart generation and audio understanding?

Hi @Tarang_Jhaveri ,

Thank you for reaching out!

To handle audio understanding, chart generation, and rich multimodal reasoning, use a multimodal LLM API instead of basic tool-calling — e.g., Gemini’s multimodal models which accept text, images, audio, and more in one request.

Recommended Gemini models:

  • Gemini 3 Pro– most capable multimodal reasoning and complex tasks across text, images, and audio.
  • Gemini 3 Flash – faster and cheaper alternative with strong multimodal skills and lower latency.

These models can process media directly and return structured data you can use to generate charts or interpret audio.