Hello, I want to build a Gemini-API-based summarizer for websites, which is different from all the other summarizers, which do a “poor” job (IMHO). The differences I want to implement:
- I will visit the website with a browser, so that dynamically-generated HTML is also considered by the summary (other solutions just do a “GET ”, and that’s also what Gemini’s URL context feature does)
- I want Gemini to also read and interpret images (others don’t do it, no idea why, images do contain relevant content, especially in technical articles!)
- In the generated summary, I want Gemini to not just generate text, but the summary should also include the links to the most relevant images of the website
I’m aware that this means that I have to build my own implementation that scrapes the website’s text and images. However, I’m unsure how exactly the generate_content() call should look for such a kind of multi-modal input. The docs show a brief example, which doesn’t help me much. I would need to feed Gemini with a (long) contents list where text and images take turns (in the order as they appear on the website). And somehow I would need to be able to provide Gemini with some context for each image (e.g., the alt-text and the URL, so that Gemini can reference the URL in output).
Any suggestions?