Imagen 4.0 API Issue: Long Contextual Prompts Rendered as Text Instead of Creative Guidance - Multimodal Alternative Needed?

Gael0 · June 3, 2025, 4:37am

The imagen-4.0-generate-preview-05-20 model performs excellently when generating images from concise descriptions. However, it struggles significantly with long prompts containing contextual elements that need to be extracted and interpreted. Instead of using lengthy text as creative guidance, the model naively incorporates the raw text directly into the generated image.

I believe a multimodal model would handle this scenario more effectively by internally generating a revised_prompt that it would then feed to the image generation component. This appears to be what happens when testing Imagen4 through the Gemini interface.

Is it possible to access via API a multimodal model that handles image generation with proper contextual prompt processing?

Akhilesh_Kambhampati · June 6, 2025, 10:17pm

@Gael0 ,
welcome to the forum.

Like you mentioned, Imagen model is not a multimodal model because its just a diffusion model. To internally revise the prompt , you can chain a text model and imagen model i.e prompt the to understand the scene and generate a prompt to give to the imagen model and use the output to prompt the imagen model, that would solve the issue.

Topic		Replies	Views
AI endpoint for Image generation which allows image+prompt -> image Gemini API api , gemini-api	6	282	October 7, 2024
When will the API support responding with voice or images? Gemini API model , gemini-flash	3	167	April 3, 2025
OpenAI compatibility + multimodal? Gemini API gemini-api , gemini-flash	2	159	June 24, 2025
Fine tuning a multimodal model Gemini API gemini-15 , api , fine-tuning	5	528	April 25, 2024
Blank imgur images within the gemini response text Gemini API	2	403	August 21, 2024

Imagen 4.0 API Issue: Long Contextual Prompts Rendered as Text Instead of Creative Guidance - Multimodal Alternative Needed?

Related topics