Imagen 4.0 API Issue: Long Contextual Prompts Rendered as Text Instead of Creative Guidance - Multimodal Alternative Needed?

The imagen-4.0-generate-preview-05-20 model performs excellently when generating images from concise descriptions. However, it struggles significantly with long prompts containing contextual elements that need to be extracted and interpreted. Instead of using lengthy text as creative guidance, the model naively incorporates the raw text directly into the generated image.

I believe a multimodal model would handle this scenario more effectively by internally generating a revised_prompt that it would then feed to the image generation component. This appears to be what happens when testing Imagen4 through the Gemini interface.

Is it possible to access via API a multimodal model that handles image generation with proper contextual prompt processing?

1 Like

@Gael0 ,
welcome to the forum.

Like you mentioned, Imagen model is not a multimodal model because its just a diffusion model. To internally revise the prompt , you can chain a text model and imagen model i.e prompt the to understand the scene and generate a prompt to give to the imagen model and use the output to prompt the imagen model, that would solve the issue.