The imagen-4.0-generate-preview-05-20 model performs excellently when generating images from concise descriptions. However, it struggles significantly with long prompts containing contextual elements that need to be extracted and interpreted. Instead of using lengthy text as creative guidance, the model naively incorporates the raw text directly into the generated image.
I believe a multimodal model would handle this scenario more effectively by internally generating a revised_prompt that it would then feed to the image generation component. This appears to be what happens when testing Imagen4 through the Gemini interface.
Is it possible to access via API a multimodal model that handles image generation with proper contextual prompt processing?