Token counts for image processing inside PDF documents

(Context) I am building a RAG system where I ingest PDF documents containing both text and images. My goal is to convert these PDFs into markdown and use Gemini to explain/describe the images embedded within the documents.

(Question) I need clarification on how the Gemini API counts input tokens for these PDFs, specifically regarding the images:

  1. Tokenization Method: When I send a PDF to the API, are the images converted into a base64 text string first and tokenized as characters (which would be huge)? Or are they processed as native image tokens (visual embeddings)?

  2. Quota Limits: I know a single high-res image can be 1,000,000+ characters when base64 encoded. If the API treats this as text, I would instantly hit the token limit. However, the documentation mentions a 3,000-image limit per prompt. Does the token count for images function separately from the 1M text token context window?

Hi @Panteley_Shmelev, welcome to the community!

Apologies for the delayed response.

  1. Gemini models process documents in PDF format using native vision to understand entire document contexts. Document Processing
  2. Images and PDFs share the same context window as text, but they use a fixed token cost per page or image that is independent of the file size in bytes. Tokens

So, if your PDF is large, use the File API as it supports up to 2GB. Using File API

You can also use the media_resolution parameter to control costs. Setting it to LOW can reduce the image’s cost. Media Resolution

Thank you!