Gemini vision pricing

Using gemini-1.5-pro-preview-0514 for Visualizing Files (PDFs/PNGs)

Hi all, I’m currently using gemini-1.5-pro-preview-0514 to visualize files such as PDFs and PNGs in text completions. Here’s how I’m currently doing it:

I’m using Vertex AI and converting local image files (like PDFs and PNGs) to a data URL before sending them. Below is a code snippet of how I achieve this:

def local_image_to_data_url(self, image_path: str) -> bytes:
    """
    Converts a local image file to a data URL.

    Args:
        image_path (str): The path to the local image file.

    Returns:
        bytes: The data URL of the image.
    """
    with open(image_path, "rb") as image_file:
        encoded_string: bytes = base64.b64encode(image_file.read()).decode('utf-8')  # type: ignore
    return encoded_string

For each image file, I check the file type and convert it accordingly:

for image_file in self.image_files:
        if image_file.get_filetype() == Filetype.PNG:
            mime_type = "image/png"
        elif image_file.get_filetype() == Filetype.JPG:
            mime_type = "image/jpeg"
        elif image_file.get_filetype() == Filetype.TruePDF or image_file.get_filetype() == Filetype.ScanPDF:
            mime_type = "application/pdf"
        else:
            raise ValueError("Unsupported file type")

        image_data_url = self.local_image_to_data_url(image_file.get_filepath())
        print(f"Image data URL created for {image_file.get_filepath()}")

        parts.append(Part.from_data(data=image_data_url, mime_type=mime_type))

Once all parts are created, I generate the content as follows:

print("Parts created for prompt")
parts.append(self.get_prompt(json_data))

responses = model.generate_content(
    parts,
    generation_config={
        "max_output_tokens": 8192,
        "temperature": 0.2,
        "top_p": 0.95
    },
    safety_settings={
        generative_models.HarmCategory.HARM_CATEGORY_HATE_SPEECH: generative_models.HarmBlockThreshold.BLOCK_ONLY_HIGH,
        generative_models.HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: generative_models.HarmBlockThreshold.BLOCK_ONLY_HIGH,
        generative_models.HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT: generative_models.HarmBlockThreshold.BLOCK_ONLY_HIGH,
        generative_models.HarmCategory.HARM_CATEGORY_HARASSMENT: generative_models.HarmBlockThreshold.BLOCK_ONLY_HIGH,
    },
    stream=False
)

My Concern

I’m currently sending the files as Base64, but I’m concerned about potentially incurring additional costs.

I believe this is how it’s supposed to be done, as I followed the documentation, but could anyone confirm if I’m using the correct approach?

2 Likes

Welcome to the forums!

What you’re doing isn’t wrong, but there are some concerns with doing it this way.

First of all - cost is not the concern. Sending it as a Data URL is no more (in terms of Gemini pricing) than other methods. In some ways, it might be cheaper, tho only slightly.

The biggest concern is that you might start to send files that are larger than allowed through the API. In this case, you’d have to switch to uploading a file to a Google Cloud Storage bucket and providing the “gs:” URI through a fileData part. This will incur GSC cloud storage costs over any Gemini costs, however you can manage this by either manually or automatically expiring the uploaded files if that makes sense.

2 Likes

Hi,

Thank you for your response.

I am currently having trouble understanding some details from the Gemini documentation, particularly the section about image handling:

Technical Details (Images) from Gemini 1.5 Pro and 1.5 Flash:

  • Both versions support a maximum of 3,600 image files.
  • Supported image formats (MIME types) are:
    • PNG (image/png)
    • JPEG (image/jpeg)
    • WEBP (image/webp)
    • HEIC (image/heic)
    • HEIF (image/heif)
  • Each image counts as 258 tokens, regardless of the image format.

Regarding image sizes:

  • There are no strict limits on pixel dimensions, apart from the model’s context window.
  • Larger images are downscaled to a maximum resolution of 3072x3072, while smaller images are scaled up to 768x768. Original aspect ratios are preserved.
  • There is no cost reduction for smaller images and no performance gain for images with higher resolutions, except in terms of bandwidth usage.

Reference: Gemini API Documentation


My Question:

The documentation states that each image equals 258 tokens. However, when I encode my images and PDFs in Base64 as text and prompt the AI, the token count (if considered as text) is significantly higher than 258.

Could you clarify:

  • Is an image always counted as 258 tokens, regardless of its size, when submitted directly (e.g., via Base64 encoding)?
  • Or is it only counted as 258 tokens when I use a Google Cloud Storage (gs:) URI after uploading the image?

If the token count remains the same when using Base64 encoding, I’ll continue with my current method. However, if using a Google Cloud Storage URI is more efficient in terms of token count, I’ll switch to that method.

Thank you for your help!

Best regards

The base64 encoded Part (inlineData) is converted to the 258 tokens (after the previously discussed image preprocessing steps like possible rescaling).

It’s easy to test: The promptTokenCount field of the usageMetadata object lets you see how many input tokens were counted.

Thanks for your reply,

I’m going to mark this as solved.

Thanks again