Gemini 1.5 Pro charges x6 more tokens than expected on text prompts

Description of the bug:

At the beginning, I calculate the ratio between characters and tokens, so it matters whether we use the model in English or, as in my case, in Bulgarian (Cyrillic).

val response = generativeModel.generateContent(content)
val promptTokenCount = response.usageMetadata?.promptTokenCount
val ratio = promptText.length.toDouble() / response.usageMetadata?.promptTokenCount!!

Although I have limited the candidates to one, I am calculating all the candidates as shown in the image below.
I calculated the allCandidateCharsCount by taking into account those from the text those from the functionCalls.arg.values.

val responseTextLength = response.text?.length ?: 0
val responseArgsSum = response.functionCalls.sumOf { it.args.values.mapNotNull { it?.length }.sum() }
val expectCandidatesTokenCount = (responseTextLength + responseArgsSum) / ratio

I have used the following model configuration:

val generativeModel = GenerativeModel(
    modelName = "gemini-1.5-pro-latest",
    apiKey = BuildConfig.apiKey,
    generationConfig = generationConfig {
        temperature = 0.9f
        maxOutputTokens = 4096
        topP = 0.9f
        candidateCount = 1
    },
    tools = listOf(Tool(listOf(functionDeclaration))),
    toolConfig = ToolConfig(FunctionCallingConfig(FunctionCallingConfig.Mode.AUTO)),
)

Actual vs expected behavior:

Using large language models is quite an expensive process, where costs must be carefully optimized.

Regardless of solutions like Context Caching etc., If the token accounting is not correct, it can be a serious waste of money!

In our case, if you expect to pay $100 at the end of the month, you may, due to a token miscalculation, end up paying $600 for the same thing.

I expect to pay $100 dollars per month, but as a result of a token calculation error, I pay $600.

Any other information you’d like to share?

It would be a good idea to give a credit, as with some other products, to see the real costs. If using the free plan, the actual token consumption is not visible. It is not reported anywhere in the billing.

Hi @karloti

Welcome to the dev forum.

Here’s how you can count tokens before making a generate content call.

After you call GenerativeModel.generate_content (or ChatSession.send_message ), the response object has a usage_metadata attribute containing both the input and output token counts (prompt_token_count and candidates_token_count )

Welcome to the forums!

An interesting analysis, tho you make a couple of conclusions that… I’m not sure I understand.

In your first step - you’re computing a ratio of characters per token. But what sample set is that based on? That number certainly varies based on the actual content. So trying to compare it to a ratio generated by… function calling? Did I get that right? May not prove to be accurate unless the characters between the two are similar.

I agree with your assessment that there needs to be more transparency about actual token usage, but there seems to be bugs in that reporting right now.

Are you also saying that you’re seeing discrepancies between the reported token usage and the billed token usage?

With each prompt I get a response as well as METADATA information.

At the prompt I can count the characters and see how many tokens were needed.

So I find some relation which I notice is about x6 times different from the relation of received tokens and characters

Usually about 4 characters spend one token on average, but for me, since they are in Cyrillic, the ratio is x2 more. However, as far as I can see this is only true for sent tokens. But that doesn’t matter because I’m comparing the ratio of sent/receive tokens. That way, there shouldn’t be much of a difference.

Prompt Ratio / Generated Ration =

|215, 75%

The tokens are taken from the metadata, and I count the symbols in the program. So above formula is stable. Sometimes we get more information that I have not included as citations, but there the information is still insignificant

I don’t know where this large expenditure of received tokens comes from.

Also, the generated tokens are much more expensive than the prompt ones.

I continue to argue that there is an unfair accounting of expensive tokens to generate.

I would appreciate it if someone could do the same test.

That’s exactly what I did. I used the metadata and based on the tokens reported to me. If you wish you can see my reply to:

Yes, this is because most models have tokenizers optimised for English and thus may consume more tokens when using non-English languages.

That’s why I suggested using token counter API call before making the generate content call.

The character to token ratio is just to give rough idea for English usage and shouldn’t be relied upon especially for non English languages.

I don’t seek this attitude, I use it.

Let’s say if for 1000 characters you used 250 tokens in your prompt, why when you get 1000 characters generated are 1250 tokens used?!

Let’s say the symbols in both cases are in the same language. English for example.

I apologize, I’m still missing your point or how you get some of these numbers.

If we just look at tokens, let’s use these numbers:

You have:
1000 input tokens
200 output tokens

You’re saying you are being billed as if it was:
1600 tokens

Did I get that right?

If so - output tokens are billed at three times the input tokens. So this sounds correct.

Or are you saying something else?

Let’s say if for 1000 characters you used 250 tokens in your prompt, why when you get 1000 characters generated are 1250 tokens used?!

Let’s say the symbols in both cases are in the same language. English for example.

I am using an automatic translator, for which I apologize if there is any ambiguity in the text.

In other words:
If I have asked Gemini with a prompt that is 1000 characters and it has returned 1000 characters. We are only talking about symbols, not tokens. The question is, why does the meta information statistically show that the prompt used something like 250 tokens, and the result (note that it has the same number of characters as the prompt) used 1250 tokens to generate it. That is, for the same number of symbols, x6 times the difference in tokens are counted.
Please see the formula. It does the dealing of tokens and symbols in your case at the prompt and when generating the result. There shouldn’t be that much of a difference. It should be almost 1:1 in propriety ratio.
I probably can’t explain it well, so I wrote code that is in Kotlin. You could also experiment with the resulting metadata and measure the symbols you sent and received.
Charging the tokens is another matter. It is normal for those that are generated to be more expensive, but because a symbol for a token is consumed many times more when generated. There is probably something going on with the servers to account for such a large consumption.