Output tokens limit for the finetuned gemini flash 1.5

Zakaria_Guellil · August 29, 2024, 2:55pm

After fine-tuning Gemini 1.5 Flash, I’m facing an issue with text generation. Even when setting the maximum output length to 8192, the model stops generating after only 1024 tokens. Are there any specific limitations on fine-tuned models that could explain this behavior?

innovatix · August 30, 2024, 11:16am

Would you like to share more about this?

Zakaria_Guellil · August 30, 2024, 11:32am

Yes, the model is expected to generate a complete output, but it keeps stopping after generating only 1024 tokens, resulting incomplete responses. I haven’t found any information about this issue till now, is it a limitation for free users ?

Andrew_Sulisthio · September 6, 2024, 5:20am

In my experience prompting Gemini 1.5 Flash (Gemini API), it seems that our maximum token output setting will not be honored by the system.

I have tried a similar scenario as yours where I specified 8192 output tokens, but when running the model and facing the infinite-repetition response issue, it will not stop at the specified 8192 tokens but keeps on going until it reaches the free-tier maximum allowed output tokens.

If you are trying to further force-stretch Gemini’s response so that it will produce a longer response using the output token parameter, it will not work (at least that’s what I experienced). YMMV.

Let me know if you finally find a way to stretch Gemini’s response into our desired amount programmatically.

innovatix · September 7, 2024, 7:58am

Please note that specifying an 8,000-token limit does not guarantee a response of exactly 8,000 tokens. Instead, it indicates the maximum number of tokens the model can generate. The actual response length may be shorter, depending on the content and context of the query.

It appears that there is an issue with the output length at Google AI Studio. When setting the output length to 100 tokens, the response was 3,000+ tokens.

You can See the video here

https://www.veed.io/view/3d7c1866-0ed6-48ab-98b5-e1ebc560fe1e?panel=share

ehsa_293 · September 7, 2024, 7:19pm

It really depends on how your code is functioning, maybe share the code so we can have more context to answer the question properly.

Nevertheless, here are some tips on how this happens:

While Gemini 1.5 Flash supports up to 8192 tokens in output, certain fine-tuning configurations or input constraints might result in earlier stopping. Double-check that the fine-tuning process hasn’t limited the model’s output length during configuration.
The model may stop generating prematurely due to stop sequences or specific parameters like “temperature” or “top_p” that influence text truncation. Review these parameters to ensure they’re not cutting off the output is important.

Zakaria_Guellil · September 7, 2024, 7:44pm

Hello, thank you for you response here is my code:
generation_config = {
“temperature”: 1.0,
“top_p”:0.5,
“top_k”: 64,
“max_output_tokens”: 8096,
“response_mime_type”: “text/plain”
}

model = genai.GenerativeModel(
model_name=“tunedModels/MY_MODEL”,
generation_config=generation_config,

)

for the gemini flash finetuning, we can’t upload a data point with more than 5000 character for the output! After what i have seen the max output length is 1024 tokens only, when it reaches it it stops the generation.

ehsa_293 · September 7, 2024, 7:48pm

Mhmmmm, You could try modifying your code to handle this limitation by implementing a loop that continues generating text until you reach your desired length or a stopping condition.

Maybe try this?

def generate_long_text(model, prompt, target_length=8096):
    generated_text = ""
    while len(generated_text) < target_length:
        response = model.generate_content(prompt + generated_text)
        if response.text:
            generated_text += response.text
        else:
            break  
        if len(generated_text) >= 1024:  
            prompt = generated_text[-512:]  # Use last 512 characters as new prompt
    return generated_text[:target_length]  # Trim to target length if exceeded

long_text = generate_long_text(model, "Your initial prompt here")

Zakaria_Guellil · September 7, 2024, 8:01pm

At this way we are doing multiple requests, and by using the previous 512 tokens we are losing the context and the purpose of the generation. Even if it helps to generate more tokens! As i said, the finetuning dataset must contain outputs with less that 5000 character, this can explain the limitation i guess ?

ehsa_293 · September 7, 2024, 8:05pm

Yeah but you can increase the 512 tokens, it’s just an example, also try experimenting with different max_output_tokens values (e.g., 1024, 2048) to see if there’s any change in behavior.
I think this could be also an internal problem on Google’s side.

Zakaria_Guellil · September 7, 2024, 8:16pm

For long inputs we will exceed the 8k tokens! I have already tried with multiple values, if we set a value less than 1024 it will stops on the provided value. Otherwise it will be limited to 1024

Jand_Hashemi · September 8, 2024, 5:32am

Any luck here? Im experiencing the same thing.

Peter_Nicholas · October 12, 2024, 5:27am

Same issue using python api with base_model = “models/gemini-1.5-flash-001-tuning”

Topic		Replies	Views
Truncated Response Issue with Gemini 2.5 Flash Preview Gemini API bug , gemini-flash	17	396	June 4, 2025
Gemini 1.5 flash continually generating same text until reach max limit of token Gemini API gemini-15 , api	4	591	December 18, 2024
Maximum Output Tokens from Tuned Models Google AI Studio gemini-15 , training	2	521	October 8, 2024
Gemini 2.0 thinking model returning truncated response with a blob of whitespace Gemini API gemini-20	6	527	January 25, 2025
Gemini translates only a part of the text Gemini API api , help-request , generative-ai	2	139	October 6, 2024

Output tokens limit for the finetuned gemini flash 1.5

Related topics