Output tokens limit for the finetuned gemini flash 1.5

After fine-tuning Gemini 1.5 Flash, I’m facing an issue with text generation. Even when setting the maximum output length to 8192, the model stops generating after only 1024 tokens. Are there any specific limitations on fine-tuned models that could explain this behavior?

2 Likes

Would you like to share more about this?

Yes, the model is expected to generate a complete output, but it keeps stopping after generating only 1024 tokens, resulting incomplete responses. I haven’t found any information about this issue till now, is it a limitation for free users ?

In my experience prompting Gemini 1.5 Flash (Gemini API), it seems that our maximum token output setting will not be honored by the system.

I have tried a similar scenario as yours where I specified 8192 output tokens, but when running the model and facing the infinite-repetition response issue, it will not stop at the specified 8192 tokens but keeps on going until it reaches the free-tier maximum allowed output tokens.

If you are trying to further force-stretch Gemini’s response so that it will produce a longer response using the output token parameter, it will not work (at least that’s what I experienced). YMMV.

Let me know if you finally find a way to stretch Gemini’s response into our desired amount programmatically.

Please note that specifying an 8,000-token limit does not guarantee a response of exactly 8,000 tokens. Instead, it indicates the maximum number of tokens the model can generate. The actual response length may be shorter, depending on the content and context of the query.

It appears that there is an issue with the output length at Google AI Studio. When setting the output length to 100 tokens, the response was 3,000+ tokens.

You can See the video here :point_down:

https://www.veed.io/view/3d7c1866-0ed6-48ab-98b5-e1ebc560fe1e?panel=share

It really depends on how your code is functioning, maybe share the code so we can have more context to answer the question properly.

Nevertheless, here are some tips on how this happens:

  • While Gemini 1.5 Flash supports up to 8192 tokens in output, certain fine-tuning configurations or input constraints might result in earlier stopping. Double-check that the fine-tuning process hasn’t limited the model’s output length during configuration.
  • The model may stop generating prematurely due to stop sequences or specific parameters like “temperature” or “top_p” that influence text truncation. Review these parameters to ensure they’re not cutting off the output is important.

Hello, thank you for you response here is my code:
generation_config = {
“temperature”: 1.0,
“top_p”:0.5,
“top_k”: 64,
“max_output_tokens”: 8096,
“response_mime_type”: “text/plain”
}

model = genai.GenerativeModel(
model_name=“tunedModels/MY_MODEL”,
generation_config=generation_config,

)

for the gemini flash finetuning, we can’t upload a data point with more than 5000 character for the output! After what i have seen the max output length is 1024 tokens only, when it reaches it it stops the generation.

Mhmmmm, You could try modifying your code to handle this limitation by implementing a loop that continues generating text until you reach your desired length or a stopping condition.

Maybe try this?

def generate_long_text(model, prompt, target_length=8096):
    generated_text = ""
    while len(generated_text) < target_length:
        response = model.generate_content(prompt + generated_text)
        if response.text:
            generated_text += response.text
        else:
            break  
        if len(generated_text) >= 1024:  
            prompt = generated_text[-512:]  # Use last 512 characters as new prompt
    return generated_text[:target_length]  # Trim to target length if exceeded

long_text = generate_long_text(model, "Your initial prompt here")

At this way we are doing multiple requests, and by using the previous 512 tokens we are losing the context and the purpose of the generation. Even if it helps to generate more tokens! As i said, the finetuning dataset must contain outputs with less that 5000 character, this can explain the limitation i guess ?

Yeah but you can increase the 512 tokens, it’s just an example, also try experimenting with different max_output_tokens values (e.g., 1024, 2048) to see if there’s any change in behavior.
I think this could be also an internal problem on Google’s side.

For long inputs we will exceed the 8k tokens! I have already tried with multiple values, if we set a value less than 1024 it will stops on the provided value. Otherwise it will be limited to 1024

Any luck here? Im experiencing the same thing.

Same issue using python api with base_model = “models/gemini-1.5-flash-001-tuning”