Hi everyone, we are building a new startup that utilizes the Gemini API for a specialized OCR use case. Instead of processing full documents, we are using the Gemini 1.5 Flash-8B model to read specific numbers from meters and displays. As we are moving towards our official launch, we are encountering significant challenges with response latency that affect our real-time user experience. Currently, we are operating on Tier 1, and we urgently need to increase our Rate Limits (RPM/TPM) to support our growing user base and avoid “429 Too Many Requests” errors. We would appreciate any advice on how to accelerate our transition to Tier 1 or Tier 2/3. Additionally, since we are focused on extracting very small amounts of data (simple digits) rather than full pages, we are looking for any specific configurations or best practices to minimize the time-to-first-token and overall latency in production. Thanks in advance for your help!
Hi @Kontent_Room, welcome to the AI Forum.
Since you are getting 429 errors, please verify that you’re within the model’s rate limit. Request a quota increase if needed. To avoid 429 errors, you can also implement Exponential backoff.
Here are the best practices to minimize the time-to-first-token and overall latency in production:
-
Use system instructions to control the length of the response
-
Shorter prompts reduce time to first token.
-
Context Caching for repeated queries
-
Experiment with the
temperatureparameter to control the randomness of the output.Please let us know if you are still facing this issue. Thanks