Embedding Accuracy, Model Drift, and API Latency Issues When Using Google AI for My Texas Roadhouse Menu Website

I’m running into a series of AI-related issues while building a feature for my Texas Roadhouse menu website that uses Google’s Generative AI APIs. The site relies on semantic search and recommendation features powered by embeddings generated from Gemini models. The first problem is that embedding similarity results have become noticeably inconsistent over the last two weeks. Queries that previously returned highly relevant menu items—like “juicy steak” or “light salad”—are now returning unrelated results or lower-ranked matches. I’m unsure whether the embedding model behavior has changed, or if I’m accidentally processing the text incorrectly before sending it to the API.

Another issue I’m seeing is what feels like “model drift” in the responses. I use Gemini for generating short menu descriptions and categorizing items based on nutrition and ingredients. The same prompt is suddenly producing different-structured responses, sometimes missing fields that my backend expects. For example, the model occasionally skips the “category” or “featured tags” section even though the prompt template hasn’t changed. I’m not sure if this is due to new model versions rolling out or if I should be pinning a specific model explicitly to avoid this variability.

I’m also experiencing random spikes in API latency. Most requests return in under a second, but some take 6–10 seconds, which breaks the smooth browsing experience on my Texas Roadhouse menu pages. The AI-powered parts of the site load recommendations and contextual insights dynamically, so slow API responses cause the page layout to jump or stall. I’m caching aggressively on my end, but the inconsistency makes it hard to maintain predictable performance. I’ve checked my network logs, and nothing indicates packet loss or client-side slowdowns.

Another strange problem is happening with the embeddings endpoint. When I send nearly identical text strings—like two menu descriptions with only minor word differences—the API sometimes generates dramatically different vector magnitudes or embeddings that don’t cluster near each other. This breaks my similarity search logic, which relies on embeddings being stable and consistent. I’ve verified that the request body is identical except for the modified words, so I’m unsure whether this is expected behavior, quantization variance, or something else.

I’m also seeing unexpected token usage patterns. For some menu-generation prompts, the token count reported by the API is higher than expected even though the prompt is static and relatively short. This results in occasional quota overruns during peak traffic when many users view menu pages at the same time. What confuses me is that the exact same request sometimes produces different token counts across calls. I don’t know if this is caused by hidden metadata, internal formatting, or a model update changing tokenization rules.

Overall, I’m trying to determine whether these issues are related to API changes, model versioning, improper request formatting, or something in my embedding storage or caching architecture. The Texas Roadhouse menu website depends heavily on these AI-driven features, so I need stable behavior to keep recommendations and menu search accurate. If anyone has experience with embedding consistency, API latency spikes, or unexpected variation in Gemini responses, I’d appreciate any guidance, debugging steps, or best practices for stabilizing output across model updates. Sorry for the long post!

Hi Joe,

I’ve worked on a few high-availability systems that had similar symptoms, and what you’re describing feels less like a broken API and more like entropy creeping into the system over time. When you move features like your Texas Roadhouse menu search into production, that’s usually the moment where things have to shift from “probabilistic by default” to deterministic on purpose.

Here’s a simple, practical way to stabilize what you’re seeing:


  1. Stop the Model Drift (Pin Exact Versions)

The fact that your results changed over the last couple of weeks is a big tell. Aliases like models/gemini-pro or models/embedding-001 are just pointers, and Google can update what they point to without warning.

What to do:

Don’t use “latest” or alias models in production.

Lock to a specific version, like gemini-1.5-pro-001 or text-embedding-004.

That way, the same “juicy steak” query today gets processed exactly the same way next month.
Also, it’s worth subscribing to Google’s model change/deprecation notices so nothing sneaks up on you later.


  1. Fix the Vector Magnitude Issue (Normalize Everywhere)

If identical text is producing vectors with different magnitudes, that’s actually expected with raw embeddings. The problem shows up when you’re using Dot Product for similarity—longer vectors can dominate your rankings.

What to do:

Normalize every vector to unit length:

Once when you store the embeddings

Again at query time for the search vector

Mathematically, that’s just:

\hat{v} = \frac{v}{|v|}\hat{v} = \frac{v

Once everything is normalized, Dot Product becomes equivalent to Cosine Similarity, which is what you actually want for semantic search. This alone usually clears up a lot of weird ranking behavior.


  1. )loopEnforce Structure (Stop Missing JSON Fields)

7 unionFor things like menu descriptions, nutrition data, or categories, you don’t want to rely on the model “doing the right thing” from a prompt alone.

What to do:

Use controlled generation instead of free-form text:

Set temperature = 0.0

Set response_mime_type = “application/json” at the API level

If you’re in Python, pass a Pydantic model as the response schema

That way, the model is forced to either:

Fill every required field correctly, or

Fail loudly so you can handle it cleanly

No more silent partial JSON.


  1. Hide the 6–10s Latency from Users

You can’t control Google’s cold starts or inference spikes—but you can make sure your users never feel them.

What to do:

Load your static menu data (images, prices, names) instantly from your normal database.

Fire the AI request in the background.

Attach recommendations, tags, or nutrition info only when the response comes back.

The page should never wait on an LLM call just to become usable. This one change alone can completely transform how “fast” the app feels.


  1. Stabilize Token Usage (Clean Your Inputs)

A surprising amount of token variance comes from invisible formatting issues.

What to do:

Strip leading/trailing whitespace

Collapse duplicate spaces

Normalize line breaks

Standardize casing and character encoding

“Steak”, " Steak", and "Steak " can tokenize differently. Cleaning that up makes both costs and behavior more predictable.


If you implement these five steps—version pinning, vector normalization, schema enforcement, UX decoupling, and input sanitization—your recommendation engine should settle down and behave in a much more stable, predictable way.

If you want, I’m happy to help translate any of this directly into code for your stack

Thanks a ton for the detailed breakdown this is honestly one of the most helpful explanations I’ve gotten so far. The point you made about alias models quietly shifting under the hood makes a lot of sense, especially because my issues started without any code changes on my end. I’ll start pinning exact versions for both the text-generation and embedding models so I can at least eliminate that variable.

I was relying on raw dot-product scoring and assumed the embeddings would stay stable by themselves. I’ll go ahead and normalize everything at storage and query time so the similarity search behaves consistently again.

I really like the structured-JSON approach too. A lot of my headaches come from the model occasionally skipping fields, so forcing strict schema + temperature 0 should prevent those surprise format changes. And your suggestion about decoupling the AI calls from the initial page load is something I wish I had done earlier I’m going to rework that part of the flow so users don’t feel those latency spikes.

This gives me a solid path to stabilize the Texas Roadhouse menu features. If I run into trouble implementing the schema enforcement or the normalization pipeline, I might take you up on that offer for code help. Thanks again!

1 Like