Issue with text-embedding-004 Returning Identical Vectors for Specific Languages

N.A · September 12, 2024, 8:54am

I have encountered an issue where the text-embedding-004 model returns identical vectors when inputting text in certain languages.

Please refer to the attached image for more details.

I am accessing text-embedding-004 using an API key issued through Google AI Studio.
google-gemini/cookbook/blob/main/quickstarts/Embeddings.ipynb

It appears that when inputting text in certain languages—presumably those that do not use word segmentation, such as Chinese, Thai, Japanese, etc.—the model now returns the same vector for all inputs.

I am certain that this issue did not exist at least two days ago.
Has anyone else experienced a similar problem or have any insights on this phenomenon?

I would greatly appreciate any advice or information on this matter. Thank you in advance for your help.

afirstenberg · September 12, 2024, 2:01pm

Can you post text that illustrates this problem?
The images are good - but having the exact text so we (and Google) can cut and paste it to test this out will go a long way to helping figure out what might be happening.

N.A · September 13, 2024, 12:06am

Thank you for your reply and advice.

I apologize for not including the code earlier.

I used the cookbook from the following URL, and my execution environment is Google Colab:

cookbook/quickstarts/Embeddings.ipynb at main · google-gemini/cookbook · GitHub

The only modification I made was to the list of input strings (content). Below is the code necessary to reproduce the issue:

!pip install -q -U "google-generativeai>=0.7.2"

import google.generativeai as genai

from google.colab import userdata
GOOGLE_API_KEY = userdata.get('GOOGLE_API_KEY')
genai.configure(api_key=GOOGLE_API_KEY)

# Different embeddings
result = genai.embed_content(
    model="models/text-embedding-004",
    content=[
        'Hello!',
        'Good evening!',
        'Good morning!'
    ]
)

for embedding in result['embedding']:
    print(str(embedding)[:50], '... TRIMMED]')

# Same embeddings, Japanese
result = genai.embed_content(
    model="models/text-embedding-004",
    content=[
        'こんにちは！',
        'こんばんは！',
        'おはようございます！'
    ]
)

for embedding in result['embedding']:
    print(str(embedding)[:50], '... TRIMMED]')

# Different embeddings, Spanish
result = genai.embed_content(
    model="models/text-embedding-004",
    content=[
        '¡Hola!',
        '¡Buenas noches!',
        '¡buen día!'
    ]
)

for embedding in result['embedding']:
    print(str(embedding)[:50], '... TRIMMED]')

# Same embeddings, Chinese
result = genai.embed_content(
    model="models/text-embedding-004",
    content=[
        '你好!',
        '晚安!',
        '早安!'
    ]
)

for embedding in result['embedding']:
    print(str(embedding)[:50], '... TRIMMED]')

# Same embeddings, Thai
result = genai.embed_content(
    model="models/text-embedding-004",
    content=[
        'สวัสดี!',
        'สวัสดีตอนเย็น!',
        'สวัสดีตอนเช้า!'
    ]
)

for embedding in result['embedding']:
    print(str(embedding)[:50], '... TRIMMED]')

# Different embeddings, Vietnamese
result = genai.embed_content(
    model="models/text-embedding-004",
    content=[
        'Xin chào!',
        'Buổi tối vui vẻ!',
        'Chào buổi sáng!'
    ]
)

for embedding in result['embedding']:
    print(str(embedding)[:50], '... TRIMMED]')

The embedding model I am using is text-embedding-004, as documented at the following URL:

Gemini models | Gemini API | Google AI for Developers

Please let me know if you need any clarification.
Thank you in advance for your time and assistance.

N.A · September 13, 2024, 1:22am

Thank you for your reply and advice.

I apologize for not including the code earlier.

I used the cookbook from the following URL, and my execution environment is Google Colab:

cookbook/quickstarts/Embeddings.ipynb at main · google-gemini/cookbook · GitHub

The only modification I made was to the list of input strings (content). Below is the code necessary to reproduce the issue:

!pip install -q -U "google-generativeai>=0.7.2"

import google.generativeai as genai

from google.colab import userdata
GOOGLE_API_KEY = userdata.get('GOOGLE_API_KEY')
genai.configure(api_key=GOOGLE_API_KEY)

# Different embeddings
result = genai.embed_content(
    model="models/text-embedding-004",
    content=[
        'Hello!',
        'Good evening!',
        'Good morning!'
    ]
)

for embedding in result['embedding']:
    print(str(embedding)[:50], '... TRIMMED]')

# Same embeddings, Japanese
result = genai.embed_content(
    model="models/text-embedding-004",
    content=[
        'こんにちは！',
        'こんばんは！',
        'おはようございます！'
    ]
)

for embedding in result['embedding']:
    print(str(embedding)[:50], '... TRIMMED]')

# Different embeddings, Spanish
result = genai.embed_content(
    model="models/text-embedding-004",
    content=[
        '¡Hola!',
        '¡Buenas noches!',
        '¡buen día!'
    ]
)

for embedding in result['embedding']:
    print(str(embedding)[:50], '... TRIMMED]')

# Same embeddings, Chinese
result = genai.embed_content(
    model="models/text-embedding-004",
    content=[
        '你好!',
        '晚安!',
        '早安!'
    ]
)

for embedding in result['embedding']:
    print(str(embedding)[:50], '... TRIMMED]')

# Same embeddings, Thai

result = genai.embed_content(
    model="models/text-embedding-004",
    content=[
        'สวัสดี!',
        'สวัสดีตอนเย็น!',
        'สวัสดีตอนเช้า!'
    ]
)

for embedding in result['embedding']:
    print(str(embedding)[:50], '... TRIMMED]')

# Different embeddings, Vietnamese
result = genai.embed_content(
    model="models/text-embedding-004",
    content=[
        'Xin chào!',
        'Buổi tối vui vẻ!',
        'Chào buổi sáng!'
    ]
)

for embedding in result['embedding']:
    print(str(embedding)[:50], '... TRIMMED]')

Please let me know if you need any additional information or clarification.
Thank you in advance for your time and assistance.

David_Lanzendorfer · December 26, 2024, 9:54pm

Embeddings are a vector representation of the semantics of expressions. So when you say good morning, it’s the same as おはようございます, so it makes sense that the vectors are basically the same, otherwise Euclidean proximity search wouldn’t work. It’s the whole point about positional encoding. If however, the Euclidean distance between Good Morning and Good Evening are zero, then something’s broken, because Good Morning and Good Evening differ in their semantics in the information what time of day it is.

tokyoboy_yosuke · June 8, 2025, 12:33pm

this happened to me too in japanese
new model gemini-embedding-exp-03-07 worked FYI

Topic		Replies	Views
2.5 pro just started hallucinating Gemini API models	12	927	June 2, 2025
Random Endless \n Output in Gemini API 1.5 Pro Responses Gemini API gemini-15 , model	14	604	June 6, 2025
Iam fed up with google :( refusing to translate words ,Really? Google AI Studio gemini-15 , models , language-translator	7	175	September 13, 2024
List_models shows incomplete/invalid response for Gemma Gemini API bug , api	3	124	April 24, 2025
Cannot use system instruction with stream mode of `gemini-1.5-flash-002` Gemini API gemini-15 , bug , api	7	385	January 10, 2025

Issue with text-embedding-004 Returning Identical Vectors for Specific Languages

Related topics