Impact of Text Structure on embedding results

I am currently working on a RAG and for that i need to embed documents using gemini-embedding-001. my documents are saved in a cetain format and i need to parse them decode the bytes, but when i get the text after processing the file, the original structure seems to be a little skewed especially if the text is originally was in tables or there are some numbered lists. so i was wondering if the embedding would be so much different if my original text looks like
”1. Building with google ai forum should be fun
1.1. there are theories to why this is the case
1.2. such theories will be discussed below
2. Because developers here are quite smart and helping.
2.1. we will see how the peopel here got smart
a. study hard
b. solve problems”
but my function that processes the file produces something like the following.
”1. Building with google ai forum should be fun 1.1. there are theories to why this is the case 1.2. such theories will be discussed below 2. Because developers here are quite smart and helping. 2.1. we will see how the peopel here got smart a. study hard b. solve problems”

so would having either of them, have a huge impact on the resulting embedding that might affect the retrieval stage?

Hi @sol-ruh-1, welcome back!

gemini-embedding-001 is a dense vector model, which means it cares about the semantic meaning of your words, not where they sit on the page. As long as you have spaces or periods between your list items, the model still sees the hierarchical relationship and the math of the vector remains identical.

The only way this structure change affects you is if your chunking strategy relies on those newlines. If you usually split documents by paragraphs using \n\n, your flattened text might turn into one massive block that gets cut off by the model’s token limit.

In terms of retrieval accuracy, the difference between the structured and flattened version is negligible. The model is smart enough to know whether they are on different lines or the same one.

Focus more on ensuring your Task Type is set to RETRIEVAL_DOCUMENT to optimize the search results. Gemini Embeddings

Thank you!

1 Like