I am currently working on a RAG and for that i need to embed documents using gemini-embedding-001. my documents are saved in a cetain format and i need to parse them decode the bytes, but when i get the text after processing the file, the original structure seems to be a little skewed especially if the text is originally was in tables or there are some numbered lists. so i was wondering if the embedding would be so much different if my original text looks like
”1. Building with google ai forum should be fun
1.1. there are theories to why this is the case
1.2. such theories will be discussed below
2. Because developers here are quite smart and helping.
2.1. we will see how the peopel here got smart
a. study hard
b. solve problems”
but my function that processes the file produces something like the following.
”1. Building with google ai forum should be fun 1.1. there are theories to why this is the case 1.2. such theories will be discussed below 2. Because developers here are quite smart and helping. 2.1. we will see how the peopel here got smart a. study hard b. solve problems”
so would having either of them, have a huge impact on the resulting embedding that might affect the retrieval stage?