I see textembedding-gecko@001. I’d try a newer model, like the suggested 004 or multilingual-002. Newer models gave lower dimensionality as well.
Conceptual thinking: as embedding goes you want to grasp well rounded concepts and not a bunch of mix of concepts when trying to index into that high dimensional latent embedding space. This is why RAG frameworks do chunking and perform the embedding for each chunk hoping the chunks are more whole concepts instead of a mix of some. 5k tokens seem way too big for an ideal chunk size. It’s usually around 150 characters or 80-100 tokens, possibly even less. So consider that when architecting your generative AI pipeline.
Hi, thanks for your response. I was following this tutorial. I just thought that the api limitation of 10k bytes (~1500 words) was a little weird. Shouldn’t it just truncate my text or give me the possibility to do so? For my poc i think its ok, i didnt want to deal with chunking yet. After i changed my code to use another sdk, it worked (with truncation):
from vertexai.language_models import TextEmbeddingInput, TextEmbeddingModel
texts = [...]
model = TextEmbeddingModel.from_pretrained(
"textembedding-gecko@001"
)
inputs = [TextEmbeddingInput(text, task_type=task_type) for text in texts]
embeddings = model.get_embeddings(inputs)