Bad quality of text embeddings with MediaPipe

Hi all
Today , I have tried the Media Pipe for the first time.
My goal is to compute cosinus similarities between a text query and a set of FAQ entries.
The corpus is a French corpus and contains some industrial technical terms (mainly acronyms & abbreviations)
I am a little bit disappointed with the results. The accuracy/precision is 0.18 and MRR 0.26
I made some experimentations with the model ‘embedder.tflite’ . I also tried ‘universal_sentence_encoder.tflite’ but resultats are not better.

Is there any other model that I can try with MediaPipe to generate my embeddings and to get better results ?

Outside the ecosystem of Media Pipe, I used all-MiniLM-L6-v2 / multilingual-e5. Results are very good but the quantization and porting the tokenizer on Android (my target) is not very easy.

Hi @Lake6985 ,

The model accuracy is expected to be low. MediaPipe solutions are for edge devices, and they are benchmarked with run times on CPUs. If anyone has a different view on this, please let me know.

As you said, performance increases with model size. However, I checked the all-MiniLM-L6-v2/multilingual-e5, and they have a TensorFlow model. You can convert it to TFLite and use it.

Hi Joel
Thank you for your prompt response.
I perfectly understand that performances on edge device are the priority. In my work, I mainly target edge devices for our application and it is a tough work.
As you suggest, I will investigate in converting the all-MiniLM-L6-v2 to TFLite in order to be able to use it with media pipe. But what about the tokenisation? I can export the config files (*.json files) and the model of the tokenizer (sentencepiece.bpe.model) but how to build a sentence piece tokenizer from files with Media Pipe ? Is there any way to use the SentencePeice tokenizer of Google with MediaPipe? coud you please tell me if you see any way to do that ?
I really hope I will be able to use Media Pipe to build my FAQ .

Thx