I’ve successfully fine-tuned the Gemma3ForConditionalGeneration model and have been getting great results. My goal now is to deploy this model on mobile devices for offline use, which requires converting it to the TensorFlow Lite (TFLite) format.
I’ve tried several standard conversion methods, but I’m running into challenges, likely due to the model’s complex multimodal architecture. I’m looking for a reliable workflow or script to handle this conversion.
Key Details:
Model Architecture:Gemma3ForConditionalGeneration
Special Tokens: The model uses several special tokens, including <bos> (ID: 2), <image_soft_token> (ID: 262144), (ID: 255999), and <end_of_image> (ID: 256000).
Input Format: The model expects a specific input sequence combining text and image tokens. Each image is represented by 256 image tokens.
Has anyone successfully converted a fine-tuned Gemma 3 vision model (or a similar multimodal model like PaliGemma) to TFLite? Any scripts, tutorials, or guidance on the correct process would be extremely helpful.
To covert the Gemma models to TFLite format, you can utilize the MediaPipe ai-edge conversion script from the following GitHub repository. The example code script which is provided is for PaliGemma. If you would like to explore for Gemma2 or Gemma3 please visit the Gemma and Gemma3 packages under example package in the same GitHub repo.