What is the best way to store text data and images from website for RAG

KonradPs · January 18, 2025, 5:05pm

Hi,

I’m looking to extract some domain data from the website. The website contain a mix of text and images. How should I approach to store this for RAG/Model grounding? Currently, for test, I store unstructured data in Bucket and use JSONL file to bring this data into Data Store. I’m not sure how to handle image data.

jkirstaetter · January 19, 2025, 8:06am

You would probably create Embeddings of the content and store it for later use.
Simplest way would be to store it as JSON files on Cloud Storage, however, using a Vector-based storage option might be more practical for the long run.

KonradPs · January 19, 2025, 7:23pm

Thanks, I was looking to use Vertex AI Search, but to maintain the context of the article I was thinking to extract the text from image and replace the image with the text. I’m trying to understand how critical it is to maintain the “location” the the image/summary in context of the text. Getting embedding of the image allow me to store it in the vector database but the image itself is of no use since I need the article with all the information from it.

Topic		Replies	Views
How to tie images to the text parts of a long context? Gemini API gemini-15 , api	5	136	May 27, 2024
Web Browser option for Gemini AI API Gemini API api , models	0	197	December 22, 2024
Gemini URL context Gemini API api , models , gemini-2-5	1	151	June 26, 2025
Can you use Gemini to answer questions based on a single text data source? Gemini API	7	914	July 10, 2024
Gemini Model Unable to Extract Text from Uploaded Image, Requests Direct Text Input Instead Gemini API api	1	196	October 23, 2024

What is the best way to store text data and images from website for RAG

Related topics