Hi,
I’m looking to extract some domain data from the website. The website contain a mix of text and images. How should I approach to store this for RAG/Model grounding? Currently, for test, I store unstructured data in Bucket and use JSONL file to bring this data into Data Store. I’m not sure how to handle image data.
1 Like
You would probably create Embeddings
of the content and store it for later use.
Simplest way would be to store it as JSON files on Cloud Storage, however, using a Vector-based storage option might be more practical for the long run.
2 Likes
Thanks, I was looking to use Vertex AI Search, but to maintain the context of the article I was thinking to extract the text from image and replace the image with the text. I’m trying to understand how critical it is to maintain the “location” the the image/summary in context of the text. Getting embedding of the image allow me to store it in the vector database but the image itself is of no use since I need the article with all the information from it.
1 Like