Should I perform dimensionality reduction on vectors before clustering?

I have converted documents into vectors using “text-embedding-004”. Since my goal is to find clusters of semantically similar documents, I have specified task_type="CLUSTERING".

For each document I now have vectors of length 768.

I can continue in 2 different ways:

  1. reduce the 768 dimensions to say 2 or 3 with MDS, tSNE or any other popular dimensionality reduction algo, followed by clustering using (e.g. K-means; DBSCAN; or HDBSCAN)
  2. immediately use K-means; DBSCAN; or HDBSCAN for the clustering.

Bonus question: does specifying task_type="CLUSTERING" influence whether 1) or 2) is chosen?

Thanks,
Oliver

hello @Oliver_Angelil,
welcome to the community.
Reducing the dimensions will help you visualize if documents of similar context are closer together
for Clustering it would be better to use all the dimensions.

here is an example that uses text-embedding and K means clustering along with TSNE for dimensionality reduction for visualization.

(generative-ai-docs/site/en/gemini-api/tutorials/clustering_with_embeddings.ipynb at main · google/generative-ai-docs · GitHub)

1 Like

Hi Akhilesh

Thanks for your help.

Can you justify this with a source or technical explanation, with consideration of the curse of dimensionality.

@Oliver_Angelil ,

I dont have any Technical source that would suggest you to use all the dimensions .
you can always do dimensionality reduction before clustering.I was just saying you might benefit form doing the clustering with all dimensions

1 Like