Should I perform dimensionality reduction on vectors before clustering?

Oliver_Angelil · May 8, 2025, 3:15pm

I have converted documents into vectors using “text-embedding-004”. Since my goal is to find clusters of semantically similar documents, I have specified task_type="CLUSTERING".

For each document I now have vectors of length 768.

I can continue in 2 different ways:

reduce the 768 dimensions to say 2 or 3 with MDS, tSNE or any other popular dimensionality reduction algo, followed by clustering using (e.g. K-means; DBSCAN; or HDBSCAN)
immediately use K-means; DBSCAN; or HDBSCAN for the clustering.

Bonus question: does specifying task_type="CLUSTERING" influence whether 1) or 2) is chosen?

Thanks,
Oliver

Akhilesh_Kambhampati · May 8, 2025, 5:14pm

hello @Oliver_Angelil,
welcome to the community.
Reducing the dimensions will help you visualize if documents of similar context are closer together
for Clustering it would be better to use all the dimensions.

here is an example that uses text-embedding and K means clustering along with TSNE for dimensionality reduction for visualization.

(generative-ai-docs/site/en/gemini-api/tutorials/clustering_with_embeddings.ipynb at main · google/generative-ai-docs · GitHub)

Oliver_Angelil · May 8, 2025, 6:15pm

Hi Akhilesh

Thanks for your help.

Can you justify this with a source or technical explanation, with consideration of the curse of dimensionality.

Akhilesh_Kambhampati · May 8, 2025, 6:39pm

@Oliver_Angelil ,

I dont have any Technical source that would suggest you to use all the dimensions .
you can always do dimensionality reduction before clustering.I was just saying you might benefit form doing the clustering with all dimensions

Topic		Replies	Views
How to apply k-means clustering to 4D tensor? General Discussion help_request	4	1792	August 27, 2021
TensorFlow for Clustering General Discussion learning , tfdf , help_request	2	1216	July 12, 2021
How does an embedding layer cluster similar words? General Discussion nlp , learning , education , help_request	2	1036	August 6, 2021
LSTM model for sentiment analysis General Discussion models , nlp , keras , help_request	14	2192	September 27, 2021
Embedding dim in multiclass text classification Keras api , help_request	1	320	October 9, 2023

Should I perform dimensionality reduction on vectors before clustering?

Related topics