You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Nov 1, 2024. It is now read-only.
Why are the dimensions of the embeddings not reduced when clustering?
From my understanding the relative distance between the farthest and closest points degrades. Therefore clustering based on any distance metric does prove effective as the algorithm is unable to distinguish between close and far points.
How to best determine n_clusters given dataset size?
Is there any empirical method in which the number of clusters can be determined to best remove duplicates in any given dataset.
Lastly computational load is decreased when dimensions are reduced.
I have two questions regarding SemDeDup:
From my understanding the relative distance between the farthest and closest points degrades. Therefore clustering based on any distance metric does prove effective as the algorithm is unable to distinguish between close and far points.
Is there any empirical method in which the number of clusters can be determined to best remove duplicates in any given dataset.
Lastly computational load is decreased when dimensions are reduced.