Question about the paper/package

I have two questions regarding SemDeDup: 

1. Why are the dimensions of the embeddings not reduced when clustering? 

From my understanding the relative distance between the farthest and closest points degrades. Therefore clustering based on any distance metric does prove effective as the algorithm is unable to distinguish between close and far points. 

2. How to best determine n_clusters given dataset size? 

Is there any empirical method in which the number of clusters can be determined to best remove duplicates in any given dataset. 

Lastly computational load is decreased when dimensions are reduced. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about the paper/package #7

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Question about the paper/package #7

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions