Skip to content
This repository was archived by the owner on Nov 1, 2024. It is now read-only.
This repository was archived by the owner on Nov 1, 2024. It is now read-only.

Question about the paper/package #7

@saikot-paul

Description

@saikot-paul

I have two questions regarding SemDeDup:

  1. Why are the dimensions of the embeddings not reduced when clustering?

From my understanding the relative distance between the farthest and closest points degrades. Therefore clustering based on any distance metric does prove effective as the algorithm is unable to distinguish between close and far points.

  1. How to best determine n_clusters given dataset size?

Is there any empirical method in which the number of clusters can be determined to best remove duplicates in any given dataset.

Lastly computational load is decreased when dimensions are reduced.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions