Skip to content

Cluster embeddings with DBSCAN and collapse duplicates #84

Description

@ionfwsrijan

Description

Use DBSCAN clustering on the embeddings to group near-duplicate findings and collapse each cluster into a single representative finding with metadata about the duplicates.

What to implement

Create backend/app/ml/deduplicator.py:

  • deduplicate(findings: list[dict], epsilon: float = 0.15) -> list[dict]
  • Embeds findings via embedder.py
  • Runs sklearn.cluster.DBSCAN(eps=epsilon, min_samples=2, metric='cosine')
  • For each cluster, keep the finding with the highest raw_severity as representative
  • Attach duplicate_count: int and related_files: list[str] to representative
  • Noise points (label == -1) are returned as-is with duplicate_count: 0

Acceptance criteria

  • A set of 10 identical SQL injection findings across different files collapses to 1
  • duplicate_count accurately reflects the cluster size
  • related_files lists all files in the cluster except the representative's own file

Metadata

Metadata

Assignees

Labels

MediumMedium difficultySSoC26backendBackend issuesmlML related issuestier-1TIER 1 Upgrade issues

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions