Description
Use DBSCAN clustering on the embeddings to group near-duplicate findings and collapse each cluster into a single representative finding with metadata about the duplicates.
What to implement
Create backend/app/ml/deduplicator.py:
deduplicate(findings: list[dict], epsilon: float = 0.15) -> list[dict]
- Embeds findings via
embedder.py
- Runs
sklearn.cluster.DBSCAN(eps=epsilon, min_samples=2, metric='cosine')
- For each cluster, keep the finding with the highest
raw_severity as representative
- Attach
duplicate_count: int and related_files: list[str] to representative
- Noise points (label == -1) are returned as-is with
duplicate_count: 0
Acceptance criteria
Description
Use DBSCAN clustering on the embeddings to group near-duplicate findings and collapse each cluster into a single representative finding with metadata about the duplicates.
What to implement
Create
backend/app/ml/deduplicator.py:deduplicate(findings: list[dict], epsilon: float = 0.15) -> list[dict]embedder.pysklearn.cluster.DBSCAN(eps=epsilon, min_samples=2, metric='cosine')raw_severityas representativeduplicate_count: intandrelated_files: list[str]to representativeduplicate_count: 0Acceptance criteria
duplicate_countaccurately reflects the cluster sizerelated_fileslists all files in the cluster except the representative's own file