feat: integrate deduplicator into scan response#113
Conversation
|
|
|
@Praneeth2711 Fix failing checks and PR description. Join our dc server to connect with fellow contributors and mentors. Make sure to star the repo. |
- apply deduplication after scan aggregation - add raw_finding_count and finding_count - support DISABLE_DEDUP - support configurable DEDUP_EPSILON - gracefully skip deduplication when sentence-transformers is unavailable
f43229b to
c2c5dcc
Compare
Linked issueCloses #85 What this PR doesThis PR integrates the semantic embedding-based deduplicator into the Type of change
ML tier (if applicable)
Stack affected
ChangesBackend
Frontend
New dependencies
Database / schema changes
TestingHow did you test this? Added integration tests covering:
Ran the backend test suite locally and verified all tests pass. Checklist
Anything reviewers should focus onPlease focus on the lazy loading of the SentenceTransformer model, deduplication logic, and handling of Screenshots (if UI changed)Not applicable (backend-only change). |
|
@Praneeth2711 Update PR description |
Linked issueCloses # What this PR doesType of change
ML tier (if applicable)
Stack affected
ChangesBackendFrontendNew dependenciesDatabase / schema changesTestingHow did you test this? Checklist
Anything reviewers should focus onScreenshots (if UI changed)@Praneeth2711 Strictly use this template as it is |
|
@Praneeth2711 Please fix failing checks |
|
@Praneeth2711 Fix failing checks and PR desc |
Linked issue
Closes #85
What this PR does
This PR integrates the deduplication logic into the backend scan process. It runs the deduplicator on aggregated findings to collapse duplicate findings based on text embeddings. It supports configuring the deduplication via environment variables (DISABLE_DEDUP and DEDUP_EPSILON) and handles cases where sentence-transformers is not installed.
Type of change
ML tier (if applicable)
Stack affected
Changes
Backend
Frontend
New dependencies
Database / schema changes
Testing
How did you test this?
Tested by running the backend pytest suite, specifically targeting est_scan_dedup.py. Checked that the deduplication runs successfully and filters out duplicate findings, and falls back gracefully when sentence-transformers is unavailable or if DISABLE_DEDUP is enabled.
Checklist
equirements.txt / package.json updated if new dependencies added
Anything reviewers should focus on
Please check the integration within _run_single_scan_task background task.
Screenshots (if UI changed)