Skip to content

feat: integrate deduplicator into scan response#113

Open
Praneeth2711 wants to merge 6 commits into
ionfwsrijan:mainfrom
Praneeth2711:feat/integrate-deduplicator-scan-response
Open

feat: integrate deduplicator into scan response#113
Praneeth2711 wants to merge 6 commits into
ionfwsrijan:mainfrom
Praneeth2711:feat/integrate-deduplicator-scan-response

Conversation

@Praneeth2711

@Praneeth2711 Praneeth2711 commented Jun 15, 2026

Copy link
Copy Markdown

Before opening: make sure there is an issue tracking this work, and link it below. PRs without a linked issue may be closed without review.

Linked issue

Closes #85

What this PR does

This PR integrates the deduplication logic into the backend scan process. It runs the deduplicator on aggregated findings to collapse duplicate findings based on text embeddings. It supports configuring the deduplication via environment variables (DISABLE_DEDUP and DEDUP_EPSILON) and handles cases where sentence-transformers is not installed.

Type of change

  • Bug fix
  • New feature
  • ML model / training pipeline
  • Refactor (no behaviour change)
  • Documentation
  • Tests only

ML tier (if applicable)

  • Tier 1 — Triage
  • Tier 2 — Predictive
  • Tier 3 — Autonomous
  • Not ML-related

Stack affected

  • Backend
  • Frontend
  • Both

Changes

Backend

  • Modified �ackend/app/main.py to integrate deduplication into the scanning background task before writing to the database.
  • Read environment variables DISABLE_DEDUP and DEDUP_EPSILON to configure deduplication behavior.
  • Added test coverage in �ackend/tests/test_scan_dedup.py to verify deduplication with different configurations.

Frontend

New dependencies

Database / schema changes


Testing

How did you test this?

Tested by running the backend pytest suite, specifically targeting est_scan_dedup.py. Checked that the deduplication runs successfully and filters out duplicate findings, and falls back gracefully when sentence-transformers is unavailable or if DISABLE_DEDUP is enabled.

Checklist

  • Tested locally end-to-end (upload ZIP or GitHub URL → scan → findings returned correctly)
  • New ML model falls back gracefully when model file is absent
  • No new console.error or unhandled Python exceptions introduced
  • Added or updated tests where applicable
  • [x]
    equirements.txt / package.json updated if new dependencies added
  • New model files (.pkl, .pt, etc.) are gitignored, not committed

Anything reviewers should focus on

Please check the integration within _run_single_scan_task background task.

Screenshots (if UI changed)

@github-actions

Copy link
Copy Markdown

⚠️ Automated Check: This PR does not strictly follow the required template. Please ensure you have not deleted any checkboxes or mandatory headings, and that you have written explanations under What this PR does and How did you test this?.

@ionfwsrijan

Copy link
Copy Markdown
Owner

@Praneeth2711 Fix failing checks and PR description.

Join our dc server to connect with fellow contributors and mentors.

https://discord.gg/FcXuyw2Rs

Make sure to star the repo.

- apply deduplication after scan aggregation

- add raw_finding_count and finding_count

- support DISABLE_DEDUP

- support configurable DEDUP_EPSILON

- gracefully skip deduplication when sentence-transformers is unavailable
@Praneeth2711 Praneeth2711 force-pushed the feat/integrate-deduplicator-scan-response branch from f43229b to c2c5dcc Compare June 15, 2026 11:04
@Praneeth2711

Praneeth2711 commented Jun 15, 2026

Copy link
Copy Markdown
Author

Linked issue

Closes #85

What this PR does

This PR integrates the semantic embedding-based deduplicator into the /scan pipeline. Findings are deduplicated before being returned by the API. It adds support for DISABLE_DEDUP=true, configurable DEDUP_EPSILON, graceful fallback when optional ML dependencies are unavailable, and stores deduplicated findings in the database to keep downstream endpoints and reports consistent.

Type of change

  • Bug fix
  • New feature
  • ML model / training pipeline
  • Refactor (no behaviour change)
  • Documentation
  • Tests only

ML tier (if applicable)

  • Tier 1 — Triage
  • Tier 2 — Predictive
  • Tier 3 — Autonomous
  • Not ML-related

Stack affected

  • Backend
  • Frontend
  • Both

Changes

Backend

  • Added embedding-based deduplication utility.
  • Added DeduplicatedScanResponse.
  • Added raw_finding_count and finding_count fields to /scan responses.
  • Added support for DISABLE_DEDUP=true.
  • Added support for configurable DEDUP_EPSILON.
  • Gracefully skips deduplication when sentence-transformers is unavailable.
  • Stores deduplicated findings in the database.

Frontend

  • None.

New dependencies

  • None.

Database / schema changes

  • Added DeduplicatedScanResponse response schema.
  • Deduplicated findings are persisted instead of raw findings.

Testing

How did you test this?

Added integration tests covering:

  • Deduplication enabled.
  • DISABLE_DEDUP=true.
  • Missing sentence-transformers.
  • Invalid DEDUP_EPSILON.
  • Backend test suite execution.

Ran the backend test suite locally and verified all tests pass.

Checklist

  • Tested locally end-to-end (upload ZIP or GitHub URL → scan → findings returned correctly)
  • New ML model falls back gracefully when model file is absent
  • No new console.error or unhandled Python exceptions introduced
  • Added or updated tests where applicable
  • requirements.txt / package.json updated if new dependencies added
  • New model files (.pkl, .pt, etc.) are gitignored, not committed

Anything reviewers should focus on

Please focus on the lazy loading of the SentenceTransformer model, deduplication logic, and handling of DISABLE_DEDUP and DEDUP_EPSILON.

Screenshots (if UI changed)

Not applicable (backend-only change).

@ionfwsrijan

Copy link
Copy Markdown
Owner

@Praneeth2711 Update PR description

@ionfwsrijan

Copy link
Copy Markdown
Owner

Before opening: make sure there is an issue tracking this work, and link it below. PRs without a linked issue may be closed without review.

Linked issue

Closes #

What this PR does

Type of change

  • Bug fix
  • New feature
  • ML model / training pipeline
  • Refactor (no behaviour change)
  • Documentation
  • Tests only

ML tier (if applicable)

  • Tier 1 — Triage
  • Tier 2 — Predictive
  • Tier 3 — Autonomous
  • Not ML-related

Stack affected

  • Backend
  • Frontend
  • Both

Changes

Backend

Frontend

New dependencies

Database / schema changes


Testing

How did you test this?

Checklist

  • Tested locally end-to-end (upload ZIP or GitHub URL → scan → findings returned correctly)
  • New ML model falls back gracefully when model file is absent
  • No new console.error or unhandled Python exceptions introduced
  • Added or updated tests where applicable
  • requirements.txt / package.json updated if new dependencies added
  • New model files (.pkl, .pt, etc.) are gitignored, not committed

Anything reviewers should focus on

Screenshots (if UI changed)

@Praneeth2711 Strictly use this template as it is

@ionfwsrijan

Copy link
Copy Markdown
Owner

@Praneeth2711 Please fix failing checks

@github-actions github-actions Bot added backend Backend issues feature New feature labels Jun 21, 2026
@ionfwsrijan

Copy link
Copy Markdown
Owner

@Praneeth2711 Fix failing checks and PR desc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend Backend issues feature New feature needs-work Work needed SSoC26

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Integrate deduplicator into /scan response

2 participants