Skip to content

Embed findings using `sentence-transformers #83

Description

@ionfwsrijan

Description

The first step in deduplication is converting each finding into a vector embedding so similar findings can be compared mathematically. This issue adds the embedding step as a utility module.

What to implement

Create backend/app/ml/embedder.py:

  • Load all-MiniLM-L6-v2 from sentence-transformers at module level
  • embed_findings(findings: list[dict]) -> np.ndarray — builds an embedding input string per finding as "{rule_id} {message} {file_path}" and returns a 2D numpy array of shape (n, 384)
  • Cache the model load so it only happens once per process

Add sentence-transformers to requirements.txt.

Acceptance criteria

  • embed_findings() returns correct shape array for a list of findings
  • Model loads once at import, not per call
  • Graceful error if sentence-transformers is not installed (clear message, not a traceback)

Metadata

Metadata

Assignees

Labels

MediumMedium difficultySSoC26backendBackend issuesmlML related issuestier-1TIER 1 Upgrade issues

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions