Description
The first step in deduplication is converting each finding into a vector embedding so similar findings can be compared mathematically. This issue adds the embedding step as a utility module.
What to implement
Create backend/app/ml/embedder.py:
- Load
all-MiniLM-L6-v2 from sentence-transformers at module level
embed_findings(findings: list[dict]) -> np.ndarray — builds an embedding input string per finding as "{rule_id} {message} {file_path}" and returns a 2D numpy array of shape (n, 384)
- Cache the model load so it only happens once per process
Add sentence-transformers to requirements.txt.
Acceptance criteria
Description
The first step in deduplication is converting each finding into a vector embedding so similar findings can be compared mathematically. This issue adds the embedding step as a utility module.
What to implement
Create
backend/app/ml/embedder.py:all-MiniLM-L6-v2fromsentence-transformersat module levelembed_findings(findings: list[dict]) -> np.ndarray— builds an embedding input string per finding as"{rule_id} {message} {file_path}"and returns a 2D numpy array of shape(n, 384)Add
sentence-transformerstorequirements.txt.Acceptance criteria
embed_findings()returns correct shape array for a list of findingssentence-transformersis not installed (clear message, not a traceback)