fix(embedding-search): guard cosineSimilarity against mismatched-length vectors#451
fix(embedding-search): guard cosineSimilarity against mismatched-length vectors#451tirth8205 wants to merge 2 commits into
Conversation
…th vectors cosineSimilarity looped over a.length and dereferenced b[i] without checking the lengths match, so mismatched vectors returned NaN (extra dims in a) or silently overstated similarity (extra dims in b). A NaN score then failed the threshold check in SemanticSearchEngine.search, silently dropping the node. Guard with an early 'a.length !== b.length' return 0 to honour the documented numeric contract. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
thejesh23
left a comment
There was a problem hiding this comment.
A few concerns.
1. Silent guard hides the upstream bug. A length mismatch means an embedding in the map was produced by a different model/dimension than queryEmbedding — that is a corrupted index, not a normal data point. Returning 0 turns it into "this node is just dissimilar," so it silently disappears from results forever and nobody notices the index is stale. At minimum a one-time console.warn (or a counter on the engine) the first time it fires would surface the real problem; otherwise a user re-running search after a model upgrade gets quietly degraded recall with no signal.
2. Root cause not addressed in SemanticSearchEngine. The realistic source of mismatch (per the PR body) is embeddings persisted from a prior run/model being loaded alongside a fresh queryEmbedding. search() is the natural place to detect this — e.g. record the expected dimension on first call / from the constructor and reject or skip non-conforming stored embeddings up front, instead of paying a .length check inside the hot cosineSimilarity loop for every node on every query. The pure-function guard is fine as a belt; the engine-level check is the suspenders.
3. Test gap: the search() path itself. Both new asserts are on cosineSimilarity directly. There's no test that SemanticSearchEngine.search() with one mis-sized stored embedding (a) doesn't throw, (b) drops only that node, and (c) still returns the correctly-sized neighbours in score order. That's the actual user-visible behavior this PR is defending, and the 1 - similarity scoring + >= threshold interaction (mis-sized now scores 1, i.e. worst, which is the intended outcome but worth pinning).
Nit: the JSDoc still only documents the zero-magnitude case — worth adding "...or if the vectors have different lengths" so the next reader doesn't re-discover this.
Document and test the length-mismatch guard in cosineSimilarity: - Expand JSDoc to cover the different-lengths case (returns 0). - Comment why a stale/corrupt-index mismatch is intentionally swallowed as 0 in the pure helper, with a TODO to surface it engine-side (warn-once / skipped-count) as a follow-up. - Add SemanticSearchEngine.search() tests: one mis-sized stored embedding does not throw, correctly-sized neighbours return in score order, the mismatched node ranks last at the default threshold (0) and is filtered out under a positive threshold, pinning the `1 - similarity` + `>= threshold` interaction. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Addressed point by point. 3 (test gap, search() path): Added three
nit (JSDoc): Updated to "Returns 0 if either vector has zero magnitude or if the two vectors have different lengths." 1 (silent guard hides corrupt index): Agreed the mismatch shouldn't vanish without a trail. I added a comment + TODO in 2 (detect at engine level): I left this as a follow-up rather than implementing it here. It's a stateful design change (record an expected dimension, then decide skip vs warn vs throw on non-conforming stored vectors) that expands past this PR's NaN-safety scope, and the per-call Full core suite green (696 tests); |
Problem
Fix
Testing
Adds unit test(s) that fail before the change and pass after. The full core test suite,
eslint, andtsc --noEmitall pass locally on this branch.Found via a static correctness audit of the semantic/embedding search.
🤖 Generated with Claude Code