feat: EmbeddingProvider protocol, hash/tfidf embedders, vector search by prakashUXtech · Pull Request #36 · qbtrix/soul-protocol

prakashUXtech · 2026-03-06T06:55:40Z

Summary

Implements the vector search layer from the vision spec:

EmbeddingProvider protocol — Runtime-checkable interface with dimensions, embed(), embed_batch()
HashEmbedder — Deterministic MD5-based embedder for testing (configurable dimensions, L2-normalized)
TFIDFEmbedder — Corpus-based TF-IDF embedder with fit() (better similarity, L2-normalized)
Similarity functions — cosine_similarity, euclidean_distance, dot_product (pure stdlib)
VectorSearchStrategy — Pluggable search strategy using embeddings, threshold filtering, compatible with MemoryEntry
Zero external dependencies — everything uses stdlib + pydantic only

Test plan

Protocol compliance tests (isinstance checks for both embedders)
HashEmbedder tests (determinism, dimensions, normalization, batch)
TFIDFEmbedder tests (fit, embed, similarity between related texts)
Similarity tests (cosine, euclidean, dot product, edge cases)
VectorSearchStrategy tests (integration with MemoryEntry, threshold, limit)
E2E test (topic-based search across programming/cooking/sports memories)
All 148 tests pass, zero regressions

… strategy Add pluggable embedding subsystem for semantic memory search: - EmbeddingProvider: runtime-checkable Protocol for any embedding backend - HashEmbedder: deterministic hash-based embedder for testing pipelines - TFIDFEmbedder: stdlib-only TF-IDF embedder with corpus fitting - Similarity functions: cosine_similarity, euclidean_distance, dot_product - VectorSearchStrategy: semantic search over MemoryEntry candidates All implementations use only stdlib (no numpy required). 81 new tests covering protocol compliance, embedder behavior, similarity math, vector strategy integration, and end-to-end topic-based search. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-actions · 2026-03-06T06:55:53Z

Issues (must fix)

No linked issue found. PRs should reference an issue (Fixes #123).
No evidence of local testing found. Please include terminal output or screenshots.

Heads up

Large PR detected (1399 lines across 17 files). Consider splitting into smaller PRs.

Please update your PR to address these points.

Two-layer architecture: core/ holds protocol primitives (EmbeddingProvider, cosine_similarity, euclidean_distance, dot_product), while embeddings/ re-exports them for backward compatibility. Implementations (HashEmbedder, TFIDFEmbedder, VectorSearchStrategy) stay at engine level. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

prakashUXtech · 2026-03-06T07:14:00Z

Self-Review: PR #36 — EmbeddingProvider + Vector Search

What's Good

Core/engine boundary is exactly right — EmbeddingProvider, cosine_similarity, euclidean_distance, dot_product in core/embeddings/
Implementations (HashEmbedder, TFIDFEmbedder, VectorSearchStrategy) stay engine-level
Re-export shims are clean and minimal
HashEmbedder is well-designed: MD5 + character n-grams, deterministic, L2-normalized
TFIDFEmbedder uses smoothed sklearn-style IDF formula — correct choice
Zero-vector handling in cosine_similarity returns 0.0 (right contract for search)
81 tests, strong coverage including E2E topic-based search
Zero external dependencies — stdlib + pydantic only

Issues Found (Blocking)

VectorSearchStrategy.search() ignores the pre-built index — search() re-embeds every candidate on every call, making index() / index_batch() misleading. Users who pre-warm the index still pay full embedding cost. Fix: use the index as a lookup cache in search(), fall back to on-the-fly embedding for unindexed content.
Mismatched-length vectors silently produce wrong answers — cosine_similarity uses zip(a, b) which truncates to the shorter vector. Comparing a 64-dim HashEmbedder output against a 128-dim TFIDFEmbedder output gives a geometrically wrong answer with no warning. Add a length guard:
```
if len(a) != len(b):
    raise ValueError(f"Vector length mismatch: {len(a)} vs {len(b)}")
```

Issues Found (Non-blocking)

pyproject.toml has vector = ["numpy>=1.24"] extra that nothing uses — Either wire numpy into a fast path or remove the vestigial extra.
TFIDFEmbedder silently returns zero vector before fit() — Consider RuntimeError or warnings.warn() instead of silent zero vector.
test_threshold_filters_results assertion is vacuously true — len(results) <= len(entries) with threshold=0.99 is always true. Should assert len(results) == 0.
Top-level __init__.py doesn't export embedding classes — Users need from soul_protocol.embeddings import ... — if intentional, document it.
TFIDFEmbedder.fit() is not idempotent — Silently replaces vocabulary. Worth documenting.

Verdict

Changes requested. Two blocking issues: the index/search disconnect and the silent vector length truncation. The overall implementation quality is high — fix those two and this is a strong merge.

…tions - search() now uses pre-built index as vector cache instead of re-embedding every candidate on each call, making index()/index_batch() useful - Added vector length mismatch guards (ValueError) to cosine_similarity, euclidean_distance, and dot_product - Fixed vacuous assertion in test_threshold_filters_results (was <=, now <) - Added warnings.warn() to TFIDFEmbedder.embed() when called before fit() - Added tests for vector length mismatch errors Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-actions bot added the needs-work label Mar 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: EmbeddingProvider protocol, hash/tfidf embedders, vector search#36

feat: EmbeddingProvider protocol, hash/tfidf embedders, vector search#36
prakashUXtech wants to merge 3 commits intomainfrom
feat/vector-search

prakashUXtech commented Mar 6, 2026

Uh oh!

github-actions bot commented Mar 6, 2026 •

edited

Loading

Uh oh!

prakashUXtech commented Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

prakashUXtech commented Mar 6, 2026

Summary

Test plan

Uh oh!

github-actions bot commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issues (must fix)

Heads up

Uh oh!

prakashUXtech commented Mar 6, 2026

Self-Review: PR #36 — EmbeddingProvider + Vector Search

What's Good

Issues Found (Blocking)

Issues Found (Non-blocking)

Verdict

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions bot commented Mar 6, 2026 •

edited

Loading