Skip to content

feat: EmbeddingProvider protocol, hash/tfidf embedders, vector search#36

Open
prakashUXtech wants to merge 3 commits intomainfrom
feat/vector-search
Open

feat: EmbeddingProvider protocol, hash/tfidf embedders, vector search#36
prakashUXtech wants to merge 3 commits intomainfrom
feat/vector-search

Conversation

@prakashUXtech
Copy link
Contributor

Summary

Implements the vector search layer from the vision spec:

  • EmbeddingProvider protocol — Runtime-checkable interface with dimensions, embed(), embed_batch()
  • HashEmbedder — Deterministic MD5-based embedder for testing (configurable dimensions, L2-normalized)
  • TFIDFEmbedder — Corpus-based TF-IDF embedder with fit() (better similarity, L2-normalized)
  • Similarity functionscosine_similarity, euclidean_distance, dot_product (pure stdlib)
  • VectorSearchStrategy — Pluggable search strategy using embeddings, threshold filtering, compatible with MemoryEntry
  • Zero external dependencies — everything uses stdlib + pydantic only

Test plan

  • Protocol compliance tests (isinstance checks for both embedders)
  • HashEmbedder tests (determinism, dimensions, normalization, batch)
  • TFIDFEmbedder tests (fit, embed, similarity between related texts)
  • Similarity tests (cosine, euclidean, dot product, edge cases)
  • VectorSearchStrategy tests (integration with MemoryEntry, threshold, limit)
  • E2E test (topic-based search across programming/cooking/sports memories)
  • All 148 tests pass, zero regressions

… strategy

Add pluggable embedding subsystem for semantic memory search:
- EmbeddingProvider: runtime-checkable Protocol for any embedding backend
- HashEmbedder: deterministic hash-based embedder for testing pipelines
- TFIDFEmbedder: stdlib-only TF-IDF embedder with corpus fitting
- Similarity functions: cosine_similarity, euclidean_distance, dot_product
- VectorSearchStrategy: semantic search over MemoryEntry candidates

All implementations use only stdlib (no numpy required). 81 new tests
covering protocol compliance, embedder behavior, similarity math,
vector strategy integration, and end-to-end topic-based search.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions
Copy link

github-actions bot commented Mar 6, 2026

Issues (must fix)

  • No linked issue found. PRs should reference an issue (Fixes #123).
  • No evidence of local testing found. Please include terminal output or screenshots.

Heads up

  • Large PR detected (1399 lines across 17 files). Consider splitting into smaller PRs.

Please update your PR to address these points.

Two-layer architecture: core/ holds protocol primitives (EmbeddingProvider,
cosine_similarity, euclidean_distance, dot_product), while embeddings/
re-exports them for backward compatibility. Implementations (HashEmbedder,
TFIDFEmbedder, VectorSearchStrategy) stay at engine level.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@prakashUXtech
Copy link
Contributor Author

Self-Review: PR #36 — EmbeddingProvider + Vector Search

What's Good

  • Core/engine boundary is exactly right — EmbeddingProvider, cosine_similarity, euclidean_distance, dot_product in core/embeddings/
  • Implementations (HashEmbedder, TFIDFEmbedder, VectorSearchStrategy) stay engine-level
  • Re-export shims are clean and minimal
  • HashEmbedder is well-designed: MD5 + character n-grams, deterministic, L2-normalized
  • TFIDFEmbedder uses smoothed sklearn-style IDF formula — correct choice
  • Zero-vector handling in cosine_similarity returns 0.0 (right contract for search)
  • 81 tests, strong coverage including E2E topic-based search
  • Zero external dependencies — stdlib + pydantic only

Issues Found (Blocking)

  1. VectorSearchStrategy.search() ignores the pre-built indexsearch() re-embeds every candidate on every call, making index() / index_batch() misleading. Users who pre-warm the index still pay full embedding cost. Fix: use the index as a lookup cache in search(), fall back to on-the-fly embedding for unindexed content.

  2. Mismatched-length vectors silently produce wrong answerscosine_similarity uses zip(a, b) which truncates to the shorter vector. Comparing a 64-dim HashEmbedder output against a 128-dim TFIDFEmbedder output gives a geometrically wrong answer with no warning. Add a length guard:

    if len(a) != len(b):
        raise ValueError(f"Vector length mismatch: {len(a)} vs {len(b)}")

Issues Found (Non-blocking)

  1. pyproject.toml has vector = ["numpy>=1.24"] extra that nothing uses — Either wire numpy into a fast path or remove the vestigial extra.

  2. TFIDFEmbedder silently returns zero vector before fit() — Consider RuntimeError or warnings.warn() instead of silent zero vector.

  3. test_threshold_filters_results assertion is vacuously truelen(results) <= len(entries) with threshold=0.99 is always true. Should assert len(results) == 0.

  4. Top-level __init__.py doesn't export embedding classes — Users need from soul_protocol.embeddings import ... — if intentional, document it.

  5. TFIDFEmbedder.fit() is not idempotent — Silently replaces vocabulary. Worth documenting.

Verdict

Changes requested. Two blocking issues: the index/search disconnect and the silent vector length truncation. The overall implementation quality is high — fix those two and this is a strong merge.

…tions

- search() now uses pre-built index as vector cache instead of re-embedding
  every candidate on each call, making index()/index_batch() useful
- Added vector length mismatch guards (ValueError) to cosine_similarity,
  euclidean_distance, and dot_product
- Fixed vacuous assertion in test_threshold_filters_results (was <=, now <)
- Added warnings.warn() to TFIDFEmbedder.embed() when called before fit()
- Added tests for vector length mismatch errors

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant