Skip to content

feat: kbx memory similar — semantic similarity lookup for dedup #71

@tenfourty

Description

@tenfourty

Summary

Add a kbx memory similar command and Python/MCP API equivalent that uses existing vector embeddings to find semantically similar facts or entity content before writing — enabling plugins and agents to check for duplicates without needing an LLM.

kbx memory similar "prefers async communication" --entity "Person A" --threshold 0.85

Motivation

Plugins and agents frequently write facts and entity updates to kbx. Without a pre-write similarity check, they produce duplicates:

  • Fact "prefers async Slack over email" already exists → agent writes "favours asynchronous communication via Slack" → near-duplicate
  • Entity role is "Engineering Director" → plugin calls kbx person edit --role "Director of Engineering" → semantically identical update
  • Open item "follow up on migration timeline" → debrief extracts "check migration schedule status" → same action, different words

The vector embeddings already exist in the index. This feature exposes them for pre-write similarity lookup — a cheap, fast check that gives plugins the signal they need to decide CREATE vs MERGE vs SKIP, without burning an LLM call on every write.

Inspiration: OpenViking's MemoryDeduplicator does a two-step pipeline: vector pre-filter (embed candidate, search within same category) → LLM decision. This issue implements the vector pre-filter step as a standalone capability. The LLM decision step lives in the calling plugin, not in kbx.

Design

1. kbx memory similar Command

Search existing facts for a given entity that are semantically similar to candidate text.

# Find similar facts for an entity
kbx memory similar "prefers async communication" --entity "Person A" --threshold 0.85 --json

# Output:
# {
#   "query": "prefers async communication",
#   "entity": "Person A",
#   "threshold": 0.85,
#   "matches": [
#     {
#       "text": "Favours Slack DMs over email for quick questions",
#       "score": 0.91,
#       "fact_id": 1234,
#       "source_path": "memory/people/person-a.md",
#       "date": "2026-02-15"
#     },
#     {
#       "text": "Prefers written async updates to synchronous meetings",
#       "score": 0.87,
#       "fact_id": 1235,
#       "source_path": "memory/people/person-a.md",
#       "date": "2026-01-20"
#     }
#   ]
# }

# Search across all entities (no --entity filter)
kbx memory similar "deployment rollback procedure" --threshold 0.80 --json

# Search within a specific document's content
kbx memory similar "migration blocked by auth dependency" --path "memory/projects/project-x.md" --json

Parameters:

Flag Default Description
--entity NAME None Scope to facts for a specific entity
--path PATH None Scope to content within a specific document
--threshold FLOAT 0.85 Minimum similarity score (0.0–1.0)
--limit INT 5 Maximum number of matches
--json False JSON output

Implementation: Embed the candidate text using the existing embedder, then vector search within the scoped collection (entity facts, document chunks, or global). Filter by threshold. Return matches with scores.

2. kbx entity similar-field Command

Check if a proposed entity field update is semantically equivalent to what's already stored.

# Check if a role update is redundant
kbx entity similar-field "Person A" --field role --value "Engineering Director" --json
# {
#   "entity": "Person A",
#   "field": "role",
#   "current_value": "Director of Engineering",
#   "proposed_value": "Engineering Director",
#   "similarity": 0.96,
#   "is_similar": true,
#   "threshold": 0.85
# }

# Check team field
kbx entity similar-field "Person A" --field team --value "Platform & Infrastructure" --json
# {
#   "entity": "Person A",
#   "field": "team",
#   "current_value": "Platform",
#   "proposed_value": "Platform & Infrastructure",
#   "similarity": 0.82,
#   "is_similar": false,
#   "threshold": 0.85
# }

Supported fields: role, team, and any custom metadata field. For short text fields, use direct embedding similarity. For facts and open items (lists), use kbx memory similar under the hood.

Short-text handling: Very short strings (< 10 tokens) may not embed well. For these, also compute normalized Levenshtein distance and take the max of embedding similarity and string similarity:

final_similarity = max(vector_similarity, 1.0 - normalized_levenshtein(current, proposed))

3. Threshold Tuning

Different content types need different thresholds:

Content type Recommended threshold Rationale
Short facts (< 20 words) 0.90 Short texts embed less distinctly; higher threshold avoids false positives
Long facts (20+ words) 0.85 More embedding signal; standard threshold works
Role/team fields 0.85 Short but domain-specific vocabulary
Open items 0.80 Action items can be phrased very differently while meaning the same thing
Document chunks 0.80 Longer text, more variation in phrasing

Default: 0.85 (good general-purpose threshold). Overridable via --threshold flag.

Configurable in kbx.toml:

[dedup]
default_threshold = 0.85
fact_threshold = 0.90
open_item_threshold = 0.80

4. JSON Output

All commands return structured JSON with --json:

{
  "query": "candidate text",
  "scope": {"entity": "Person A"},
  "threshold": 0.85,
  "matches": [
    {
      "text": "matched text",
      "score": 0.91,
      "fact_id": 1234,
      "source_path": "memory/people/person-a.md",
      "date": "2026-02-15",
      "entity_name": "Person A",
      "match_type": "fact"
    }
  ],
  "has_similar": true,
  "best_score": 0.91
}

Key fields for programmatic consumers:

  • has_similar — boolean, true if any match above threshold (quick check)
  • best_score — highest similarity score (for threshold comparison)
  • matches[].fact_id — enables MERGE operations (kbx memory edit-fact <id>)
  • matches[].match_typefact, open_item, chunk, field (tells consumer what matched)

5. MCP Tool

Expose as an MCP tool for agent-driven dedup:

{
  "name": "kbx_memory_similar",
  "description": "Find existing facts or content semantically similar to candidate text. Use before writing new facts to check for duplicates.",
  "inputSchema": {
    "type": "object",
    "properties": {
      "text": {"type": "string", "description": "Candidate text to check for similarity"},
      "entity": {"type": "string", "description": "Entity name to scope search (optional)"},
      "path": {"type": "string", "description": "Document path to scope search (optional)"},
      "threshold": {"type": "number", "default": 0.85, "description": "Minimum similarity (0.0-1.0)"},
      "limit": {"type": "integer", "default": 5, "description": "Max matches to return"}
    },
    "required": ["text"]
  }
}

MCP tool annotation: readOnlyHint: true — this tool never modifies data.

6. Python API

from kb import KnowledgeBase

kb = KnowledgeBase()

# Similar facts for an entity
matches = kb.memory_similar(
    text="prefers async communication",
    entity="Person A",
    threshold=0.85,
    limit=5,
)
# Returns: list[SimilarityMatch]
# SimilarityMatch: {text, score, fact_id, source_path, date, entity_name, match_type}

# Check if similar content exists (boolean shortcut)
if kb.has_similar(text="prefers async communication", entity="Person A"):
    print("Similar fact already exists — consider MERGE instead of CREATE")

# Similar-field check
result = kb.entity_similar_field(
    entity="Person A",
    field="role",
    value="Engineering Director",
)
# Returns: FieldSimilarity {current_value, proposed_value, similarity, is_similar, threshold}

7. Plugin Integration Pattern

The intended workflow for plugins and agents:

# Before writing a new fact
candidate = "Person A prefers written async updates"
matches = kb.memory_similar(text=candidate, entity="Person A")

if matches and matches[0].score > 0.90:
    # Very similar fact exists — probably a duplicate
    # Surface to LLM for MERGE/SKIP decision
    existing = matches[0].text
    decision = llm_decide(candidate, existing)  # plugin's responsibility
    if decision == "MERGE":
        kb.edit_fact(matches[0].fact_id, text=merged_text)
    elif decision == "SKIP":
        pass  # don't write
elif matches and matches[0].score > 0.80:
    # Somewhat similar — surface to LLM for CREATE/MERGE decision
    decision = llm_decide(candidate, matches[0].text)
    # ...
else:
    # No similar content — safe to CREATE
    kb.memory_add(title=candidate, entity="Person A")

Key principle: kbx provides the similarity signal. The LLM makes the judgment. This keeps the dedup logic in the plugin (where it has full conversation context) rather than baking opinionated LLM calls into kbx.

Implementation Phases

  1. Phase 1 — memory similar command: Embed candidate, vector search within entity/doc scope, threshold filter, JSON output. ~1-2 days
  2. Phase 2 — entity similar-field: Field-level similarity check with short-text handling (Levenshtein fallback). ~1 day
  3. Phase 3 — Python API: memory_similar(), has_similar(), entity_similar_field() on KnowledgeBase. ~0.5 day
  4. Phase 4 — MCP tool: kbx_memory_similar tool with annotations. ~0.5 day

Open Questions

  • Should memory similar also search document chunks (not just facts)? This would catch duplicates between facts and inline document content, but increases the search space.
  • Should the short-text Levenshtein fallback in similar-field also consider token overlap (Jaccard similarity) as a third signal?
  • Should there be a --dry-run mode on kbx memory add that automatically runs similarity check before writing? e.g. kbx memory add "fact" --entity "Name" --dry-run → shows similar facts and asks for confirmation.
  • Should similarity results include a suggested_action field (CREATE/MERGE/SKIP) based on score ranges, or is that too opinionated for kbx?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions