Summary
Add a kbx memory similar command and Python/MCP API equivalent that uses existing vector embeddings to find semantically similar facts or entity content before writing — enabling plugins and agents to check for duplicates without needing an LLM.
kbx memory similar "prefers async communication" --entity "Person A" --threshold 0.85
Motivation
Plugins and agents frequently write facts and entity updates to kbx. Without a pre-write similarity check, they produce duplicates:
- Fact "prefers async Slack over email" already exists → agent writes "favours asynchronous communication via Slack" → near-duplicate
- Entity role is "Engineering Director" → plugin calls
kbx person edit --role "Director of Engineering" → semantically identical update
- Open item "follow up on migration timeline" → debrief extracts "check migration schedule status" → same action, different words
The vector embeddings already exist in the index. This feature exposes them for pre-write similarity lookup — a cheap, fast check that gives plugins the signal they need to decide CREATE vs MERGE vs SKIP, without burning an LLM call on every write.
Inspiration: OpenViking's MemoryDeduplicator does a two-step pipeline: vector pre-filter (embed candidate, search within same category) → LLM decision. This issue implements the vector pre-filter step as a standalone capability. The LLM decision step lives in the calling plugin, not in kbx.
Design
1. kbx memory similar Command
Search existing facts for a given entity that are semantically similar to candidate text.
# Find similar facts for an entity
kbx memory similar "prefers async communication" --entity "Person A" --threshold 0.85 --json
# Output:
# {
# "query": "prefers async communication",
# "entity": "Person A",
# "threshold": 0.85,
# "matches": [
# {
# "text": "Favours Slack DMs over email for quick questions",
# "score": 0.91,
# "fact_id": 1234,
# "source_path": "memory/people/person-a.md",
# "date": "2026-02-15"
# },
# {
# "text": "Prefers written async updates to synchronous meetings",
# "score": 0.87,
# "fact_id": 1235,
# "source_path": "memory/people/person-a.md",
# "date": "2026-01-20"
# }
# ]
# }
# Search across all entities (no --entity filter)
kbx memory similar "deployment rollback procedure" --threshold 0.80 --json
# Search within a specific document's content
kbx memory similar "migration blocked by auth dependency" --path "memory/projects/project-x.md" --json
Parameters:
| Flag |
Default |
Description |
--entity NAME |
None |
Scope to facts for a specific entity |
--path PATH |
None |
Scope to content within a specific document |
--threshold FLOAT |
0.85 |
Minimum similarity score (0.0–1.0) |
--limit INT |
5 |
Maximum number of matches |
--json |
False |
JSON output |
Implementation: Embed the candidate text using the existing embedder, then vector search within the scoped collection (entity facts, document chunks, or global). Filter by threshold. Return matches with scores.
2. kbx entity similar-field Command
Check if a proposed entity field update is semantically equivalent to what's already stored.
# Check if a role update is redundant
kbx entity similar-field "Person A" --field role --value "Engineering Director" --json
# {
# "entity": "Person A",
# "field": "role",
# "current_value": "Director of Engineering",
# "proposed_value": "Engineering Director",
# "similarity": 0.96,
# "is_similar": true,
# "threshold": 0.85
# }
# Check team field
kbx entity similar-field "Person A" --field team --value "Platform & Infrastructure" --json
# {
# "entity": "Person A",
# "field": "team",
# "current_value": "Platform",
# "proposed_value": "Platform & Infrastructure",
# "similarity": 0.82,
# "is_similar": false,
# "threshold": 0.85
# }
Supported fields: role, team, and any custom metadata field. For short text fields, use direct embedding similarity. For facts and open items (lists), use kbx memory similar under the hood.
Short-text handling: Very short strings (< 10 tokens) may not embed well. For these, also compute normalized Levenshtein distance and take the max of embedding similarity and string similarity:
final_similarity = max(vector_similarity, 1.0 - normalized_levenshtein(current, proposed))
3. Threshold Tuning
Different content types need different thresholds:
| Content type |
Recommended threshold |
Rationale |
| Short facts (< 20 words) |
0.90 |
Short texts embed less distinctly; higher threshold avoids false positives |
| Long facts (20+ words) |
0.85 |
More embedding signal; standard threshold works |
| Role/team fields |
0.85 |
Short but domain-specific vocabulary |
| Open items |
0.80 |
Action items can be phrased very differently while meaning the same thing |
| Document chunks |
0.80 |
Longer text, more variation in phrasing |
Default: 0.85 (good general-purpose threshold). Overridable via --threshold flag.
Configurable in kbx.toml:
[dedup]
default_threshold = 0.85
fact_threshold = 0.90
open_item_threshold = 0.80
4. JSON Output
All commands return structured JSON with --json:
{
"query": "candidate text",
"scope": {"entity": "Person A"},
"threshold": 0.85,
"matches": [
{
"text": "matched text",
"score": 0.91,
"fact_id": 1234,
"source_path": "memory/people/person-a.md",
"date": "2026-02-15",
"entity_name": "Person A",
"match_type": "fact"
}
],
"has_similar": true,
"best_score": 0.91
}
Key fields for programmatic consumers:
has_similar — boolean, true if any match above threshold (quick check)
best_score — highest similarity score (for threshold comparison)
matches[].fact_id — enables MERGE operations (kbx memory edit-fact <id>)
matches[].match_type — fact, open_item, chunk, field (tells consumer what matched)
5. MCP Tool
Expose as an MCP tool for agent-driven dedup:
{
"name": "kbx_memory_similar",
"description": "Find existing facts or content semantically similar to candidate text. Use before writing new facts to check for duplicates.",
"inputSchema": {
"type": "object",
"properties": {
"text": {"type": "string", "description": "Candidate text to check for similarity"},
"entity": {"type": "string", "description": "Entity name to scope search (optional)"},
"path": {"type": "string", "description": "Document path to scope search (optional)"},
"threshold": {"type": "number", "default": 0.85, "description": "Minimum similarity (0.0-1.0)"},
"limit": {"type": "integer", "default": 5, "description": "Max matches to return"}
},
"required": ["text"]
}
}
MCP tool annotation: readOnlyHint: true — this tool never modifies data.
6. Python API
from kb import KnowledgeBase
kb = KnowledgeBase()
# Similar facts for an entity
matches = kb.memory_similar(
text="prefers async communication",
entity="Person A",
threshold=0.85,
limit=5,
)
# Returns: list[SimilarityMatch]
# SimilarityMatch: {text, score, fact_id, source_path, date, entity_name, match_type}
# Check if similar content exists (boolean shortcut)
if kb.has_similar(text="prefers async communication", entity="Person A"):
print("Similar fact already exists — consider MERGE instead of CREATE")
# Similar-field check
result = kb.entity_similar_field(
entity="Person A",
field="role",
value="Engineering Director",
)
# Returns: FieldSimilarity {current_value, proposed_value, similarity, is_similar, threshold}
7. Plugin Integration Pattern
The intended workflow for plugins and agents:
# Before writing a new fact
candidate = "Person A prefers written async updates"
matches = kb.memory_similar(text=candidate, entity="Person A")
if matches and matches[0].score > 0.90:
# Very similar fact exists — probably a duplicate
# Surface to LLM for MERGE/SKIP decision
existing = matches[0].text
decision = llm_decide(candidate, existing) # plugin's responsibility
if decision == "MERGE":
kb.edit_fact(matches[0].fact_id, text=merged_text)
elif decision == "SKIP":
pass # don't write
elif matches and matches[0].score > 0.80:
# Somewhat similar — surface to LLM for CREATE/MERGE decision
decision = llm_decide(candidate, matches[0].text)
# ...
else:
# No similar content — safe to CREATE
kb.memory_add(title=candidate, entity="Person A")
Key principle: kbx provides the similarity signal. The LLM makes the judgment. This keeps the dedup logic in the plugin (where it has full conversation context) rather than baking opinionated LLM calls into kbx.
Implementation Phases
- Phase 1 —
memory similar command: Embed candidate, vector search within entity/doc scope, threshold filter, JSON output. ~1-2 days
- Phase 2 —
entity similar-field: Field-level similarity check with short-text handling (Levenshtein fallback). ~1 day
- Phase 3 — Python API:
memory_similar(), has_similar(), entity_similar_field() on KnowledgeBase. ~0.5 day
- Phase 4 — MCP tool:
kbx_memory_similar tool with annotations. ~0.5 day
Open Questions
Summary
Add a
kbx memory similarcommand and Python/MCP API equivalent that uses existing vector embeddings to find semantically similar facts or entity content before writing — enabling plugins and agents to check for duplicates without needing an LLM.Motivation
Plugins and agents frequently write facts and entity updates to kbx. Without a pre-write similarity check, they produce duplicates:
kbx person edit --role "Director of Engineering"→ semantically identical updateThe vector embeddings already exist in the index. This feature exposes them for pre-write similarity lookup — a cheap, fast check that gives plugins the signal they need to decide CREATE vs MERGE vs SKIP, without burning an LLM call on every write.
Inspiration: OpenViking's
MemoryDeduplicatordoes a two-step pipeline: vector pre-filter (embed candidate, search within same category) → LLM decision. This issue implements the vector pre-filter step as a standalone capability. The LLM decision step lives in the calling plugin, not in kbx.Design
1.
kbx memory similarCommandSearch existing facts for a given entity that are semantically similar to candidate text.
Parameters:
--entity NAME--path PATH--threshold FLOAT--limit INT--jsonImplementation: Embed the candidate text using the existing embedder, then vector search within the scoped collection (entity facts, document chunks, or global). Filter by threshold. Return matches with scores.
2.
kbx entity similar-fieldCommandCheck if a proposed entity field update is semantically equivalent to what's already stored.
Supported fields:
role,team, and any custom metadata field. For short text fields, use direct embedding similarity. For facts and open items (lists), usekbx memory similarunder the hood.Short-text handling: Very short strings (< 10 tokens) may not embed well. For these, also compute normalized Levenshtein distance and take the max of embedding similarity and string similarity:
3. Threshold Tuning
Different content types need different thresholds:
Default: 0.85 (good general-purpose threshold). Overridable via
--thresholdflag.Configurable in
kbx.toml:4. JSON Output
All commands return structured JSON with
--json:{ "query": "candidate text", "scope": {"entity": "Person A"}, "threshold": 0.85, "matches": [ { "text": "matched text", "score": 0.91, "fact_id": 1234, "source_path": "memory/people/person-a.md", "date": "2026-02-15", "entity_name": "Person A", "match_type": "fact" } ], "has_similar": true, "best_score": 0.91 }Key fields for programmatic consumers:
has_similar— boolean, true if any match above threshold (quick check)best_score— highest similarity score (for threshold comparison)matches[].fact_id— enables MERGE operations (kbx memory edit-fact <id>)matches[].match_type—fact,open_item,chunk,field(tells consumer what matched)5. MCP Tool
Expose as an MCP tool for agent-driven dedup:
{ "name": "kbx_memory_similar", "description": "Find existing facts or content semantically similar to candidate text. Use before writing new facts to check for duplicates.", "inputSchema": { "type": "object", "properties": { "text": {"type": "string", "description": "Candidate text to check for similarity"}, "entity": {"type": "string", "description": "Entity name to scope search (optional)"}, "path": {"type": "string", "description": "Document path to scope search (optional)"}, "threshold": {"type": "number", "default": 0.85, "description": "Minimum similarity (0.0-1.0)"}, "limit": {"type": "integer", "default": 5, "description": "Max matches to return"} }, "required": ["text"] } }MCP tool annotation:
readOnlyHint: true— this tool never modifies data.6. Python API
7. Plugin Integration Pattern
The intended workflow for plugins and agents:
Key principle: kbx provides the similarity signal. The LLM makes the judgment. This keeps the dedup logic in the plugin (where it has full conversation context) rather than baking opinionated LLM calls into kbx.
Implementation Phases
memory similarcommand: Embed candidate, vector search within entity/doc scope, threshold filter, JSON output.~1-2 daysentity similar-field: Field-level similarity check with short-text handling (Levenshtein fallback).~1 daymemory_similar(),has_similar(),entity_similar_field()onKnowledgeBase.~0.5 daykbx_memory_similartool with annotations.~0.5 dayOpen Questions
memory similaralso search document chunks (not just facts)? This would catch duplicates between facts and inline document content, but increases the search space.similar-fieldalso consider token overlap (Jaccard similarity) as a third signal?--dry-runmode onkbx memory addthat automatically runs similarity check before writing? e.g.kbx memory add "fact" --entity "Name" --dry-run→ shows similar facts and asks for confirmation.suggested_actionfield (CREATE/MERGE/SKIP) based on score ranges, or is that too opinionated for kbx?