feat: kbx memory similar — semantic similarity lookup for dedup

## Summary

Add a `kbx memory similar` command and Python/MCP API equivalent that uses existing vector embeddings to find semantically similar facts or entity content before writing — enabling plugins and agents to check for duplicates without needing an LLM.

```bash
kbx memory similar "prefers async communication" --entity "Person A" --threshold 0.85
```

## Motivation

Plugins and agents frequently write facts and entity updates to kbx. Without a pre-write similarity check, they produce duplicates:

- Fact "prefers async Slack over email" already exists → agent writes "favours asynchronous communication via Slack" → near-duplicate
- Entity role is "Engineering Director" → plugin calls `kbx person edit --role "Director of Engineering"` → semantically identical update
- Open item "follow up on migration timeline" → debrief extracts "check migration schedule status" → same action, different words

The vector embeddings already exist in the index. This feature exposes them for pre-write similarity lookup — a cheap, fast check that gives plugins the signal they need to decide CREATE vs MERGE vs SKIP, without burning an LLM call on every write.

**Inspiration**: OpenViking's `MemoryDeduplicator` does a two-step pipeline: vector pre-filter (embed candidate, search within same category) → LLM decision. This issue implements the vector pre-filter step as a standalone capability. The LLM decision step lives in the calling plugin, not in kbx.

## Design

### 1. `kbx memory similar` Command

Search existing facts for a given entity that are semantically similar to candidate text.

```bash
# Find similar facts for an entity
kbx memory similar "prefers async communication" --entity "Person A" --threshold 0.85 --json

# Output:
# {
#   "query": "prefers async communication",
#   "entity": "Person A",
#   "threshold": 0.85,
#   "matches": [
#     {
#       "text": "Favours Slack DMs over email for quick questions",
#       "score": 0.91,
#       "fact_id": 1234,
#       "source_path": "memory/people/person-a.md",
#       "date": "2026-02-15"
#     },
#     {
#       "text": "Prefers written async updates to synchronous meetings",
#       "score": 0.87,
#       "fact_id": 1235,
#       "source_path": "memory/people/person-a.md",
#       "date": "2026-01-20"
#     }
#   ]
# }

# Search across all entities (no --entity filter)
kbx memory similar "deployment rollback procedure" --threshold 0.80 --json

# Search within a specific document's content
kbx memory similar "migration blocked by auth dependency" --path "memory/projects/project-x.md" --json
```

**Parameters**:
| Flag | Default | Description |
|------|---------|-------------|
| `--entity NAME` | None | Scope to facts for a specific entity |
| `--path PATH` | None | Scope to content within a specific document |
| `--threshold FLOAT` | 0.85 | Minimum similarity score (0.0–1.0) |
| `--limit INT` | 5 | Maximum number of matches |
| `--json` | False | JSON output |

**Implementation**: Embed the candidate text using the existing embedder, then vector search within the scoped collection (entity facts, document chunks, or global). Filter by threshold. Return matches with scores.

### 2. `kbx entity similar-field` Command

Check if a proposed entity field update is semantically equivalent to what's already stored.

```bash
# Check if a role update is redundant
kbx entity similar-field "Person A" --field role --value "Engineering Director" --json
# {
#   "entity": "Person A",
#   "field": "role",
#   "current_value": "Director of Engineering",
#   "proposed_value": "Engineering Director",
#   "similarity": 0.96,
#   "is_similar": true,
#   "threshold": 0.85
# }

# Check team field
kbx entity similar-field "Person A" --field team --value "Platform & Infrastructure" --json
# {
#   "entity": "Person A",
#   "field": "team",
#   "current_value": "Platform",
#   "proposed_value": "Platform & Infrastructure",
#   "similarity": 0.82,
#   "is_similar": false,
#   "threshold": 0.85
# }
```

**Supported fields**: `role`, `team`, and any custom metadata field. For short text fields, use direct embedding similarity. For facts and open items (lists), use `kbx memory similar` under the hood.

**Short-text handling**: Very short strings (< 10 tokens) may not embed well. For these, also compute normalized Levenshtein distance and take the max of embedding similarity and string similarity:
```python
final_similarity = max(vector_similarity, 1.0 - normalized_levenshtein(current, proposed))
```

### 3. Threshold Tuning

Different content types need different thresholds:

| Content type | Recommended threshold | Rationale |
|-------------|----------------------|-----------|
| Short facts (< 20 words) | 0.90 | Short texts embed less distinctly; higher threshold avoids false positives |
| Long facts (20+ words) | 0.85 | More embedding signal; standard threshold works |
| Role/team fields | 0.85 | Short but domain-specific vocabulary |
| Open items | 0.80 | Action items can be phrased very differently while meaning the same thing |
| Document chunks | 0.80 | Longer text, more variation in phrasing |

**Default**: 0.85 (good general-purpose threshold). Overridable via `--threshold` flag.

**Configurable in `kbx.toml`**:
```toml
[dedup]
default_threshold = 0.85
fact_threshold = 0.90
open_item_threshold = 0.80
```

### 4. JSON Output

All commands return structured JSON with `--json`:

```json
{
  "query": "candidate text",
  "scope": {"entity": "Person A"},
  "threshold": 0.85,
  "matches": [
    {
      "text": "matched text",
      "score": 0.91,
      "fact_id": 1234,
      "source_path": "memory/people/person-a.md",
      "date": "2026-02-15",
      "entity_name": "Person A",
      "match_type": "fact"
    }
  ],
  "has_similar": true,
  "best_score": 0.91
}
```

**Key fields for programmatic consumers**:
- `has_similar` — boolean, true if any match above threshold (quick check)
- `best_score` — highest similarity score (for threshold comparison)
- `matches[].fact_id` — enables MERGE operations (`kbx memory edit-fact <id>`)
- `matches[].match_type` — `fact`, `open_item`, `chunk`, `field` (tells consumer what matched)

### 5. MCP Tool

Expose as an MCP tool for agent-driven dedup:

```json
{
  "name": "kbx_memory_similar",
  "description": "Find existing facts or content semantically similar to candidate text. Use before writing new facts to check for duplicates.",
  "inputSchema": {
    "type": "object",
    "properties": {
      "text": {"type": "string", "description": "Candidate text to check for similarity"},
      "entity": {"type": "string", "description": "Entity name to scope search (optional)"},
      "path": {"type": "string", "description": "Document path to scope search (optional)"},
      "threshold": {"type": "number", "default": 0.85, "description": "Minimum similarity (0.0-1.0)"},
      "limit": {"type": "integer", "default": 5, "description": "Max matches to return"}
    },
    "required": ["text"]
  }
}
```

**MCP tool annotation**: `readOnlyHint: true` — this tool never modifies data.

### 6. Python API

```python
from kb import KnowledgeBase

kb = KnowledgeBase()

# Similar facts for an entity
matches = kb.memory_similar(
    text="prefers async communication",
    entity="Person A",
    threshold=0.85,
    limit=5,
)
# Returns: list[SimilarityMatch]
# SimilarityMatch: {text, score, fact_id, source_path, date, entity_name, match_type}

# Check if similar content exists (boolean shortcut)
if kb.has_similar(text="prefers async communication", entity="Person A"):
    print("Similar fact already exists — consider MERGE instead of CREATE")

# Similar-field check
result = kb.entity_similar_field(
    entity="Person A",
    field="role",
    value="Engineering Director",
)
# Returns: FieldSimilarity {current_value, proposed_value, similarity, is_similar, threshold}
```

### 7. Plugin Integration Pattern

The intended workflow for plugins and agents:

```python
# Before writing a new fact
candidate = "Person A prefers written async updates"
matches = kb.memory_similar(text=candidate, entity="Person A")

if matches and matches[0].score > 0.90:
    # Very similar fact exists — probably a duplicate
    # Surface to LLM for MERGE/SKIP decision
    existing = matches[0].text
    decision = llm_decide(candidate, existing)  # plugin's responsibility
    if decision == "MERGE":
        kb.edit_fact(matches[0].fact_id, text=merged_text)
    elif decision == "SKIP":
        pass  # don't write
elif matches and matches[0].score > 0.80:
    # Somewhat similar — surface to LLM for CREATE/MERGE decision
    decision = llm_decide(candidate, matches[0].text)
    # ...
else:
    # No similar content — safe to CREATE
    kb.memory_add(title=candidate, entity="Person A")
```

**Key principle**: kbx provides the similarity signal. The LLM makes the judgment. This keeps the dedup logic in the plugin (where it has full conversation context) rather than baking opinionated LLM calls into kbx.

## Implementation Phases

1. **Phase 1 — `memory similar` command**: Embed candidate, vector search within entity/doc scope, threshold filter, JSON output. `~1-2 days`
2. **Phase 2 — `entity similar-field`**: Field-level similarity check with short-text handling (Levenshtein fallback). `~1 day`
3. **Phase 3 — Python API**: `memory_similar()`, `has_similar()`, `entity_similar_field()` on `KnowledgeBase`. `~0.5 day`
4. **Phase 4 — MCP tool**: `kbx_memory_similar` tool with annotations. `~0.5 day`

## Open Questions

- [ ] Should `memory similar` also search document chunks (not just facts)? This would catch duplicates between facts and inline document content, but increases the search space.
- [ ] Should the short-text Levenshtein fallback in `similar-field` also consider token overlap (Jaccard similarity) as a third signal?
- [ ] Should there be a `--dry-run` mode on `kbx memory add` that automatically runs similarity check before writing? e.g. `kbx memory add "fact" --entity "Name" --dry-run` → shows similar facts and asks for confirmation.
- [ ] Should similarity results include a `suggested_action` field (CREATE/MERGE/SKIP) based on score ranges, or is that too opinionated for kbx?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: kbx memory similar — semantic similarity lookup for dedup #71

Summary

Motivation

Design

1. `kbx memory similar` Command

2. `kbx entity similar-field` Command

3. Threshold Tuning

4. JSON Output

5. MCP Tool

6. Python API

7. Plugin Integration Pattern

Implementation Phases

Open Questions

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Flag	Default	Description
`--entity NAME`	None	Scope to facts for a specific entity
`--path PATH`	None	Scope to content within a specific document
`--threshold FLOAT`	0.85	Minimum similarity score (0.0–1.0)
`--limit INT`	5	Maximum number of matches
`--json`	False	JSON output

Content type	Recommended threshold	Rationale
Short facts (< 20 words)	0.90	Short texts embed less distinctly; higher threshold avoids false positives
Long facts (20+ words)	0.85	More embedding signal; standard threshold works
Role/team fields	0.85	Short but domain-specific vocabulary
Open items	0.80	Action items can be phrased very differently while meaning the same thing
Document chunks	0.80	Longer text, more variation in phrasing

feat: kbx memory similar — semantic similarity lookup for dedup #71

Description

Summary

Motivation

Design

1. kbx memory similar Command

2. kbx entity similar-field Command

3. Threshold Tuning

4. JSON Output

5. MCP Tool

6. Python API

7. Plugin Integration Pattern

Implementation Phases

Open Questions

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

1. `kbx memory similar` Command

2. `kbx entity similar-field` Command