Skip to content

feat: vector search should respect --path filtering #73

@tenfourty

Description

@tenfourty

Summary

Currently --path filtering works for FTS but is ignored by vector/semantic search. The vector pipeline scans the full embedding table and scores all documents, with path filtering only applied after ANN scoring. This means kbx search "topic" --path memory/meetings/ returns FTS results scoped to that path but vector results from the entire KB.

Proposed behaviour: Pre-filter the vector candidate set by document path before ANN scoring, so both FTS and vector search respect the same --path argument consistently.

Use Cases

  • Scope to a date range of meetings: kbx search "deployment" --path memory/meetings/2026/03/ — only search March 2026 meetings
  • Search only entity files: kbx search "infrastructure lead" --path memory/people/ — find people matching a role
  • Agent-driven scoped search: Callers that know exactly which KB subtree is relevant can reduce noise from unrelated documents
  • Project-scoped search: kbx search "blockers" --path memory/projects/ — only search project entity files

Current Behaviour

# FTS results are correctly scoped to meetings
kbx search "migration timeline" --fast --path memory/meetings/ --json
# → only results from memory/meetings/

# Hybrid search: FTS results are scoped, but vector results come from everywhere
kbx search "migration timeline" --path memory/meetings/ --json
# → FTS hits from memory/meetings/ + vector hits from memory/people/, memory/notes/, etc.

The inconsistency means --path is unreliable in hybrid mode — users can't trust that all results come from the specified path.

Implementation Notes

LanceDB Pre-Filtering

LanceDB supports where clauses on vector search that filter before ANN scoring:

# Current: no path filter on vector search
results = table.search(query_vector).limit(limit).to_list()

# Proposed: pre-filter by path prefix
results = (
    table.search(query_vector)
    .where(f"source_path LIKE '{path_prefix}%'")
    .limit(limit)
    .to_list()
)

Pre-filtering reduces the ANN candidate set, which is both more correct and potentially faster (smaller search space).

Glob Pattern Support

--path currently accepts glob patterns (e.g. memory/meetings/2026/0[1-3]/). The vector pre-filter needs to handle both forms:

Pattern LanceDB where clause
memory/meetings/ source_path LIKE 'memory/meetings/%'
memory/meetings/2026/03/ source_path LIKE 'memory/meetings/2026/03/%'
memory/people/*.md source_path LIKE 'memory/people/%.md' (approximate)
Complex globs Fall back to post-filter (fetch more candidates, filter in Python)

For simple prefix patterns (the common case), translate directly to a LIKE clause. For complex globs with character classes or alternation, fetch a larger candidate set and post-filter — still better than no filtering at all.

FTS Parity

FTS already filters by path correctly. The fix is isolated to the vector search path — ensure it receives and applies the same --path argument that FTS does. The hybrid merge step should then combine two equally-scoped result sets.

Acceptance Criteria

  • kbx search "query" --path memory/meetings/ returns only results from memory/meetings/ in both FTS and vector results
  • kbx search "query" --path memory/people/ --json — all results have path starting with memory/people/
  • Simple prefix patterns use LanceDB where pre-filter (no post-filter overhead)
  • Complex glob patterns fall back to post-filter with expanded candidate set
  • No regression in search performance for queries without --path
  • --explain (feat: search explain/trace mode for retrieval diagnostics #68) shows the path filter applied to both FTS and vector pipelines

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions