Skip to content

feat: two-pass hierarchical search with entity-guided retrieval #69

@tenfourty

Description

@tenfourty

Summary

Split search into two stages: first find the most relevant entities (people, projects, teams), then search within documents linked to those entities — blending entity relevance into document scores for dramatically better precision on structured queries.

# Automatic two-pass (when entity matches are confident)
kbx search "what did the platform team decide about migration?"

# Force flat search (bypass hierarchy)
kbx search "what did the platform team decide about migration?" --no-hierarchy

Motivation

kbx currently does flat search across all 3000+ documents. Every document is treated equally regardless of its relationship to entities. This works well for keyword-specific queries but poorly for entity-centric queries like:

  • "What has Person A been working on?" — should prioritise documents where Person A is an attendee or is mentioned heavily, not documents that happen to contain similar keywords
  • "Latest updates on Project X" — should prioritise documents linked to that project entity, not unrelated documents that mention similar terms
  • "What did the infrastructure team discuss last week?" — should scope to meetings with infra team members

The problem: Flat vector search treats "Person A discussed the migration" and "Person B discussed a different migration" as equally relevant to a query about Person A's work. Two-pass search uses the entity graph to filter first, then score within that filtered set.

Inspiration: OpenViking's HierarchicalRetriever uses a priority-queue BFS over its directory tree with score propagation (final = 0.5 * child + 0.5 * parent). Their convergence detection stops early when top-K stabilises. We adapt this concept to kbx's entity→document graph rather than a filesystem hierarchy.

Design

1. Pass 1 — Entity Search

Use the existing entity index to find the most relevant entities for the query.

def pass1_entity_search(query: str, limit: int = 5) -> list[EntityScore]:
    """
    Search entities (people, projects, teams) by name, alias, role, and facts.
    Returns top-K entities with relevance scores.
    """
    # FTS search across entity names, aliases, roles, facts
    fts_matches = entity_fts_search(query, limit=limit * 2)

    # Vector search across entity descriptions/facts
    vector_matches = entity_vector_search(query, limit=limit * 2)

    # Merge and deduplicate
    return merge_entity_scores(fts_matches, vector_matches, limit=limit)

Entity scoring sources:

  • Name/alias exact or fuzzy match (highest signal)
  • Role/team match
  • Fact text match (e.g. entity fact "leads the migration project" matches query "migration")
  • Vector similarity of entity description to query

Output: List of (entity_id, entity_name, entity_type, score) tuples, sorted by score descending.

Confidence threshold: Pass 1 results are only used if the top entity scores above a configurable threshold (e.g. 0.5). Below that, fall back to flat search — the query is probably not entity-centric.

2. Pass 2 — Entity-Constrained Document Search

Search documents linked to the top entities from Pass 1, blending entity relevance into document scores.

def pass2_document_search(
    query: str,
    entity_scores: list[EntityScore],
    alpha: float = 0.5,
    limit: int = 10,
) -> list[SearchResult]:
    """
    Search within documents associated with top entities.
    Blend entity score into document score.
    """
    # Collect document IDs linked to top entities (via entity mention index)
    candidate_doc_ids = set()
    doc_entity_scores = {}  # doc_id -> best parent entity score

    for entity in entity_scores:
        linked_docs = get_documents_mentioning_entity(entity.id)
        for doc_id in linked_docs:
            candidate_doc_ids.add(doc_id)
            # Keep the best entity score per document
            doc_entity_scores[doc_id] = max(
                doc_entity_scores.get(doc_id, 0.0),
                entity.score,
            )

    # Run standard FTS+vector search, but constrained to candidate docs
    doc_results = search_within_documents(query, doc_ids=candidate_doc_ids)

    # Blend scores
    for result in doc_results:
        parent_score = doc_entity_scores.get(result.doc_id, 0.0)
        result.score = alpha * result.score + (1 - alpha) * parent_score

    return sorted(doc_results, key=lambda r: r.score, reverse=True)[:limit]

Document→entity linkage uses the existing entity mention index (find_entity_mentions()). A document is linked to an entity if:

  • The entity is mentioned in the document text (existing linking)
  • The entity is listed as an attendee (for meetings)
  • The entity is referenced in frontmatter metadata

3. Score Propagation Formula

final_score = alpha * doc_relevance_score + (1 - alpha) * parent_entity_score
Alpha Behaviour
1.0 Pure document relevance (flat search)
0.5 Equal weight to document relevance and entity match (default)
0.0 Pure entity proximity (rank by how strongly linked to matched entity)

Default: alpha = 0.5 — a document that's moderately relevant but strongly linked to a matched entity ranks similarly to a highly relevant document with weak entity linkage.

Edge case: Documents linked to multiple matched entities get the best (max) entity score, not the sum. This prevents documents mentioning many entities from dominating purely through quantity.

Configurable in kbx.toml:

[search]
hierarchy_alpha = 0.5              # 0.0 to 1.0
hierarchy_entity_threshold = 0.5   # minimum entity score to trigger two-pass
hierarchy_max_entities = 5         # max entities from Pass 1

4. Fallback to Flat Search

Two-pass search is not always appropriate. Fall back to standard flat search when:

  1. No confident entities: Pass 1 top score < threshold (default 0.5). The query is probably not entity-centric (e.g. "how do we handle deployment rollbacks?" — general topic, no specific entity).

  2. Too many entities: If Pass 1 returns many entities with similar scores, the query is too broad for entity-guided filtering. Fall back if top_score - 5th_score < 0.1.

  3. Explicit opt-out: --no-hierarchy flag.

  4. FTS-only mode: --fast (FTS-only search) skips two-pass by default since entity vector search isn't available. Could still do entity FTS matching in a future iteration.

Fallback is transparent: Results include a search_mode field in --explain output indicating whether two-pass or flat search was used, and why.

5. Convergence Detection

For large entity-linked document sets, implement early stopping:

MAX_CONVERGENCE_ROUNDS = 3

def search_with_convergence(query, candidate_docs, limit):
    """Process candidate docs in batches, stop when top-K stabilises."""
    top_k = []
    stable_rounds = 0

    for batch in chunk(candidate_docs, batch_size=50):
        new_results = score_batch(query, batch)
        prev_top_k = [r.doc_id for r in top_k[:limit]]

        top_k = merge_and_sort(top_k + new_results)[:limit * 2]
        curr_top_k = [r.doc_id for r in top_k[:limit]]

        if prev_top_k == curr_top_k:
            stable_rounds += 1
            if stable_rounds >= MAX_CONVERGENCE_ROUNDS:
                break  # Top-K is stable, stop early
        else:
            stable_rounds = 0

    return top_k[:limit]

This prevents scanning hundreds of entity-linked documents when the top results are already clear after the first few batches. Particularly valuable for entities that appear in 100+ documents.

6. Query Type Suitability

Query type Two-pass benefit Example
Person queries High "What has Person A been focused on?"
Project queries High "Latest updates on Project X"
Team queries High "What did the platform team discuss?"
Topic + person High "What does Person B think about migration?"
Pure topic Low "How do deployment rollbacks work?"
Keyword lookup None "error code 5032"
Date-scoped Low "What happened last Tuesday?"

The confidence threshold in Pass 1 handles this automatically — pure topic queries won't match entities confidently, so they fall back to flat search.

7. CLI Flags

# Default: automatic (two-pass when entities match, flat otherwise)
kbx search "Person A migration status" --json

# Force flat search (bypass entity pass)
kbx search "Person A migration status" --no-hierarchy --json

# Tune alpha (more weight to entity match)
kbx search "Person A migration status" --hierarchy-alpha 0.7 --json

# Explain mode shows which pass was used
kbx search "Person A migration status" --explain --json
# → meta.search_mode: "two_pass"
# → meta.pass1_entities: [{name: "Person A", score: 0.92}]
# → per-result: explain.parent_entity_score: 0.92

8. Performance Implications

Added latency from Pass 1:

  • Entity FTS search: ~5ms (fast, small index)
  • Entity vector search: ~20ms (if embeddings exist for entities)
  • Total Pass 1: ~25ms overhead

Pass 2 can be faster than flat search:

  • Flat search scans all ~3000+ documents
  • Two-pass scans only documents linked to top entities (typically 50-200)
  • Net effect: often faster despite the extra pass

Caching opportunities:

  • Entity search results are stable across queries in the same session. Cache Pass 1 results for queries with overlapping entity matches.
  • Entity→document linkage is stable between index runs. Cache the entity_id → [doc_ids] mapping in memory.
  • LRU cache on entity scores with ~60s TTL would eliminate redundant Pass 1 calls in interactive sessions.

Worst case: Query matches 5 entities, each linked to 200 documents → 1000 candidate docs. With convergence detection (batch size 50, 3 stable rounds), typically evaluates 150-200 before stopping.

Integration with Other Features

Implementation Phases

  1. Phase 1 — Pass 1 entity search: Build entity scoring (FTS + vector on entity names/facts/roles). Confidence threshold for triggering two-pass. ~2 days
  2. Phase 2 — Pass 2 constrained search: Filter document search to entity-linked docs. Score propagation with configurable alpha. Fallback logic. ~2 days
  3. Phase 3 — Convergence detection: Batch processing with early stopping when top-K stabilises. ~1 day
  4. Phase 4 — CLI surface + caching: --no-hierarchy, --hierarchy-alpha flags. Entity score caching. Integration with --explain. ~1 day

Open Questions

  • Should Pass 1 use the existing find_entity_mentions() matching or a dedicated entity search index? The existing matching is string-based; a vector search over entity descriptions would catch semantic matches (e.g. "infrastructure issues" matching an entity whose role is "SRE Lead").
  • Should two-pass be the default for all searches, or opt-in via --hierarchy? Default-on with automatic fallback seems right, but adds complexity.
  • How should team entities work? A team query should expand to all team members' documents. Is this a recursive entity expansion (team → members → docs) or a single-level lookup?
  • Should the entity→document linkage include a "strength" score (e.g. number of mentions) that feeds into the propagation formula? A document mentioning an entity once vs ten times likely has different relevance.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions