Summary
Split search into two stages: first find the most relevant entities (people, projects, teams), then search within documents linked to those entities — blending entity relevance into document scores for dramatically better precision on structured queries.
# Automatic two-pass (when entity matches are confident)
kbx search "what did the platform team decide about migration?"
# Force flat search (bypass hierarchy)
kbx search "what did the platform team decide about migration?" --no-hierarchy
Motivation
kbx currently does flat search across all 3000+ documents. Every document is treated equally regardless of its relationship to entities. This works well for keyword-specific queries but poorly for entity-centric queries like:
- "What has Person A been working on?" — should prioritise documents where Person A is an attendee or is mentioned heavily, not documents that happen to contain similar keywords
- "Latest updates on Project X" — should prioritise documents linked to that project entity, not unrelated documents that mention similar terms
- "What did the infrastructure team discuss last week?" — should scope to meetings with infra team members
The problem: Flat vector search treats "Person A discussed the migration" and "Person B discussed a different migration" as equally relevant to a query about Person A's work. Two-pass search uses the entity graph to filter first, then score within that filtered set.
Inspiration: OpenViking's HierarchicalRetriever uses a priority-queue BFS over its directory tree with score propagation (final = 0.5 * child + 0.5 * parent). Their convergence detection stops early when top-K stabilises. We adapt this concept to kbx's entity→document graph rather than a filesystem hierarchy.
Design
1. Pass 1 — Entity Search
Use the existing entity index to find the most relevant entities for the query.
def pass1_entity_search(query: str, limit: int = 5) -> list[EntityScore]:
"""
Search entities (people, projects, teams) by name, alias, role, and facts.
Returns top-K entities with relevance scores.
"""
# FTS search across entity names, aliases, roles, facts
fts_matches = entity_fts_search(query, limit=limit * 2)
# Vector search across entity descriptions/facts
vector_matches = entity_vector_search(query, limit=limit * 2)
# Merge and deduplicate
return merge_entity_scores(fts_matches, vector_matches, limit=limit)
Entity scoring sources:
- Name/alias exact or fuzzy match (highest signal)
- Role/team match
- Fact text match (e.g. entity fact "leads the migration project" matches query "migration")
- Vector similarity of entity description to query
Output: List of (entity_id, entity_name, entity_type, score) tuples, sorted by score descending.
Confidence threshold: Pass 1 results are only used if the top entity scores above a configurable threshold (e.g. 0.5). Below that, fall back to flat search — the query is probably not entity-centric.
2. Pass 2 — Entity-Constrained Document Search
Search documents linked to the top entities from Pass 1, blending entity relevance into document scores.
def pass2_document_search(
query: str,
entity_scores: list[EntityScore],
alpha: float = 0.5,
limit: int = 10,
) -> list[SearchResult]:
"""
Search within documents associated with top entities.
Blend entity score into document score.
"""
# Collect document IDs linked to top entities (via entity mention index)
candidate_doc_ids = set()
doc_entity_scores = {} # doc_id -> best parent entity score
for entity in entity_scores:
linked_docs = get_documents_mentioning_entity(entity.id)
for doc_id in linked_docs:
candidate_doc_ids.add(doc_id)
# Keep the best entity score per document
doc_entity_scores[doc_id] = max(
doc_entity_scores.get(doc_id, 0.0),
entity.score,
)
# Run standard FTS+vector search, but constrained to candidate docs
doc_results = search_within_documents(query, doc_ids=candidate_doc_ids)
# Blend scores
for result in doc_results:
parent_score = doc_entity_scores.get(result.doc_id, 0.0)
result.score = alpha * result.score + (1 - alpha) * parent_score
return sorted(doc_results, key=lambda r: r.score, reverse=True)[:limit]
Document→entity linkage uses the existing entity mention index (find_entity_mentions()). A document is linked to an entity if:
- The entity is mentioned in the document text (existing linking)
- The entity is listed as an attendee (for meetings)
- The entity is referenced in frontmatter metadata
3. Score Propagation Formula
final_score = alpha * doc_relevance_score + (1 - alpha) * parent_entity_score
| Alpha |
Behaviour |
| 1.0 |
Pure document relevance (flat search) |
| 0.5 |
Equal weight to document relevance and entity match (default) |
| 0.0 |
Pure entity proximity (rank by how strongly linked to matched entity) |
Default: alpha = 0.5 — a document that's moderately relevant but strongly linked to a matched entity ranks similarly to a highly relevant document with weak entity linkage.
Edge case: Documents linked to multiple matched entities get the best (max) entity score, not the sum. This prevents documents mentioning many entities from dominating purely through quantity.
Configurable in kbx.toml:
[search]
hierarchy_alpha = 0.5 # 0.0 to 1.0
hierarchy_entity_threshold = 0.5 # minimum entity score to trigger two-pass
hierarchy_max_entities = 5 # max entities from Pass 1
4. Fallback to Flat Search
Two-pass search is not always appropriate. Fall back to standard flat search when:
-
No confident entities: Pass 1 top score < threshold (default 0.5). The query is probably not entity-centric (e.g. "how do we handle deployment rollbacks?" — general topic, no specific entity).
-
Too many entities: If Pass 1 returns many entities with similar scores, the query is too broad for entity-guided filtering. Fall back if top_score - 5th_score < 0.1.
-
Explicit opt-out: --no-hierarchy flag.
-
FTS-only mode: --fast (FTS-only search) skips two-pass by default since entity vector search isn't available. Could still do entity FTS matching in a future iteration.
Fallback is transparent: Results include a search_mode field in --explain output indicating whether two-pass or flat search was used, and why.
5. Convergence Detection
For large entity-linked document sets, implement early stopping:
MAX_CONVERGENCE_ROUNDS = 3
def search_with_convergence(query, candidate_docs, limit):
"""Process candidate docs in batches, stop when top-K stabilises."""
top_k = []
stable_rounds = 0
for batch in chunk(candidate_docs, batch_size=50):
new_results = score_batch(query, batch)
prev_top_k = [r.doc_id for r in top_k[:limit]]
top_k = merge_and_sort(top_k + new_results)[:limit * 2]
curr_top_k = [r.doc_id for r in top_k[:limit]]
if prev_top_k == curr_top_k:
stable_rounds += 1
if stable_rounds >= MAX_CONVERGENCE_ROUNDS:
break # Top-K is stable, stop early
else:
stable_rounds = 0
return top_k[:limit]
This prevents scanning hundreds of entity-linked documents when the top results are already clear after the first few batches. Particularly valuable for entities that appear in 100+ documents.
6. Query Type Suitability
| Query type |
Two-pass benefit |
Example |
| Person queries |
High |
"What has Person A been focused on?" |
| Project queries |
High |
"Latest updates on Project X" |
| Team queries |
High |
"What did the platform team discuss?" |
| Topic + person |
High |
"What does Person B think about migration?" |
| Pure topic |
Low |
"How do deployment rollbacks work?" |
| Keyword lookup |
None |
"error code 5032" |
| Date-scoped |
Low |
"What happened last Tuesday?" |
The confidence threshold in Pass 1 handles this automatically — pure topic queries won't match entities confidently, so they fall back to flat search.
7. CLI Flags
# Default: automatic (two-pass when entities match, flat otherwise)
kbx search "Person A migration status" --json
# Force flat search (bypass entity pass)
kbx search "Person A migration status" --no-hierarchy --json
# Tune alpha (more weight to entity match)
kbx search "Person A migration status" --hierarchy-alpha 0.7 --json
# Explain mode shows which pass was used
kbx search "Person A migration status" --explain --json
# → meta.search_mode: "two_pass"
# → meta.pass1_entities: [{name: "Person A", score: 0.92}]
# → per-result: explain.parent_entity_score: 0.92
8. Performance Implications
Added latency from Pass 1:
- Entity FTS search: ~5ms (fast, small index)
- Entity vector search: ~20ms (if embeddings exist for entities)
- Total Pass 1: ~25ms overhead
Pass 2 can be faster than flat search:
- Flat search scans all ~3000+ documents
- Two-pass scans only documents linked to top entities (typically 50-200)
- Net effect: often faster despite the extra pass
Caching opportunities:
- Entity search results are stable across queries in the same session. Cache Pass 1 results for queries with overlapping entity matches.
- Entity→document linkage is stable between index runs. Cache the
entity_id → [doc_ids] mapping in memory.
- LRU cache on entity scores with ~60s TTL would eliminate redundant Pass 1 calls in interactive sessions.
Worst case: Query matches 5 entities, each linked to 200 documents → 1000 candidate docs. With convergence detection (batch size 50, 3 stable rounds), typically evaluates 150-200 before stopping.
Integration with Other Features
Implementation Phases
- Phase 1 — Pass 1 entity search: Build entity scoring (FTS + vector on entity names/facts/roles). Confidence threshold for triggering two-pass.
~2 days
- Phase 2 — Pass 2 constrained search: Filter document search to entity-linked docs. Score propagation with configurable alpha. Fallback logic.
~2 days
- Phase 3 — Convergence detection: Batch processing with early stopping when top-K stabilises.
~1 day
- Phase 4 — CLI surface + caching:
--no-hierarchy, --hierarchy-alpha flags. Entity score caching. Integration with --explain. ~1 day
Open Questions
Summary
Split search into two stages: first find the most relevant entities (people, projects, teams), then search within documents linked to those entities — blending entity relevance into document scores for dramatically better precision on structured queries.
Motivation
kbx currently does flat search across all 3000+ documents. Every document is treated equally regardless of its relationship to entities. This works well for keyword-specific queries but poorly for entity-centric queries like:
The problem: Flat vector search treats "Person A discussed the migration" and "Person B discussed a different migration" as equally relevant to a query about Person A's work. Two-pass search uses the entity graph to filter first, then score within that filtered set.
Inspiration: OpenViking's
HierarchicalRetrieveruses a priority-queue BFS over its directory tree with score propagation (final = 0.5 * child + 0.5 * parent). Their convergence detection stops early when top-K stabilises. We adapt this concept to kbx's entity→document graph rather than a filesystem hierarchy.Design
1. Pass 1 — Entity Search
Use the existing entity index to find the most relevant entities for the query.
Entity scoring sources:
Output: List of
(entity_id, entity_name, entity_type, score)tuples, sorted by score descending.Confidence threshold: Pass 1 results are only used if the top entity scores above a configurable threshold (e.g. 0.5). Below that, fall back to flat search — the query is probably not entity-centric.
2. Pass 2 — Entity-Constrained Document Search
Search documents linked to the top entities from Pass 1, blending entity relevance into document scores.
Document→entity linkage uses the existing entity mention index (
find_entity_mentions()). A document is linked to an entity if:3. Score Propagation Formula
Default:
alpha = 0.5— a document that's moderately relevant but strongly linked to a matched entity ranks similarly to a highly relevant document with weak entity linkage.Edge case: Documents linked to multiple matched entities get the best (max) entity score, not the sum. This prevents documents mentioning many entities from dominating purely through quantity.
Configurable in
kbx.toml:4. Fallback to Flat Search
Two-pass search is not always appropriate. Fall back to standard flat search when:
No confident entities: Pass 1 top score < threshold (default 0.5). The query is probably not entity-centric (e.g. "how do we handle deployment rollbacks?" — general topic, no specific entity).
Too many entities: If Pass 1 returns many entities with similar scores, the query is too broad for entity-guided filtering. Fall back if
top_score - 5th_score < 0.1.Explicit opt-out:
--no-hierarchyflag.FTS-only mode:
--fast(FTS-only search) skips two-pass by default since entity vector search isn't available. Could still do entity FTS matching in a future iteration.Fallback is transparent: Results include a
search_modefield in--explainoutput indicating whether two-pass or flat search was used, and why.5. Convergence Detection
For large entity-linked document sets, implement early stopping:
This prevents scanning hundreds of entity-linked documents when the top results are already clear after the first few batches. Particularly valuable for entities that appear in 100+ documents.
6. Query Type Suitability
The confidence threshold in Pass 1 handles this automatically — pure topic queries won't match entities confidently, so they fall back to flat search.
7. CLI Flags
8. Performance Implications
Added latency from Pass 1:
Pass 2 can be faster than flat search:
Caching opportunities:
entity_id → [doc_ids]mapping in memory.Worst case: Query matches 5 entities, each linked to 200 documents → 1000 candidate docs. With convergence detection (batch size 50, 3 stable rounds), typically evaluates 150-200 before stopping.
Integration with Other Features
final = (1 - hotness_weight) * two_pass_score + hotness_weight * hotnessImplementation Phases
~2 days~2 days~1 day--no-hierarchy,--hierarchy-alphaflags. Entity score caching. Integration with--explain.~1 dayOpen Questions
find_entity_mentions()matching or a dedicated entity search index? The existing matching is string-based; a vector search over entity descriptions would catch semantic matches (e.g. "infrastructure issues" matching an entity whose role is "SRE Lead").--hierarchy? Default-on with automatic fallback seems right, but adds complexity.