Skip to content

“container” sections embed with low/no semantic content #47

@au-phiware

Description

@au-phiware

Thankyou for this project. I hope it catches on!

Problem: “container” sections embed with low/no semantic content

Today, parseSections() in src/lattice.ts builds a section tree from headings and computes:

  • startLine: heading line
  • endLine: line before the next heading (regardless of depth)
  • firstParagraph: first paragraph before the next heading (so “container” headings get '')

Indexing (src/search/index.ts) embeds and stores the raw markdown slice for each section (startLine..endLine).

This produces a poor result for “container” sections that only contain subheadings, e.g.:

## OAuth Flow
### Refresh Tokens
(paragraphs...)

### PKCE
(paragraphs...)

### Error handling
(paragraphs...)

For ## Auth Flow, the slice is typically just ## Auth Flow (maybe whitespace). So the vector DB stores/embeds a near-empty document. These headings then:

  • rank poorly in semantic search
  • don’t help retrieve the right “chapter/overview” node even when the children contain the relevant content

Proposal: improve embedding text for containers via title-path + child summary

Goals

  1. Improve semantic search recall/precision for outline-heavy docs.
  2. Keep results navigable (IDs remain stable; parents/children are still meaningful).
  3. Keep embedding payload bounded (avoid embedding huge subtrees).

Suggested Approach

When building the string that gets embedded/stored for a section, construct a richer “embedding document” that includes:

  1. Heading path / title context
  • Include the full path of headings (and file) derived from section.id
    • e.g. auth#OAuth Flow#Refresh Tokens → path: auth > OAuth Flow > Refresh Tokens
  • This disambiguates common headings and improves retrieval.
  1. Section body content
  • Include the section’s own raw markdown slice as we do today.
  1. Add a bounded summary of children
  • Append a ToC or short summary constructed from descendants, bounded by limits:
    • take direct children (or descendants) in document order
    • for each child, take its heading title
    • optional, first paragraph (or up limited number of characters and/or words)

Example embedding doc for a container:

Path: auth > OAuth Flow

Table of Contents:
- Refresh Tokens
- PKCE
- Error handling

Notes / design questions

  • Summarize direct children vs all descendants until a size limit?
    • Suggestion: direct children first; if still empty, include grandchildren until limits hit.
  • Keep the stored DB heading column unchanged (still the section’s own heading), but store a richer content (or a new column) used for embeddings/search display.

References

Why this is worth it

This repo’s section model is hierarchical, but embeddings are currently derived from a non-hierarchical slice strategy. Adding heading-path context and a bounded child summary makes embeddings reflect the hierarchy and improves semantic search for outline-heavy documentation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions