Thankyou for this project. I hope it catches on!
Problem: “container” sections embed with low/no semantic content
Today, parseSections() in src/lattice.ts builds a section tree from headings and computes:
startLine: heading line
endLine: line before the next heading (regardless of depth)
firstParagraph: first paragraph before the next heading (so “container” headings get '')
Indexing (src/search/index.ts) embeds and stores the raw markdown slice for each section (startLine..endLine).
This produces a poor result for “container” sections that only contain subheadings, e.g.:
## OAuth Flow
### Refresh Tokens
(paragraphs...)
### PKCE
(paragraphs...)
### Error handling
(paragraphs...)
For ## Auth Flow, the slice is typically just ## Auth Flow (maybe whitespace). So the vector DB stores/embeds a near-empty document. These headings then:
- rank poorly in semantic search
- don’t help retrieve the right “chapter/overview” node even when the children contain the relevant content
Proposal: improve embedding text for containers via title-path + child summary
Goals
- Improve semantic search recall/precision for outline-heavy docs.
- Keep results navigable (IDs remain stable; parents/children are still meaningful).
- Keep embedding payload bounded (avoid embedding huge subtrees).
Suggested Approach
When building the string that gets embedded/stored for a section, construct a richer “embedding document” that includes:
- Heading path / title context
- Include the full path of headings (and file) derived from
section.id
- e.g.
auth#OAuth Flow#Refresh Tokens → path: auth > OAuth Flow > Refresh Tokens
- This disambiguates common headings and improves retrieval.
- Section body content
- Include the section’s own raw markdown slice as we do today.
- Add a bounded summary of children
- Append a ToC or short summary constructed from descendants, bounded by limits:
- take direct children (or descendants) in document order
- for each child, take its heading title
- optional, first paragraph (or up limited number of characters and/or words)
Example embedding doc for a container:
Path: auth > OAuth Flow
Table of Contents:
- Refresh Tokens
- PKCE
- Error handling
Notes / design questions
- Summarize direct children vs all descendants until a size limit?
- Suggestion: direct children first; if still empty, include grandchildren until limits hit.
- Keep the stored DB
heading column unchanged (still the section’s own heading), but store a richer content (or a new column) used for embeddings/search display.
References
Why this is worth it
This repo’s section model is hierarchical, but embeddings are currently derived from a non-hierarchical slice strategy. Adding heading-path context and a bounded child summary makes embeddings reflect the hierarchy and improves semantic search for outline-heavy documentation.
Thankyou for this project. I hope it catches on!
Problem: “container” sections embed with low/no semantic content
Today,
parseSections()insrc/lattice.tsbuilds a section tree from headings and computes:startLine: heading lineendLine: line before the next heading (regardless of depth)firstParagraph: first paragraph before the next heading (so “container” headings get'')Indexing (
src/search/index.ts) embeds and stores the raw markdown slice for each section (startLine..endLine).This produces a poor result for “container” sections that only contain subheadings, e.g.:
For
## Auth Flow, the slice is typically just## Auth Flow(maybe whitespace). So the vector DB stores/embeds a near-empty document. These headings then:Proposal: improve embedding text for containers via title-path + child summary
Goals
Suggested Approach
When building the string that gets embedded/stored for a section, construct a richer “embedding document” that includes:
section.idauth#OAuth Flow#Refresh Tokens→ path:auth > OAuth Flow > Refresh TokensExample embedding doc for a container:
Notes / design questions
headingcolumn unchanged (still the section’s own heading), but store a richercontent(or a new column) used for embeddings/search display.References
src/lattice.ts: https://github.com/1st1/lat.md/blob/main/src/lattice.tsWhy this is worth it
This repo’s section model is hierarchical, but embeddings are currently derived from a non-hierarchical slice strategy. Adding heading-path context and a bounded child summary makes embeddings reflect the hierarchy and improves semantic search for outline-heavy documentation.