A PageIndex-inspired architecture for document retrieval through agentic reasoning over hierarchical document trees, replacing flat chunking and single-shot vector similarity with LLM-powered tree traversal.
These are the foundational changes that transform Vectorless from flat section selection into a true PageIndex-style hierarchical tree search system. Everything else builds on this.
Stop flattening NativeHeading.children in the extract strategy. Store sections with parentSectionId and level to preserve the document's natural heading tree. The ToCManifest gets a tree variant (ToCTreeManifest) with nested ToCTreeNode[] that the LLM agent navigates.
- Status: Implemented
- Impact: Critical — enables tree traversal
- Files:
extract.strategy.ts,toc/index.ts, shared types
Bottom-up LLM summarization across the entire document tree. Leaf nodes get precision retrieval summaries (names entities, concepts, scope boundaries). Branch nodes get scope summaries synthesized from their children's summaries. The root gets a document-level summary. This creates the "multi-resolution map" — the LLM agent can grasp high-level themes at branch level without loading raw text.
- Status: Implemented
- Impact: Critical — makes tree navigation effective
- Files:
toc/summarize.ts,hybrid.strategy.ts
Replace the single-shot LLM call with a multi-step agent built on the Vercel AI SDK's generateText + tools. The agent iteratively calls six tools to navigate the document tree:
| Tool | Purpose |
|---|---|
getRoots |
See top-level sections (called first) |
expandBranch |
Drill into a branch to see children |
getContent |
Retrieve leaf content (tracks token budget) |
getSiblings |
See adjacent sections for context |
searchTree |
Keyword search across all titles/summaries |
finish |
Signal completion with final reasoning |
This is the MCTS-style approach: the agent reasons about which branches to explore, drills down through levels, and retrieves content from relevant leaf nodes — producing a full reasoning trace at every step.
- Status: Implemented
- Impact: Critical — the core differentiator
- Files:
retrieval/tree-agent.ts,retrieval/model-selector.ts,routes/query.ts
Track page boundaries through the full pipeline including after section splitting. When a leaf is split, estimate proportional page ranges for each part rather than naively copying the parent's full range. For non-PDF formats, derive page-equivalent boundaries from character offsets.
- Status: Implemented (proportional estimation)
- Impact: High — accuracy for citations
- Files:
splitter/index.ts
These build on the tree architecture to make retrieval smarter, faster, and more capable.
Already built into the agent tool parameters. The getContent tool requires a confidence parameter (0-1). The traversal trace records confidence at each step, enabling post-hoc analysis, low-confidence filtering, and retrieval quality monitoring.
- Effort: Done (built into agent)
- Impact: High
The agent is told its token budget in the system prompt and receives tokens_remaining after each getContent call. When the budget is exhausted, the agent calls finish. The getContent tool enforces a hard stop when budget would be exceeded.
- Effort: Done (built into agent)
- Impact: High
The Vercel AI SDK handles parallel tool calling automatically — if the LLM returns multiple expandBranch calls in one step, all execute concurrently. The system prompt encourages this: "You can call multiple tools in a single step to explore branches in parallel."
- Effort: Done (inherent in AI SDK)
- Impact: Medium
Fetch tree ToC manifests for all documents in a project and let the agent traverse multiple document trees in a single query. New endpoint: POST /v1/query (project-level) with doc_ids[] param. Agent gets a listDocuments tool that returns document titles and root-level summaries, then selects which documents to explore.
- Effort: Medium
- Impact: 🔥🔥🔥
- Phase: Next
When retrieval selects a leaf section, the agent can call getSiblings to see adjacent sections at the same level. This provides structural context without the blunt overlap of traditional chunking. Particularly useful for sections that reference context established in the preceding section.
- Effort: Done (built into agent tools)
- Impact: Medium
Scan sections for "see Section X" or "as described in Chapter Y" patterns at ingest time and store them in the crossReferences JSONB column. Add a followReferences tool to the agent that fetches referenced sections alongside selected ones. This ensures the LLM has all the context a human reader would follow.
- Effort: Medium
- Impact: 🔥🔥
- Phase: Next
Cache the mapping (docId, queryHash) → sectionIds[] in an LRU cache. Before running the full agent, check if a similar query has been seen. On cache miss, run agent and store result. This eliminates redundant LLM calls for repeated or similar queries.
- Effort: Low
- Impact: 🔥🔥
- Phase: Next
Instead of a fixed maxSteps, classify query complexity and adjust the step budget automatically:
- Simple factual queries: 3-5 steps
- Multi-aspect queries: 8-12 steps
- Exploratory/survey queries: 15+ steps
The system prompt can analyze the query type and set traversal expectations.
- Effort: Low
- Impact: 🔥
- Phase: Next
Improvements to parsing, extraction, and summary quality that feed into the tree.
Send each PDF page as an image to Gemini Vision API for layout-aware extraction. Captures tables, headings inferred from visual hierarchy (font size, bold, positioning), images, and scanned/OCR documents that text-based parsers miss. ~$0.001 per page with Gemini Flash.
- Effort: Medium
- Impact: 🔥🔥🔥
- Phase: Next
Detect tables via VLM and convert to markdown or HTML table format within section content. Preserves row, column, and header relationships so downstream LLMs can reason over tabular data accurately. Critical for financial, clinical, and legal documents.
- Effort: Medium
- Impact: 🔥🔥🔥
- Phase: Next
When sections contain images, charts, or figures, send them to a VLM and generate searchable text descriptions embedded in section content. Makes visual content discoverable through the tree search and gives LLMs meaningful context about figures.
- Effort: Medium
- Impact: 🔥🔥
- Phase: Next
Prepend document-level context to each section before generating summaries: "Here is a document about X. This section covers..." Drawn from Anthropic's Contextual Retrieval research (49% fewer missed retrievals). Now feeds directly into the recursive summarization pipeline — both leaf and branch summaries benefit.
- Effort: Low
- Impact: 🔥🔥🔥
- Phase: Next
Parse → validate → re-parse pipeline for problem areas. After initial extraction, check for content gaps, missing sections, and overlapping boundaries. Re-run with more specific prompts on detected failures. Inspired by Reducto's "Agentic OCR" approach.
- Effort: High
- Impact: 🔥🔥
- Phase: Future
After LLM generates a summary, validate automatically: check length (50-500 chars), verify entity coverage (≥60% of top noun phrases appear), reject vague language ("various", "several aspects"). Auto-regenerate summaries that fail. Ensures consistently high quality across the tree.
- Effort: Medium
- Impact: 🔥🔥
- Phase: Future
Expose Vectorless as an MCP (Model Context Protocol) tool server so Claude Desktop, Cursor, Windsurf, and other MCP-compatible clients can use tree-aware document retrieval natively. Position Vectorless as a first-class tool in the agentic ecosystem.
- Status: In progress — local stdio + remote HTTP + OAuth code written, blocked on migration + consent screen + middleware wiring. See
docs/ISSUES.md. - Effort: Medium
- Impact: 🔥🔥🔥
- Phase: Launch-blocking (v1.0)
- Implementation plan:
docs/MCP_AND_AUTH_PLAN.md - Open blockers:
docs/ISSUES.md(#1 migration, #2 consent screen, #5 stateless sessions, #6 adapter hardening) - Shipped:
packages/mcp-tools/— shared tool definitions, schemas, handlersapps/mcp-stdio/— local stdio server, publishes asvectorless-mcpon npmapps/api/src/routes/mcp.ts+services/mcp-server.ts— remote HTTP MCP via Streamable HTTP transportapps/api/src/routes/oauth.ts— full OAuth 2.1 provider (DCR, authorize, token, revoke, introspect, well-known metadata)apps/api/src/middleware/auth.ts— unified bearer API key + OAuth JWT middlewareoauth_clients,oauth_authorization_codes,oauth_refresh_tokens,oauth_consents,oauth_revoked_jtistables defined inschema.ts
Server-Sent Events for real-time ingestion progress: "parsing... building tree... summarizing level 3... summarizing level 2... storing sections... ready". Replaces polling with push, enables responsive UIs showing recursive summarization progress.
- Effort: Medium
- Impact: 🔥🔥
- Phase: Future
client.reprocessDocument(docId, { tocStrategy: 'hybrid' }) to re-run ingestion with different parameters without re-uploading the file. Lets users upgrade documents when new parsing capabilities ship or experiment with different strategies.
- Effort: Medium
- Impact: 🔥🔥
- Phase: Next
Pre-built integrations for LangChain, LlamaIndex, Vercel AI SDK, CrewAI, and AutoGen. Each integration provides idiomatic adapters (e.g., a LangChain Retriever wrapping client.query(), a Vercel AI SDK tool definition) so developers plug Vectorless into their existing AI stack.
- Effort: High
- Impact: 🔥🔥🔥
- Phase: Future
POST /v1/documents/:id/query/stream returns Server-Sent Events for each agent traversal step in real-time. Frontend can show the agent "thinking" — expanding branches, evaluating summaries, retrieving content — as it happens.
- Effort: Medium
- Impact: 🔥🔥
- Phase: Future
The agentic retrieval engine already produces TraversalStep[] traces recording every tool call, argument, result, and reasoning. Surface these in the dashboard as a visual tree explorer — show which branches the agent explored, which it skipped, and the exact path to each retrieved section.
- Effort: Medium
- Impact: 🔥🔥
- Phase: Next
Tiered rate limits: free at 10 req/s, pro at 100 req/s, enterprise custom. Protects platform stability and creates upgrade paths. Separate limits for ingest (heavy) vs query (lighter) vs tree navigation (lightest).
- Status: In progress — plans config, middleware, and quota enforcement written but NOT wired to any route and still using in-memory store. See
docs/ISSUES.md#3 and #4. - Effort: Low
- Impact: 🔥🔥 (hard prerequisite for opening the platform to the public)
- Phase: Launch-blocking (v1.0)
- Implementation plan:
docs/MCP_AND_AUTH_PLAN.md§6 - Shipped:
apps/api/src/config/plans.ts— Free / Pro / Enterprise limits matching the planapps/api/src/middleware/rate-limit.ts—rateLimitFor(kind)factory (in-memory fallback)apps/api/src/middleware/quota.ts—quotaCheck(kind)+recordUsage()user_plans+usage_recordstables defined inschema.ts
- Remaining:
- Generate + apply migration for the two new tables (Blocker #1)
- Wire
rateLimitFor+quotaChecktoquery.ts,documents.ts,toc.ts,sections.ts,mcp.ts(Blocker #3) - Swap in-memory store for
@upstash/ratelimit(Blocker #4) - Record usage after successful queries + ingests
- Build
dashboard/usage/page.tsx(stub exists, needs real chart)
Full project-level isolation with separate API keys, documents, and usage tracking per project. Lets agencies and SaaS platforms manage multiple customers from a single account.
- Effort: High
- Impact: 🔥🔥
- Phase: Future
Support re-uploading new versions of a document while keeping old sections accessible and tracking version history. The tree structure makes diff-aware updates possible — only re-summarize branches that changed.
- Effort: High
- Impact: 🔥🔥
- Phase: Future
Cache tree ToC manifests at the CDN edge using Cloudflare Workers. Since tree ToCs rarely change after ingestion, they can be cached indefinitely with cache-busting on re-processing. Eliminates a database round-trip on every tree navigation request.
- Effort: Low
- Impact: 🔥🔥
- Phase: Next
| Phase | Focus | Items |
|---|---|---|
| Implemented | Architectural core | 1, 2, 3, 4, 5, 6, 7, 9 |
| Launch-blocking (v1.0) | MCP + auth + guardrails | 19, 25 (both code-complete, wiring/migration/consent page outstanding — see docs/ISSUES.md) |
| Next | Retrieval depth + platform | 8, 10, 11, 12, 13, 14, 16, 21, 24, 28 |
| Future | Advanced quality + ecosystem | 15, 17, 18, 20, 22, 23, 26, 27 |
See docs/ISSUES.md for the full audit. Ten blockers identified, ~2 focused days to clear:
- 🔴 Missing Drizzle migration for 7 new tables (oauth_*, user_plans, usage_records)
- 🔴 OAuth consent screen page does not exist in the dashboard
- 🔴 Rate-limit / quota middleware is written but not wired to any route
- 🟡 Rate-limit store is in-memory, not Upstash — breaks on serverless
- 🟡 MCP sessions kept in process memory — serverless fragility
- 🟡 Hand-rolled Node req/res mock in
handleMcpRequest— needs tests - 🟡 DCR endpoint has no abuse protection
- 🟢 No cleanup cron for
oauth_revoked_jtis - 🟢 Missing env vars in
DEPLOYMENT.md - 🟡 No integration tests for MCP / OAuth / rate-limit