Skip to content

Latest commit

 

History

History
219 lines (170 loc) · 11.5 KB

File metadata and controls

219 lines (170 loc) · 11.5 KB

vectorless-engine — roadmap

Living document. Tick boxes as things land. Edit freely; this is the source of truth for "what's done, what's next."

Legend: [x] done · [~] in progress · [ ] not started · [?] idea, not committed · (opt) optional polish


Phase 0 — scaffold (shipped)

One-line: get a single Go binary that builds, boots, and serves an HTTP surface.

  • Go module (github.com/hallelx2/vectorless-engine, Go 1.25+)
  • cmd/engine entry point with graceful shutdown (signal.NotifyContext, 15s drain)
  • Structured logging (slog, JSON + console handlers)
  • config package — YAML + VLE_* env overrides + Validate()
  • HTTP layer (chi router, RequestID / RealIP / Recoverer middleware)
  • Pluggable interfaces: storage.Storage, queue.Queue, llm.Client, retrieval.Strategy
  • Driver stubs for: local / S3 storage · QStash / River / Asynq queue · Anthropic / OpenAI / Gemini LLM
  • tree package — core Tree / Section / View model
  • Dockerfile (multi-stage, distroless) + docker-compose.yml (Postgres / Redis / MinIO)
  • Apache 2.0 license, README with badges and SVG diagrams
  • GitHub repo created, main pushed, topics added

Phase 1 — ingest (shipped)

One-line: raw bytes → queryable, persisted tree.

  • Database layer

    • pgxpool wrapper in internal/db with Open() + ping
    • Embedded SQL migrations + auto-apply at boot (schema_migrations tracked)
    • Schema: documents (lifecycle: pending → parsing → summarizing → ready | failed) + sections (self-referential tree)
    • CRUD helpers: NewDocument, GetDocument, SetDocumentStatus, SetDocumentTitle, DeleteDocument, UpsertSection, UpdateSectionSummary, GetSection, ListSections, LoadTree, ListDocuments
    • (opt) sqlc migration — queries are hand-written right now; revisit once schema stabilizes
  • Parser subsystem

    • Parser interface + Registry that routes by content-type / extension
    • Markdown (goldmark, ATX+Setext headings → level-stack hierarchy)
    • HTML (golang.org/x/net/html, prefers <main>/<article>, strips chrome)
    • DOCX (stdlib archive/zip + encoding/xml, detects Heading 1…9 + Title styles — both Heading2 and Heading 2 spellings)
    • PDF (ledongthuc/pdf, pure Go no cgo)
      • Font-size heuristic for unstructured PDFs
      • /Outlines ground truth when bookmarks exist, with text-matching fallback (< 50% match ⇒ fall back)
      • (opt) OCR for scanned PDFs (Tesseract via shell-out, or LLM vision call)
      • (opt) Encrypted PDF support via NewReaderEncrypted
    • Plain Text single-section fallback
    • Table-driven smoke tests for all five (DOCX test assembles .docx in-memory — no binary fixtures)
  • Ingest pipeline

    • Pipeline orchestrates parse → persist tree → summarize (all stages idempotent)
    • Registered against queue.KindIngestDocument
    • Section content lands in object storage (storage.Storage); DB holds only outline + summaries
    • Summarizer calls the llm.Client with a terse one-sentence prompt
    • Graceful degradation when LLM is stubbed — falls back to truncated excerpt so ingest completes end-to-end in dev
    • (opt) Parallel summarization via errgroup + semaphore (today it's sequential)
    • (opt) Retry budget per section; surface summary errors on the document row
  • HTTP API (ingest side)

    • POST /v1/documents — multipart or JSON body, stores bytes, enqueues job, returns 202
    • GET /v1/documents — keyset pagination (?limit, ?cursor, ?status)
    • GET /v1/documents/{id} — metadata + status
    • GET /v1/documents/{id}/tree — compact View
    • GET /v1/sections/{id} — metadata + full content
    • DELETE /v1/documents/{id} — cascades to sections
    • (opt) GET /v1/documents/{id}/source — stream the original bytes back
    • (opt) Presigned URL passthrough when storage.SignedURL is supported
  • TLS

    • Plaintext by default (behind reverse proxy — recommended production setup)
    • Opt-in direct TLS via server.tls.{cert_file,key_file,min_version} + VLE_TLS_* env overrides
    • (opt) Autocert / Let's Encrypt integration for single-node deployments
  • Dev ergonomics

    • docker-compose with Postgres, Redis, MinIO
    • engine service gated behind profiles: ["engine"] for one-command containerised stack
    • .gitignore tightened so cmd/engine/main.go stops being ignored

Phase 2 — retrieval (shipped, minus benchmarks)

One-line: turn POST /v1/query from a 501 into the feature the engine exists for.

  • Live LLM clients — extracted to llmgate

    • Anthropic, OpenAI, Gemini all live via langchaingo under a shared llmgate.Client
    • Provider switching is pure config (llm.driver: anthropic | openai | gemini)
    • Retry with exponential backoff + jitter on 429 / 5xx (llmgate/middleware/retry)
    • Cost tracking per call (Usage.CostUSD populated from a static price table)
    • Error classification shared across providers (llmgate.Classify)
    • Real CountTokens via provider endpoints — currently heuristic in llmgate; tracked in the llmgate roadmap, not a blocker here
    • Streaming responses (SSE) — deferred to Phase 4
  • Retrieval strategies

    • SinglePass — real implementation
      • Build prompt from tree.View (titles + summaries + IDs, depth-aware indentation)
      • Request structured output (JSON list of section IDs + reasoning) — JSON-mode via prompt nudge + schema
      • Validate returned IDs against the tree; drop unknown ones (FilterKnownIDs)
      • Tolerate code fences / leading prose in model output (ParseSelection)
    • ChunkedTree — real implementation of the parallel map-reduce design
      • Splitter that slices the tree view into budget-sized chunks with breadcrumb + sibling summaries (structure-aware bin-packing, recurses into oversized subtrees)
      • errgroup + semaphore bounded by MaxParallelCalls (already in scaffold)
      • Merge policies: Union default (dedupe + sorted)
      • (opt) TopN(ranked), Vote(k-of-n) merges
      • Fall back to single slice when the tree fits the budget
      • Filter IDs per-slice so the model can't fabricate IDs from other slices
    • Unit tests with a mock llm.Client that returns canned IDs
      • Happy-path selection, unknown-ID filtering, code-fence tolerance, multi-slice split, ID-fabrication guard, splitter fast path
  • POST /v1/query handler

    • Parse body { document_id, query, model?, max_tokens?, reserved_for_prompt?, max_parallel_calls?, max_sections? }
    • Load tree via db.LoadTree
    • Run the configured retrieval.Strategy
    • Fetch picked sections' content from storage
    • Return { sections: [...], strategy, model, elapsed_ms }
    • (opt) Include tokens_in / tokens_out in response (Response struct already tracks them — just needs plumbing)
    • (opt) SSE streaming variant for progressively revealing sections as they're picked
  • Benchmarks vs. traditional RAG

    • Pick a corpus (e.g. 50 technical docs + hand-written QA pairs)
    • Baseline: pgvector + OpenAI embeddings + top-K=5
    • Metrics: precision@k, recall, citation correctness, $ per query, p50/p95 latency
    • Publish in benchmarks/README.md

Phase 3 — ecosystem (soon)

One-line: the engine becomes useful beyond a single go run on a laptop.

  • Queue drivers — flesh out the stubs

    • River live (Postgres-backed, uses the same DB as the data plane)
    • Asynq live (Redis-backed, higher throughput path)
    • QStash webhook signature verification in handleQueueWebhook
    • Dead-letter surface (document row records last error + retry count)
  • Storage drivers

    • S3-compatible live (AWS S3, Cloudflare R2, MinIO, Backblaze B2, DigitalOcean Spaces)
    • SignedURL for providers that support it
    • (opt) GCS driver
    • (opt) Azure Blob driver
  • SDKs (separate repos)

    • @vectorless/sdk-ts — TypeScript, targets node + edge runtimes
    • vectorless Python package — targets 3.10+
    • github.com/hallelx2/vectorless-go — Go client
    • OpenAPI 3 spec generated from route handlers, SDKs generated from it
  • Packaging / deploy

    • GitHub Actions: build + test + lint matrix
    • GHCR image publish on tag (:latest, :vX.Y.Z, :sha-<short>)
    • Release binaries via goreleaser (linux/darwin/windows × amd64/arm64)
    • Helm chart (charts/vectorless-engine)
    • Terraform module (terraform/) for one-click cloud deploys
    • systemd unit file for bare-metal installs

Phase 4 — scale (later)

One-line: push the engine past the "one doc, one query" comfort zone.

  • Multi-document queries — reason across N trees in one call, merge across docs
  • Streaming answers — SSE on /v1/query, tokens as they come
  • Tree caching — cache the View prompt per document+model so repeated queries skip rebuilding
  • Tree compaction — merge adjacent leaf sections with tiny token counts for more efficient reasoning
  • Incremental re-ingest — detect changed sections in a re-uploaded doc, keep stable section IDs for unchanged ones
  • Access control — per-document ACLs + API key scoping (the control-plane's job, but engine needs hooks)

Cross-cutting — always on

Observability

  • OpenTelemetry tracing (HTTP + queue jobs + LLM calls)
  • Prometheus metrics endpoint (/metrics): request counters, queue depth, ingest latency, LLM token usage, error rates
  • Structured error wrapping everywhere (sentinel errors + errors.Is)

Security

  • API key auth middleware (pluggable; the control-plane supplies keys)
  • Rate limiting per key
  • Request size limits (already 32MB on multipart; review)
  • SBOM generation + supply-chain signing (cosign on images)

Developer docs

  • docs/API.md — full OpenAPI-driven reference
  • docs/CONTRIBUTING.md — conventions, commit style, local dev loop
  • docs/ADR/ — architecture decision records as we go
  • docs/BENCHMARKS.md — live numbers, updated per release

Testing

  • Unit test coverage ≥ 70% on internal/retrieval, internal/ingest, internal/db, internal/parser
  • Integration test suite that spins docker-compose and runs end-to-end ingest → query
  • Fuzz tests on parsers (malformed markdown, malformed HTML, truncated PDFs)
  • Load test harness with k6 or vegeta scripts

Known issues / deferred

  • Windows CRLF handling in git — benign warnings on every git add
  • PDF parser doesn't handle scanned (image-only) PDFs — needs OCR
  • DOCX parser loses inline formatting (bold/italic/links) — plain text only for now
  • Summarizer is sequential; large trees (> 100 sections) serialize too long
  • handleQueueWebhook is a no-op stub; needed when queue.driver=qstash

How to use this doc

  • Before starting a task: flip its box to [~] in a tiny commit so collaborators see it's claimed.
  • On merge: flip to [x] in the same PR that delivers the work.
  • New ideas: drop them under the right phase with [?] — it means "plausible, not committed yet."
  • Removals: if a task turns out not to make sense, delete it rather than leaving a zombie checkbox.