The internal data room the 42nights team runs on — drop in customer contracts, call transcripts, Slack exports, and procurement playbooks, then ask questions and get answers grounded only in the corpus, with verbatim citations back to the source.
The deployed app runs on Convex — documents, chunks, embeddings, threads, and messages all live there, and search is a Convex vector query. The only other outbound calls are OpenAI (embeddings) and Anthropic (generation). A local-only build (LOCAL_MODE) also exists for zero-cloud-touch runs; it monkey-patches fetch to block every non-localhost request.
- Hand-rolled RAG. No
@convex-dev/rag, no Pinecone, no LanceDB. Chunks and their embeddings are stored in Convex; retrieval is a Convex vector search (by_embedding), cosine similarity since every embedding is unit-normalized. - OpenAI embeddings (
text-embedding-3-small, 1536d) + Anthropic Claude Opus 4.7 for generation with tool-use schema. - Citation validator — every quote the model returns must appear verbatim in the chunk it cites. Invalid citations get dropped on the floor. If a fact-asserting answer ends up with zero valid citations after a stricter retry, we override to "I couldn't ground that in the data room."
- Answerability gate in front of the LLM — if top retrieval score is below threshold, return a canned "I don't have that" with zero tokens spent.
- Secret scrubber server-side: Anthropic / OpenAI / GitHub / AWS / Stripe / Slack / PEM / JWT patterns redacted before anything hits an embedding or generation API.
- Compliance checks — every doc is auto-classified for PHI / PII / PCI / financial / legal-confidential data via a regex pass (checksum-validated) plus a cached Claude Haiku semantic pass.
npm run compliance:auditwrites a full report. See Compliance checks. - Parsers: PDF (
pdfjs-dist), DOCX (mammoth), XLSX (xlsx), CSV (papaparse), JSON, MD, TXT.
- Next.js 16 (App Router, Turbopack), React 19, TypeScript strict, Tailwind 3
- shadcn-style primitives (button, dialog, dropdown, badge, input, textarea),
sonnerfor toasts,lucide-reactfor icons - Convex for storage + vector search (
convexclient, admin deploy key server-side) @anthropic-ai/sdkfor generation, rawfetchfor OpenAI embeddingsvitestfor tests (30 unit tests covering chunker, scrubber, citations, parsers, vector math, compliance detectors)- Inter (body) + Crimson Pro (display) + JetBrains Mono (code)
nvm use # node 22+
npm install
cp .env.local.example .env.local # paste your OpenAI + Anthropic keys
npm run dev # next dev on :3000The schema migrates itself on the first API request. To wipe everything: rm -rf data/.
npm test # 30 vitest cases
npm run typecheck # tsc --noEmit
npm run build # Turbopack production build
npm run reingest:stale # re-embed any docs whose prompt_version is out of date
npm run compliance:audit # full corpus compliance report → data/
npm run ingest:fixtures # ingest test_fixtures/ (used by local-mode setup)
npm run smoke # run the 8-query smoke test, write a JSON artifact Browser ──fetch──▶ Next.js API routes ──▶ lib/ingest, lib/search
│
▼
better-sqlite3 ──▶ ./data/dataroom.db
│
▼
In-memory Map<chunkId, Float32Array>
(hydrated from chunks.embedding on first import)
│
▼
api.openai.com (embeddings) · api.anthropic.com (generation)
The only network calls this app makes are to those two endpoints. No analytics, no telemetry, no third-party DB.
Every numbered chip is a validated citation — click one for the verbatim quote, why-it's-relevant context, and the surrounding chunks pulled from disk.
The /how-it-works page documents what's actually happening between a question and an answer, including the answerability gate and the deterministic citation validator. Same tone as Lightyear's transparency page.
lib/
├─ db/ schema.sql · better-sqlite3 singleton · ensureSchema()
├─ parsers/ pdf · docx · xlsx · csv · json · plain · markdown
├─ chunker.ts markdown-aware splitter (tables stay whole, soft overlap)
├─ scrubber.ts secret patterns → [REDACTED:<kind>]
├─ compliance/ regex + LLM sensitive-data detection (PHI/PII/PCI/financial/legal)
│ ├─ regex.ts deterministic detectors (Luhn/ABA/IBAN checksums, context gating)
│ ├─ llm.ts Claude Haiku semantic pass, cached by content hash + prompt version
│ ├─ types.ts Finding / ComplianceLabel / Severity
│ └─ index.ts orchestration · document_compliance writes · summary · rescan
├─ citations.ts zod schema + Anthropic tool spec + validateCitations()
├─ prompts.ts versioned system prompt + stricter-retry nudge
├─ embeddings.ts OpenAI (cloud) or local (Ollama/LM Studio) batch embed + normalize()
├─ llm.ts callLLM dispatcher — Anthropic (cloud) or local OpenAI-compat
├─ local-mode.ts LOCAL_MODE fetch guard (blocks non-localhost) + key-conflict refusal
├─ vector.ts in-memory index · insert/delete · searchByVector · getChunkWithContext
├─ ingest.ts parse → scrub → chunk → embed → store (transactional)
├─ search.ts retrieve → gate → tool-use → validate → retry
├─ chat-db.ts thread + message CRUD
├─ anthropic.ts thin client wrapper
├─ audit.ts append-only audit_log
└─ api-client.ts browser-side fetch helpers
app/
├─ sources/ list, upload, detail
├─ chat/[threadId]/ threaded chat with citation chips
├─ how-it-works/ methodology page
├─ unlock/ optional passcode form (only used when APP_PASSCODE is set)
└─ api/
├─ upload/ POST multipart
├─ sources/[id]/chunks/[idx]/ citation preview
├─ chat/threads/[id]/messages/ message list
├─ chat/ send + answer
└─ admin/compliance/
├─ summary/ GET corpus-wide stats
├─ document/[id]/ GET full findings for one doc
├─ rescan/[id]/ POST force re-run one doc
└─ rescan-all/ POST full-corpus rescan → job id
| Variable | Purpose |
|---|---|
OPENAI_API_KEY |
Embeddings via text-embedding-3-small (cloud mode) |
ANTHROPIC_API_KEY |
Generation via Claude Opus 4.7; compliance semantic pass via Claude Haiku 4.5. Unset = compliance runs regex-only. |
APP_PASSCODE |
Optional. If set, the whole app gates behind a single-field passcode at /unlock. Unset = bypass. |
LOCAL_MODE |
true = fully local, zero cloud touch (see below). Refuses to start if a cloud key is also set. |
LOCAL_LLM_URL / LOCAL_LLM_MODEL |
LM Studio OpenAI-compat endpoint + model id for generation. |
LOCAL_EMBED_URL / LOCAL_EMBED_MODEL |
Ollama /api/embed (or LM Studio /v1/embeddings) + 768d embedding model. |
LOCAL_MODE=true routes generation to LM Studio and embeddings to Ollama/LM Studio with no network calls to OpenAI or Anthropic — enforced by lib/local-mode.ts, which monkey-patches fetch to throw on any non-localhost request and refuses to start if a cloud key is present. Embedding dimension drops 1536→768, so switching modes requires a wipe + re-ingest. Full write-up + benchmarks: ../findings_local_port.md.
cp .env.cloud.local .env.local # (created on first switch) restore cloud later
# edit .env.local: LOCAL_MODE=true, remove cloud keys, set LOCAL_* (see .env.local.example)
rm -rf data/dataroom.db data/files/ && npm run migrate
npm run ingest:fixtures # ingest test_fixtures/ at 768-dim
npm run smoke # 8-query smoke, writes data/smoke-<mode>-*.json- Upload (
POST /api/upload) — content-hashes the bytes (SHA-256), dedupes against existing rows. Identical bytes return the existingdocumentIdwithdeduped: trueand skip every downstream step. - Parse (
lib/parsers/) — per-mime dispatch to plain markdown. Tables in XLSX/CSV become real markdown tables. - Scrub (
lib/scrubber.ts) — server-side regex sweep redacts API keys, PATs, AWS access keys, Stripe keys, Slack tokens, PEM private keys, JWTs. Logged but not blocked. - Chunk (
lib/chunker.ts) — markdown-aware: splits on headings, keeps tables whole up to a hard cap, soft overlap between chunks so context survives boundaries. Defaults: 200/1200/100/4000. - Compliance (
lib/compliance/) — regex pass + cached Claude Haiku semantic pass classify the doc intophi/pii/pci/financial/legal-confidentialwithblock/warnseverity. Writes onedocument_compliancerow; chunks inherit the doc's labels. See Compliance checks. - Embed (
lib/embeddings.ts) — batched against OpenAI (96 per request). Every vector unit-normalized on receive. - Index (
lib/vector.ts) — inserted into both SQLite (chunks.embeddingas aBLOB) and the in-memoryMap<chunkId, Float32Array>. Status flips toready. - Query (
POST /api/chat) — embed the question → dot-product against the in-memory map → top-K → answerability gate. If the gate passes, build<source id="Sn">blocks (with one chunk of context on either side) and call Claude with tool-use forcing the answer into a shape with citations, confidence, follow-ups, andanswerable. - Validate — every cited quote must appear verbatim (whitespace-normalized) in the chunk it points at. Invalid citations dropped. Fact-asserting answer with zero validated citations → one stricter-retry → if still zero, force
answerable: false. - Persist — append assistant message with validated
citations,followups,confidence,prompt_version. Audit row written for every ask.
Every ingested document is automatically classified for sensitive data. This is detection and reporting — not HIPAA certification, access control, or encryption. It tells you which documents hold what classes of sensitive data so you can decide what to do.
Two passes, merged into one document_compliance row:
- Regex (
lib/compliance/regex.ts) — deterministic, runs every time. SSN (with grouping sanity check), credit cards (Luhn), CVV/MRN/ICD-10/DOB/phone/email/driver-license/passport, bank-account/routing (ABA checksum)/IBAN (mod-97), and attorney-confidential markers. Several detectors only fire near a context word (a 9-digit run is "routing" only next to routing/aba; a date is a DOB only near born/dob). Snippets are redacted — the raw value never lands in the report. - LLM (
lib/compliance/llm.ts) — Claude Haiku reads the scrubbed markdown and catches what regex can't: named individuals in prose, contracts, internal commercial strategy. Schema-bound via tool-use,temperature: 0, consolidated to ≤3 findings per doc. Cached on(content_hash, prompt_version, model)— an unchanged doc never re-pays for the call.
Labels stack per doc: phi · pii · pci · financial · secrets · legal-confidential · clean. Severity is block (real and dangerous — SSN, card, MRN, account number, live secret) or warn (worth surfacing, possibly a false positive). Secrets at block are already redacted upstream by the scrubber, so compliance sees [REDACTED:…] by the time it runs.
npm run compliance:audit # → data/compliance-report-<timestamp>.mdWalks the corpus, runs both passes (cached), then adds an adversarial pass where Claude Opus answers "if a stranger got this doc, what's the worst that happens, and who should it be restricted to?" — then writes a markdown report: summary table, per-doc findings with redacted snippets, the adversarial narrative, and a recommendation (redact / restrict / keep). This is what you hand to a customer to show the corpus's compliance posture.
GET /api/admin/compliance/summary corpus stats (by severity, by label, top offenders)
GET /api/admin/compliance/document/:id full findings for one doc
POST /api/admin/compliance/rescan/:id re-run one doc
POST /api/admin/compliance/rescan-all re-run the corpus → { job_id, scanned, summary }
block→ there's a real identifier in the doc. Open it, redact the span, re-upload. Re-ingest with identical bytes is a no-op; edit the file first.warn→ review and decide. Mostwarnitems (a name, a vendor price) are fine to keep — the label is what the texting agent's access tiers will key off (seetexting_agent_access_tiers.md), not a reason to delete.- Reviewed-and-accepted is a manual step for v0: note it in the report. There is no "mark as reviewed" state in the DB yet.
- Bumping
COMPLIANCE_PROMPT_VERSION(inlib/compliance/llm.ts) invalidates the LLM cache for the new version, so the next ingest/rescan re-runs the semantic pass. Use this after editing the prompt.
A Chrome extension that captures what you read and folds it into this same data
room as a new ingestion source. Captures land as documents rows tagged
source='web_brain', chunked + embedded into the same chunks table, so the
brain rides on the existing retrieval, citation, and compliance plumbing.
Tiered memory. A capture starts at tier='realtime'. Cron jobs consolidate
upward — realtime → daily → weekly → monthly — each tier an LLM-compressed,
denser summary. A consolidated capture is marked consolidated_into and ranked
lower; its summary becomes the primary search target (it stays as the citation).
Cloud vs local. Same INFERENCE_MODE split as the rest of the app. Cloud:
the extension sends scrubbed page text, the backend compresses with Claude. Local
(LOCAL_MODE=true): the extension compresses in-browser with WebLLM and only the
digest leaves the device. Either way a server-side scrubber (lib/webbrain/ scrubber.ts) redacts secrets + PII and drops any capture with > 5 redactions
before it touches the LLM or storage.
Endpoints (extension calls use Authorization: Bearer <token>):
| Route | Purpose |
|---|---|
POST /api/extension/connect |
Mint an extension token (passcode-gated when APP_PASSCODE set) |
POST /api/webbrain/capture |
Ingest one capture → scrub → compress → embed → chunk |
GET /api/webbrain/search |
Tier-aware vector search over captures |
POST /api/webbrain/chat |
RAG chat over the brain (cites url + date + tier) |
POST /api/webbrain/consolidate |
Cron-fired tier consolidation (?tier=daily, CRON_SECRET) |
POST /api/brain/chat |
Dashboard /brain chat (passcode-gated, no token) |
Quick start.
# 1. backend — migration runs automatically on first request; or:
npm run migrate
npm run dev
# 2. extension — build + load unpacked
cd extensions/web-brain && npm install && npm run build
# chrome://extensions → Developer mode → Load unpacked → extensions/web-brain/dist
# (local/privacy mode also needs: npm install @mlc-ai/web-llm)
# 3. connect — open http://localhost:3000/extension/connect, copy the token,
# paste it into the extension's Settings tab.
# 4. consolidation — Vercel crons (vercel.json) hit /api/webbrain/consolidate,
# or run it by hand / backfill history:
npm run webbrain:consolidate # daily+weekly+monthly, trailing windows
npm run webbrain:consolidate all --backfillRead the full brain over all your reading at /brain in the dashboard.
- No streaming — answers complete in a single tool-use call (4–10s).
- Web Brain is single-tenant (no
user_id): one passcode, one logical owner. Multi-user isolation is auser_idcolumn away. - Web Brain consolidation cadence is UTC (
vercel.json), not per-user-local — the spec's "user-local" needs a per-user timezone column. - Pure vector search. FTS5 hybrid is sketched in
lib/db/schema.sqland is a ~30-line follow-up. - PDFs with complex tables can lose structure under
pdfjs-dist. A LlamaParse fallback is sketched in the original spec. - Designed for ≤ 20 files. The in-memory index grows linearly with chunks; past ~10k chunks, swap to
sqlite-vecor HNSW (~100-line change). - No GitHub OAuth or member allowlist. Local-only assumes the person at the keyboard is authorized.
test/
├─ chunker.test.ts 4 tests — bounds, tables, overlap, empty input
├─ scrubber.test.ts 5 tests — Anthropic/GitHub/AWS/PEM patterns + clean text
├─ citations.test.ts 5 tests — quote-in-chunk validation + fact-detection
├─ parsers.test.ts 4 tests — CSV/JSON/MD/plain text round-trips
├─ vector.test.ts 2 tests — dot-product ordering + delete-syncs-index
└─ compliance.test.ts 10 tests — fixture detections, checksum validators, redacted snippets
fixtures/ seeded PHI/PII/clean files the detectors run against
npm test runs all 30 in under a second. The one LLM-pass test self-skips unless ANTHROPIC_API_KEY is set.
Dataroom ships as a Castle template. One click in the Castle dashboard provisions a Vercel project, a fresh data/dataroom.db, and a .env prefilled with the tenant and Castle vars. What you need to know:
Set these in your Vercel project env (or equivalent):
DATAROOM_TENANT_SLUG=acme # activates tenant mode
DATAROOM_TENANT_DISPLAY_NAME=Acme # shown in app title / footer
DATAROOM_TENANT_PUBLIC_URL=https://dataroom.acme.com
DATAROOM_LOGO_URL=https://cdn.acme.com/logo.png # optional favicon
DATAROOM_PRIMARY_COLOR=#0066ff # optional brand color
CASTLE_API_URL=https://api.castle.42nights.com
CASTLE_DEPLOYMENT_ID=<id from Castle dashboard>
CASTLE_WEBHOOK_SECRET=<secret from Castle dashboard>
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
APP_PASSCODE=<strong passcode for the tenant>
All tenant vars are optional individually. Setting DATAROOM_TENANT_SLUG activates tenant mode and requires DATAROOM_TENANT_PUBLIC_URL, CASTLE_DEPLOYMENT_ID, and CASTLE_API_URL — the app will throw at startup if those are missing.
The recommended deployment pattern is one Vercel project per tenant with a persistent volume (or Vercel's file-system on a Pro plan) mounted at ./data/. Each tenant gets:
- Their own
data/dataroom.db(no row-level multi-tenancy) - Their own
APP_PASSCODE - Their own OpenAI/Anthropic billing (or shared keys with per-tenant cost attribution via
DATAROOM_TENANT_SLUG)
This means there is no cross-tenant data risk — the DB is physically separate. Multi-tenant-in-one-DB is a future user_id column away.
The embedding dimension is written at ingest time. Cloud mode uses 1536d (text-embedding-3-small); local mode uses 768d. Switching modes after any document has been ingested requires wiping the DB and re-ingesting. This is by design — mixed-dimension indexes silently break cosine similarity. If you need to switch:
rm data/dataroom.db data/files/
npm run migrate
# re-upload documents or run npm run ingest:fixturesUse DATAROOM_MODE=local (or LOCAL_MODE=true) to lock the instance to local models. DATAROOM_MODE is the preferred var in Castle deployments.
Control ingestion-time LLM spend with DATAROOM_COMPLIANCE_LEVEL:
standard(default) — regex + Claude Haiku semantic pass. Haiku calls are cached by content hash, so re-ingesting the same bytes is free.off— regex only. Zero Haiku calls during ingest. Good for bulk-load bootstrapping or cost-constrained tenants. Runnpm run compliance:auditafterward to fill in the semantic findings.strict— same as standard currently; reserved for a future secondary adversarial pass.
Each tenant's Chrome extension instance should be pointed at the tenant's Dataroom URL. See docs/template-handoff.md for the paste-on-install flow. Captures from the extension land in the same SQLite store as uploaded documents and are fully searchable from the main chat.
Two events fire automatically when Castle vars are configured:
document_ingested— fired after every successful ingest, withdocument_id,chunk_count, and the compliancelabelsarray.chat_answered— fired after every answered chat turn, withthread_id,citation_count, andanswerable.
Events are fire-and-forget (1 retry, swallowed on failure) and never block the request path.
See docs/template-handoff.md — env table, first-60-seconds walkthrough, failure modes, Web Brain install note, and first-call checklist.
v0.2 · Convex-backed · hand-rolled RAG · grounded answers or none



