Skip to content

42nights/local_dataroom

Repository files navigation

Dataroom · by 42nights Inc.

The internal data room the 42nights team runs on — drop in customer contracts, call transcripts, Slack exports, and procurement playbooks, then ask questions and get answers grounded only in the corpus, with verbatim citations back to the source.

The deployed app runs on Convex — documents, chunks, embeddings, threads, and messages all live there, and search is a Convex vector query. The only other outbound calls are OpenAI (embeddings) and Anthropic (generation). A local-only build (LOCAL_MODE) also exists for zero-cloud-touch runs; it monkey-patches fetch to block every non-localhost request.

Sources

What's in the box

  • Hand-rolled RAG. No @convex-dev/rag, no Pinecone, no LanceDB. Chunks and their embeddings are stored in Convex; retrieval is a Convex vector search (by_embedding), cosine similarity since every embedding is unit-normalized.
  • OpenAI embeddings (text-embedding-3-small, 1536d) + Anthropic Claude Opus 4.7 for generation with tool-use schema.
  • Citation validator — every quote the model returns must appear verbatim in the chunk it cites. Invalid citations get dropped on the floor. If a fact-asserting answer ends up with zero valid citations after a stricter retry, we override to "I couldn't ground that in the data room."
  • Answerability gate in front of the LLM — if top retrieval score is below threshold, return a canned "I don't have that" with zero tokens spent.
  • Secret scrubber server-side: Anthropic / OpenAI / GitHub / AWS / Stripe / Slack / PEM / JWT patterns redacted before anything hits an embedding or generation API.
  • Compliance checks — every doc is auto-classified for PHI / PII / PCI / financial / legal-confidential data via a regex pass (checksum-validated) plus a cached Claude Haiku semantic pass. npm run compliance:audit writes a full report. See Compliance checks.
  • Parsers: PDF (pdfjs-dist), DOCX (mammoth), XLSX (xlsx), CSV (papaparse), JSON, MD, TXT.

Stack

  • Next.js 16 (App Router, Turbopack), React 19, TypeScript strict, Tailwind 3
  • shadcn-style primitives (button, dialog, dropdown, badge, input, textarea), sonner for toasts, lucide-react for icons
  • Convex for storage + vector search (convex client, admin deploy key server-side)
  • @anthropic-ai/sdk for generation, raw fetch for OpenAI embeddings
  • vitest for tests (30 unit tests covering chunker, scrubber, citations, parsers, vector math, compliance detectors)
  • Inter (body) + Crimson Pro (display) + JetBrains Mono (code)

Start

nvm use                                  # node 22+
npm install
cp .env.local.example .env.local         # paste your OpenAI + Anthropic keys
npm run dev                              # next dev on :3000

The schema migrates itself on the first API request. To wipe everything: rm -rf data/.

npm test                                 # 30 vitest cases
npm run typecheck                        # tsc --noEmit
npm run build                            # Turbopack production build
npm run reingest:stale                   # re-embed any docs whose prompt_version is out of date
npm run compliance:audit                 # full corpus compliance report → data/
npm run ingest:fixtures                  # ingest test_fixtures/ (used by local-mode setup)
npm run smoke                            # run the 8-query smoke test, write a JSON artifact

Architecture

   Browser ──fetch──▶ Next.js API routes ──▶ lib/ingest, lib/search
                            │
                            ▼
                  better-sqlite3 ──▶ ./data/dataroom.db
                            │
                            ▼
              In-memory Map<chunkId, Float32Array>
              (hydrated from chunks.embedding on first import)
                            │
                            ▼
       api.openai.com (embeddings) · api.anthropic.com (generation)

The only network calls this app makes are to those two endpoints. No analytics, no telemetry, no third-party DB.

Ask a question

Chat

Every numbered chip is a validated citation — click one for the verbatim quote, why-it's-relevant context, and the surrounding chunks pulled from disk.

Citation preview

Methodology lives in the product

The /how-it-works page documents what's actually happening between a question and an answer, including the answerability gate and the deterministic citation validator. Same tone as Lightyear's transparency page.

How it works

File layout

lib/
├─ db/                  schema.sql · better-sqlite3 singleton · ensureSchema()
├─ parsers/             pdf · docx · xlsx · csv · json · plain · markdown
├─ chunker.ts           markdown-aware splitter (tables stay whole, soft overlap)
├─ scrubber.ts          secret patterns → [REDACTED:<kind>]
├─ compliance/          regex + LLM sensitive-data detection (PHI/PII/PCI/financial/legal)
│  ├─ regex.ts          deterministic detectors (Luhn/ABA/IBAN checksums, context gating)
│  ├─ llm.ts            Claude Haiku semantic pass, cached by content hash + prompt version
│  ├─ types.ts          Finding / ComplianceLabel / Severity
│  └─ index.ts          orchestration · document_compliance writes · summary · rescan
├─ citations.ts         zod schema + Anthropic tool spec + validateCitations()
├─ prompts.ts           versioned system prompt + stricter-retry nudge
├─ embeddings.ts        OpenAI (cloud) or local (Ollama/LM Studio) batch embed + normalize()
├─ llm.ts               callLLM dispatcher — Anthropic (cloud) or local OpenAI-compat
├─ local-mode.ts        LOCAL_MODE fetch guard (blocks non-localhost) + key-conflict refusal
├─ vector.ts            in-memory index · insert/delete · searchByVector · getChunkWithContext
├─ ingest.ts            parse → scrub → chunk → embed → store (transactional)
├─ search.ts            retrieve → gate → tool-use → validate → retry
├─ chat-db.ts           thread + message CRUD
├─ anthropic.ts         thin client wrapper
├─ audit.ts             append-only audit_log
└─ api-client.ts        browser-side fetch helpers

app/
├─ sources/             list, upload, detail
├─ chat/[threadId]/     threaded chat with citation chips
├─ how-it-works/        methodology page
├─ unlock/              optional passcode form (only used when APP_PASSCODE is set)
└─ api/
   ├─ upload/                              POST multipart
   ├─ sources/[id]/chunks/[idx]/           citation preview
   ├─ chat/threads/[id]/messages/          message list
   ├─ chat/                                send + answer
   └─ admin/compliance/
      ├─ summary/                          GET  corpus-wide stats
      ├─ document/[id]/                    GET  full findings for one doc
      ├─ rescan/[id]/                      POST force re-run one doc
      └─ rescan-all/                       POST full-corpus rescan → job id

Env

Variable Purpose
OPENAI_API_KEY Embeddings via text-embedding-3-small (cloud mode)
ANTHROPIC_API_KEY Generation via Claude Opus 4.7; compliance semantic pass via Claude Haiku 4.5. Unset = compliance runs regex-only.
APP_PASSCODE Optional. If set, the whole app gates behind a single-field passcode at /unlock. Unset = bypass.
LOCAL_MODE true = fully local, zero cloud touch (see below). Refuses to start if a cloud key is also set.
LOCAL_LLM_URL / LOCAL_LLM_MODEL LM Studio OpenAI-compat endpoint + model id for generation.
LOCAL_EMBED_URL / LOCAL_EMBED_MODEL Ollama /api/embed (or LM Studio /v1/embeddings) + 768d embedding model.

Local mode (zero cloud touch)

LOCAL_MODE=true routes generation to LM Studio and embeddings to Ollama/LM Studio with no network calls to OpenAI or Anthropic — enforced by lib/local-mode.ts, which monkey-patches fetch to throw on any non-localhost request and refuses to start if a cloud key is present. Embedding dimension drops 1536→768, so switching modes requires a wipe + re-ingest. Full write-up + benchmarks: ../findings_local_port.md.

cp .env.cloud.local .env.local           # (created on first switch) restore cloud later
# edit .env.local: LOCAL_MODE=true, remove cloud keys, set LOCAL_* (see .env.local.example)
rm -rf data/dataroom.db data/files/ && npm run migrate
npm run ingest:fixtures                   # ingest test_fixtures/ at 768-dim
npm run smoke                             # 8-query smoke, writes data/smoke-<mode>-*.json

Pipeline detail

  1. Upload (POST /api/upload) — content-hashes the bytes (SHA-256), dedupes against existing rows. Identical bytes return the existing documentId with deduped: true and skip every downstream step.
  2. Parse (lib/parsers/) — per-mime dispatch to plain markdown. Tables in XLSX/CSV become real markdown tables.
  3. Scrub (lib/scrubber.ts) — server-side regex sweep redacts API keys, PATs, AWS access keys, Stripe keys, Slack tokens, PEM private keys, JWTs. Logged but not blocked.
  4. Chunk (lib/chunker.ts) — markdown-aware: splits on headings, keeps tables whole up to a hard cap, soft overlap between chunks so context survives boundaries. Defaults: 200/1200/100/4000.
  5. Compliance (lib/compliance/) — regex pass + cached Claude Haiku semantic pass classify the doc into phi/pii/pci/financial/legal-confidential with block/warn severity. Writes one document_compliance row; chunks inherit the doc's labels. See Compliance checks.
  6. Embed (lib/embeddings.ts) — batched against OpenAI (96 per request). Every vector unit-normalized on receive.
  7. Index (lib/vector.ts) — inserted into both SQLite (chunks.embedding as a BLOB) and the in-memory Map<chunkId, Float32Array>. Status flips to ready.
  8. Query (POST /api/chat) — embed the question → dot-product against the in-memory map → top-K → answerability gate. If the gate passes, build <source id="Sn"> blocks (with one chunk of context on either side) and call Claude with tool-use forcing the answer into a shape with citations, confidence, follow-ups, and answerable.
  9. Validate — every cited quote must appear verbatim (whitespace-normalized) in the chunk it points at. Invalid citations dropped. Fact-asserting answer with zero validated citations → one stricter-retry → if still zero, force answerable: false.
  10. Persist — append assistant message with validated citations, followups, confidence, prompt_version. Audit row written for every ask.

Compliance checks

Every ingested document is automatically classified for sensitive data. This is detection and reporting — not HIPAA certification, access control, or encryption. It tells you which documents hold what classes of sensitive data so you can decide what to do.

Two passes, merged into one document_compliance row:

  • Regex (lib/compliance/regex.ts) — deterministic, runs every time. SSN (with grouping sanity check), credit cards (Luhn), CVV/MRN/ICD-10/DOB/phone/email/driver-license/passport, bank-account/routing (ABA checksum)/IBAN (mod-97), and attorney-confidential markers. Several detectors only fire near a context word (a 9-digit run is "routing" only next to routing/aba; a date is a DOB only near born/dob). Snippets are redacted — the raw value never lands in the report.
  • LLM (lib/compliance/llm.ts) — Claude Haiku reads the scrubbed markdown and catches what regex can't: named individuals in prose, contracts, internal commercial strategy. Schema-bound via tool-use, temperature: 0, consolidated to ≤3 findings per doc. Cached on (content_hash, prompt_version, model) — an unchanged doc never re-pays for the call.

Labels stack per doc: phi · pii · pci · financial · secrets · legal-confidential · clean. Severity is block (real and dangerous — SSN, card, MRN, account number, live secret) or warn (worth surfacing, possibly a false positive). Secrets at block are already redacted upstream by the scrubber, so compliance sees [REDACTED:…] by the time it runs.

Run the audit

npm run compliance:audit        # → data/compliance-report-<timestamp>.md

Walks the corpus, runs both passes (cached), then adds an adversarial pass where Claude Opus answers "if a stranger got this doc, what's the worst that happens, and who should it be restricted to?" — then writes a markdown report: summary table, per-doc findings with redacted snippets, the adversarial narrative, and a recommendation (redact / restrict / keep). This is what you hand to a customer to show the corpus's compliance posture.

Endpoints

GET  /api/admin/compliance/summary        corpus stats (by severity, by label, top offenders)
GET  /api/admin/compliance/document/:id    full findings for one doc
POST /api/admin/compliance/rescan/:id      re-run one doc
POST /api/admin/compliance/rescan-all      re-run the corpus → { job_id, scanned, summary }

Interpreting & acting on findings

  • block → there's a real identifier in the doc. Open it, redact the span, re-upload. Re-ingest with identical bytes is a no-op; edit the file first.
  • warn → review and decide. Most warn items (a name, a vendor price) are fine to keep — the label is what the texting agent's access tiers will key off (see texting_agent_access_tiers.md), not a reason to delete.
  • Reviewed-and-accepted is a manual step for v0: note it in the report. There is no "mark as reviewed" state in the DB yet.
  • Bumping COMPLIANCE_PROMPT_VERSION (in lib/compliance/llm.ts) invalidates the LLM cache for the new version, so the next ingest/rescan re-runs the semantic pass. Use this after editing the prompt.

Web Brain (browser memory)

A Chrome extension that captures what you read and folds it into this same data room as a new ingestion source. Captures land as documents rows tagged source='web_brain', chunked + embedded into the same chunks table, so the brain rides on the existing retrieval, citation, and compliance plumbing.

Tiered memory. A capture starts at tier='realtime'. Cron jobs consolidate upward — realtime → daily → weekly → monthly — each tier an LLM-compressed, denser summary. A consolidated capture is marked consolidated_into and ranked lower; its summary becomes the primary search target (it stays as the citation).

Cloud vs local. Same INFERENCE_MODE split as the rest of the app. Cloud: the extension sends scrubbed page text, the backend compresses with Claude. Local (LOCAL_MODE=true): the extension compresses in-browser with WebLLM and only the digest leaves the device. Either way a server-side scrubber (lib/webbrain/ scrubber.ts) redacts secrets + PII and drops any capture with > 5 redactions before it touches the LLM or storage.

Endpoints (extension calls use Authorization: Bearer <token>):

Route Purpose
POST /api/extension/connect Mint an extension token (passcode-gated when APP_PASSCODE set)
POST /api/webbrain/capture Ingest one capture → scrub → compress → embed → chunk
GET /api/webbrain/search Tier-aware vector search over captures
POST /api/webbrain/chat RAG chat over the brain (cites url + date + tier)
POST /api/webbrain/consolidate Cron-fired tier consolidation (?tier=daily, CRON_SECRET)
POST /api/brain/chat Dashboard /brain chat (passcode-gated, no token)

Quick start.

# 1. backend — migration runs automatically on first request; or:
npm run migrate
npm run dev

# 2. extension — build + load unpacked
cd extensions/web-brain && npm install && npm run build
#    chrome://extensions → Developer mode → Load unpacked → extensions/web-brain/dist
#    (local/privacy mode also needs: npm install @mlc-ai/web-llm)

# 3. connect — open http://localhost:3000/extension/connect, copy the token,
#    paste it into the extension's Settings tab.

# 4. consolidation — Vercel crons (vercel.json) hit /api/webbrain/consolidate,
#    or run it by hand / backfill history:
npm run webbrain:consolidate            # daily+weekly+monthly, trailing windows
npm run webbrain:consolidate all --backfill

Read the full brain over all your reading at /brain in the dashboard.

Caveats

  • No streaming — answers complete in a single tool-use call (4–10s).
  • Web Brain is single-tenant (no user_id): one passcode, one logical owner. Multi-user isolation is a user_id column away.
  • Web Brain consolidation cadence is UTC (vercel.json), not per-user-local — the spec's "user-local" needs a per-user timezone column.
  • Pure vector search. FTS5 hybrid is sketched in lib/db/schema.sql and is a ~30-line follow-up.
  • PDFs with complex tables can lose structure under pdfjs-dist. A LlamaParse fallback is sketched in the original spec.
  • Designed for ≤ 20 files. The in-memory index grows linearly with chunks; past ~10k chunks, swap to sqlite-vec or HNSW (~100-line change).
  • No GitHub OAuth or member allowlist. Local-only assumes the person at the keyboard is authorized.

Test surface

test/
├─ chunker.test.ts      4 tests — bounds, tables, overlap, empty input
├─ scrubber.test.ts     5 tests — Anthropic/GitHub/AWS/PEM patterns + clean text
├─ citations.test.ts    5 tests — quote-in-chunk validation + fact-detection
├─ parsers.test.ts      4 tests — CSV/JSON/MD/plain text round-trips
├─ vector.test.ts       2 tests — dot-product ordering + delete-syncs-index
└─ compliance.test.ts   10 tests — fixture detections, checksum validators, redacted snippets
   fixtures/            seeded PHI/PII/clean files the detectors run against

npm test runs all 30 in under a second. The one LLM-pass test self-skips unless ANTHROPIC_API_KEY is set.

Deploy as a Castle template

Dataroom ships as a Castle template. One click in the Castle dashboard provisions a Vercel project, a fresh data/dataroom.db, and a .env prefilled with the tenant and Castle vars. What you need to know:

Tenant env contract

Set these in your Vercel project env (or equivalent):

DATAROOM_TENANT_SLUG=acme          # activates tenant mode
DATAROOM_TENANT_DISPLAY_NAME=Acme  # shown in app title / footer
DATAROOM_TENANT_PUBLIC_URL=https://dataroom.acme.com
DATAROOM_LOGO_URL=https://cdn.acme.com/logo.png   # optional favicon
DATAROOM_PRIMARY_COLOR=#0066ff                     # optional brand color
CASTLE_API_URL=https://api.castle.42nights.com
CASTLE_DEPLOYMENT_ID=<id from Castle dashboard>
CASTLE_WEBHOOK_SECRET=<secret from Castle dashboard>
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
APP_PASSCODE=<strong passcode for the tenant>

All tenant vars are optional individually. Setting DATAROOM_TENANT_SLUG activates tenant mode and requires DATAROOM_TENANT_PUBLIC_URL, CASTLE_DEPLOYMENT_ID, and CASTLE_API_URL — the app will throw at startup if those are missing.

Pattern A: per-tenant Vercel project + SQLite

The recommended deployment pattern is one Vercel project per tenant with a persistent volume (or Vercel's file-system on a Pro plan) mounted at ./data/. Each tenant gets:

  • Their own data/dataroom.db (no row-level multi-tenancy)
  • Their own APP_PASSCODE
  • Their own OpenAI/Anthropic billing (or shared keys with per-tenant cost attribution via DATAROOM_TENANT_SLUG)

This means there is no cross-tenant data risk — the DB is physically separate. Multi-tenant-in-one-DB is a future user_id column away.

Mode-lock after first boot

The embedding dimension is written at ingest time. Cloud mode uses 1536d (text-embedding-3-small); local mode uses 768d. Switching modes after any document has been ingested requires wiping the DB and re-ingesting. This is by design — mixed-dimension indexes silently break cosine similarity. If you need to switch:

rm data/dataroom.db data/files/
npm run migrate
# re-upload documents or run npm run ingest:fixtures

Use DATAROOM_MODE=local (or LOCAL_MODE=true) to lock the instance to local models. DATAROOM_MODE is the preferred var in Castle deployments.

Compliance levels

Control ingestion-time LLM spend with DATAROOM_COMPLIANCE_LEVEL:

  • standard (default) — regex + Claude Haiku semantic pass. Haiku calls are cached by content hash, so re-ingesting the same bytes is free.
  • off — regex only. Zero Haiku calls during ingest. Good for bulk-load bootstrapping or cost-constrained tenants. Run npm run compliance:audit afterward to fill in the semantic findings.
  • strict — same as standard currently; reserved for a future secondary adversarial pass.

Web Brain as an ingestion source

Each tenant's Chrome extension instance should be pointed at the tenant's Dataroom URL. See docs/template-handoff.md for the paste-on-install flow. Captures from the extension land in the same SQLite store as uploaded documents and are fully searchable from the main chat.

Castle events

Two events fire automatically when Castle vars are configured:

  • document_ingested — fired after every successful ingest, with document_id, chunk_count, and the compliance labels array.
  • chat_answered — fired after every answered chat turn, with thread_id, citation_count, and answerable.

Events are fire-and-forget (1 retry, swallowed on failure) and never block the request path.

Full handoff guide

See docs/template-handoff.md — env table, first-60-seconds walkthrough, failure modes, Web Brain install note, and first-call checklist.


v0.2 · Convex-backed · hand-rolled RAG · grounded answers or none

About

Fully-local fork of the 42nights data room — LM Studio LLM + local embeddings, zero cloud touch.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors