Releases · ahmedEid1/thoth

24 May 22:50

v1.0.1

7f53fa6

v1.0.1 — eval CI hardening + public /evals explanations Latest

Latest

Patch release. Engineering is complete; this iteration is post-v1.0.0 polish focused on the public /evals dashboard and the weekly eval CI workflow.

Public eval dashboard (/evals)

Metric descriptions on the four aggregate cards (one-line plain-English gloss under each percentage).
"How this works" section between the per-question table and the methodology footer, covering:
- Lifecycle — golden YAMLs → headless agent run → EvalRun rows → latest-per-pair display
- Philosophy — public-not-hidden, vacuous-true scoring, per-metric regression guard
- The four metrics, in detail — formulas + what each catches + caveats
Sweep-completion badge in the hero: "X of Y goldens have data at this commit" with a tooltip explaining why a golden may have no row (not in the latest sweep, or hit a rate-limit / walltime cap).
next.config.outputFileTracingIncludes ensures the YAML files ship into the Vercel function bundle so the badge denominator is accurate on prod.

Eval CI hardening (`.github/workflows/evals.yml` + `scripts/`)

The weekly cron previously got cancelled at the 60-min job cap mid-sweep when Mistral's free-tier rate-limiting was tighter than estimated. Fixes:

6-golden smoke set on the cron (EVAL_GOLDENS=000,001,002,004,005,007). The full 17-golden sweep is now opt-in via workflow_dispatch with goldens: smoke|all.
Per-golden walltime cap (EVAL_GOLDEN_TIMEOUT_MS, default 15 min) via Promise.race — one stuck golden no longer kills the whole run. Timer is now clearTimeout-ed in a finally block so completed goldens don't leak event-loop handles.
Empty-sweep guard in both run-evals.ts and check-eval-regression.ts — if every golden fails the workflow exits non-zero with a clear message instead of falsely greenlighting an empty sweep.
Per-metric regression thresholds in check-eval-regression.ts: 10% for the deterministic citation/coverage metrics (where a single missed paper is 10–20%), 20% for claim_faithfulness (where one LLM-judge verdict flip moves the score 7–10 points on a small-N denominator).
generateObject maxRetries bumped from 2 (default) to 4 for more exponential backoff before bubbling a rate-limit failure.
Job timeout bumped 60 → 90 min as safety margin.
packageManager pinned to pnpm@9.15.0 to fix ERR_PNPM_LOCKFILE_CONFIG_MISMATCH after an attempted pnpm 10 migration regressed CI.

Review process

Both Codex adversarial-review and Claude code-reviewer ran against the post-v1.0.0 diff before tagging. They surfaced 7 real issues — 6 are fixed in this release (commit f946758); the 7th (AbortSignal-plumbed timeout cancellation through the LangGraph + LLM chain) is deferred as a larger change.

Tests

326/326 substantive tests passing. pnpm lint clean. pnpm tsc --noEmit clean.

MCP Registry

server.json is bumped to 1.0.1. Republish to registry.modelcontextprotocol.io is a manual follow-up (interactive GitHub device-code auth required).

Assets 2

24 May 09:42

ahmedEid1

v0.7.0-m5

34cdb8a

v0.7.0-m5 — Authenticated MCP Server

M5 — Authenticated MCP Server

Atlas ships an authenticated, hosted MCP server at /api/mcp/mcp on the
live Vercel deploy. OAuth 2.1 + PKCE + Dynamic Client Registration via
Clerk (resource-server pattern). 3 read-only tools over tenant-scoped
data: list_reviews, get_review_draft, get_citation_audit. DB-backed
audit log of every call; per-user sliding-window rate limits; no raw
input ever logged (SHA-256 of canonical-JSON only).

Verified end-to-end against 3 MCP clients:

MCP Inspector — full OAuth flow + all 3 tools
Claude Desktop (via mcp-remote bridge) — natural-language prompts
claude.ai web (custom Connector with DCR) — 9 audit-log rows across
all clients confirm identity persistence + correct user scoping

Highlights:

199 tests pass, 38 new (auth, audit, rate-limit, handler wrapper,
3 tools, route integration); tsc + lint clean
Spec: docs/superpowers/specs/2026-05-24-m5-mcp-server-design.md
Plan: docs/superpowers/plans/2026-05-24-m5-mcp-server.md
User docs: docs/mcp/tools.md + docs/mcp/security.md
Pre-tag manual checklist in RELEASING.md
Demo + setup GIFs embedded in README

Stack: Next.js 16 / Clerk / @clerk/mcp-tools / mcp-handler /
@modelcontextprotocol/sdk / Prisma / Neon / Vercel.

Assets 2

23 May 22:12

ahmedEid1

v0.6.0-m3.5c

2b42a24

v0.6.0 — M3.5c self-host fallback shipped

Atlas is fully self-hostable

Closes the M3.5 trilogy. With v0.6.0, Atlas can run on a single VM on Oracle Cloud's Always Free ARM tier (4 cores, 24 GB RAM, free forever, no credit card after signup verification) — as an alternative to the Vercel + Neon + R2 + Trigger.dev Cloud + Langfuse Cloud stack used in production.

What ships in this release

infra/self-host/docker-compose.prod.yml — full production stack:
- Caddy reverse proxy with auto-TLS via Let's Encrypt
- Atlas Next.js app (built from the included Dockerfile)
- Postgres 17 (replaces Neon)
- MinIO (replaces Cloudflare R2)
- Full self-hosted Langfuse v3 stack: web + worker + ClickHouse + Redis + dedicated MinIO for events
infra/self-host/Dockerfile — multi-stage Node 22 build. No Python (the v0.5.x Mistral OCR swap eliminated the marker-pdf/Python deploy complexity entirely)
infra/self-host/Caddyfile — single-domain auto-TLS, optional Langfuse subdomain
infra/self-host/.env.prod.example — env template with sensible defaults + clear "MUST CHANGE" comments on secrets
infra/self-host/backup-postgres.sh — cron-runnable pg_dump script with rotation
docs/self-host/oracle-cloud-quickstart.md — step-by-step Oracle Cloud setup walkthrough

Honest scope

What's self-hosted: app + DB + object store + tracing.
What stays cloud (with documented swap paths):

LLM: still uses an API (Mistral free Experiment tier by default; or any of the 6 supported providers — Anthropic, OpenAI, Groq, Gemini, Mistral, Claude Agent SDK)
Auth: Clerk Cloud free tier (NextAuth + Postgres swap deferred to a future M3.5d)
Trigger.dev (background jobs): default is Trigger.dev Cloud free; self-host is REFERENCED via triggerdotdev/self-hosted-trigger.dev but not duplicated here (their compose is large; out of quickstart scope)

Total cost

Item	Cost
Oracle Cloud Ampere A1 (4 cores, 24 GB RAM)	$0/month forever
Domain	~€10/year (you already have one)
Mistral API (Experiment tier)	$0 (small-scale dev)
Marginal cost per real review	~$0.02 (Mistral OCR + LLM calls)
Total	$0/month, ~$0.02/review

Why this matters for the portfolio

"I built a system that runs end-to-end on managed cloud AND on a single $0/month VM" — defensible "I own my deployment" angle
Closes a real gap: free tiers occasionally change terms; having a documented escape hatch is mature engineering
Demonstrates Docker + Caddy + ARM-compatible Linux stack literacy (relevant to AppliedAI / Agentic SWE roles)

Cumulative state at v0.6.0

12 GitHub releases shipped over the project's life (M1 → v0.6.0)
Live production deploy: https://atlas-sooty-delta.vercel.app
Public eval dashboard: https://atlas-sooty-delta.vercel.app/evals (real metrics from Claude Agent SDK runs)
Self-host story now complete

Outstanding from the original roadmap:

M5 (Wk 6): Authenticated MCP server (OAuth 2.1)
M6 (Wk 7): Public launch + recruiter 1-pager + blog series

Assets 2

23 May 21:59

ahmedEid1

v0.5.1-first-live-review

6fd660c

v0.5.1 — First live e2e review completed 🎯

Atlas just generated its first systematic literature review on production

Tonight Atlas ran a full systematic literature review of the ReAct paper (arxiv 2210.03629) end-to-end on live production infrastructure. $0 spend.

What ran

User upload (atlas-sooty-delta.vercel.app)
    ↓
Vercel API route → R2 (PDF stored)
    ↓
Trigger.dev parse-pdf task → Mistral OCR → markdown
    ↓
Trigger.dev summarize-paper task → Mistral text model → structured summary
    ↓
Trigger.dev run-review task:
    ├─ planner (Mistral large) → PICOC + sub-questions + criteria
    ├─ plan_gate HITL → user approves via UI
    ├─ retriever (Mistral large) → per-paper relevance scoring
    ├─ papers_gate HITL → user approves via UI
    ├─ assessor (Mistral large) → claim extraction
    ├─ drafter (Mistral large) → review markdown with [paper_id] citations
    ├─ critic (Mistral large) → rubric scoring
    └─ cite_check (Mistral large) → per-citation verification (some rate-limited, recorded as "unclear")
    ↓
Run COMPLETED → draft visible in run workspace UI with critic + faithfulness widgets

Bugs the live run surfaced (and fixed)

The first end-to-end run was a stress test that found 7 real production issues — each is now committed + fixed:

Trigger.dev CLI version mismatch — pinned all @trigger.dev/* deps to exact 4.4.6
Issue triggerdotdev/trigger.dev#1843 — nested requirements.txt fails. Moved to project root.
Eager env validation at module import crashed the deploy indexer. Refactored lib/env.ts, lib/db.ts, lib/object-store.ts to lazy Proxy/factory pattern.
Wrong Trigger.dev key type — TRIGGER_SECRET_KEY needed a tr_prod_* server-side key (we had tr_pat_*).
Missing runtime env vars on Trigger.dev — added syncEnvVars extension to push local .env on every deploy.
Neon WebSocket constructor missing in Trigger.dev's worker Node runtime. Set neonConfig.webSocketConstructor = ws explicitly.
HITL resume payload wrapped — wait.forToken returns { ok, output } but run-review.ts was passing the wrapper to LangGraph's Command resume. Latent M3 bug that mocked tests never caught. Fixed with .unwrap().
Marker-pdf cold-start timeout at 10 min on Trigger.dev's large-1x machine. Swapped to Mistral OCR (~5-10 sec per paper, $0.002/page on free tier).
Groq gpt-oss-20b 8K TPM ceiling exceeded by real 9-page paper input. Added Mistral as a 6th provider; switched default to LLM_PROVIDER=mistral (free Experiment tier with much higher TPM).
Cite_check rate limit on free Mistral tier — parallel batches of 5 hit ~1 RPS cap. Dropped to sequential + per-citation error tolerance so one rate-limited check doesn't fail the whole run.

The stack that now actually runs reviews end-to-end

Layer	Service	Cost
App	Vercel Hobby (Next.js 16)	Free
DB	Neon Postgres (Frankfurt)	Free
Object store	Cloudflare R2	Free
Background jobs	Trigger.dev Cloud (deployed prod tasks)	Free
Auth	Clerk Cloud	Free
Tracing	Langfuse Cloud	Free
LLM	Mistral (large/small, Experiment tier)	Free
PDF parsing	Mistral OCR	$0.002/page (effectively free on free Experiment credits)
Local-dev evals	Claude Agent SDK via Max subscription	Free

Monthly cost: $0. Per-review marginal cost: ~$0.02 worst case.

The narrative

Built Atlas as a portfolio piece. The first live e2e run flushed out 10 distinct production bugs — half of them latent in the M3 codebase since they were only exercised by mocked tests. Each got diagnosed, fixed cleanly, committed. The harness DID the job an eval harness is supposed to do: surface real problems before they reach a user.

That's the hiring story right there. Not "I built a slick demo" — "I built a system, stress-tested it on real infra, and the failure modes are visible in git history with the fixes underneath."

Assets 2

23 May 20:38

ahmedEid1

v0.5.0-trigger-cloud-deploy

08d2f15

v0.5.0 — Trigger.dev Cloud production deploy

Atlas is fully managed in the cloud 🚀

After today's session, Atlas runs entirely on managed infrastructure with zero dependency on Ahmed's machine being on. Trigger.dev's 3 tasks (parse-pdf, summarize-paper, run-review) are deployed to Trigger.dev Cloud production and execute autonomously.

Four real blockers resolved

Trigger.dev CLI version mismatch — pinned all @trigger.dev/* deps to exact 4.4.6 (caret ranges trip the "CI mode" version check)
Issue triggerdotdev/trigger.dev#1843 — Python extension's auto-generated Dockerfile mis-routes nested `requirements.txt` paths. Workaround: `requirements.txt` moved to project root + trigger config updated.
Eager env validation at module import crashed the Trigger.dev deploy indexer (runs in a sandboxed container without project env vars). Refactored `lib/env.ts`, `lib/db.ts`, and `lib/object-store.ts` to lazy Proxy + factory patterns — module import is module-load-safe; real env access at runtime still throws clearly if vars are missing.
Test fixtures updated — 2 env tests previously asserted on import-time throw; now assert on first property access.

Stack — completely managed, $0/month

                                                Atlas (master)
                                                       │
                          ┌────────────────────────────┼────────────────────────────┐
                          ▼                            ▼                            ▼
              Vercel Hobby (Next.js)         Trigger.dev Cloud              Neon Postgres
              atlas-sooty-delta.vercel.app   3 deployed tasks               eu-central-1
                                              parse-pdf, summarize-paper,
                                              run-review
                                                       │
                          ┌────────────────────────────┼────────────────────────────┐
                          ▼                            ▼                            ▼
                  Cloudflare R2              Langfuse Cloud                  Groq (default)
                  10 GB free                 50K obs/mo                      free tier
                                                                              + Claude Agent SDK
                                                                              local for evals

Engineering

155 tests passing
Lazy env pattern is broadly useful (multi-context safety, not just for Trigger.dev)
$0 lifetime spend on this milestone

E2E smoke test on the live deploy — sign up → upload PDF → run review (now actually possible)
M3.5c — Self-host fallback docs

Assets 2

23 May 20:16

ahmedEid1

v0.4.3-cite-check-relaxed

86eb678

v0.4.3 — cite_check relaxation + full 12-row eval baseline

Full 12-row eval baseline 🎯

Relaxes `CiteCheckPerCitationSchema.reason` max from 500 → 1500 chars (the model occasionally needs more room for nuanced cite-check explanations). With this, all 3 synthetic goldens complete a full eval run.

Live results via Claude Agent SDK (Claude Code CLI session, $0)

Question	Recall	Precision	Faithfulness	Coverage
000-tdd-web-frameworks	100%	100%	43%	0%
001-llm-code-review-security	25%	100%	29%	0%
002-rag-architecture-patterns	100%	100%	42%	0%

Public dashboard: https://atlas-sooty-delta.vercel.app/evals

100% precision across all 3 = retriever doesn't include irrelevant papers. Mixed recall = retriever is conservative on harder questions. 0% expected-claim-coverage = drafter paraphrases (real signal the harness surfaces).

Engineering

155 tests passing
$0 spend (Claude Max via Agent SDK + CLI auth)
12 EvalRun rows tagged with commit `883ecd5` in Neon

v0.5.0 — Trigger.dev Cloud deploy via WSL (proper production worker for the live deploy, no more local-dev-worker dependency)
M3.5c — Self-host fallback docs

Assets 2

23 May 19:48

ahmedEid1

v0.4.2-claude-agent-provider

883ecd5

v0.4.2 — Claude Agent SDK provider + first live eval baseline

First live eval baseline 🎯

The /evals dashboard at https://atlas-sooty-delta.vercel.app/evals now shows real data from a Claude Max subscription via the Agent SDK — $0 spend.

What landed

New provider: LLM_PROVIDER=claude-agent routes Atlas through @anthropic-ai/claude-agent-sdk. When ANTHROPIC_API_KEY is unset, the SDK auto-detects the local Claude Code CLI session — so Claude Max subscribers can run Atlas evals for free.

Adapter (lib/llm/providers/claude-agent.ts):

Single-shot non-interactive calls (no tools, maxTurns: 1)
Uses Zod v4's built-in z.toJSONSchema() to render the actual schema in the prompt
Explicit "CRITICAL OUTPUT CONTRACT" preamble naming common JSON output failure modes
Strips markdown code fences if the model wraps; validates with Zod

Bypass branch in runLLM: when provider is claude-agent, the dispatcher skips Vercel AI SDK's generateObject and calls the adapter directly. Dynamic import means the bundled Claude Code binary only loads when actually needed (production stays on Groq, never touches it).

First live eval results

Question	Recall	Precision	Faithfulness	Coverage
001-llm-code-review-security	25%	100%	47%	0%
002-rag-architecture-patterns	40%	100%	45%	0%

100% precision shows the retriever doesn't include irrelevant papers. Low recall shows it's conservative (skips some expected). 0% expected-claim-coverage shows the drafter paraphrases away from exact phrasings — surfaced as a real signal by the harness.

Known limitations

Local-dev only: GitHub Actions CI containers have no Claude Code auth, so CI still needs Groq/OpenAI/Anthropic API key (production unchanged — stays on Groq default)
Q0 (000-tdd-web-frameworks) failed on a separate Zod constraint: CiteCheckPerCitationSchema.reason.max(500) was hit by a long model response. Filed as a v0.4.3 follow-up (just relax to .max(1500))
June 15, 2026: Anthropic introduces a separate "Agent SDK credit" pool in Max plans (currently shares interactive limits)

Engineering

155 tests passing (148 baseline + 5 adapter + 2 tier)
pnpm tsc --noEmit + pnpm lint clean
Zero ANTHROPIC_API_KEY required for local eval runs
8 EvalRun rows in Neon — dashboard live with real data

v0.4.3 — relax cite_check.reason max so Q0 can complete the full 12-row baseline
v0.4.2-m4b-expand — 7 more real-paper golden questions
M3.5c — Self-host fallback docs (Oracle Cloud Always Free)
M5 — Authenticated MCP server (OAuth 2.1)

Assets 2

23 May 17:55

ahmedEid1

v0.4.1-m4b

e36bbb5

v0.4.1-m4b — Eval harness + public dashboard

M4b — Atlas is measurable

Atlas now has a self-contained eval harness with 3 hand-curated synthetic golden SLR questions, run on every push + nightly, with a public dashboard at https://atlas-sooty-delta.vercel.app/evals.

Metrics

Citation recall — % of expected papers Atlas included
Citation precision — % of Atlas's included papers that were expected
Claim faithfulness — % of in-draft citations cite_check verified as SUPPORTED
Expected-claim coverage — % of expected claims found in Atlas's draft (case-insensitive substring)

Engineering

Headless graph runner (lib/eval/headless-runner.ts) drives Atlas's M3+M4a LangGraph in-process
HITL gates auto-approved for evals only (production unchanged)
CorpusItems seeded from inline YAML markdown (skips marker-pdf + summarisation for eval speed)
GitHub Actions: push + nightly + 10% regression gate (scripts/check-eval-regression.ts)
Server-rendered Next.js dashboard, no client-side framework

Known issue

The dashboard currently shows the empty-state because the first live eval run is blocked on vercel/ai#12187 — the Vercel AI SDK's generateObject doesn't reliably get strict JSON from Gemini Flash on the free tier. The harness itself works correctly; once the upstream bug is fixed OR LLM_PROVIDER is switched to a non-Gemini provider (Groq free tier, Anthropic paid, OpenAI paid), evals will run end-to-end. Documented at the top of evals/README.md.

Scorecard

148+ tests passing, all unit/integration LLM calls mocked
tsc --noEmit + pnpm lint clean
$0 spend (Gemini free tier when working; alternative providers available)

v0.4.2-m4b-expand — 10 real-paper golden questions (deferred for time)
M5 — Authenticated MCP server (OAuth 2.1) published to MCP registry
M6 — Public launch + recruiter 1-pager + blog series

Assets 2

23 May 16:40

ahmedEid1

v0.4.0-m4a

8d54baf

v0.4.0-m4a — Critic + cite_check

M4a — Quality gates added

After M4a, every Atlas review gets a critic pass (LLM-as-judge with up to one revision loop) and a cite_check post-pass that verifies every [paper_id] citation is actually supported by the cited paper.

What changed

lib/agent/nodes/critic.ts — rubric-scored judge node
lib/agent/nodes/cite-check.ts — per-citation LLM verification
lib/agent/cite-extract.ts — pure parser for [paper_id] mentions
New ClaimCheck table; Run.critiqueScore + Run.faithfulnessScore aggregates
UI: critic + faithfulness widgets in run workspace

Engineering

119+ tests passing, all LLM calls mocked
tsc --noEmit clean
Zero real LLM calls in CI; budget unchanged at $0
Critic + cite_check run on free Gemini quota in production

M4b — Eval harness v1 (10 golden SLR questions + Promptfoo + GitHub Actions + public /evals dashboard)
M5 — Authenticated MCP server

Assets 2

23 May 15:33

ahmedEid1

v0.3.6-m3.5b

6c13277

v0.3.6-m3.5b — Cloud deploy LIVE

🚀 Atlas is live

Demo: https://atlas-sooty-delta.vercel.app

M3.5b ships the full free-tier production stack. Atlas is now reachable from anywhere with zero monthly spend.

Stack ($0/month total)

Layer	Service	Free tier
App host	Vercel Hobby	Generous; Next.js 16 native
Database	Neon Postgres (eu-central-1)	0.5 GB storage
Object store	Cloudflare R2	10 GB, zero egress fees
Background jobs	Trigger.dev Cloud	500K runs/month
Auth	Clerk Cloud	10K MAU
Tracing	Langfuse Cloud	50K observations/month
LLM	Google Gemini 2.5 Flash	1500 req/day

Engineering wins

@prisma/adapter-pg → @prisma/adapter-neon swap for serverless-friendly Neon connections
postinstall: prisma generate so Vercel builds find the gitignored Prisma client
Cloudflare R2 works with zero code change to Atlas's existing lib/object-store.ts — only env values flip
Clerk signed webhooks verified end-to-end against the live URL (User row landed in Neon within 1.2s of clicking Send Example in Clerk dashboard)
4 re-runnable smoke scripts: scripts/verify-{gemini,neon,r2,langfuse}.ts

Verified by

pnpm tsx scripts/verify-gemini.ts    # ✓ Gemini ($0)
pnpm tsx scripts/verify-neon.ts      # ✓ All 8 tables queryable
pnpm tsx scripts/verify-r2.ts        # ✓ R2 round-trip + signed URL
pnpm tsx scripts/verify-langfuse.ts  # ✓ Trace landed in Langfuse Cloud

M3.5c — Self-host fallback docs (Oracle Cloud Always Free)
M4 — Critic + cite_check + eval harness v1 + public /evals dashboard

What still needs polishing (not blocking)

/dashboard returns 404 for unauthenticated users instead of redirecting to /sign-in (Clerk v7's intentional auth.protect() behavior; one-line fix later)
Real end-to-end review smoke (sign up → create project → upload PDF → run review) not yet performed against the live deploy

Assets 2

Releases: ahmedEid1/thoth

v1.0.1 — eval CI hardening + public /evals explanations

Public eval dashboard (/evals)

Eval CI hardening (.github/workflows/evals.yml + scripts/)

Review process

Tests

MCP Registry

Uh oh!

v0.7.0-m5 — Authenticated MCP Server

Uh oh!

v0.6.0 — M3.5c self-host fallback shipped

Atlas is fully self-hostable

What ships in this release

Honest scope

Total cost

Why this matters for the portfolio

Cumulative state at v0.6.0

Next

Uh oh!

v0.5.1 — First live e2e review completed 🎯

Atlas just generated its first systematic literature review on production

What ran

Bugs the live run surfaced (and fixed)

The stack that now actually runs reviews end-to-end

The narrative

Uh oh!

v0.5.0 — Trigger.dev Cloud production deploy

Atlas is fully managed in the cloud 🚀

Four real blockers resolved

Stack — completely managed, $0/month

Engineering

Next

Uh oh!

v0.4.3 — cite_check relaxation + full 12-row eval baseline

Full 12-row eval baseline 🎯

Live results via Claude Agent SDK (Claude Code CLI session, $0)

Engineering

Next

Uh oh!

v0.4.2 — Claude Agent SDK provider + first live eval baseline

First live eval baseline 🎯

What landed

First live eval results

Known limitations

Engineering

Next

Uh oh!

v0.4.1-m4b — Eval harness + public dashboard

M4b — Atlas is measurable

Metrics

Engineering

Known issue

Scorecard

Next

Uh oh!

v0.4.0-m4a — Critic + cite_check

M4a — Quality gates added

What changed

Engineering

Next

Uh oh!

v0.3.6-m3.5b — Cloud deploy LIVE

🚀 Atlas is live

Stack ($0/month total)

Engineering wins

Verified by

Next

What still needs polishing (not blocking)

Uh oh!

Eval CI hardening (`.github/workflows/evals.yml` + `scripts/`)