Skip to content

tjkuhns/explodable

Repository files navigation

Explodable

tests

An AI content engine that produces analytical essays about B2B buyer psychology, grounded in a structured knowledge base of 305 behavioral-science findings.

Live demo · Eval methodology writeup · explodable.io


Built entirely through AI pair programming with Claude Code. The pipeline architecture, the evaluation harness, the knowledge graph, and the 20 research reports grounding the design decisions were all produced through agentic engineering — directing AI systems to implement, then measuring whether the output is actually good.


What it produces

The VP of Operations at a $25M logistics company stares at three SaaS demos on her laptop at 4:47 PM on a Thursday. She's been in meetings since 7 AM, the CEO is asking for a vendor recommendation by Friday, and her team of eight stakeholders can't agree on basic requirements. She closes the laptop and defaults to "let's table this until Q2" — the same decision she made last quarter.

The mid-market sits in a decision-making dead zone. Too big for intuitive founder choices. Too small for enterprise decision infrastructure. They don't decide with logic first, then hire fear to audit. They decide with fear first, then hire logic to testify.

"The Mid-Market Trap", generated by the hybrid pipeline from KB findings on scarcity cognition, committee dynamics, and loss aversion

Key findings

Thesis-as-structural-schema is the biggest quality driver. Encoding the thesis ("Buyers don't decide with logic. They decide with fear, then hire logic to testify.") as a structural constraint in the outline stage — where each section must instantiate the fear→testimony mechanism — produced an 8-point improvement over standard prompting. Validated at N=50.

CAG (full-context stuffing) fails for long-form generation. Tested in a controlled 3-way bake-off: CAG scored 26.3 vs Wiki's 32.0 vs the existing retrieval pipeline's 32.6. Every published CAG evaluation is on QA tasks. This is a negative result on a generation task.

Cross-domain delta, decomposed. On one cross-domain topic (T3: B2B vendor lock-in × religious conversion, N=1):

  • Production retrieval pipeline: 23.0
  • Wiki-style index scanning (Pipeline C bakeoff): 32.0 (+9)
  • Full hybrid pipeline (Wiki + graph expansion + thesis-constrained outline): 36.0 (+13)

The +9 Wiki-alone delta is from the Phase 1 bakeoff; the +13 hybrid delta is from the Phase 2 smoke test. An N=50 replication showed cross-domain topics scored 32.2 on average with a wide confidence interval (27.7–36.7). Single-topic deltas overstate the N=50 effect; the improvement is real but noisier than the T3 result suggests.

Evaluation methodology

Quality is measured by a 10-criterion LLM-as-judge rubric covering thesis clarity, evidence integration, integrative thinking, counterargument handling, and six other dimensions. The judge was calibrated against a 7-model editorial panel (Gemini 2.5 Pro, Grok 4, DeepSeek V3, Mistral Large, GPT-5, Claude Deep Research, Qwen3).

  • Spearman ρ = 0.841 (Claude Opus 4.6, claude-opus-4-20250514) / ρ = 0.782 (Claude Sonnet 4, claude-sonnet-4-20250514) against the 5-model tight cluster
  • Validated across 50 test topics spanning dense, medium, sparse, cross-domain, and out-of-distribution conditions
  • Every architectural decision in the pipeline was driven by measured results, not assumptions

Calibration limit: the 5-model cluster was derived by dropping 2 outlier models (Claude Deep Research, Qwen) after seeing disagreement. A pre-registered protocol would specify drop criteria in advance. The ρ = 0.841 number is real but inflated by post-hoc selection — full disclosure in the methodology writeup.

Full eval methodology writeup →

Architecture

Two pipelines share one knowledge base. The production pipeline runs content generation via Celery; the experimental (hybrid) pipeline is used for measurement runs and ablation studies. Full details and file:line references in docs/architecture.md.

Production pipeline (src/content_pipeline/graph.py)

calendar_trigger → kb_retriever → content_selector
     ├─ outline_generator ──── hitl_gate (outline review)
     └─ standalone_post_generator
          │
draft_generator ─── voice profile, inline citation markers
     │
bvcs_scorer ────── voice compliance; auto-revise loop if <70, max 3
     │
hitl_gate (draft review) → publisher → END

Retrieval includes graph expansion: top-5 semantic results become PPR seeds, one-hop walk via typed relationships, neighbors scored by seed_score × relationship_weight × edge_confidence.

Experimental pipeline (src/content_pipeline/experimental/hybrid_graph.py)

topic_router ──── [wiki_selector | vector_retriever | graph_walker]
     │
graph_expander ── PPR + MMR diversity reranking (explicit stage)
     │
thesis_outline ── Architecture B: fear-commit → logic-recruit → testimony-deploy
     │
draft_generator → bvcs_scorer → adversarial_critic → revision_gate → publisher

The experimental graph is where architecture changes are measured against the calibrated judge on the N=50 test set before being promoted. See src/content_pipeline/experimental/README.md for the production/experimental boundary and promotion criteria.

The knowledge base

305 behavioral-science findings organized by:

  • 5 root anxiety categories — helplessness, insignificance, isolation, meaninglessness, mortality
  • 7 Panksepp affective circuits mapped to buyer behavior
  • 763 typed relationships — supports, extends, qualifies, reframes, subsumes, contradicts
  • 24 cultural domains — from competitive systems (110 findings) to heroism (6)

85% of relationship edges connect findings across different cultural domains. Most RAG systems retrieve semantically similar documents; this one retrieves across taxonomic distance on purpose, because B2B buyer psychology requires pulling from unrelated domains (religious conversion, addiction architecture, competitive sports) to produce non-obvious synthesis. That's the architectural bet — and the cross-domain delta on T3 (single-topic, N=1) is the first evidence it pays off; the N=50 replication shows the effect persists with wide variance.

Built with

  • LangGraph — pipeline orchestration with conditional routing and state persistence
  • Claude Sonnet 4 (claude-sonnet-4-20250514) — generation via Anthropic API with prompt caching
  • Claude Opus 4.6 (claude-opus-4-20250514) — judge calibration
  • Gemini 2.5 Flash — adversarial critique (different model family, free tier)
  • PostgreSQL 16 + pgvector — knowledge base, embeddings, materialized views
  • Python igraph — Personalized PageRank for graph expansion
  • Redis — task queue (Celery)
  • Supabase — cloud KB hosting for the live demo

Project structure

src/content_pipeline/     the engine — graph definitions, retrieval,
                          outline, drafting, evaluation, critique
src/kb/                   knowledge base models, embeddings, ingestion
config/                   voice profiles, rubrics, domain configs
scripts/                  evaluation, compilation, testing
docs/                     architecture, results, eval methodology
docs/research/            20 research reports grounding design decisions
demo/                     Streamlit app (explodable.streamlit.app)

Running it locally

Requires PostgreSQL (with pgvector), Redis, and API keys for Anthropic + Google AI (Gemini).

# Start infrastructure
docker compose -f docker/docker-compose.yml up -d

# Compile read models (wiki index, CAG cache, materialized views)
python scripts/compile_read_models.py

# Run the hybrid pipeline on a topic
python scripts/phase2_smoke_hybrid.py --topic T14

# Score a draft against the calibrated judge
python scripts/score_drafts_and_calibrate.py

# Run the execution gate (checks for scaffolding leaks)
python scripts/check_export_gate.py drafts/

The eval harness, repurposed for code quality

The same evaluation harness that scores long-form essays also scores Python code quality. Methodology transferred without modification: structured rubric in YAML, LLM judge with per-criterion scoring, anchor exemplars at 1/3/5 levels, weighted criteria, veto rules. Six criteria grounded in Clean Code (Martin), A Philosophy of Software Design (Ousterhout), PEP 8/257, and Google's Python Style Guide — naming clarity, readability & structure, architectural fit, documentation quality, error handling, testability.

Proposed as a feature contribution to Braintrust's autoevals:

The code judge itself is not yet calibrated against human reviewers — the essay judge's ρ = 0.841 against the 5-model panel does not transfer automatically to the code domain. Calibration against human labels is the next piece of work.

Research foundation

20 research reports grounding the engineering decisions:

Each architectural decision in the content pipeline traces to a specific engineering research finding. The code judge reports back the methodology transfer to Python code quality.

What's next

  • Cross-domain application — apply the eval harness methodology to a third domain (legal memos, medical note quality, or research paper review) to validate generalizability beyond essays and Python code.
  • Standalone eval harness package — extract the judge + multi-model calibration protocol into an installable package. pip install, give it a rubric and drafts, get calibrated scores.
  • Human-validated calibration for the code judge — close the gap between "the methodology transferred" and "the judge is calibrated" via 10–20 human rankings on a code-quality subset.

About

AI content engine using an anxiety-indexed behavioral science KB, multi-stage LangGraph pipeline, and calibrated LLM-as-judge evaluation harness

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors