ReleaseLens takes a Python packaging PEP and answers a single question: is this specification actually implemented, where, and what does it mean for the codebase that depends on it? It decomposes each PEP into atomic spec-claims, hunts for implementation evidence across pip, uv, and PyPI Warehouse, authors and self-critiques verification tests for the survivors, and renders a Markdown impact report against a target codebase resolved through a pluggable registry connector. An example Markdown report of the pipeline run against PEP-658 can be seen here
Under the hood it's a multi-agent pipeline orchestrated by LangGraph with a tiered evidence ladder (static grep → changelog archaeology → behavioural probe in a sandboxed venv), an evaluator-optimizer loop for test authoring with asymmetric model routing (Nova Pro author, Nova Lite critic), SQLite-checkpointed state with resume, end-to-end Langfuse tracing through three callback seams, deterministic cassette replay so CI never burns tokens, and a graded eval harness that emits feature/evidence F1 per PEP and pushes them back to Langfuse alongside the run trace.
Where it is today: PEP ingest, feature decomposition, the evidence ladder, test authoring/critique, eval scoring, and Markdown report rendering all run end-to-end against bundled PEPs. The registry-connector layer that resolves a real target codebase is still synthetic —
DevpiPublicConnectorreturns cannedResolvedTargets, so today's impact reports describe a stub package rather than your project. The evidence-tool calibration that drives F1 against ground truth is also still weak; the scoring is correct, the prompts and thresholds aren't tuned yet. See Status for the line-by-line breakdown.
For the full system spec, see docs/architecture.md. For code-generation conventions, see AGENTS.md.
flowchart TD
%% ReleaseLens · LangGraph · ● Pro (gen) · ○ Lite (classify)
START([START]):::se
START ==>|"⚡ per PEP id"| PI1[pep_ingest]:::det
START ==> PI2[pep_ingest]:::det
PI1 & PI2 --> J1{{after_pep_ingest}}:::join
J1 ==>|"⚡ per PEP id"| FE1("feature_extract ● Pro"):::pro
J1 ==> FE2("feature_extract ● Pro"):::pro
FE1 & FE2 --> J2{{after_feature_extract}}:::join
subgraph LOOP ["evaluator–optimizer loop · budget = 2"]
direction LR
TA("test_author ● Pro"):::pro -->|differential test| CR("critic ○ Lite"):::lite
CR -.->|"rubric feedback"| TA
end
J2 ==>|"⚡ per SpecClaim"| TA
CR ==> T_OK([accepted]):::ok
CR ==> T_EX([budget_exhausted]):::bad
CR ==> T_UN([unverifiable]):::neu
T_OK & T_EX & T_UN --> TAG[test_authoring_aggregate]:::det
TAG --> J3{{after_test_authoring}}:::join
subgraph LADDER ["evidence ladder · cost-gated · NOT a loop"]
direction TB
ES["① evidence_static · ripgrep + ○ Lite"]:::lite
EC["② evidence_changelog · GitHub"]:::det
EP["③ evidence_probe · uv sandbox · no LLM"]:::probe
EA[evidence_aggregate]:::det
ES ==>|"conf ≥ 0.8 ✓"| EA
ES -.->|"conf < 0.8 ↓"| EC
EC ==>|"conf ≥ 0.8 ✓"| EA
EC -.->|"conf < 0.8 ↓"| EP
EP --> EA
end
J3 ==>|"⚡ per (feature × tool)"| ES
EA --> MB[matrix_build]:::det
MB ==>|"⚡ per feature"| V1[verify]:::det
MB ==> V2[verify]:::det
V1 --> IS1("impact_scope ● Pro"):::pro
V2 --> IS2("impact_scope ● Pro"):::pro
IS1 & IS2 --> RR[report_render]:::det
RR --> END_N([END]):::se
subgraph SIDE ["cross-cutting · backgrounded"]
direction LR
TOOLS["tools · ripgrep · PyGithub · ChromaDB · uv · diff runner"]:::side
OBS["Langfuse spans · nodes · LLM calls · tools"]:::side
DUR["SqliteSaver · checkpoint & resume"]:::side
end
END_N ~~~ SIDE
classDef se fill:#0f172a,stroke:#e2e8f0,color:#f8fafc
classDef pro fill:#1e1b4b,stroke:#818cf8,color:#e0e7ff
classDef lite fill:#0e4f5c,stroke:#22d3ee,color:#cffafe
classDef det fill:#14532d,stroke:#86efac,color:#dcfce7
classDef probe fill:#7c2d12,stroke:#fdba74,color:#fed7aa
classDef join fill:#78350f,stroke:#fbbf24,color:#fef3c7
classDef ok fill:#166534,stroke:#4ade80,color:#f0fdf4
classDef bad fill:#7f1d1d,stroke:#f87171,color:#fef2f2
classDef neu fill:#1f2937,stroke:#9ca3af,color:#f3f4f6
classDef side fill:#fafafa,stroke:#d4d4d8,color:#71717a
style LOOP fill:#1e293b,stroke:#a5b4fc,color:#f1f5f9
style LADDER fill:#1c1917,stroke:#fdba74,color:#fafaf9
style SIDE fill:#fafafa,stroke:#e4e4e7,color:#71717a
make setup # uv sync --all-extras
uv run releaselens --help
uv run releaselens run --pep-ids 658The run command end-to-ends the pipeline against a bundled PEP-658 fixture and writes a Markdown report under reports/. The path is printed on completion alongside the run_id (use it with releaselens resume <run_id> to replay from the SQLite checkpoint at .releaselens/checkpoints.db).
The pipeline runs end-to-end without any of these — each evidence node records a "skipped" / "no artefacts" note and the escalation ladder takes over — but you'll only see real source refs when they're present.
ripgreponPATH(used byevidence_static). Not a Python package, so it can't go inpyproject.toml. Install with the platform package manager:brew install ripgrep # macOS apt-get install ripgrep # Debian/Ubuntu
- Local source clones of the tools
evidence_staticshould grep. Defaults aredata/sources/{pip,uv,warehouse}/; override the parent dir withRELEASELENS_SOURCES_DIR=/path/to/clones.mkdir -p data/sources git clone --depth 1 https://github.com/pypa/pip data/sources/pip git clone --depth 1 https://github.com/astral-sh/uv data/sources/uv git clone --depth 1 https://github.com/pypi/warehouse data/sources/warehouse
GITHUB_TOKENin the environment (used byevidence_changelogfor commit + release archaeology). Unauthenticated GitHub search is rate-limited to ~10 req/min and blocks the live run with cascading 403s; an authenticated token raises the limit to 30 req/min and is more than enough for the demo. A read-only personal access token suffices.export GITHUB_TOKEN=ghp_xxx- AWS credentials with Amazon Bedrock access for any LLM mode other than
replayorstub(i.e.record-missing,record,live). The pipeline calls Mistral Pixtral Large (Pro tier) and Amazon Nova-2-Lite (cheap tier) via Bedrock through LiteLLM, so the principal — IAM user, role, or SSO session — needsbedrock:InvokeModelonmistral.pixtral-large-2502-v1:0andamazon.nova-2-lite-v1:0, and model access for both must be granted in the Bedrock console (Bedrock → Model access → Manage model access; access is opt-in per account/region and is the most common reason a freshly-credentialed run still 403s). Defaultreplaymode reads the bundled cassettes and needs none of this.export AWS_ACCESS_KEY_ID=... export AWS_SECRET_ACCESS_KEY=... export AWS_REGION=eu-west-1 # pinned models live in eu-west-1; AWS_PROFILE works too
The scaffold is complete: every node, every edge, every schema, and every reducer in docs/architecture.md §3–§7 exists in code and the graph executes end-to-end. What varies is whether each node does real work or returns deterministic stub data sized to the schema.
Core graph. All 12 nodes wired in src/releaselens/graph.py, all five Send fan-out points (PEP, feature, test-author per claim, evidence per tool, impact per feature), the evidence-escalation conditional ladder, and the SqliteSaver checkpointer.
Schemas & state. All 11 Pydantic v2 record types under src/releaselens/schemas/. PipelineState TypedDict with add / dict_merge reducers on every fan-in field.
LLM nodes that do real work.
pep_ingest— reads RST fromdata/peps/, parses sections viareleaselens.peps.rst.feature_extract— Nova Pro decomposition into atomicFeatures +SpecClaims.test_author/critic— evaluator-optimizer loop with retry budget and feedback piping (ADR-0007), routed asymmetrically (Pro author, Lite critic).evidence_static— derives identifier-like search candidates from claim text, ripgreps the local source clone for each, sends top hits to Nova Lite for found / not_found / ambiguous classification with a confidence score; confidence ≥ threshold short-circuits the escalation ladder.evidence_changelog— commit and release archaeology via the GitHub wrapper, escalation rung two when static grep fails the confidence gate.evidence_probe—uv venv-sandboxed differential runner with three executors (static_signature,behavioural_probe,metadata_assertion), no LLM in the runner path; the rung-three confirmation step.
Deterministic Python nodes (no LLM). evidence_aggregate, matrix_build, verify, impact_scope, report_render.
Eval harness. releaselens eval loads YAML ground-truth fixtures from data/fixtures/, runs the graph end-to-end (replaying cassettes when present), and scores feature decomposition and evidence detection via pure-Python set arithmetic — TP/FP/FN with aggregate, per-tool, and per-method facets. Multi-run averaging via --runs N reports mean ± stdev; feature_f1, evidence_f1, and the per-tool / per-method breakdowns are pushed to Langfuse as scores per run, so eval traces are filterable in the UI alongside the corresponding pipeline traces.
Tooling layer. Real implementations with stub-mode parity selected via RELEASELENS_<TOOL>_MODE:
tools/ripgrep.py— streaming stdout wrapper.tools/github.py— commit-archaeology via PyGithub.tools/rag.py— ChromaDB collections backed by Bedrock Titan embeddings.tools/uv_sandbox.py— per-invocationuv venvlifecycle.tools/differential_runner.py— three executors (static_signature,behavioural_probe,metadata_assertion); no LLM in the runner path.
Bundled corpus. Ten PEP RST documents under data/peps/ (440, 503, 625, 639, 643, 658, 691, 700, 708, 740) — make demo runs the pipeline against 658, 691, and 740 end-to-end. PEP-658 has a graded ground-truth fixture in data/fixtures/PEP-658.yaml that drives the eval harness.
LLM gateway & routing. LiteLLM call wrapper with cassette record/replay (src/releaselens/llm.py); per-node model selection from config/model_routing.yaml (currently Amazon Nova family — see the file header for the rationale).
Observability. Langfuse self-host docker compose at infra/langfuse/; tracing wired through three seams (LiteLLM callback for generations, LangGraph CallbackHandler for node spans, tool_span context manager around each tool wrapper). All seams are no-ops without LANGFUSE_* env vars. See "See it in action" §5 for the walkthrough.
CLI. releaselens run | resume | eval with click; run auto-copies the bundled PEP fixture into data/peps/ if missing.
Tests. Unit tests for the loop nodes, RST parser, smoke tests, integration scaffolding; recorded LiteLLM cassettes under tests/cassettes/.
The gap between architecture-complete and "the numbers are publishable" is the §19 done-state checklist. The headline gaps:
- Evidence-tool quality is the current bottleneck. The escalation ladder is wired and the eval harness scores it correctly, but feature- and evidence-level F1 against the PEP-658 fixture are weak: the candidate-derivation prompts, ripgrep query shapes, and confidence thresholds all need tuning before the numbers are anything to point at. Scaffold is right; calibration is the work.
- Eval ground-truth is one PEP deep. Only
data/fixtures/PEP-658.yamlis labelled.--pep-ids 691,740will run end-to-end but produce no eval scores — every other bundled PEP needs a hand-curated YAML fixture before it can be graded. DevpiPublicConnectoris synthetic. Returns cannedResolvedTargets withpinned_version="0.0.0-stub"; no HTTP traffic to a real devpi. The Protocol shape is correct; the network code isn't there.- No ADRs committed under
docs/adr/despite ADR-0001 through ADR-0007 being referenced throughout the spec. - No
docs/cost_model.mdwith per-stage figures from a baseline run. tests/integration/is empty — failure-mode coverage from architecture §12.3 not yet exercised.
make setup
uv run releaselens run --pep-ids 658What to look for:
- Console prints
run_id: <uuid>andreport: reports/<run_id>/report.md. - The report has Spec / Reality / Impact sections per Feature. The Reality columns are populated by the live evidence ladder when ripgrep + local source clones +
GITHUB_TOKENare present (see "Optional system dependencies"); without them, evidence nodes record "skipped" / "no artefacts" notes and the columns show that the rung was unreachable rather than synthetic data. - The Impact section will fall back to
version_first_seen=0.0.0-stubbecause the registry connector (DevpiPublicConnector) is still synthetic — see "Not done." - A SQLite checkpoint database appears at
.releaselens/checkpoints.db.
uv run releaselens resume <run_id> # the run_id printed aboveConfirms SqliteSaver is wired and the graph compiles against persisted state.
uv run pytest tests/unit/test_loop_integration.py -vDrives the test_author → critic → test_author retry loop using recorded Nova cassettes from tests/cassettes/. To re-record against live Bedrock, delete the relevant cassette and re-run with AWS credentials present.
RELEASELENS_RIPGREP_MODE=real uv run python -c "from releaselens.tools.ripgrep import search; \
[print(line) for line in search(['Feature'], ['src/releaselens/schemas'])]"
RELEASELENS_RAG_MODE=real uv run python -c "from releaselens.tools.rag import PEPCollection; \
c = PEPCollection(); c.upsert('pep-658', 'metadata files via Warehouse'); print(c.query('metadata'))"Stub mode (the default in tests) returns canned data registered per call key; unregistered keys raise StubNotRegistered so silent zero-result replays can't mask bugs.
Tracing is no-op until the three LANGFUSE_* env vars are set, so tests stay infra-free. To see a run end-to-end:
# 1. start Postgres + Langfuse via docker compose
make langfuse-up
# 2. open http://localhost:3000, create a local user + organisation +
# project, then copy the project's public + secret keys into the env.
# The Postgres volume persists across compose restarts, so this is a
# one-time setup.
export LANGFUSE_HOST=http://localhost:3000
export LANGFUSE_PUBLIC_KEY=pk-lf-...
export LANGFUSE_SECRET_KEY=sk-lf-...
# 3. choose an LLM mode (see the table below) and run the pipeline.
# Every node, every LLM call, every tool wrapper invocation is now
# a span. record-missing is the recommended demo mode: one paid run
# captures cassettes, every subsequent run replays them for free
# while still emitting the trace.
RELEASELENS_LLM_MODE=record-missing \
uv run releaselens run --pep-ids 658
# 4. refresh the Langfuse UI → Sessions, filter by the printed run_id.
# LangGraph nodes appear as a trace tree; LLM generations under each
# node carry input_tokens / output_tokens / cost_usd from LiteLLM;
# test_author / critic spans carry the iteration attribute.| Mode | Behaviour | When to use |
|---|---|---|
replay (default) |
Refuses to call Bedrock; requires a cassette to exist for the exact prompt SHA, otherwise raises CassetteMissing. The test_author node converts that to a budget_exhausted terminal state per claim and the run completes with errors recorded in state. |
Tests, deterministic CI. Will produce empty test-author / critic spans for any prompts that haven't been recorded — expected. |
record-missing |
Replay when a cassette exists; otherwise call Bedrock live and write the cassette. | Recommended for the trace demo: one run with AWS creds present captures every cassette, subsequent runs are free and identical. |
record |
Always call Bedrock live and overwrite the cassette. | Re-recording after a prompt change. |
live |
Always call Bedrock; never read or write cassettes. | Ad-hoc experimentation. |
stub |
Returns the canonical registered stub response per node. No Bedrock calls, no cassette I/O. | Smoke runs without AWS creds. Trace will show node spans but no real LLM generations. |
Default replay mode against a fresh PEP intentionally produces a noisy "trace full of budget_exhausted" — the cassettes under tests/cassettes/ are sized to the unit tests, not to the real pep_ingest + feature_extract output, so prompt SHAs will not match. The graceful-degradation path is doing exactly what architecture §7.3.1 mandates; switch to record-missing once to capture the live cassettes.
docs/screenshots/langfuse-trace.png
make langfuse-down stops both containers; the named Postgres volume persists traces and your project keys between sessions, delete it explicitly (docker volume rm releaselens-langfuse_langfuse_pg) if you want a clean slate.
make lint
make testuv run releaselens eval # scores PEP-658 against data/fixtures/PEP-658.yaml
uv run releaselens eval --runs 5 # multi-run averaging: mean ± stdev per facetWhat to expect right now: the harness runs end-to-end and surfaces feature/evidence F1 with per-tool and per-method breakdowns, but the numbers are weak — the evidence-tool prompts and confidence thresholds are the next chunk of work (see Not done above). Adding more graded PEPs is a matter of dropping additional data/fixtures/PEP-XXX.yaml files alongside the existing one; the runner picks them up automatically.
With Langfuse running, every eval run is tagged eval=true and emits feature_f1, evidence_f1, plus per_tool_*_f1 and per_method_*_f1 as scores against its run trace, so calibration changes are diffable in the UI.