benitezfede · benitezfede · Mar 30, 2026
diff --git a/README.md b/README.md
@@ -73,6 +73,43 @@ GitHub PR URL
 
 ---
 
+## Why flow-guided review?
+
+Traditional code review tools present diffs as a flat file list — alphabetical, with no context about how changes relate to each other. Reviewers mentally reconstruct the code flow: "this handler calls that validator which uses this new utility..." and hope they don't miss a connection.
+
+Flow-guided review structures the same diff as a directed graph: entry points first, then downstream through call chains, with explicit dependency ordering, risk levels, and clusters of tightly coupled changes. The reviewer — human or AI — follows the code flow instead of guessing at it.
+
+### 100-PR evaluation
+
+We evaluated this approach across **100 open-source PRs from 57 repositories** (8 languages, 2-86 files per PR) using a blind 3-agent framework:
+
+1. **Baseline reviewer** — reviews the diff with no structural guidance (standard approach)
+2. **Flow-guided reviewer** — reviews the same diff with the PR Flow Graph review plan
+3. **Blind judge** — scores both reviews on 5 criteria without knowing which used the graph
+
+| Metric | Value |
+|--------|-------|
+| Flow-guided wins | **92 / 100** (92%) |
+| Baseline wins | 0 / 100 (0%) |
+| Ties | 8 / 100 (8%) |
+| Avg improvement | **+1.3** (6.0 → 7.3 on 10-point scale) |
+
+### Per-criterion results
+
+| Criterion | Baseline | Flow-Guided | Delta | What it measures |
+|-----------|----------|-------------|-------|------------------|
+| Completeness | 6.7 | 7.7 | +1.0 | Covered all meaningful changes? |
+| Flow Awareness | 3.9 | 7.0 | **+3.1** | Understood cross-file connections? |
+| Risk Identification | 6.3 | 7.6 | +1.3 | Flagged the riskiest parts? |
+| Actionability | 6.2 | 7.1 | +0.9 | Specific, useful comments? |
+| Efficiency | 7.1 | 7.0 | -0.1 | Avoided noise / false positives? |
+
+The largest gain is **flow awareness** (+3.1) — understanding how changes in one file affect behavior in another. This is what the review plan directly provides. Efficiency stays flat, meaning the structured approach doesn't add noise.
+
+Full results: [`evals/RESULTS.md`](./evals/RESULTS.md) | Methodology: [`evals/README.md`](./evals/README.md)
+
+---
+
 ## Quick start
 
 ### Prerequisites

diff --git a/evals/README.md b/evals/README.md
@@ -0,0 +1,162 @@
+# PR Flow Graph — Evaluation Framework
+
+Compares code reviews produced **with** vs **without** the PR Flow Graph review plan across 100 open-source PRs from 57 repositories.
+
+## 3-Agent Evaluation Framework
+
+Each PR is evaluated by three independent Claude agents in a blind comparison:
+
+```
+                     GitHub PR
+                    /         \
+                   v           v
+         ┌─────────────┐  ┌──────────────────┐
+         │  Agent A     │  │  Agent B          │
+         │  (Baseline)  │  │  (Flow-Guided)    │
+         │              │  │                   │
+         │  Input:      │  │  Input:           │
+         │  - PR diff   │  │  - PR diff        │
+         │              │  │  - Review plan    │
+         │              │  │    from /api/     │
+         │              │  │    agent/         │
+         │              │  │    review-plan    │
+         └──────┬───────┘  └────────┬──────────┘
+                │                   │
+                v                   v
+         ┌────────────────────────────────┐
+         │  Judge (Blind)                 │
+         │                                │
+         │  Sees: "Review 1" / "Review 2" │
+         │  (randomized order)            │
+         │  Scores each 1-10 on 5 criteria│
+         │  Picks a winner                │
+         └────────────────────────────────┘
+```
+
+### Agent A — Baseline Reviewer
+
+Reviews the PR using only the raw GitHub diff. This is how most AI code review tools work today: the model sees a flat list of file diffs and produces comments.
+
+### Agent B — Flow-Guided Reviewer
+
+Reviews the same PR diff but also receives the structured review plan from `/api/agent/review-plan`. The plan provides:
+
+- **Topological review order** — review callees before callers
+- **Node roles** — entry points, internal functions, leaf functions, context-only (unchanged but referenced)
+- **Risk levels** — high/medium/low with reasons (large diff, many callers, entry point)
+- **Clusters** — tightly coupled groups of functions to review together
+- **Dependency chains** — "review X before Y because Y calls X"
+
+### Judge — Blind Evaluator
+
+The judge receives both reviews labeled only as "Review 1" and "Review 2" in **randomized order** (coin flip per PR). It does not know which used the flow graph. It scores each review on 5 criteria and declares a winner.
+
+## Scoring Criteria
+
+Each criterion is scored 1-10 independently.
+
+### Completeness (1-10)
+
+Did the review cover all meaningful changes in the PR?
+
+A high-scoring review identifies and comments on every significant code change: new functions, modified logic, deleted code, configuration changes, and test coverage. A low score means the reviewer missed entire files, skipped important logic paths, or ignored edge cases. For a 14-file PR, a review that only covers 5 files would score low regardless of how good those 5 comments are.
+
+### Flow Awareness (1-10)
+
+Did the review understand how changes connect across files?
+
+This is the core differentiator. A high score means the reviewer recognized cross-file relationships: how a change in `handler.ts` affects `validator.ts` which is called by `middleware.ts`. It caught consistency issues between caller and callee, identified that a type change in one file breaks assumptions in another, or traced data flow through the call chain. A low score means the review treated each file in isolation, as if they were independent changes.
+
+### Risk Identification (1-10)
+
+Did the review flag the riskiest parts of the PR?
+
+High-risk areas include: entry points with many downstream callers (a bug here cascades), large diffs touching shared state, breaking API changes, missing error handling on new code paths, and security-sensitive changes. A high-scoring review correctly prioritizes these over low-risk cosmetic changes. A low score means the reviewer spent equal time on all changes regardless of impact, or missed the highest-risk modifications entirely.
+
+### Actionability (1-10)
+
+Were the review comments specific and useful?
+
+A high score means comments pointed to exact lines of code, explained *why* something is a problem (not just *that* it is), and suggested concrete fixes or alternatives. Comments like "the null check on line 45 should handle the empty-array case too — `if (!items?.length)`" score high. Comments like "this could be better" or "consider error handling" without specifics score low.
+
+### Efficiency (1-10)
+
+Did the review avoid noise and false positives?
+
+A high score means every comment adds value — no redundant observations, no low-signal nits masquerading as major issues, no incorrect severity ratings, and no comments on code that wasn't actually changed. A review that raises 5 precise issues scores higher on efficiency than one that raises 12 comments where 7 are trivial or wrong. This criterion counterbalances completeness: you shouldn't score well by just commenting on everything.
+
+### Overall Score
+
+The arithmetic mean of all 5 criteria. Range: 1.0-10.0.
+
+## File Format
+
+Each eval produces `evals/<owner>__<repo>__<pr_number>.json`:
+
+```json
+{
+  "pr": {
+    "url": "https://github.com/owner/repo/pull/123",
+    "owner": "owner",
+    "repo": "repo",
+    "number": 123,
+    "title": "PR title",
+    "files_changed": 14,
+    "additions": 200,
+    "deletions": 8,
+    "language": "typescript"
+  },
+  "timestamp": "2026-03-30T...",
+  "baseline_review": {
+    "comments": [
+      {
+        "file": "path/to/file.ts",
+        "line": 42,
+        "severity": "critical|major|minor|nit|positive",
+        "comment": "Specific review comment..."
+      }
+    ],
+    "summary": "2-3 sentence assessment"
+  },
+  "flow_guided_review": {
+    "comments": [...],
+    "summary": "..."
+  },
+  "review_plan": { "totalSteps": 12, "..." : "..." },
+  "judge": {
+    "baseline_scores": {
+      "completeness": 7,
+      "flow_awareness": 5,
+      "risk_identification": 6,
+      "actionability": 6,
+      "efficiency": 7,
+      "overall": 6.2
+    },
+    "flow_guided_scores": {
+      "completeness": 8,
+      "flow_awareness": 8,
+      "risk_identification": 7,
+      "actionability": 7,
+      "efficiency": 7,
+      "overall": 7.4
+    },
+    "reasoning": "1-2 sentence explanation of the winner selection",
+    "winner": "flow_guided"
+  }
+}
+```
+
+## Running the Eval
+
+The eval runner (`run-eval.ts`) processes a single PR through the 3-agent pipeline:
+
+```bash
+# Expects pre-fetched data in /tmp/prflow-evals/ and PR list in /tmp/prflow-eval-prs.json
+npx tsx evals/run-eval.ts <index>
+```
+
+Each run makes 3 API calls (baseline, flow-guided, judge) and writes the result JSON to `evals/`.
+
+## Results
+
+See [RESULTS.md](./RESULTS.md) for the full aggregated results across all 100 PRs.