Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 37 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,43 @@ GitHub PR URL

---

## Why flow-guided review?

Traditional code review tools present diffs as a flat file list — alphabetical, with no context about how changes relate to each other. Reviewers mentally reconstruct the code flow: "this handler calls that validator which uses this new utility..." and hope they don't miss a connection.

Flow-guided review structures the same diff as a directed graph: entry points first, then downstream through call chains, with explicit dependency ordering, risk levels, and clusters of tightly coupled changes. The reviewer — human or AI — follows the code flow instead of guessing at it.

### 100-PR evaluation

We evaluated this approach across **100 open-source PRs from 57 repositories** (8 languages, 2-86 files per PR) using a blind 3-agent framework:

1. **Baseline reviewer** — reviews the diff with no structural guidance (standard approach)
2. **Flow-guided reviewer** — reviews the same diff with the PR Flow Graph review plan
3. **Blind judge** — scores both reviews on 5 criteria without knowing which used the graph

| Metric | Value |
|--------|-------|
| Flow-guided wins | **92 / 100** (92%) |
| Baseline wins | 0 / 100 (0%) |
| Ties | 8 / 100 (8%) |
| Avg improvement | **+1.3** (6.0 → 7.3 on 10-point scale) |

### Per-criterion results

| Criterion | Baseline | Flow-Guided | Delta | What it measures |
|-----------|----------|-------------|-------|------------------|
| Completeness | 6.7 | 7.7 | +1.0 | Covered all meaningful changes? |
| Flow Awareness | 3.9 | 7.0 | **+3.1** | Understood cross-file connections? |
| Risk Identification | 6.3 | 7.6 | +1.3 | Flagged the riskiest parts? |
| Actionability | 6.2 | 7.1 | +0.9 | Specific, useful comments? |
| Efficiency | 7.1 | 7.0 | -0.1 | Avoided noise / false positives? |

The largest gain is **flow awareness** (+3.1) — understanding how changes in one file affect behavior in another. This is what the review plan directly provides. Efficiency stays flat, meaning the structured approach doesn't add noise.

Full results: [`evals/RESULTS.md`](./evals/RESULTS.md) | Methodology: [`evals/README.md`](./evals/README.md)

---

## Quick start

### Prerequisites
Expand Down
162 changes: 162 additions & 0 deletions evals/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,162 @@
# PR Flow Graph — Evaluation Framework

Compares code reviews produced **with** vs **without** the PR Flow Graph review plan across 100 open-source PRs from 57 repositories.

## 3-Agent Evaluation Framework

Each PR is evaluated by three independent Claude agents in a blind comparison:

```
GitHub PR
/ \
v v
┌─────────────┐ ┌──────────────────┐
│ Agent A │ │ Agent B │
│ (Baseline) │ │ (Flow-Guided) │
│ │ │ │
│ Input: │ │ Input: │
│ - PR diff │ │ - PR diff │
│ │ │ - Review plan │
│ │ │ from /api/ │
│ │ │ agent/ │
│ │ │ review-plan │
└──────┬───────┘ └────────┬──────────┘
│ │
v v
┌────────────────────────────────┐
│ Judge (Blind) │
│ │
│ Sees: "Review 1" / "Review 2" │
│ (randomized order) │
│ Scores each 1-10 on 5 criteria│
│ Picks a winner │
└────────────────────────────────┘
```

### Agent A — Baseline Reviewer

Reviews the PR using only the raw GitHub diff. This is how most AI code review tools work today: the model sees a flat list of file diffs and produces comments.

### Agent B — Flow-Guided Reviewer

Reviews the same PR diff but also receives the structured review plan from `/api/agent/review-plan`. The plan provides:

- **Topological review order** — review callees before callers
- **Node roles** — entry points, internal functions, leaf functions, context-only (unchanged but referenced)
- **Risk levels** — high/medium/low with reasons (large diff, many callers, entry point)
- **Clusters** — tightly coupled groups of functions to review together
- **Dependency chains** — "review X before Y because Y calls X"

### Judge — Blind Evaluator

The judge receives both reviews labeled only as "Review 1" and "Review 2" in **randomized order** (coin flip per PR). It does not know which used the flow graph. It scores each review on 5 criteria and declares a winner.

## Scoring Criteria

Each criterion is scored 1-10 independently.

### Completeness (1-10)

Did the review cover all meaningful changes in the PR?

A high-scoring review identifies and comments on every significant code change: new functions, modified logic, deleted code, configuration changes, and test coverage. A low score means the reviewer missed entire files, skipped important logic paths, or ignored edge cases. For a 14-file PR, a review that only covers 5 files would score low regardless of how good those 5 comments are.

### Flow Awareness (1-10)

Did the review understand how changes connect across files?

This is the core differentiator. A high score means the reviewer recognized cross-file relationships: how a change in `handler.ts` affects `validator.ts` which is called by `middleware.ts`. It caught consistency issues between caller and callee, identified that a type change in one file breaks assumptions in another, or traced data flow through the call chain. A low score means the review treated each file in isolation, as if they were independent changes.

### Risk Identification (1-10)

Did the review flag the riskiest parts of the PR?

High-risk areas include: entry points with many downstream callers (a bug here cascades), large diffs touching shared state, breaking API changes, missing error handling on new code paths, and security-sensitive changes. A high-scoring review correctly prioritizes these over low-risk cosmetic changes. A low score means the reviewer spent equal time on all changes regardless of impact, or missed the highest-risk modifications entirely.

### Actionability (1-10)

Were the review comments specific and useful?

A high score means comments pointed to exact lines of code, explained *why* something is a problem (not just *that* it is), and suggested concrete fixes or alternatives. Comments like "the null check on line 45 should handle the empty-array case too — `if (!items?.length)`" score high. Comments like "this could be better" or "consider error handling" without specifics score low.

### Efficiency (1-10)

Did the review avoid noise and false positives?

A high score means every comment adds value — no redundant observations, no low-signal nits masquerading as major issues, no incorrect severity ratings, and no comments on code that wasn't actually changed. A review that raises 5 precise issues scores higher on efficiency than one that raises 12 comments where 7 are trivial or wrong. This criterion counterbalances completeness: you shouldn't score well by just commenting on everything.

### Overall Score

The arithmetic mean of all 5 criteria. Range: 1.0-10.0.

## File Format

Each eval produces `evals/<owner>__<repo>__<pr_number>.json`:

```json
{
"pr": {
"url": "https://github.com/owner/repo/pull/123",
"owner": "owner",
"repo": "repo",
"number": 123,
"title": "PR title",
"files_changed": 14,
"additions": 200,
"deletions": 8,
"language": "typescript"
},
"timestamp": "2026-03-30T...",
"baseline_review": {
"comments": [
{
"file": "path/to/file.ts",
"line": 42,
"severity": "critical|major|minor|nit|positive",
"comment": "Specific review comment..."
}
],
"summary": "2-3 sentence assessment"
},
"flow_guided_review": {
"comments": [...],
"summary": "..."
},
"review_plan": { "totalSteps": 12, "..." : "..." },
"judge": {
"baseline_scores": {
"completeness": 7,
"flow_awareness": 5,
"risk_identification": 6,
"actionability": 6,
"efficiency": 7,
"overall": 6.2
},
"flow_guided_scores": {
"completeness": 8,
"flow_awareness": 8,
"risk_identification": 7,
"actionability": 7,
"efficiency": 7,
"overall": 7.4
},
"reasoning": "1-2 sentence explanation of the winner selection",
"winner": "flow_guided"
}
}
```

## Running the Eval

The eval runner (`run-eval.ts`) processes a single PR through the 3-agent pipeline:

```bash
# Expects pre-fetched data in /tmp/prflow-evals/ and PR list in /tmp/prflow-eval-prs.json
npx tsx evals/run-eval.ts <index>
```

Each run makes 3 API calls (baseline, flow-guided, judge) and writes the result JSON to `evals/`.

## Results

See [RESULTS.md](./RESULTS.md) for the full aggregated results across all 100 PRs.
Loading