Add 100-PR eval evidence: flow-guided wins 92% by benitezfede · Pull Request #1 · benitezfede/prflow

benitezfede · 2026-03-30T14:29:39Z

Summary

Adds a 100-PR blind evaluation comparing baseline (diff-only) vs flow-guided code review across 57 open-source repositories
Flow-guided approach wins 92% of comparisons with +1.3 avg score improvement and +3.1 on flow awareness
Adds detailed methodology docs explaining the 3-agent framework and all 5 scoring criteria
Updates main README with a "Why Flow-Guided Review?" section linking to eval evidence

What's included

evals/*.json — 101 individual eval results (one per PR), each containing baseline review, flow-guided review, and blind judge scores
evals/RESULTS.md — aggregated results: win rates, per-criterion breakdown, by-language analysis, full PR table with GitHub links
evals/README.md — methodology: 3-agent framework, detailed criterion definitions (Completeness, Flow Awareness, Risk Identification, Actionability, Efficiency), file format, how to run
evals/run-eval.ts — eval runner script
README.md — new "Why Flow-Guided Review?" section with key results and links to eval docs

Key results

Criterion	Baseline	Flow-Guided	Delta
Completeness	6.7	7.7	+1.0
Flow Awareness	3.9	7.0	+3.1
Risk Identification	6.3	7.6	+1.3
Actionability	6.2	7.1	+0.9
Efficiency	7.1	7.0	-0.1
Overall	6.0	7.3	+1.3

Test plan

All 101 eval JSON files parse correctly
RESULTS.md aggregations match individual file data
README links resolve to correct files
[na] No code changes to test — documentation and data only

🤖 Generated with Claude Code

… comparisons Blind 3-agent eval framework (baseline reviewer, flow-guided reviewer, judge) across 100 PRs from 57 open-source repos. Flow-guided approach scored +1.3 avg improvement with +3.1 on flow awareness. Zero baseline wins. Adds eval results (101 JSON files), methodology docs, and README section explaining why flow-guided review outperforms flat file-list review. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add 100-PR eval evidence: flow-guided wins 92%#1

Add 100-PR eval evidence: flow-guided wins 92%#1
benitezfede wants to merge 1 commit into
mainfrom
fbenitez/eval-evidence

benitezfede commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

benitezfede commented Mar 30, 2026

Summary

What's included

Key results

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant