Skip to content

Add 100-PR eval evidence: flow-guided wins 92%#1

Open
benitezfede wants to merge 1 commit into
mainfrom
fbenitez/eval-evidence
Open

Add 100-PR eval evidence: flow-guided wins 92%#1
benitezfede wants to merge 1 commit into
mainfrom
fbenitez/eval-evidence

Conversation

@benitezfede

Copy link
Copy Markdown
Owner

Summary

  • Adds a 100-PR blind evaluation comparing baseline (diff-only) vs flow-guided code review across 57 open-source repositories
  • Flow-guided approach wins 92% of comparisons with +1.3 avg score improvement and +3.1 on flow awareness
  • Adds detailed methodology docs explaining the 3-agent framework and all 5 scoring criteria
  • Updates main README with a "Why Flow-Guided Review?" section linking to eval evidence

What's included

  • evals/*.json — 101 individual eval results (one per PR), each containing baseline review, flow-guided review, and blind judge scores
  • evals/RESULTS.md — aggregated results: win rates, per-criterion breakdown, by-language analysis, full PR table with GitHub links
  • evals/README.md — methodology: 3-agent framework, detailed criterion definitions (Completeness, Flow Awareness, Risk Identification, Actionability, Efficiency), file format, how to run
  • evals/run-eval.ts — eval runner script
  • README.md — new "Why Flow-Guided Review?" section with key results and links to eval docs

Key results

Criterion Baseline Flow-Guided Delta
Completeness 6.7 7.7 +1.0
Flow Awareness 3.9 7.0 +3.1
Risk Identification 6.3 7.6 +1.3
Actionability 6.2 7.1 +0.9
Efficiency 7.1 7.0 -0.1
Overall 6.0 7.3 +1.3

Test plan

  • All 101 eval JSON files parse correctly
  • RESULTS.md aggregations match individual file data
  • README links resolve to correct files
  • [na] No code changes to test — documentation and data only

🤖 Generated with Claude Code

… comparisons

Blind 3-agent eval framework (baseline reviewer, flow-guided reviewer, judge)
across 100 PRs from 57 open-source repos. Flow-guided approach scored +1.3 avg
improvement with +3.1 on flow awareness. Zero baseline wins.

Adds eval results (101 JSON files), methodology docs, and README section
explaining why flow-guided review outperforms flat file-list review.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant