Add completeness principle and online eval pipeline by DhruvBhatia0 · Pull Request #2 · morphllm/examples

DhruvBhatia0 · 2026-03-10T18:49:28Z

Summary

Add "VERIFY COMPLETENESS OF NEW ADDITIONS" investigation principle to system prompt — checks resource cleanup, DB constraints, auth parity, read/write pair consistency
Add skip_post mode to review API for running evals without posting to GitHub
Add online_eval.py pipeline (BigQuery discover → GitHub enrich → Fly review → LLM judge)
Lower confidence thresholds from 0.99 to 0.50-0.70 per category
Gitignore output artifacts and eval result files

Eval Results (100 PRs, 10 workers, gpt-5.4)

Run	P (GT)	Recall	Mean F1	Key Change
Baseline	0.288	0.120	0.370	0.70 thresholds, extensive search
Attempt 1	0.244	0.111	0.369	+ code quality paragraph (hurt)
Attempt 2	0.301	0.128	0.418	- code quality, + completeness principle
Attempt 3	0.263	0.104	0.363	+ config/infra category (hurt)

Broadening scope to non-bug categories consistently hurts. The completeness principle works because it finds more bugs within the existing scope.

Test plan

Deployed to Fly and ran 3 eval iterations on ~90 PRs each
Verified skip_post mode returns comments without posting to GitHub
Best prompt (Attempt 2) is deployed

🤖 Generated with Claude Code

- Add "VERIFY COMPLETENESS OF NEW ADDITIONS" principle to system prompt (resource cleanup, DB constraints, auth parity, read/write consistency) - Add skip_post mode to review API for eval without posting to GitHub - Add online_eval.py pipeline (discover → enrich → review → judge) - Lower confidence thresholds from 0.99 to 0.50-0.70 per category - Track eval results in eval_results.md (best: F1=0.418, P=0.301, R=0.128) - Gitignore output artifacts and eval result files Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

DhruvBhatia0 force-pushed the experimental-prompt branch from cf78ed8 to 29068db Compare March 10, 2026 20:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add completeness principle and online eval pipeline#2

Add completeness principle and online eval pipeline#2
DhruvBhatia0 wants to merge 1 commit intomainfrom
experimental-prompt

DhruvBhatia0 commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

DhruvBhatia0 commented Mar 10, 2026

Summary

Eval Results (100 PRs, 10 workers, gpt-5.4)

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant