Skip to content

Add completeness principle and online eval pipeline#2

Open
DhruvBhatia0 wants to merge 1 commit intomainfrom
experimental-prompt
Open

Add completeness principle and online eval pipeline#2
DhruvBhatia0 wants to merge 1 commit intomainfrom
experimental-prompt

Conversation

@DhruvBhatia0
Copy link
Copy Markdown
Collaborator

Summary

  • Add "VERIFY COMPLETENESS OF NEW ADDITIONS" investigation principle to system prompt — checks resource cleanup, DB constraints, auth parity, read/write pair consistency
  • Add skip_post mode to review API for running evals without posting to GitHub
  • Add online_eval.py pipeline (BigQuery discover → GitHub enrich → Fly review → LLM judge)
  • Lower confidence thresholds from 0.99 to 0.50-0.70 per category
  • Gitignore output artifacts and eval result files

Eval Results (100 PRs, 10 workers, gpt-5.4)

Run P (GT) Recall Mean F1 Key Change
Baseline 0.288 0.120 0.370 0.70 thresholds, extensive search
Attempt 1 0.244 0.111 0.369 + code quality paragraph (hurt)
Attempt 2 0.301 0.128 0.418 - code quality, + completeness principle
Attempt 3 0.263 0.104 0.363 + config/infra category (hurt)

Broadening scope to non-bug categories consistently hurts. The completeness principle works because it finds more bugs within the existing scope.

Test plan

  • Deployed to Fly and ran 3 eval iterations on ~90 PRs each
  • Verified skip_post mode returns comments without posting to GitHub
  • Best prompt (Attempt 2) is deployed

🤖 Generated with Claude Code

- Add "VERIFY COMPLETENESS OF NEW ADDITIONS" principle to system prompt
  (resource cleanup, DB constraints, auth parity, read/write consistency)
- Add skip_post mode to review API for eval without posting to GitHub
- Add online_eval.py pipeline (discover → enrich → review → judge)
- Lower confidence thresholds from 0.99 to 0.50-0.70 per category
- Track eval results in eval_results.md (best: F1=0.418, P=0.301, R=0.128)
- Gitignore output artifacts and eval result files

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant