morphllm · DhruvBhatia0 · Mar 10, 2026
diff --git a/.gitignore b/.gitignore
@@ -3,5 +3,8 @@
 **/.wrangler/
 pr_clones/
 code-review-benchmark/
+pr_review_agent/output/
+online_eval_results*.json
+online_eval.db*
 .env
 plan.md
diff --git a/eval_results.md b/eval_results.md
@@ -0,0 +1,85 @@
+# Online Eval Results
+
+Evaluating prompt changes on the same set of ~91 PRs (skip-discover, skip-enrich).
+Using 10 workers, gpt-5.4, skip_post=True.
+
+## Baseline (Run 2 — 0.70 thresholds, no borderline line, extensive search line)
+
+| Metric | All PRs | PRs w/ GT (52) |
+|--------|---------|----------------|
+| Suggestions | 208 | 125 |
+| Matched suggestions | 36 | 36 |
+| Ground truth | 300 | 300 |
+| Precision | 0.173 | 0.288 |
+| Recall | 0.120 | 0.120 |
+| Mean F1 | — | 0.370 (26 PRs) |
+
+## Attempt 1 — Add code quality paragraph + restore borderline bug line
+
+**Changes:**
+- Restored "it is better to report a borderline real bug than to miss one" in system.py and reviewer.py
+- Added paragraph: "ALSO LOOK FOR CODE QUALITY ISSUES THE AUTHOR WOULD FIX" — inconsistent naming/style, unnecessarily complex logic, duplicated code, dead code, unused imports
+- Kept 0.70 thresholds and extensive search line
+
+**Results:**
+
+| Metric | All PRs (90) | PRs w/ GT (51) |
+|--------|--------------|----------------|
+| Suggestions | 195 | 123 |
+| Matched suggestions | 30 | 30 |
+| Ground truth | 271 | 271 |
+| Precision | 0.154 | 0.244 |
+| Recall | 0.111 | 0.111 |
+| Mean F1 | — | 0.369 (21 PRs) |
+
+**Analysis:** Code quality paragraph was net negative. Generated 5 more non-bug suggestions (21 vs 16) but only 1 matched. Crowded out bug-finding: 18 fewer bug suggestions, 6 lost matches. Lost matches on 12 PRs vs gained on 8.
+
+## Attempt 2 — Remove code quality, add completeness principle
+
+**Changes:**
+- Removed "ALSO LOOK FOR CODE QUALITY ISSUES" paragraph (proven harmful)
+- Added investigation principle #5: "VERIFY COMPLETENESS OF NEW ADDITIONS" — resource cleanup/teardown, DB constraints matching business logic, API endpoint authorization, read/write pair consistency, feature flag both-paths
+- Kept borderline bug line, 0.70 thresholds, extensive search line
+
+**Results:**
+
+| Metric | All PRs (85) | PRs w/ GT (51) |
+|--------|--------------|----------------|
+| Suggestions | 195 | 123 |
+| Matched suggestions | 37 | 37 |
+| Ground truth | 290 | 290 |
+| Precision | 0.190 | 0.301 |
+| Recall | 0.128 | 0.128 |
+| Mean F1 | — | 0.418 (25 PRs) |
+
+**Analysis:** Best run so far. Removing code quality paragraph refocused the model on bugs. Completeness principle helped find resource cleanup, constraint, and lifecycle bugs. +13 gained matches vs -9 lost on shared PR set. Bug match rate improved from 16% to 19%, security from 10% to 18%.
+
+## Attempt 3 — Add config/infrastructure category + investigation budget emphasis
+
+**Changes:**
+- Added bug category #15: "Configuration / infrastructure" — missing env var mappings, Dockerfile steps, CI workflow permissions, version pins
+- Strengthened "DON'T STOP AT THE FIRST FINDING" with "Pay equal attention to config files as to source code"
+
+**Results:**
+
+| Metric | All PRs (91) | PRs w/ GT (52) |
+|--------|--------------|----------------|
+| Suggestions | 209 | — |
+| Matched suggestions | 31 | 31 |
+| Ground truth | 299 | 299 |
+| Precision | 0.148 | 0.263 |
+| Recall | 0.104 | 0.104 |
+| Mean F1 | — | 0.363 (24 PRs) |
+
+**Analysis:** Regressed vs Attempt 2. Same pattern as Attempt 1 — broadening scope dilutes bug-finding focus. Generated 14 more suggestions but 6 fewer matches. Lost on 15 shared PRs vs gained on 8. Config emphasis caused the model to spend cognitive budget on config analysis at the expense of core logic bugs.
+
+## Summary
+
+| Run | P (GT) | Recall | Mean F1 | Key Change |
+|-----|--------|--------|---------|------------|
+| Baseline | 0.288 | 0.120 | 0.370 | 0.70 thresholds, no borderline, extensive search |
+| Attempt 1 | 0.244 | 0.111 | 0.369 | + code quality paragraph, + borderline bug line |
+| **Attempt 2** | **0.301** | **0.128** | **0.418** | - code quality, + completeness principle |
+| Attempt 3 | 0.263 | 0.104 | 0.363 | + config/infra category, + config emphasis |
+
+**Best: Attempt 2.** Key insight: the model performs best when its scope is tightly focused on runtime defects + structural completeness. Any instruction that broadens scope to non-bug categories (code quality, config/infra) dilutes focus and hurts precision without meaningfully improving recall.
diff --git a/github_app/app.py b/github_app/app.py
@@ -103,13 +103,17 @@ class ReviewRequest(BaseModel):
     model: str = ""  # Override model name (e.g. "gpt-5.4", "gemini-3.1-pro-preview")
     openai_api_key: str = ""  # Override OpenAI API key (optional, falls back to env)
     google_api_key: str = ""  # Override Google API key (optional, falls back to env)
+    skip_post: bool = False  # Run review but don't post to GitHub (for evals)
 
 
 REVIEW_API_SECRET = os.environ.get("REVIEW_API_SECRET", "") or os.environ.get("GHAPP_INTERNAL_SECRET", "")
 
 
-async def _run_review_from_api(req: ReviewRequest):
-    """Run the review pipeline from an API request (not a webhook)."""
+async def _run_review_from_api(req: ReviewRequest) -> list | None:
+    """Run the review pipeline from an API request (not a webhook).
+
+    Returns the list of review comments when skip_post=True, None otherwise.
+    """
     import time
     import uuid
     from github_app.telemetry import make_event_emitter
@@ -187,32 +191,39 @@ async def _run_review_from_api(req: ReviewRequest):
 
         t_post = time.monotonic()
         num_issues = len(comments) if comments else 0
-        summary_line = f"Found {num_issues} issue{'s' if num_issues != 1 else ''}"
-        if req.personality and req.github_username:
-            review_body = (
-                f"@{req.github_username}'s review twin\n\n"
-                f"{summary_line}\n\n---\n"
-                f"*a2a-review based on @{req.github_username}'s coding preferences*"
-            )
-        else:
-            review_body = f"## Morph Code Review\n\n{summary_line}"
 
-        if comments:
-            await client.post_review(
-                req.owner, req.repo, req.pr_number, req.head_sha,
-                comments, diff, review_body,
+        if req.skip_post:
+            logger.info(
+                "skip_post=True, skipping GitHub post for %s PR #%d (%d comments)",
+                full_name, req.pr_number, num_issues,
             )
+            duration_post = 0.0
         else:
-            # Post summary even with 0 issues so callers can detect completion
-            await client.post_issue_comment(
-                req.owner, req.repo, req.pr_number, review_body,
-            )
-        duration_post = round(time.monotonic() - t_post, 1)
-        on_event("review.post", {
-            "comments_posted": len(comments) if comments else 0,
-            "duration_s": duration_post,
-            "success": True,
-        })
+            summary_line = f"Found {num_issues} issue{'s' if num_issues != 1 else ''}"
+            if req.personality and req.github_username:
+                review_body = (
+                    f"@{req.github_username}'s review twin\n\n"
+                    f"{summary_line}\n\n---\n"
+                    f"*a2a-review based on @{req.github_username}'s coding preferences*"
+                )
+            else:
+                review_body = f"## Morph Code Review\n\n{summary_line}"
+
+            if comments:
+                await client.post_review(
+                    req.owner, req.repo, req.pr_number, req.head_sha,
+                    comments, diff, review_body,
+                )
+            else:
+                await client.post_issue_comment(
+                    req.owner, req.repo, req.pr_number, review_body,
+                )
+            duration_post = round(time.monotonic() - t_post, 1)
+            on_event("review.post", {
+                "comments_posted": num_issues,
+                "duration_s": duration_post,
+                "success": True,
+            })
 
         logger.info(
             "API review completed for %s PR #%d: %d comments",
@@ -240,6 +251,9 @@ async def _run_review_from_api(req: ReviewRequest):
         if req.callback_url:
             await _callback(req.callback_url, agent_run_id, "completed")
 
+        if req.skip_post:
+            return comments or []
+
     except Exception as exc:
         logger.exception("API review failed for %s PR #%d", full_name, req.pr_number)
         on_event("review.failed", {
@@ -278,6 +292,15 @@ async def review_api(req: ReviewRequest, request: Request):
     if REVIEW_API_SECRET and auth != f"Bearer {REVIEW_API_SECRET}":
         raise HTTPException(status_code=401, detail="Unauthorized")
 
+    if req.skip_post:
+        # Run synchronously and return comments directly (for evals)
+        comments = await _run_review_from_api(req)
+        return {
+            "status": "completed",
+            "agent_run_id": req.agent_run_id or "generated-server-side",
+            "comments": comments or [],
+        }
+
     asyncio.create_task(_run_review_from_api(req))
     return {"status": "accepted", "agent_run_id": req.agent_run_id or "generated-server-side"}