Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,5 +3,8 @@
**/.wrangler/
pr_clones/
code-review-benchmark/
pr_review_agent/output/
online_eval_results*.json
online_eval.db*
.env
plan.md
85 changes: 85 additions & 0 deletions eval_results.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
# Online Eval Results

Evaluating prompt changes on the same set of ~91 PRs (skip-discover, skip-enrich).
Using 10 workers, gpt-5.4, skip_post=True.

## Baseline (Run 2 — 0.70 thresholds, no borderline line, extensive search line)

| Metric | All PRs | PRs w/ GT (52) |
|--------|---------|----------------|
| Suggestions | 208 | 125 |
| Matched suggestions | 36 | 36 |
| Ground truth | 300 | 300 |
| Precision | 0.173 | 0.288 |
| Recall | 0.120 | 0.120 |
| Mean F1 | — | 0.370 (26 PRs) |

## Attempt 1 — Add code quality paragraph + restore borderline bug line

**Changes:**
- Restored "it is better to report a borderline real bug than to miss one" in system.py and reviewer.py
- Added paragraph: "ALSO LOOK FOR CODE QUALITY ISSUES THE AUTHOR WOULD FIX" — inconsistent naming/style, unnecessarily complex logic, duplicated code, dead code, unused imports
- Kept 0.70 thresholds and extensive search line

**Results:**

| Metric | All PRs (90) | PRs w/ GT (51) |
|--------|--------------|----------------|
| Suggestions | 195 | 123 |
| Matched suggestions | 30 | 30 |
| Ground truth | 271 | 271 |
| Precision | 0.154 | 0.244 |
| Recall | 0.111 | 0.111 |
| Mean F1 | — | 0.369 (21 PRs) |

**Analysis:** Code quality paragraph was net negative. Generated 5 more non-bug suggestions (21 vs 16) but only 1 matched. Crowded out bug-finding: 18 fewer bug suggestions, 6 lost matches. Lost matches on 12 PRs vs gained on 8.

## Attempt 2 — Remove code quality, add completeness principle

**Changes:**
- Removed "ALSO LOOK FOR CODE QUALITY ISSUES" paragraph (proven harmful)
- Added investigation principle #5: "VERIFY COMPLETENESS OF NEW ADDITIONS" — resource cleanup/teardown, DB constraints matching business logic, API endpoint authorization, read/write pair consistency, feature flag both-paths
- Kept borderline bug line, 0.70 thresholds, extensive search line

**Results:**

| Metric | All PRs (85) | PRs w/ GT (51) |
|--------|--------------|----------------|
| Suggestions | 195 | 123 |
| Matched suggestions | 37 | 37 |
| Ground truth | 290 | 290 |
| Precision | 0.190 | 0.301 |
| Recall | 0.128 | 0.128 |
| Mean F1 | — | 0.418 (25 PRs) |

**Analysis:** Best run so far. Removing code quality paragraph refocused the model on bugs. Completeness principle helped find resource cleanup, constraint, and lifecycle bugs. +13 gained matches vs -9 lost on shared PR set. Bug match rate improved from 16% to 19%, security from 10% to 18%.

## Attempt 3 — Add config/infrastructure category + investigation budget emphasis

**Changes:**
- Added bug category #15: "Configuration / infrastructure" — missing env var mappings, Dockerfile steps, CI workflow permissions, version pins
- Strengthened "DON'T STOP AT THE FIRST FINDING" with "Pay equal attention to config files as to source code"

**Results:**

| Metric | All PRs (91) | PRs w/ GT (52) |
|--------|--------------|----------------|
| Suggestions | 209 | — |
| Matched suggestions | 31 | 31 |
| Ground truth | 299 | 299 |
| Precision | 0.148 | 0.263 |
| Recall | 0.104 | 0.104 |
| Mean F1 | — | 0.363 (24 PRs) |

**Analysis:** Regressed vs Attempt 2. Same pattern as Attempt 1 — broadening scope dilutes bug-finding focus. Generated 14 more suggestions but 6 fewer matches. Lost on 15 shared PRs vs gained on 8. Config emphasis caused the model to spend cognitive budget on config analysis at the expense of core logic bugs.

## Summary

| Run | P (GT) | Recall | Mean F1 | Key Change |
|-----|--------|--------|---------|------------|
| Baseline | 0.288 | 0.120 | 0.370 | 0.70 thresholds, no borderline, extensive search |
| Attempt 1 | 0.244 | 0.111 | 0.369 | + code quality paragraph, + borderline bug line |
| **Attempt 2** | **0.301** | **0.128** | **0.418** | - code quality, + completeness principle |
| Attempt 3 | 0.263 | 0.104 | 0.363 | + config/infra category, + config emphasis |

**Best: Attempt 2.** Key insight: the model performs best when its scope is tightly focused on runtime defects + structural completeness. Any instruction that broadens scope to non-bug categories (code quality, config/infra) dilutes focus and hurts precision without meaningfully improving recall.
73 changes: 48 additions & 25 deletions github_app/app.py
Original file line number Diff line number Diff line change
Expand Up @@ -103,13 +103,17 @@ class ReviewRequest(BaseModel):
model: str = "" # Override model name (e.g. "gpt-5.4", "gemini-3.1-pro-preview")
openai_api_key: str = "" # Override OpenAI API key (optional, falls back to env)
google_api_key: str = "" # Override Google API key (optional, falls back to env)
skip_post: bool = False # Run review but don't post to GitHub (for evals)


REVIEW_API_SECRET = os.environ.get("REVIEW_API_SECRET", "") or os.environ.get("GHAPP_INTERNAL_SECRET", "")


async def _run_review_from_api(req: ReviewRequest):
"""Run the review pipeline from an API request (not a webhook)."""
async def _run_review_from_api(req: ReviewRequest) -> list | None:
"""Run the review pipeline from an API request (not a webhook).

Returns the list of review comments when skip_post=True, None otherwise.
"""
import time
import uuid
from github_app.telemetry import make_event_emitter
Expand Down Expand Up @@ -187,32 +191,39 @@ async def _run_review_from_api(req: ReviewRequest):

t_post = time.monotonic()
num_issues = len(comments) if comments else 0
summary_line = f"Found {num_issues} issue{'s' if num_issues != 1 else ''}"
if req.personality and req.github_username:
review_body = (
f"@{req.github_username}'s review twin\n\n"
f"{summary_line}\n\n---\n"
f"*a2a-review based on @{req.github_username}'s coding preferences*"
)
else:
review_body = f"## Morph Code Review\n\n{summary_line}"

if comments:
await client.post_review(
req.owner, req.repo, req.pr_number, req.head_sha,
comments, diff, review_body,
if req.skip_post:
logger.info(
"skip_post=True, skipping GitHub post for %s PR #%d (%d comments)",
full_name, req.pr_number, num_issues,
)
duration_post = 0.0
else:
# Post summary even with 0 issues so callers can detect completion
await client.post_issue_comment(
req.owner, req.repo, req.pr_number, review_body,
)
duration_post = round(time.monotonic() - t_post, 1)
on_event("review.post", {
"comments_posted": len(comments) if comments else 0,
"duration_s": duration_post,
"success": True,
})
summary_line = f"Found {num_issues} issue{'s' if num_issues != 1 else ''}"
if req.personality and req.github_username:
review_body = (
f"@{req.github_username}'s review twin\n\n"
f"{summary_line}\n\n---\n"
f"*a2a-review based on @{req.github_username}'s coding preferences*"
)
else:
review_body = f"## Morph Code Review\n\n{summary_line}"

if comments:
await client.post_review(
req.owner, req.repo, req.pr_number, req.head_sha,
comments, diff, review_body,
)
else:
await client.post_issue_comment(
req.owner, req.repo, req.pr_number, review_body,
)
duration_post = round(time.monotonic() - t_post, 1)
on_event("review.post", {
"comments_posted": num_issues,
"duration_s": duration_post,
"success": True,
})

logger.info(
"API review completed for %s PR #%d: %d comments",
Expand Down Expand Up @@ -240,6 +251,9 @@ async def _run_review_from_api(req: ReviewRequest):
if req.callback_url:
await _callback(req.callback_url, agent_run_id, "completed")

if req.skip_post:
return comments or []

except Exception as exc:
logger.exception("API review failed for %s PR #%d", full_name, req.pr_number)
on_event("review.failed", {
Expand Down Expand Up @@ -278,6 +292,15 @@ async def review_api(req: ReviewRequest, request: Request):
if REVIEW_API_SECRET and auth != f"Bearer {REVIEW_API_SECRET}":
raise HTTPException(status_code=401, detail="Unauthorized")

if req.skip_post:
# Run synchronously and return comments directly (for evals)
comments = await _run_review_from_api(req)
return {
"status": "completed",
"agent_run_id": req.agent_run_id or "generated-server-side",
"comments": comments or [],
}

asyncio.create_task(_run_review_from_api(req))
return {"status": "accepted", "agent_run_id": req.agent_run_id or "generated-server-side"}

Expand Down
Loading