-
Notifications
You must be signed in to change notification settings - Fork 5
Description
Context
We've been logging the outcomes of local code reviews in a feedback log. Analysis of REJECT entries reveals systematic failure modes in the review skills that can be fixed.
Feedback Log Analysis Data
We analyzed 24 review outcomes. Key REJECT patterns:
| Rejection | Root Cause | Frequency Pattern |
|---|---|---|
| "Line too long" violations that don't exist | Reviewer not running linter before flagging | 4 entries (rows 15-18) |
| "Misleading RunContext pattern" | pydantic-ai docs explicitly support raw function registration | 1 entry |
| "Inconsistent error handling" flagged as bug | Intentional optimization, not a bug | 1 entry |
| "Missing test coverage" for wrong code path | Reviewer didn't trace code before claiming gap | 1 entry |
Sample Feedback Log (Current Format)
date,file,line,issue,verdict,action
2025-12-20,tests/integration/test_cli_flows.py,407,Unused extra_args parameter in parametrization,ACCEPT,Fixed - removed dead parameter
2025-12-20,tests/integration/test_cli_flows.py,237-242,Missing review --local in git repo error test,REJECT,Not applicable - review uses different error path
2025-12-21,amelia/server/orchestrator/service.py,1702,Direct mutation of frozen ExecutionState,ACCEPT,Fixed using model_copy(update={...})
2025-12-23,amelia/drivers/api/tools.py,48-53,Misleading RunContext pattern - should use decorators,REJECT,pydantic-ai docs explicitly support passing raw functions with RunContext to Agent(tools=[])
2025-12-23,amelia/drivers/api/openai.py,102,Line too long (104 > 100),REJECT,ruff check passes - no E501 violation exists
2025-12-23,amelia/drivers/api/openai.py,166,Line too long (139 > 100),REJECT,ruff check passes - no E501 violation exists
2025-12-27,amelia/core/orchestrator.py,190-191,Generic exception handling in get_code_changes_for_review,ACCEPT,Changed Exception to (FileNotFoundError OSError)
2025-12-27,amelia/agents/developer.py,128,Return type list[Any] loses type safety,ACCEPT,Changed to list[AgentMessage] and removed unused Any importTask 1: Enhance Feedback Log Schema
Create a new skill or schema document that defines an enhanced feedback log format for tracking review outcomes. This will be the foundation for future agent-based skill improvement.
New Schema
date,file,line,rule_source,category,severity,issue,verdict,rationale| Field | Type | Description | Example Values |
|---|---|---|---|
date |
ISO date | When review occurred | 2025-12-23 |
file |
path | Relative file path | amelia/agents/developer.py |
line |
string | Line number(s) | 128, 190-191 |
rule_source |
string | Skill and rule that triggered issue | python-code-review/common-mistakes:unused-variables, pydantic-ai-common-pitfalls:tool-decorator |
category |
enum | Issue taxonomy | type-safety, async, error-handling, style, patterns, testing, security |
severity |
enum | As flagged by reviewer | critical, major, minor |
issue |
string | Brief description | Return type list[Any] loses type safety |
verdict |
enum | Human decision | ACCEPT, REJECT, DEFER, ACKNOWLEDGE |
rationale |
string | Why verdict was chosen | pydantic-ai docs explicitly support this pattern |
Deliverable
Create skills/review-feedback-schema/SKILL.md that:
- Defines the schema above
- Explains each verdict type
- Provides examples of good rationale writing
- Explains how this data feeds into skill improvement
Task 2: Update review-python Command
File: commands/review-python.md
Change 1: Add Linter Verification Step (BEFORE Step 2)
Insert a new step after "Identify Changed Files" that mandates running linters before manual review:
## Step 1.5: Verify Linter Status
**CRITICAL**: Run project linters BEFORE flagging any style or type issues.
\`\`\`bash
# Check if ruff config exists and run it
if [ -f "pyproject.toml" ] || [ -f "ruff.toml" ]; then
ruff check <changed_files>
fi
# Check if mypy config exists and run it
if [ -f "pyproject.toml" ] || [ -f "mypy.ini" ]; then
mypy <changed_files>
fi
\`\`\`
**Rules:**
- If a linter passes for a specific rule (e.g., line length), DO NOT flag that issue manually
- Linter configuration is authoritative for style rules
- Only flag issues that linters cannot detect (semantic issues, architectural problems)
**Why:** Analysis of 24 review outcomes showed 4 false positives (17%) where reviewers flagged line-length violations that `ruff check` confirmed don't exist. The linter's configuration reflects intentional project decisions.Change 2: Add "Check Codebase Patterns" Guidance
Add to the Review step:
### Before Flagging Optimization or Pattern Issues
1. **Check CLAUDE.md** for documented intentional patterns
2. **Check code comments** around the flagged area for "intentional", "optimization", or "NOTE:"
3. **Trace the code path** before claiming missing coverage or inconsistent handling
4. **Consider framework idioms** - what looks wrong generically may be correct for the framework
**Why:** Analysis showed rejections where reviewers flagged "inconsistent error handling" that was intentional optimization, and "missing test coverage" for code paths that don't exist.Task 3: Update pydantic-ai-common-pitfalls Skill
File: skills/pydantic-ai-common-pitfalls/SKILL.md
Add Exception for Raw Function Tool Registration
Insert after the "Tool Decorator Errors" section:
## Valid Patterns (Not Errors)
### Raw Function Tool Registration
The following pattern IS valid and supported by pydantic-ai:
\`\`\`python
from pydantic_ai import Agent, RunContext
async def search_db(ctx: RunContext[MyDeps], query: str) -> list[dict]:
"""Search the database."""
return await ctx.deps.db.search(query)
async def get_user(ctx: RunContext[MyDeps], user_id: int) -> dict:
"""Get user by ID."""
return await ctx.deps.db.get_user(user_id)
# Valid: Pass raw functions to Agent(tools=[...])
agent = Agent(
'openai:gpt-4o',
deps_type=MyDeps,
tools=[search_db, get_user] # RunContext detected from signature
)
\`\`\`
**Why this works:** PydanticAI inspects function signatures. If the first parameter is `RunContext[T]`, it's treated as a context-aware tool. No decorator required.
**Reference:** https://ai.pydantic.dev/agents/#registering-tools-via-the-tools-argument
**Do NOT flag** code that passes functions with `RunContext` signatures to `Agent(tools=[...])`. This is equivalent to using `@agent.tool` and is explicitly documented.Task 4: Create Review Improvement Agent Skill (Future Foundation)
Create skills/review-skill-improver/SKILL.md:
---
name: review-skill-improver
description: Analyzes feedback logs to identify patterns and suggest improvements to review skills. Use when you have accumulated feedback data and want to improve review accuracy.
---
# Review Skill Improver
## Purpose
Analyzes structured feedback logs to:
1. Identify rules that produce false positives (high REJECT rate)
2. Identify missing rules (issues that should have been caught)
3. Suggest specific skill modifications
## Input
Feedback log in enhanced schema format (see `review-feedback-schema` skill).
## Analysis Process
### Step 1: Aggregate by Rule Source
For each unique rule_source:
- Count total issues flagged
- Count ACCEPT vs REJECT
- Calculate rejection rate
- Extract rejection rationales
### Step 2: Identify High-Rejection Rules
Rules with >30% rejection rate warrant investigation:
- Read the rejection rationales
- Identify common themes
- Determine if rule needs refinement or exception
### Step 3: Pattern Analysis
Group rejections by rationale theme:
- "Linter already handles this" → Add linter verification step
- "Framework supports this pattern" → Add exception to skill
- "Intentional design decision" → Add codebase context check
- "Wrong code path assumed" → Add code tracing step
### Step 4: Generate Improvement Recommendations
For each identified issue, produce:
\`\`\`markdown
## Recommendation: [SHORT_TITLE]
**Affected Skill:** `skill-name/SKILL.md` or `skill-name/references/file.md`
**Problem:** [What's causing false positives]
**Evidence:**
- [X] rejections with rationale "[common theme]"
- Example: [file:line] - [issue] - [rationale]
**Proposed Fix:**
\`\`\`markdown
[Exact text to add/modify in the skill]
\`\`\`
**Expected Impact:** Reduce false positive rate for [rule] from X% to Y%
\`\`\`
## Output Format
\`\`\`markdown
# Review Skill Improvement Report
## Summary
- Feedback entries analyzed: [N]
- Unique rules triggered: [N]
- High-rejection rules identified: [N]
- Recommendations generated: [N]
## High-Rejection Rules
| Rule Source | Total | Rejected | Rate | Theme |
|-------------|-------|----------|------|-------|
| ... | ... | ... | ... | ... |
## Recommendations
[Numbered list of recommendations in format above]
## Rules Performing Well
[Rules with <10% rejection rate - preserve these]
\`\`\`
## Usage
\`\`\`bash
# In a project with feedback log
/review-skill-improver --log .feedback-log.csv --output improvement-report.md
\`\`\`
## Future: Automated Skill Updates
Once confidence is high, this skill can:
1. Generate PRs to beagle with skill improvements
2. Track improvement impact over time
3. A/B test rule variations
Task 5: Update Skill Documentation Index
Update the README or skill index to include:
review-feedback-schema- Schema for tracking review outcomesreview-skill-improver- Analyze feedback to improve review skills
Verification
After making changes:
- Syntax check the markdown files
- Cross-reference the rule_source values in the schema match actual skill paths
- Test that
review-pythonstill loads and executes correctly - Validate the pydantic-ai exception against current pydantic-ai docs (v0.0.39+)
Success Criteria
-
review-pythonincludes linter verification step before manual review -
pydantic-ai-common-pitfallsincludes exception for raw function tool registration - New
review-feedback-schemaskill defines the enhanced log format - New
review-skill-improverskill provides foundation for future agent-based improvement - All changes include "Why" explanations linking back to feedback data
Future Vision
This creates a feedback loop:
Review Code → Log Outcomes → Analyze Patterns → Improve Skills → Better Reviews
↑ |
└──────────────────────────────────────────────────────────────┘
Eventually, an agent can:
- Read accumulated feedback logs across projects
- Identify systematic failure modes
- Generate PRs to improve review skills
- Track if improvements reduce rejection rates
The schema and improver skill are the foundation for this automated improvement cycle.