Skip to content

Improve review skills based on feedback log analysis #6

@anderskev

Description

@anderskev

Context

We've been logging the outcomes of local code reviews in a feedback log. Analysis of REJECT entries reveals systematic failure modes in the review skills that can be fixed.

Feedback Log Analysis Data

We analyzed 24 review outcomes. Key REJECT patterns:

Rejection Root Cause Frequency Pattern
"Line too long" violations that don't exist Reviewer not running linter before flagging 4 entries (rows 15-18)
"Misleading RunContext pattern" pydantic-ai docs explicitly support raw function registration 1 entry
"Inconsistent error handling" flagged as bug Intentional optimization, not a bug 1 entry
"Missing test coverage" for wrong code path Reviewer didn't trace code before claiming gap 1 entry

Sample Feedback Log (Current Format)

date,file,line,issue,verdict,action
2025-12-20,tests/integration/test_cli_flows.py,407,Unused extra_args parameter in parametrization,ACCEPT,Fixed - removed dead parameter
2025-12-20,tests/integration/test_cli_flows.py,237-242,Missing review --local in git repo error test,REJECT,Not applicable - review uses different error path
2025-12-21,amelia/server/orchestrator/service.py,1702,Direct mutation of frozen ExecutionState,ACCEPT,Fixed using model_copy(update={...})
2025-12-23,amelia/drivers/api/tools.py,48-53,Misleading RunContext pattern - should use decorators,REJECT,pydantic-ai docs explicitly support passing raw functions with RunContext to Agent(tools=[])
2025-12-23,amelia/drivers/api/openai.py,102,Line too long (104 > 100),REJECT,ruff check passes - no E501 violation exists
2025-12-23,amelia/drivers/api/openai.py,166,Line too long (139 > 100),REJECT,ruff check passes - no E501 violation exists
2025-12-27,amelia/core/orchestrator.py,190-191,Generic exception handling in get_code_changes_for_review,ACCEPT,Changed Exception to (FileNotFoundError OSError)
2025-12-27,amelia/agents/developer.py,128,Return type list[Any] loses type safety,ACCEPT,Changed to list[AgentMessage] and removed unused Any import

Task 1: Enhance Feedback Log Schema

Create a new skill or schema document that defines an enhanced feedback log format for tracking review outcomes. This will be the foundation for future agent-based skill improvement.

New Schema

date,file,line,rule_source,category,severity,issue,verdict,rationale
Field Type Description Example Values
date ISO date When review occurred 2025-12-23
file path Relative file path amelia/agents/developer.py
line string Line number(s) 128, 190-191
rule_source string Skill and rule that triggered issue python-code-review/common-mistakes:unused-variables, pydantic-ai-common-pitfalls:tool-decorator
category enum Issue taxonomy type-safety, async, error-handling, style, patterns, testing, security
severity enum As flagged by reviewer critical, major, minor
issue string Brief description Return type list[Any] loses type safety
verdict enum Human decision ACCEPT, REJECT, DEFER, ACKNOWLEDGE
rationale string Why verdict was chosen pydantic-ai docs explicitly support this pattern

Deliverable

Create skills/review-feedback-schema/SKILL.md that:

  1. Defines the schema above
  2. Explains each verdict type
  3. Provides examples of good rationale writing
  4. Explains how this data feeds into skill improvement

Task 2: Update review-python Command

File: commands/review-python.md

Change 1: Add Linter Verification Step (BEFORE Step 2)

Insert a new step after "Identify Changed Files" that mandates running linters before manual review:

## Step 1.5: Verify Linter Status

**CRITICAL**: Run project linters BEFORE flagging any style or type issues.

\`\`\`bash
# Check if ruff config exists and run it
if [ -f "pyproject.toml" ] || [ -f "ruff.toml" ]; then
    ruff check <changed_files>
fi

# Check if mypy config exists and run it  
if [ -f "pyproject.toml" ] || [ -f "mypy.ini" ]; then
    mypy <changed_files>
fi
\`\`\`

**Rules:**
- If a linter passes for a specific rule (e.g., line length), DO NOT flag that issue manually
- Linter configuration is authoritative for style rules
- Only flag issues that linters cannot detect (semantic issues, architectural problems)

**Why:** Analysis of 24 review outcomes showed 4 false positives (17%) where reviewers flagged line-length violations that `ruff check` confirmed don't exist. The linter's configuration reflects intentional project decisions.

Change 2: Add "Check Codebase Patterns" Guidance

Add to the Review step:

### Before Flagging Optimization or Pattern Issues

1. **Check CLAUDE.md** for documented intentional patterns
2. **Check code comments** around the flagged area for "intentional", "optimization", or "NOTE:"
3. **Trace the code path** before claiming missing coverage or inconsistent handling
4. **Consider framework idioms** - what looks wrong generically may be correct for the framework

**Why:** Analysis showed rejections where reviewers flagged "inconsistent error handling" that was intentional optimization, and "missing test coverage" for code paths that don't exist.

Task 3: Update pydantic-ai-common-pitfalls Skill

File: skills/pydantic-ai-common-pitfalls/SKILL.md

Add Exception for Raw Function Tool Registration

Insert after the "Tool Decorator Errors" section:

## Valid Patterns (Not Errors)

### Raw Function Tool Registration

The following pattern IS valid and supported by pydantic-ai:

\`\`\`python
from pydantic_ai import Agent, RunContext

async def search_db(ctx: RunContext[MyDeps], query: str) -> list[dict]:
    """Search the database."""
    return await ctx.deps.db.search(query)

async def get_user(ctx: RunContext[MyDeps], user_id: int) -> dict:
    """Get user by ID."""
    return await ctx.deps.db.get_user(user_id)

# Valid: Pass raw functions to Agent(tools=[...])
agent = Agent(
    'openai:gpt-4o',
    deps_type=MyDeps,
    tools=[search_db, get_user]  # RunContext detected from signature
)
\`\`\`

**Why this works:** PydanticAI inspects function signatures. If the first parameter is `RunContext[T]`, it's treated as a context-aware tool. No decorator required.

**Reference:** https://ai.pydantic.dev/agents/#registering-tools-via-the-tools-argument

**Do NOT flag** code that passes functions with `RunContext` signatures to `Agent(tools=[...])`. This is equivalent to using `@agent.tool` and is explicitly documented.

Task 4: Create Review Improvement Agent Skill (Future Foundation)

Create skills/review-skill-improver/SKILL.md:

---
name: review-skill-improver
description: Analyzes feedback logs to identify patterns and suggest improvements to review skills. Use when you have accumulated feedback data and want to improve review accuracy.
---

# Review Skill Improver

## Purpose

Analyzes structured feedback logs to:
1. Identify rules that produce false positives (high REJECT rate)
2. Identify missing rules (issues that should have been caught)
3. Suggest specific skill modifications

## Input

Feedback log in enhanced schema format (see `review-feedback-schema` skill).

## Analysis Process

### Step 1: Aggregate by Rule Source

For each unique rule_source:

  • Count total issues flagged
  • Count ACCEPT vs REJECT
  • Calculate rejection rate
  • Extract rejection rationales

### Step 2: Identify High-Rejection Rules

Rules with >30% rejection rate warrant investigation:
- Read the rejection rationales
- Identify common themes
- Determine if rule needs refinement or exception

### Step 3: Pattern Analysis

Group rejections by rationale theme:
- "Linter already handles this" → Add linter verification step
- "Framework supports this pattern" → Add exception to skill
- "Intentional design decision" → Add codebase context check
- "Wrong code path assumed" → Add code tracing step

### Step 4: Generate Improvement Recommendations

For each identified issue, produce:

\`\`\`markdown
## Recommendation: [SHORT_TITLE]

**Affected Skill:** `skill-name/SKILL.md` or `skill-name/references/file.md`

**Problem:** [What's causing false positives]

**Evidence:** 
- [X] rejections with rationale "[common theme]"
- Example: [file:line] - [issue] - [rationale]

**Proposed Fix:**
\`\`\`markdown
[Exact text to add/modify in the skill]
\`\`\`

**Expected Impact:** Reduce false positive rate for [rule] from X% to Y%
\`\`\`

## Output Format

\`\`\`markdown
# Review Skill Improvement Report

## Summary
- Feedback entries analyzed: [N]
- Unique rules triggered: [N]
- High-rejection rules identified: [N]
- Recommendations generated: [N]

## High-Rejection Rules

| Rule Source | Total | Rejected | Rate | Theme |
|-------------|-------|----------|------|-------|
| ... | ... | ... | ... | ... |

## Recommendations

[Numbered list of recommendations in format above]

## Rules Performing Well

[Rules with <10% rejection rate - preserve these]
\`\`\`

## Usage

\`\`\`bash
# In a project with feedback log
/review-skill-improver --log .feedback-log.csv --output improvement-report.md
\`\`\`

## Future: Automated Skill Updates

Once confidence is high, this skill can:
1. Generate PRs to beagle with skill improvements
2. Track improvement impact over time
3. A/B test rule variations

Task 5: Update Skill Documentation Index

Update the README or skill index to include:

  • review-feedback-schema - Schema for tracking review outcomes
  • review-skill-improver - Analyze feedback to improve review skills

Verification

After making changes:

  1. Syntax check the markdown files
  2. Cross-reference the rule_source values in the schema match actual skill paths
  3. Test that review-python still loads and executes correctly
  4. Validate the pydantic-ai exception against current pydantic-ai docs (v0.0.39+)

Success Criteria

  • review-python includes linter verification step before manual review
  • pydantic-ai-common-pitfalls includes exception for raw function tool registration
  • New review-feedback-schema skill defines the enhanced log format
  • New review-skill-improver skill provides foundation for future agent-based improvement
  • All changes include "Why" explanations linking back to feedback data

Future Vision

This creates a feedback loop:

Review Code → Log Outcomes → Analyze Patterns → Improve Skills → Better Reviews
     ↑                                                              |
     └──────────────────────────────────────────────────────────────┘

Eventually, an agent can:

  1. Read accumulated feedback logs across projects
  2. Identify systematic failure modes
  3. Generate PRs to improve review skills
  4. Track if improvements reduce rejection rates

The schema and improver skill are the foundation for this automated improvement cycle.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions