Improve review skills based on feedback log analysis

## Context

We've been logging the outcomes of local code reviews in a feedback log. Analysis of REJECT entries reveals systematic failure modes in the review skills that can be fixed.

### Feedback Log Analysis Data

We analyzed 24 review outcomes. Key REJECT patterns:

| Rejection | Root Cause | Frequency Pattern |
|-----------|------------|-------------------|
| "Line too long" violations that don't exist | Reviewer not running linter before flagging | 4 entries (rows 15-18) |
| "Misleading RunContext pattern" | pydantic-ai docs explicitly support raw function registration | 1 entry |
| "Inconsistent error handling" flagged as bug | Intentional optimization, not a bug | 1 entry |
| "Missing test coverage" for wrong code path | Reviewer didn't trace code before claiming gap | 1 entry |

### Sample Feedback Log (Current Format)

```csv
date,file,line,issue,verdict,action
2025-12-20,tests/integration/test_cli_flows.py,407,Unused extra_args parameter in parametrization,ACCEPT,Fixed - removed dead parameter
2025-12-20,tests/integration/test_cli_flows.py,237-242,Missing review --local in git repo error test,REJECT,Not applicable - review uses different error path
2025-12-21,amelia/server/orchestrator/service.py,1702,Direct mutation of frozen ExecutionState,ACCEPT,Fixed using model_copy(update={...})
2025-12-23,amelia/drivers/api/tools.py,48-53,Misleading RunContext pattern - should use decorators,REJECT,pydantic-ai docs explicitly support passing raw functions with RunContext to Agent(tools=[])
2025-12-23,amelia/drivers/api/openai.py,102,Line too long (104 > 100),REJECT,ruff check passes - no E501 violation exists
2025-12-23,amelia/drivers/api/openai.py,166,Line too long (139 > 100),REJECT,ruff check passes - no E501 violation exists
2025-12-27,amelia/core/orchestrator.py,190-191,Generic exception handling in get_code_changes_for_review,ACCEPT,Changed Exception to (FileNotFoundError OSError)
2025-12-27,amelia/agents/developer.py,128,Return type list[Any] loses type safety,ACCEPT,Changed to list[AgentMessage] and removed unused Any import
```

---

## Task 1: Enhance Feedback Log Schema

Create a new skill or schema document that defines an enhanced feedback log format for tracking review outcomes. This will be the foundation for future agent-based skill improvement.

### New Schema

```csv
date,file,line,rule_source,category,severity,issue,verdict,rationale
```

| Field | Type | Description | Example Values |
|-------|------|-------------|----------------|
| `date` | ISO date | When review occurred | `2025-12-23` |
| `file` | path | Relative file path | `amelia/agents/developer.py` |
| `line` | string | Line number(s) | `128`, `190-191` |
| `rule_source` | string | Skill and rule that triggered issue | `python-code-review/common-mistakes:unused-variables`, `pydantic-ai-common-pitfalls:tool-decorator` |
| `category` | enum | Issue taxonomy | `type-safety`, `async`, `error-handling`, `style`, `patterns`, `testing`, `security` |
| `severity` | enum | As flagged by reviewer | `critical`, `major`, `minor` |
| `issue` | string | Brief description | `Return type list[Any] loses type safety` |
| `verdict` | enum | Human decision | `ACCEPT`, `REJECT`, `DEFER`, `ACKNOWLEDGE` |
| `rationale` | string | Why verdict was chosen | `pydantic-ai docs explicitly support this pattern` |

### Deliverable

Create `skills/review-feedback-schema/SKILL.md` that:
1. Defines the schema above
2. Explains each verdict type
3. Provides examples of good rationale writing
4. Explains how this data feeds into skill improvement

---

## Task 2: Update `review-python` Command

File: `commands/review-python.md`

### Change 1: Add Linter Verification Step (BEFORE Step 2)

Insert a new step after "Identify Changed Files" that mandates running linters before manual review:

```markdown
## Step 1.5: Verify Linter Status

**CRITICAL**: Run project linters BEFORE flagging any style or type issues.

\`\`\`bash
# Check if ruff config exists and run it
if [ -f "pyproject.toml" ] || [ -f "ruff.toml" ]; then
    ruff check <changed_files>
fi

# Check if mypy config exists and run it  
if [ -f "pyproject.toml" ] || [ -f "mypy.ini" ]; then
    mypy <changed_files>
fi
\`\`\`

**Rules:**
- If a linter passes for a specific rule (e.g., line length), DO NOT flag that issue manually
- Linter configuration is authoritative for style rules
- Only flag issues that linters cannot detect (semantic issues, architectural problems)

**Why:** Analysis of 24 review outcomes showed 4 false positives (17%) where reviewers flagged line-length violations that `ruff check` confirmed don't exist. The linter's configuration reflects intentional project decisions.
```

### Change 2: Add "Check Codebase Patterns" Guidance

Add to the Review step:

```markdown
### Before Flagging Optimization or Pattern Issues

1. **Check CLAUDE.md** for documented intentional patterns
2. **Check code comments** around the flagged area for "intentional", "optimization", or "NOTE:"
3. **Trace the code path** before claiming missing coverage or inconsistent handling
4. **Consider framework idioms** - what looks wrong generically may be correct for the framework

**Why:** Analysis showed rejections where reviewers flagged "inconsistent error handling" that was intentional optimization, and "missing test coverage" for code paths that don't exist.
```

---

## Task 3: Update `pydantic-ai-common-pitfalls` Skill

File: `skills/pydantic-ai-common-pitfalls/SKILL.md`

### Add Exception for Raw Function Tool Registration

Insert after the "Tool Decorator Errors" section:

```markdown
## Valid Patterns (Not Errors)

### Raw Function Tool Registration

The following pattern IS valid and supported by pydantic-ai:

\`\`\`python
from pydantic_ai import Agent, RunContext

async def search_db(ctx: RunContext[MyDeps], query: str) -> list[dict]:
    """Search the database."""
    return await ctx.deps.db.search(query)

async def get_user(ctx: RunContext[MyDeps], user_id: int) -> dict:
    """Get user by ID."""
    return await ctx.deps.db.get_user(user_id)

# Valid: Pass raw functions to Agent(tools=[...])
agent = Agent(
    'openai:gpt-4o',
    deps_type=MyDeps,
    tools=[search_db, get_user]  # RunContext detected from signature
)
\`\`\`

**Why this works:** PydanticAI inspects function signatures. If the first parameter is `RunContext[T]`, it's treated as a context-aware tool. No decorator required.

**Reference:** https://ai.pydantic.dev/agents/#registering-tools-via-the-tools-argument

**Do NOT flag** code that passes functions with `RunContext` signatures to `Agent(tools=[...])`. This is equivalent to using `@agent.tool` and is explicitly documented.
```

---

## Task 4: Create Review Improvement Agent Skill (Future Foundation)

Create `skills/review-skill-improver/SKILL.md`:

```markdown
---
name: review-skill-improver
description: Analyzes feedback logs to identify patterns and suggest improvements to review skills. Use when you have accumulated feedback data and want to improve review accuracy.
---

# Review Skill Improver

## Purpose

Analyzes structured feedback logs to:
1. Identify rules that produce false positives (high REJECT rate)
2. Identify missing rules (issues that should have been caught)
3. Suggest specific skill modifications

## Input

Feedback log in enhanced schema format (see `review-feedback-schema` skill).

## Analysis Process

### Step 1: Aggregate by Rule Source

```
For each unique rule_source:
  - Count total issues flagged
  - Count ACCEPT vs REJECT
  - Calculate rejection rate
  - Extract rejection rationales
```

### Step 2: Identify High-Rejection Rules

Rules with >30% rejection rate warrant investigation:
- Read the rejection rationales
- Identify common themes
- Determine if rule needs refinement or exception

### Step 3: Pattern Analysis

Group rejections by rationale theme:
- "Linter already handles this" → Add linter verification step
- "Framework supports this pattern" → Add exception to skill
- "Intentional design decision" → Add codebase context check
- "Wrong code path assumed" → Add code tracing step

### Step 4: Generate Improvement Recommendations

For each identified issue, produce:

\`\`\`markdown
## Recommendation: [SHORT_TITLE]

**Affected Skill:** `skill-name/SKILL.md` or `skill-name/references/file.md`

**Problem:** [What's causing false positives]

**Evidence:** 
- [X] rejections with rationale "[common theme]"
- Example: [file:line] - [issue] - [rationale]

**Proposed Fix:**
\`\`\`markdown
[Exact text to add/modify in the skill]
\`\`\`

**Expected Impact:** Reduce false positive rate for [rule] from X% to Y%
\`\`\`

## Output Format

\`\`\`markdown
# Review Skill Improvement Report

## Summary
- Feedback entries analyzed: [N]
- Unique rules triggered: [N]
- High-rejection rules identified: [N]
- Recommendations generated: [N]

## High-Rejection Rules

| Rule Source | Total | Rejected | Rate | Theme |
|-------------|-------|----------|------|-------|
| ... | ... | ... | ... | ... |

## Recommendations

[Numbered list of recommendations in format above]

## Rules Performing Well

[Rules with <10% rejection rate - preserve these]
\`\`\`

## Usage

\`\`\`bash
# In a project with feedback log
/review-skill-improver --log .feedback-log.csv --output improvement-report.md
\`\`\`

## Future: Automated Skill Updates

Once confidence is high, this skill can:
1. Generate PRs to beagle with skill improvements
2. Track improvement impact over time
3. A/B test rule variations
```

---

## Task 5: Update Skill Documentation Index

Update the README or skill index to include:
- `review-feedback-schema` - Schema for tracking review outcomes
- `review-skill-improver` - Analyze feedback to improve review skills

---

## Verification

After making changes:

1. **Syntax check** the markdown files
2. **Cross-reference** the rule_source values in the schema match actual skill paths
3. **Test** that `review-python` still loads and executes correctly
4. **Validate** the pydantic-ai exception against current pydantic-ai docs (v0.0.39+)

---

## Success Criteria

- [ ] `review-python` includes linter verification step before manual review
- [ ] `pydantic-ai-common-pitfalls` includes exception for raw function tool registration  
- [ ] New `review-feedback-schema` skill defines the enhanced log format
- [ ] New `review-skill-improver` skill provides foundation for future agent-based improvement
- [ ] All changes include "Why" explanations linking back to feedback data

---

## Future Vision

This creates a feedback loop:

```
Review Code → Log Outcomes → Analyze Patterns → Improve Skills → Better Reviews
     ↑                                                              |
     └──────────────────────────────────────────────────────────────┘
```

Eventually, an agent can:
1. Read accumulated feedback logs across projects
2. Identify systematic failure modes
3. Generate PRs to improve review skills
4. Track if improvements reduce rejection rates

The schema and improver skill are the foundation for this automated improvement cycle.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve review skills based on feedback log analysis #6

Context

Feedback Log Analysis Data

Sample Feedback Log (Current Format)

Task 1: Enhance Feedback Log Schema

New Schema

Deliverable

Task 2: Update `review-python` Command

Change 1: Add Linter Verification Step (BEFORE Step 2)

Change 2: Add "Check Codebase Patterns" Guidance

Task 3: Update `pydantic-ai-common-pitfalls` Skill

Add Exception for Raw Function Tool Registration

Task 4: Create Review Improvement Agent Skill (Future Foundation)

Task 5: Update Skill Documentation Index

Verification

Success Criteria

Future Vision

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Rejection	Root Cause	Frequency Pattern
"Line too long" violations that don't exist	Reviewer not running linter before flagging	4 entries (rows 15-18)
"Misleading RunContext pattern"	pydantic-ai docs explicitly support raw function registration	1 entry
"Inconsistent error handling" flagged as bug	Intentional optimization, not a bug	1 entry
"Missing test coverage" for wrong code path	Reviewer didn't trace code before claiming gap	1 entry

Field	Type	Description	Example Values
`date`	ISO date	When review occurred	`2025-12-23`
`file`	path	Relative file path	`amelia/agents/developer.py`
`line`	string	Line number(s)	`128`, `190-191`
`rule_source`	string	Skill and rule that triggered issue	`python-code-review/common-mistakes:unused-variables`, `pydantic-ai-common-pitfalls:tool-decorator`
`category`	enum	Issue taxonomy	`type-safety`, `async`, `error-handling`, `style`, `patterns`, `testing`, `security`
`severity`	enum	As flagged by reviewer	`critical`, `major`, `minor`
`issue`	string	Brief description	`Return type list[Any] loses type safety`
`verdict`	enum	Human decision	`ACCEPT`, `REJECT`, `DEFER`, `ACKNOWLEDGE`
`rationale`	string	Why verdict was chosen	`pydantic-ai docs explicitly support this pattern`

Improve review skills based on feedback log analysis #6

Description

Context

Feedback Log Analysis Data

Sample Feedback Log (Current Format)

Task 1: Enhance Feedback Log Schema

New Schema

Deliverable

Task 2: Update review-python Command

Change 1: Add Linter Verification Step (BEFORE Step 2)

Change 2: Add "Check Codebase Patterns" Guidance

Task 3: Update pydantic-ai-common-pitfalls Skill

Add Exception for Raw Function Tool Registration

Task 4: Create Review Improvement Agent Skill (Future Foundation)

Task 5: Update Skill Documentation Index

Verification

Success Criteria

Future Vision

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Task 2: Update `review-python` Command

Task 3: Update `pydantic-ai-common-pitfalls` Skill