fix(core): add skill eval scaffolding and unbreak test-reminder by RBozydar · Pull Request #25 · RBozydar/rbw-claude-code

RBozydar · 2026-04-22T21:29:17Z

No description provided.

gemini-code-assist

Code Review

This pull request introduces a formal evaluation harness for skills, featuring a new evaluate_skill.py script and a standardized skill-evals.yaml specification format for tracking trigger performance. The skill creation templates (simple-skill.md and router-skill.md) have been significantly updated to prioritize trigger examples, verification steps, and 'gotchas' while encouraging flexible workflow application instead of rigid adherence. Furthermore, the PR implements skill name validation in init_skill.py, renames the test-reminder hook for consistency, and adds a robust test suite covering the new evaluation and validation logic. I have no feedback to provide as there were no review comments to assess.

claude · 2026-04-22T21:31:42Z

Code Review — fix(core): add skill eval scaffolding and unbreak test-reminder

Overview

This PR delivers two focused changes:

Skill eval scaffolding — a lightweight YAML-based framework to track whether skills trigger correctly, including a new evaluate_skill.py summary script, evals/skill-evals.yaml spec format, integration into init_skill.py and quick_validate.py, and rewritten templates
test-reminder fix — renames test_reminder.py → reminder_hook.py so pytest stops treating the hook script as a test file, then adds real tests and refactors the hook into proper functions

Both are coherent and well-scoped. The overall direction is good. Issues below are ordered by severity.

Bugs

evaluate_skill.py is missing PEP 723 inline metadata (likely runtime failure)

The script shebang is #!/usr/bin/env python3 and imports yaml, but pyyaml is not a stdlib module and there is no # /// script / dependencies = [...] block. SKILL.md documents running it with uv run python, which won't install pyyaml automatically. Most users hitting the documented command will get ModuleNotFoundError: No module named 'yaml'. All other hook/script files in this repo use #!/usr/bin/env -S uv run --script with inline metadata — this one should too.

#!/usr/bin/env -S uv run --script
# /// script
# dependencies = ["pyyaml"]
# ///

Results keyed by prompt text, not by ID (silent data loss on prompt edits)

summarize_cases matches results to spec entries via case_results.get(prompt, []). The spec already has id fields on every case — but the results JSON is keyed by prompt text. If a prompt is tweaked in the spec after results are collected, the lookup silently returns [] (zero observed trials, 0 % pass rate) with no warning. ID-based matching would be more robust.

Code Quality

exec() used in tests to extract functions from evaluate_skill.py

tests/test_skill_creator.py at line 985 does:

exec(eval_script.read_text(), namespace)
summarize_cases = namespace["summarize_cases"]

This works but is fragile — it runs all top-level side effects, couples the test to the function names being at module scope, and will silently break if the yaml import fails. The test-reminder tests already use importlib.util.spec_from_file_location which is the correct pattern here.

Reminder hook: missing return after early exit_success() calls

In reminder_hook.py main():

if c.tool_name != "Write":
    c.output.exit_success()   # no return
file_path = c.tool_input.get("file_path", "")

If exit_success() calls sys.exit() this is fine. If it doesn't (e.g. it just writes to stdout), execution falls through to the next lines unexpectedly. A return after each exit_success() call would make control flow unambiguous.

Test Coverage

What's covered is solid. Gaps worth noting:

possible_test_paths and has_corresponding_test in reminder_hook.py have no unit tests
validate_skill_name edge cases ("-name", "name-", "name--name") aren't tested explicitly — they're implied by integration but would be clearer as direct unit tests
evaluate_skill.py error paths (load_eval_spec with missing skill field, non-list section, etc.) have no coverage

Skill Clarity

Templates — significantly improved

The rewritten simple-skill.md and router-skill.md are much better for Claude:

✅ Explicit "Trigger Examples" / "Do not use for" sections remove ambiguity about activation
✅ "Gotchas" sections use a structured format (Situation / Wrong instinct / Correct approach / Verification) that is actionable
✅ Removing **Wait for response before proceeding** from the router template and replacing with conditional intake is a real UX improvement — less unnecessary interruption
✅ Replacing "follow it exactly" with "apply with judgment" gives Claude appropriate latitude without undermining safety steps

One residual ambiguity: the template description frontmatter placeholder Use this skill when {{...}}. Do not use it for {{...}}. is clearer than before, but the {{...}} placeholders are visually similar to Jinja2/Handlebars templating. Adding a comment like # Replace {{...}} with your actual trigger description would prevent confusion.

evals/skill-evals.yaml for skill-creator itself — well chosen cases

should_trigger: "Create a new Claude Code skill..." and "Audit this existing skill..." are distinct, realistic, and representative
should_not_trigger: "Refactor this Python function..." and "Summarize this paper..." are appropriately distant from the skill's domain

One documentation gap: there is no explanation of how a skill author translates success_criteria strings into the boolean values in the results JSON. The system is intentionally manual, but a one-paragraph note in the SKILL.md or evaluate_skill.py docstring would close that loop.

Minor

EXAMPLE_EVAL_SPEC in init_skill.py uses escaped quotes (\") inside a triple-quoted string — this works but ' single quotes would be cleaner YAML content
test_test_reminder_hook.py — the double-test_ prefix is unconventional. test_reminder_hook.py placed outside the hooks/ directory would follow the standard pytest layout better
contains_placeholder in quick_validate.py is called but its definition isn't in this diff — worth confirming it correctly gates the eval warning so finished skills aren't accidentally warned during development

Summary

The core goal (give skill authors a repeatable way to evaluate trigger behavior) is sound and the implementation is mostly clean. The critical issue is the missing PEP 723 metadata on evaluate_skill.py — the documented uv run python invocation will fail for anyone without pyyaml pre-installed. Everything else is moderate quality / clarity feedback that doesn't block the approach.

fix(core): add skill eval scaffolding and unbreak test-reminder

e7c2099

RBozydar merged commit 9602075 into main Apr 22, 2026
1 of 3 checks passed

RBozydar deleted the fix/skill-creator-evals-and-reminder-hook branch April 22, 2026 21:29

gemini-code-assist Bot reviewed Apr 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(core): add skill eval scaffolding and unbreak test-reminder#25

fix(core): add skill eval scaffolding and unbreak test-reminder#25
RBozydar merged 1 commit into
mainfrom
fix/skill-creator-evals-and-reminder-hook

RBozydar commented Apr 22, 2026

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

claude Bot commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

RBozydar commented Apr 22, 2026

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

claude Bot commented Apr 22, 2026

Code Review — fix(core): add skill eval scaffolding and unbreak test-reminder

Overview

Bugs

Code Quality

Test Coverage

Skill Clarity

Minor

Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant