Skip to content

fix(core): add skill eval scaffolding and unbreak test-reminder#25

Merged
RBozydar merged 1 commit into
mainfrom
fix/skill-creator-evals-and-reminder-hook
Apr 22, 2026
Merged

fix(core): add skill eval scaffolding and unbreak test-reminder#25
RBozydar merged 1 commit into
mainfrom
fix/skill-creator-evals-and-reminder-hook

Conversation

@RBozydar
Copy link
Copy Markdown
Owner

No description provided.

@RBozydar RBozydar merged commit 9602075 into main Apr 22, 2026
1 of 3 checks passed
@RBozydar RBozydar deleted the fix/skill-creator-evals-and-reminder-hook branch April 22, 2026 21:29
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a formal evaluation harness for skills, featuring a new evaluate_skill.py script and a standardized skill-evals.yaml specification format for tracking trigger performance. The skill creation templates (simple-skill.md and router-skill.md) have been significantly updated to prioritize trigger examples, verification steps, and 'gotchas' while encouraging flexible workflow application instead of rigid adherence. Furthermore, the PR implements skill name validation in init_skill.py, renames the test-reminder hook for consistency, and adds a robust test suite covering the new evaluation and validation logic. I have no feedback to provide as there were no review comments to assess.

@claude
Copy link
Copy Markdown

claude Bot commented Apr 22, 2026

Code Review — fix(core): add skill eval scaffolding and unbreak test-reminder

Overview

This PR delivers two focused changes:

  1. Skill eval scaffolding — a lightweight YAML-based framework to track whether skills trigger correctly, including a new evaluate_skill.py summary script, evals/skill-evals.yaml spec format, integration into init_skill.py and quick_validate.py, and rewritten templates
  2. test-reminder fix — renames test_reminder.pyreminder_hook.py so pytest stops treating the hook script as a test file, then adds real tests and refactors the hook into proper functions

Both are coherent and well-scoped. The overall direction is good. Issues below are ordered by severity.


Bugs

evaluate_skill.py is missing PEP 723 inline metadata (likely runtime failure)

The script shebang is #!/usr/bin/env python3 and imports yaml, but pyyaml is not a stdlib module and there is no # /// script / dependencies = [...] block. SKILL.md documents running it with uv run python, which won't install pyyaml automatically. Most users hitting the documented command will get ModuleNotFoundError: No module named 'yaml'. All other hook/script files in this repo use #!/usr/bin/env -S uv run --script with inline metadata — this one should too.

#!/usr/bin/env -S uv run --script
# /// script
# dependencies = ["pyyaml"]
# ///

Results keyed by prompt text, not by ID (silent data loss on prompt edits)

summarize_cases matches results to spec entries via case_results.get(prompt, []). The spec already has id fields on every case — but the results JSON is keyed by prompt text. If a prompt is tweaked in the spec after results are collected, the lookup silently returns [] (zero observed trials, 0 % pass rate) with no warning. ID-based matching would be more robust.


Code Quality

exec() used in tests to extract functions from evaluate_skill.py

tests/test_skill_creator.py at line 985 does:

exec(eval_script.read_text(), namespace)
summarize_cases = namespace["summarize_cases"]

This works but is fragile — it runs all top-level side effects, couples the test to the function names being at module scope, and will silently break if the yaml import fails. The test-reminder tests already use importlib.util.spec_from_file_location which is the correct pattern here.

Reminder hook: missing return after early exit_success() calls

In reminder_hook.py main():

if c.tool_name != "Write":
    c.output.exit_success()   # no return
file_path = c.tool_input.get("file_path", "")

If exit_success() calls sys.exit() this is fine. If it doesn't (e.g. it just writes to stdout), execution falls through to the next lines unexpectedly. A return after each exit_success() call would make control flow unambiguous.


Test Coverage

What's covered is solid. Gaps worth noting:

  • possible_test_paths and has_corresponding_test in reminder_hook.py have no unit tests
  • validate_skill_name edge cases ("-name", "name-", "name--name") aren't tested explicitly — they're implied by integration but would be clearer as direct unit tests
  • evaluate_skill.py error paths (load_eval_spec with missing skill field, non-list section, etc.) have no coverage

Skill Clarity

Templates — significantly improved

The rewritten simple-skill.md and router-skill.md are much better for Claude:

  • ✅ Explicit "Trigger Examples" / "Do not use for" sections remove ambiguity about activation
  • ✅ "Gotchas" sections use a structured format (Situation / Wrong instinct / Correct approach / Verification) that is actionable
  • ✅ Removing **Wait for response before proceeding** from the router template and replacing with conditional intake is a real UX improvement — less unnecessary interruption
  • ✅ Replacing "follow it exactly" with "apply with judgment" gives Claude appropriate latitude without undermining safety steps

One residual ambiguity: the template description frontmatter placeholder Use this skill when {{...}}. Do not use it for {{...}}. is clearer than before, but the {{...}} placeholders are visually similar to Jinja2/Handlebars templating. Adding a comment like # Replace {{...}} with your actual trigger description would prevent confusion.

evals/skill-evals.yaml for skill-creator itself — well chosen cases

  • should_trigger: "Create a new Claude Code skill..." and "Audit this existing skill..." are distinct, realistic, and representative
  • should_not_trigger: "Refactor this Python function..." and "Summarize this paper..." are appropriately distant from the skill's domain

One documentation gap: there is no explanation of how a skill author translates success_criteria strings into the boolean values in the results JSON. The system is intentionally manual, but a one-paragraph note in the SKILL.md or evaluate_skill.py docstring would close that loop.


Minor

  • EXAMPLE_EVAL_SPEC in init_skill.py uses escaped quotes (\") inside a triple-quoted string — this works but ' single quotes would be cleaner YAML content
  • test_test_reminder_hook.py — the double-test_ prefix is unconventional. test_reminder_hook.py placed outside the hooks/ directory would follow the standard pytest layout better
  • contains_placeholder in quick_validate.py is called but its definition isn't in this diff — worth confirming it correctly gates the eval warning so finished skills aren't accidentally warned during development

Summary

The core goal (give skill authors a repeatable way to evaluate trigger behavior) is sound and the implementation is mostly clean. The critical issue is the missing PEP 723 metadata on evaluate_skill.py — the documented uv run python invocation will fail for anyone without pyyaml pre-installed. Everything else is moderate quality / clarity feedback that doesn't block the approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant