fix(core): add skill eval scaffolding and unbreak test-reminder#25
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a formal evaluation harness for skills, featuring a new evaluate_skill.py script and a standardized skill-evals.yaml specification format for tracking trigger performance. The skill creation templates (simple-skill.md and router-skill.md) have been significantly updated to prioritize trigger examples, verification steps, and 'gotchas' while encouraging flexible workflow application instead of rigid adherence. Furthermore, the PR implements skill name validation in init_skill.py, renames the test-reminder hook for consistency, and adds a robust test suite covering the new evaluation and validation logic. I have no feedback to provide as there were no review comments to assess.
Code Review — fix(core): add skill eval scaffolding and unbreak test-reminderOverviewThis PR delivers two focused changes:
Both are coherent and well-scoped. The overall direction is good. Issues below are ordered by severity. Bugs
The script shebang is #!/usr/bin/env -S uv run --script
# /// script
# dependencies = ["pyyaml"]
# ///Results keyed by prompt text, not by ID (silent data loss on prompt edits)
Code Quality
exec(eval_script.read_text(), namespace)
summarize_cases = namespace["summarize_cases"]This works but is fragile — it runs all top-level side effects, couples the test to the function names being at module scope, and will silently break if the Reminder hook: missing In if c.tool_name != "Write":
c.output.exit_success() # no return
file_path = c.tool_input.get("file_path", "")If Test CoverageWhat's covered is solid. Gaps worth noting:
Skill ClarityTemplates — significantly improved The rewritten
One residual ambiguity: the template description frontmatter placeholder
One documentation gap: there is no explanation of how a skill author translates Minor
SummaryThe core goal (give skill authors a repeatable way to evaluate trigger behavior) is sound and the implementation is mostly clean. The critical issue is the missing PEP 723 metadata on |
No description provided.