Add skill-eval-authoring skill and harden release-plan eval exemplar#15836
Draft
helen229 wants to merge 1 commit into
Draft
Add skill-eval-authoring skill and harden release-plan eval exemplar#15836helen229 wants to merge 1 commit into
helen229 wants to merge 1 commit into
Conversation
Adds a new meta-skill that teaches the agent (and contributors) how to write robust evals for skills under .github/skills. Includes SKILL.md, three reference docs (four-layer pattern, grader catalog, anti-patterns), and self-eval (capability + trigger). Rewrites .github/skills/azsdk-common-prepare-release-plan/evals/eval.yaml as the canonical four-layer-pattern exemplar that skill-eval-authoring points to. Every capability stimulus now asserts skill-invocation + tool-calls, replacing single-substring output-contains graders that previously passed on refusals and wrong-tool calls. Registers the new skill in .github/skills/.vally.yaml paths.evals.
📊 GEPA Skill Quality Scores
0/10 skills at quality ≥ 0.80 How to improve# Score a specific skill
python .github/skills/sensei/scripts/gepa/auto_evaluator.py score --skill <name> --skills-dir .github/skills --tests-dir tests
# Optimize a skill with GEPA
python .github/skills/sensei/scripts/gepa/auto_evaluator.py optimize --skill <name> --skills-dir .github/skills --tests-dir tests |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a new meta-skill,
skill-eval-authoring, that teaches the agent (and contributors) how to write robust evals for skills under.github/skills/. Rewrites the release-plan capability suite as the canonical exemplar the new skill points to.Motivation
Audit of all 9 skill eval suites surfaced that ~25 capability stimuli are graded only by a single
output-contains "<keyword>". These tests pass on:CI is green by construction. The fix is structural, not a one-line lint: contributors need a pattern and a worked example.
What's in this PR
New:
.github/skills/skill-eval-authoring/SKILL.mdreferences/four-layer-pattern.mdreferences/grader-catalog.mddist/graders/. Documents hidden behaviors (scoring.weightsnot applied,scoring.thresholddefaults to 0, regex semantics oftool-calls.name, prefixed tool names work).references/anti-patterns.mdevals/eval.yaml+evals/trigger.eval.yamlRewritten:
.github/skills/azsdk-common-prepare-release-plan/evals/eval.yamlReplaced the previous 6 stimuli (most of which were graded only by
output-contains "release plan") with 6 four-layer-pattern stimuli covering: create-from-spec-pr, update-sdk-details, link-sdk-prs, get-release-plan-status, and two negatives. The file'sdescription:declares it the canonical exemplar.Registered:
.github/skills/.vally.yamlAdds
skill-eval-authoring/evals/topaths.evals.Out of scope (deliberate)
unit/integration/e2e) — proposed as P2.Open questions for the group (also captured in the design doc)
tool-callsarg matching to arbitrary keys (today onlycommand/path)?Verification
eng/skill-eval/node_modules/@microsoft/vally/dist/pipeline/grading.jsanddist/graders/static/.vally-results/.../results.jsonl.Draft until reviewers weigh in on the open questions above.