Add skill-eval-authoring skill and harden release-plan eval exemplar by helen229 · Pull Request #15836 · Azure/azure-sdk-tools

helen229 · 2026-06-02T23:22:28Z

Summary

Adds a new meta-skill, skill-eval-authoring, that teaches the agent (and contributors) how to write robust evals for skills under .github/skills/. Rewrites the release-plan capability suite as the canonical exemplar the new skill points to.

Motivation

Audit of all 9 skill eval suites surfaced that ~25 capability stimuli are graded only by a single output-contains "<keyword>". These tests pass on:

correct tool call ✅
polite refusal ❌
wrong-tool call that mentions the keyword ❌
prompt parroting ❌

CI is green by construction. The fix is structural, not a one-line lint: contributors need a pattern and a worked example.

What's in this PR

New: `.github/skills/skill-eval-authoring/`

File	Purpose
`SKILL.md`	Meta-skill with WHEN / DO NOT USE FOR / INVOKES frontmatter. Hard rules + workflow.
`references/four-layer-pattern.md`	Deep dive on the routing → tool-use → output shape → judgment layering.
`references/grader-catalog.md`	All Vally 0.5.0 graders, verified against `dist/graders/`. Documents hidden behaviors (`scoring.weights` not applied, `scoring.threshold` defaults to 0, regex semantics of `tool-calls.name`, prefixed tool names work).
`references/anti-patterns.md`	7 smells (A1–A7) with examples and fixes.
`evals/eval.yaml` + `evals/trigger.eval.yaml`	Eats its own dog food: 2 capability + 9 trigger/anti-trigger stimuli.

Rewritten: `.github/skills/azsdk-common-prepare-release-plan/evals/eval.yaml`

Replaced the previous 6 stimuli (most of which were graded only by output-contains "release plan") with 6 four-layer-pattern stimuli covering: create-from-spec-pr, update-sdk-details, link-sdk-prs, get-release-plan-status, and two negatives. The file's description: declares it the canonical exemplar.

Registered: `.github/skills/.vally.yaml`

Adds skill-eval-authoring/evals/ to paths.evals.

Out of scope (deliberate)

Lint script that fails CI on single-substring capability tests — proposed as P2 in the design doc.
Sweep of the other 8 skill eval suites to apply the four-layer pattern — proposed as P1 follow-ups, one PR per skill, owned by skill author.
Tiered environments (unit / integration / e2e) — proposed as P2.

Open questions for the group (also captured in the design doc)

Where do live-MCP e2e tests run? Owner? Cost?
Multi-model matrix (opus + gpt-5.4) for skill evals, or pin one?
Threshold per tier — 0.8 for capability, 1.0 for anti-trigger?
Extend tool-calls arg matching to arbitrary keys (today only command / path)?
Should we record a known-good trajectory per skill and diff-grade against it?
Who owns the eval suite per skill — author, language team, or engsys?
Add a failure-attribution LLM pass to classify each red stimulus (skill / model / MCP tool)?

Verification

Manually verified grader registration against eng/skill-eval/node_modules/@microsoft/vally/dist/pipeline/grading.js and dist/graders/static/.
Confirmed prefixed tool names emit through copilot-sdk adapter by inspecting a real trajectory in vally-results/.../results.jsonl.
New skill folder lints clean (frontmatter parses; references resolve).

Draft until reviewers weigh in on the open questions above.

Adds a new meta-skill that teaches the agent (and contributors) how to write robust evals for skills under .github/skills. Includes SKILL.md, three reference docs (four-layer pattern, grader catalog, anti-patterns), and self-eval (capability + trigger). Rewrites .github/skills/azsdk-common-prepare-release-plan/evals/eval.yaml as the canonical four-layer-pattern exemplar that skill-eval-authoring points to. Every capability stimulus now asserts skill-invocation + tool-calls, replacing single-substring output-contains graders that previously passed on refusals and wrong-tool calls. Registers the new skill in .github/skills/.vally.yaml paths.evals.

github-actions · 2026-06-02T23:25:20Z

📊 GEPA Skill Quality Scores

Skill	Quality	Triggers	Tests
azsdk-common-generate-sdk-locally	❌ 0.30	N/A	---
azsdk-common-pipeline-troubleshooting	❌ 0.34	N/A	---
azsdk-common-prepare-release-plan	❌ 0.37	N/A	---
azsdk-common-sdk-release	❌ 0.37	N/A	---
markdown-token-optimizer	❌ 0.39	N/A	---
azure-typespec-author	❌ 0.40	N/A	---
azsdk-common-apiview-feedback-resolution	❌ 0.43	N/A	---
skill-authoring	❌ 0.48	N/A	---
skill-eval-authoring	⚠️ 0.51	N/A	---
sensei	⚠️ 0.57	N/A	---

0/10 skills at quality ≥ 0.80

How to improve

# Score a specific skill
python .github/skills/sensei/scripts/gepa/auto_evaluator.py score --skill <name> --skills-dir .github/skills --tests-dir tests

# Optimize a skill with GEPA
python .github/skills/sensei/scripts/gepa/auto_evaluator.py optimize --skill <name> --skills-dir .github/skills --tests-dir tests

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add skill-eval-authoring skill and harden release-plan eval exemplar#15836

Add skill-eval-authoring skill and harden release-plan eval exemplar#15836
helen229 wants to merge 1 commit into
mainfrom
feat/skill-eval-authoring

helen229 commented Jun 2, 2026

Uh oh!

github-actions Bot commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

helen229 commented Jun 2, 2026

Summary

Motivation

What's in this PR

New: .github/skills/skill-eval-authoring/

Rewritten: .github/skills/azsdk-common-prepare-release-plan/evals/eval.yaml

Registered: .github/skills/.vally.yaml

Out of scope (deliberate)

Open questions for the group (also captured in the design doc)

Verification

Uh oh!

github-actions Bot commented Jun 2, 2026

📊 GEPA Skill Quality Scores

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

New: `.github/skills/skill-eval-authoring/`

Rewritten: `.github/skills/azsdk-common-prepare-release-plan/evals/eval.yaml`

Registered: `.github/skills/.vally.yaml`