Skip to content

Add skill-eval-authoring skill and harden release-plan eval exemplar#15836

Draft
helen229 wants to merge 1 commit into
mainfrom
feat/skill-eval-authoring
Draft

Add skill-eval-authoring skill and harden release-plan eval exemplar#15836
helen229 wants to merge 1 commit into
mainfrom
feat/skill-eval-authoring

Conversation

@helen229
Copy link
Copy Markdown
Member

@helen229 helen229 commented Jun 2, 2026

Summary

Adds a new meta-skill, skill-eval-authoring, that teaches the agent (and contributors) how to write robust evals for skills under .github/skills/. Rewrites the release-plan capability suite as the canonical exemplar the new skill points to.

Motivation

Audit of all 9 skill eval suites surfaced that ~25 capability stimuli are graded only by a single output-contains "<keyword>". These tests pass on:

  • correct tool call ✅
  • polite refusal ❌
  • wrong-tool call that mentions the keyword ❌
  • prompt parroting ❌

CI is green by construction. The fix is structural, not a one-line lint: contributors need a pattern and a worked example.

What's in this PR

New: .github/skills/skill-eval-authoring/

File Purpose
SKILL.md Meta-skill with WHEN / DO NOT USE FOR / INVOKES frontmatter. Hard rules + workflow.
references/four-layer-pattern.md Deep dive on the routing → tool-use → output shape → judgment layering.
references/grader-catalog.md All Vally 0.5.0 graders, verified against dist/graders/. Documents hidden behaviors (scoring.weights not applied, scoring.threshold defaults to 0, regex semantics of tool-calls.name, prefixed tool names work).
references/anti-patterns.md 7 smells (A1–A7) with examples and fixes.
evals/eval.yaml + evals/trigger.eval.yaml Eats its own dog food: 2 capability + 9 trigger/anti-trigger stimuli.

Rewritten: .github/skills/azsdk-common-prepare-release-plan/evals/eval.yaml

Replaced the previous 6 stimuli (most of which were graded only by output-contains "release plan") with 6 four-layer-pattern stimuli covering: create-from-spec-pr, update-sdk-details, link-sdk-prs, get-release-plan-status, and two negatives. The file's description: declares it the canonical exemplar.

Registered: .github/skills/.vally.yaml

Adds skill-eval-authoring/evals/ to paths.evals.

Out of scope (deliberate)

  • Lint script that fails CI on single-substring capability tests — proposed as P2 in the design doc.
  • Sweep of the other 8 skill eval suites to apply the four-layer pattern — proposed as P1 follow-ups, one PR per skill, owned by skill author.
  • Tiered environments (unit / integration / e2e) — proposed as P2.

Open questions for the group (also captured in the design doc)

  1. Where do live-MCP e2e tests run? Owner? Cost?
  2. Multi-model matrix (opus + gpt-5.4) for skill evals, or pin one?
  3. Threshold per tier — 0.8 for capability, 1.0 for anti-trigger?
  4. Extend tool-calls arg matching to arbitrary keys (today only command / path)?
  5. Should we record a known-good trajectory per skill and diff-grade against it?
  6. Who owns the eval suite per skill — author, language team, or engsys?
  7. Add a failure-attribution LLM pass to classify each red stimulus (skill / model / MCP tool)?

Verification

  • Manually verified grader registration against eng/skill-eval/node_modules/@microsoft/vally/dist/pipeline/grading.js and dist/graders/static/.
  • Confirmed prefixed tool names emit through copilot-sdk adapter by inspecting a real trajectory in vally-results/.../results.jsonl.
  • New skill folder lints clean (frontmatter parses; references resolve).

Draft until reviewers weigh in on the open questions above.

Adds a new meta-skill that teaches the agent (and contributors) how to write robust evals for skills under .github/skills. Includes SKILL.md, three reference docs (four-layer pattern, grader catalog, anti-patterns), and self-eval (capability + trigger).

Rewrites .github/skills/azsdk-common-prepare-release-plan/evals/eval.yaml as the canonical four-layer-pattern exemplar that skill-eval-authoring points to. Every capability stimulus now asserts skill-invocation + tool-calls, replacing single-substring output-contains graders that previously passed on refusals and wrong-tool calls.

Registers the new skill in .github/skills/.vally.yaml paths.evals.
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 2, 2026

📊 GEPA Skill Quality Scores

Skill Quality Triggers Tests
azsdk-common-generate-sdk-locally ❌ 0.30 N/A ---
azsdk-common-pipeline-troubleshooting ❌ 0.34 N/A ---
azsdk-common-prepare-release-plan ❌ 0.37 N/A ---
azsdk-common-sdk-release ❌ 0.37 N/A ---
markdown-token-optimizer ❌ 0.39 N/A ---
azure-typespec-author ❌ 0.40 N/A ---
azsdk-common-apiview-feedback-resolution ❌ 0.43 N/A ---
skill-authoring ❌ 0.48 N/A ---
skill-eval-authoring ⚠️ 0.51 N/A ---
sensei ⚠️ 0.57 N/A ---

0/10 skills at quality ≥ 0.80

How to improve
# Score a specific skill
python .github/skills/sensei/scripts/gepa/auto_evaluator.py score --skill <name> --skills-dir .github/skills --tests-dir tests

# Optimize a skill with GEPA
python .github/skills/sensei/scripts/gepa/auto_evaluator.py optimize --skill <name> --skills-dir .github/skills --tests-dir tests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant