Follow-up to #1735: improve aca-sandboxes-evals signal quality
After #1735 landed, the aca-sandboxes-evals Vally suite still reports a low absolute pass rate (~9/30 best, ~7/30 median) despite the skill working correctly in live use. Most failures are eval-infrastructure noise, not skill-quality issues.
Root causes
1. LLM-judge nondeterminism (~12 of 30 stimuli use prompt graders)
The same model response scores 0/1 then 1/1 across identical re-runs. With scoring: scale_1_5 and threshold: 0.6, borderline-correct answers flip verdicts run-to-run.
2. Brittle output-contains substrings
Several positive-case graders require exact literal substrings (e.g. specific aka.ms URLs, flag spellings). Semantically correct paraphrases fail. Example: pos-install-aca-windows checks output-contains: "aka.ms/aca-cli-install-ps" — the model often emits the full https://aka.ms/... URL inside a code fence and still fails the literal match.
3. output-not-matches regex traps
Several "anti-pattern" graders fire when the skill's own anti-cue text echoes a forbidden literal back to the user. Two were patched in #1735:
(?<!auth )\baca login\b tripped by "do not use bare aca login" prose
--allow-ip|--block-host|--deny-host tripped by "these flags don't exist" anti-cue
But the pattern is fragile — any future anti-cue that names a forbidden literal will trip its own grader.
4. Grader-name drift
Renaming the skill sandboxes → aca-sandboxes in #1735 broke 28 skill-invocation graders. Local fix committed as 30828f0 in the eval repo, but the repo is archived so the push fails 403. Until the eval repo is unarchived, those grader renames live only on this machine and any contributor running the suite hits the bug.
Proposed remediation
- Unarchive
paulyuk/aca-sandboxes-evals (or fork to microsoft/aca-sandboxes-evals) so fixes can land.
- Push the pending grader-rename commit (
sandboxes → aca-sandboxes).
- Replace brittle
output-contains with output-matches regexes where the answer has acceptable variation (e.g. the URL could be bare or in a fence; aca auth login could appear with or without az login first).
- Tighten
prompt graders:
- Lower variance: include explicit "PASS iff…" criteria and example pass/fail outputs in the prompt.
- Consider lowering
threshold from 0.6 → 0.5 for the most subjective stimuli, OR run each prompt grader N=3 times and take the majority.
- Audit
output-not-matches patterns so they only match the model's recommendations, not anti-cue echoes — e.g. require the forbidden literal to appear inside a code fence rather than in prose.
- Add a baseline regression-floor check: pin a minimum pass count (currently ~7/30) as a CI gate so future skill edits can't silently regress.
Acceptance criteria
Context
cc @paulyuk
Follow-up to #1735: improve aca-sandboxes-evals signal quality
After #1735 landed, the
aca-sandboxes-evalsVally suite still reports a low absolute pass rate (~9/30 best, ~7/30 median) despite the skill working correctly in live use. Most failures are eval-infrastructure noise, not skill-quality issues.Root causes
1. LLM-judge nondeterminism (~12 of 30 stimuli use
promptgraders)The same model response scores 0/1 then 1/1 across identical re-runs. With
scoring: scale_1_5andthreshold: 0.6, borderline-correct answers flip verdicts run-to-run.2. Brittle
output-containssubstringsSeveral positive-case graders require exact literal substrings (e.g. specific aka.ms URLs, flag spellings). Semantically correct paraphrases fail. Example:
pos-install-aca-windowschecksoutput-contains: "aka.ms/aca-cli-install-ps"— the model often emits the fullhttps://aka.ms/...URL inside a code fence and still fails the literal match.3.
output-not-matchesregex trapsSeveral "anti-pattern" graders fire when the skill's own anti-cue text echoes a forbidden literal back to the user. Two were patched in #1735:
(?<!auth )\baca login\btripped by "do not use bareaca login" prose--allow-ip|--block-host|--deny-hosttripped by "these flags don't exist" anti-cueBut the pattern is fragile — any future anti-cue that names a forbidden literal will trip its own grader.
4. Grader-name drift
Renaming the skill
sandboxes→aca-sandboxesin #1735 broke 28 skill-invocation graders. Local fix committed as 30828f0 in the eval repo, but the repo is archived so the push fails 403. Until the eval repo is unarchived, those grader renames live only on this machine and any contributor running the suite hits the bug.Proposed remediation
paulyuk/aca-sandboxes-evals(or fork tomicrosoft/aca-sandboxes-evals) so fixes can land.sandboxes→aca-sandboxes).output-containswithoutput-matchesregexes where the answer has acceptable variation (e.g. the URL could be bare or in a fence;aca auth logincould appear with or withoutaz loginfirst).promptgraders:thresholdfrom 0.6 → 0.5 for the most subjective stimuli, OR run eachpromptgrader N=3 times and take the majority.output-not-matchespatterns so they only match the model's recommendations, not anti-cue echoes — e.g. require the forbidden literal to appear inside a code fence rather than in prose.Acceptance criteria
CONTRIBUTING.mdso future skill authors don't repeat itContext
cc @paulyuk