Skip to content

Improve aca-sandboxes-evals signal quality (follow-up to #1735) #1737

@paulyuk

Description

@paulyuk

Follow-up to #1735: improve aca-sandboxes-evals signal quality

After #1735 landed, the aca-sandboxes-evals Vally suite still reports a low absolute pass rate (~9/30 best, ~7/30 median) despite the skill working correctly in live use. Most failures are eval-infrastructure noise, not skill-quality issues.

Root causes

1. LLM-judge nondeterminism (~12 of 30 stimuli use prompt graders)
The same model response scores 0/1 then 1/1 across identical re-runs. With scoring: scale_1_5 and threshold: 0.6, borderline-correct answers flip verdicts run-to-run.

2. Brittle output-contains substrings
Several positive-case graders require exact literal substrings (e.g. specific aka.ms URLs, flag spellings). Semantically correct paraphrases fail. Example: pos-install-aca-windows checks output-contains: "aka.ms/aca-cli-install-ps" — the model often emits the full https://aka.ms/... URL inside a code fence and still fails the literal match.

3. output-not-matches regex traps
Several "anti-pattern" graders fire when the skill's own anti-cue text echoes a forbidden literal back to the user. Two were patched in #1735:

  • (?<!auth )\baca login\b tripped by "do not use bare aca login" prose
  • --allow-ip|--block-host|--deny-host tripped by "these flags don't exist" anti-cue

But the pattern is fragile — any future anti-cue that names a forbidden literal will trip its own grader.

4. Grader-name drift
Renaming the skill sandboxesaca-sandboxes in #1735 broke 28 skill-invocation graders. Local fix committed as 30828f0 in the eval repo, but the repo is archived so the push fails 403. Until the eval repo is unarchived, those grader renames live only on this machine and any contributor running the suite hits the bug.

Proposed remediation

  1. Unarchive paulyuk/aca-sandboxes-evals (or fork to microsoft/aca-sandboxes-evals) so fixes can land.
  2. Push the pending grader-rename commit (sandboxesaca-sandboxes).
  3. Replace brittle output-contains with output-matches regexes where the answer has acceptable variation (e.g. the URL could be bare or in a fence; aca auth login could appear with or without az login first).
  4. Tighten prompt graders:
    • Lower variance: include explicit "PASS iff…" criteria and example pass/fail outputs in the prompt.
    • Consider lowering threshold from 0.6 → 0.5 for the most subjective stimuli, OR run each prompt grader N=3 times and take the majority.
  5. Audit output-not-matches patterns so they only match the model's recommendations, not anti-cue echoes — e.g. require the forbidden literal to appear inside a code fence rather than in prose.
  6. Add a baseline regression-floor check: pin a minimum pass count (currently ~7/30) as a CI gate so future skill edits can't silently regress.

Acceptance criteria

Context

cc @paulyuk

Metadata

Metadata

Assignees

Labels

Needs: triage 🔍Pending a first pass to read, tag, and assign

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions