Improve aca-sandboxes-evals signal quality (follow-up to #1735)

## Follow-up to #1735: improve aca-sandboxes-evals signal quality

After #1735 landed, the [`aca-sandboxes-evals`](https://github.com/paulyuk/aca-sandboxes-evals) Vally suite still reports a low absolute pass rate (~9/30 best, ~7/30 median) **despite the skill working correctly in live use**. Most failures are eval-infrastructure noise, not skill-quality issues.

### Root causes

**1. LLM-judge nondeterminism (~12 of 30 stimuli use `prompt` graders)**
The same model response scores 0/1 then 1/1 across identical re-runs. With `scoring: scale_1_5` and `threshold: 0.6`, borderline-correct answers flip verdicts run-to-run.

**2. Brittle `output-contains` substrings**
Several positive-case graders require *exact* literal substrings (e.g. specific aka.ms URLs, flag spellings). Semantically correct paraphrases fail. Example: `pos-install-aca-windows` checks `output-contains: "aka.ms/aca-cli-install-ps"` — the model often emits the full `https://aka.ms/...` URL inside a code fence and still fails the literal match.

**3. `output-not-matches` regex traps**
Several "anti-pattern" graders fire when the skill's own anti-cue text echoes a forbidden literal back to the user. Two were patched in #1735:
- `(?<!auth )\baca login\b` tripped by "do not use bare `aca login`" prose
- `--allow-ip|--block-host|--deny-host` tripped by "these flags don't exist" anti-cue

But the pattern is fragile — any future anti-cue that names a forbidden literal will trip its own grader.

**4. Grader-name drift**
Renaming the skill `sandboxes` → `aca-sandboxes` in #1735 broke 28 skill-invocation graders. Local fix committed as 30828f0 in the eval repo, but **the repo is archived** so the push fails 403. Until the eval repo is unarchived, those grader renames live only on this machine and any contributor running the suite hits the bug.

### Proposed remediation

1. **Unarchive [`paulyuk/aca-sandboxes-evals`](https://github.com/paulyuk/aca-sandboxes-evals)** (or fork to `microsoft/aca-sandboxes-evals`) so fixes can land.
2. **Push the pending grader-rename commit** (`sandboxes` → `aca-sandboxes`).
3. **Replace brittle `output-contains` with `output-matches` regexes** where the answer has acceptable variation (e.g. the URL could be bare or in a fence; `aca auth login` could appear with or without `az login` first).
4. **Tighten `prompt` graders**:
   - Lower variance: include explicit "PASS iff…" criteria and example pass/fail outputs in the prompt.
   - Consider lowering `threshold` from 0.6 → 0.5 for the most subjective stimuli, OR run each `prompt` grader N=3 times and take the majority.
5. **Audit `output-not-matches` patterns** so they only match the model's *recommendations*, not anti-cue echoes — e.g. require the forbidden literal to appear inside a code fence rather than in prose.
6. **Add a baseline regression-floor check**: pin a minimum pass count (currently ~7/30) as a CI gate so future skill edits can't silently regress.

### Acceptance criteria

- Eval repo accepts pushes again
- Grader rename commit lands
- After judge cleanup, a fresh full run of the unmodified plugin from #1735 produces ≥15/30 stable pass count across 3 consecutive runs
- Document the regex-trap pattern in `CONTRIBUTING.md` so future skill authors don't repeat it

### Context

- PR #1735 (the manifest-registration fix that motivated this) https://github.com/microsoft/azure-container-apps/pull/1735
- Eval repo (archived) https://github.com/paulyuk/aca-sandboxes-evals
- Best run in #1735 testing: 9/30 (+80% vs 5/30 baseline) — see comment thread on #1735

cc @paulyuk


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve aca-sandboxes-evals signal quality (follow-up to #1735) #1737

Follow-up to #1735: improve aca-sandboxes-evals signal quality

Root causes

Proposed remediation

Acceptance criteria

Context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Improve aca-sandboxes-evals signal quality (follow-up to #1735) #1737

Description

Follow-up to #1735: improve aca-sandboxes-evals signal quality

Root causes

Proposed remediation

Acceptance criteria

Context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions