Port tool-scenario benchmarks from Azure.Sdk.Tools.Cli.Evaluations to Vally (#15124) by helen229 · Pull Request #15811 · Azure/azure-sdk-tools

helen229 · 2026-06-01T17:22:50Z

Stands up Azure.Sdk.Tools.Vally as the home for MCP-tool scenario and trigger evals, ports the legacy Azure.Sdk.Tools.Cli.Evaluations benchmarks, and folds in the per-tool trigger evals from #15183 so we have a single eval surface.

What's in the PR

New project: `tools/azsdk-cli/Azure.Sdk.Tools.Vally/`

.vally.yaml — single azsdk-mcp environment that spawns Azure.Sdk.Tools.Cli via dotnet run; named suites for selective execution (typespec, release-plan, github, pipeline, scenarios, triggers, all).
.gitignore — excludes local vally-results/ and results/.
README.md — explains how Vally evals relate to the per-skill evals under .github/skills/, lists scenario + trigger coverage, documents the run loop.

`evals/scenarios/` — 11 multi-step workflow evals (the #15124 port)

Ported from Azure.Sdk.Tools.Cli.Evaluations and reshaped for Vally's tool-calls grader:

Scenario	Shape
`check-public-repo`	Single-tool: is a TypeSpec project published in `azure-rest-api-specs`?
`check-public-repo-then-validate`	Multi-tool, ordered: validate then check
`validate-typespec`	Single-tool: `tsp` linter/validation
`typespec-generation-step02`	Step in the spec-PR generation flow
`get-modified-typespec-projects`	Git-aware tool against current branch
`add-arm-resource`	Calls `azsdk_typespec_generate_authoring_plan` for an ARM resource
`create-release-plan`	Single-tool: create a release-plan work item
`link-namespace-approval-issue`	Link an existing approval issue to a release plan
`get-pr-link-current-branch`	Resolve the PR for the active git branch
`check-sdk-generation-status`	Pipeline status lookup
`rename-client-property`	Stub — needs `expected-diff` grader (follow-up)

`evals/triggers/` — 9 per-tool trigger evals (ported from #15183)

One YAML per tool category; each stimulus is a single prompt expected to invoke one MCP tool. Used to catch tool-rename / description-drift regressions.

apiview, config, engsys, github, package, pipeline, releaseplan, typespec, verify — covering the bulk of the azsdk_* tool surface.

`scripts/Validate-EvalTools.ps1` (ported from #15183)

Drift detector. Runs azsdk list --output json and cross-checks:

every tool referenced in evals/triggers/ exists on the running MCP server (catches renames)
every server tool has at least one trigger eval (catches new tools landing without coverage)
known-excluded tools (examples, hello_world, upgrade, codeowner helpers) are filtered out

What's not in this PR (deliberate)

AZSDKTOOLS_AGENT_TESTING toggle — currently false. There's a real-e2e vs. safe-replay tradeoff worth a separate discussion; flipping it would need either a second azsdk-mcp-live environment or a CI policy. Left for a follow-up.
rename-client-property grader — still a stub awaiting a Vally expected-diff grader.
CI wiring — the project builds and runs locally; a ci.yml under eng/pipelines/templates is a follow-up.

Acknowledgements

Trigger evals + Validate-EvalTools.ps1 ported from @jeo02's #15183 (migrate-evaluations-to-vally); prefixes stripped from azure-sdk-mcp-azsdk_* → azsdk_* to match the bare MCP tool names emitted in trajectories. Once this PR merges, #15183 is superseded.

Verification

dotnet build on Azure.Sdk.Tools.Vally — green.
vally eval --eval-spec evals/scenarios/check-public-repo-then-validate.eval.yaml — runs end-to-end against the MCP server and grades against tool-calls (trajectory captured under vally-results/).
scripts/Validate-EvalTools.ps1 — runs against a live MCP server and produces the expected coverage report.

Adds a new Vally eval suite under tools/azsdk-cli/Azure.Sdk.Tools.Vally/ for MCP tool / scenario evaluations, replacing the deleted Azure.Sdk.Tools.Cli.Benchmarks project (#15697). - README documents project intent, layout, local run instructions, and how to add a new scenario. - .vally.yaml wires the azsdk-mcp environment (stdio dotnet run against Azure.Sdk.Tools.Cli) and defines 'typespec' and 'all' suites. - evals/check-public-repo.eval.yaml is the first ported scenario (from the deleted CheckPublicRepoScenario): verifies the agent invokes azsdk_typespec_check_project_in_public_repo for a public-repo check prompt. Lints clean via 'vally lint --eval-spec'. - fixtures/.gitkeep reserves the per-scenario fixtures layout. Remaining scenarios from the deleted benchmark are tracked as a checklist in the project README and in #15124.

Adds eval YAMLs for every scenario that was deleted from Azure.Sdk.Tools.Cli.Benchmarks in #15697: - check-public-repo-then-validate - validate-typespec - typespec-generation-step02 - get-modified-typespec-projects (stub — needs git-repo fixture / setup hook) - add-arm-resource (stub — needs fixtures + npx tsp compile post-check) - create-release-plan - link-namespace-approval-issue - get-pr-link-current-branch - check-sdk-generation-status Each eval uses the built-in tool-calls grader for presence checks; the original benchmark's argument/order/forbidden/optional assertions are captured in prompt text + inline TODOs (require custom graders or upstream Vally support, documented in README). Also adds release-plan/github/pipeline suites to .vally.yaml. All 10 evals pass 'vally lint --eval-spec'.

Ports the deleted RenameClientPropertyScenario as a tool-calls-only stub. Full expected-diff grading + sparse-clone setup hook are tracked as follow-ups in the README.

#15183 - Move 11 multi-step scenario evals to evals/scenarios/ - Port 9 per-tool trigger evals from jeo02/migrate-evaluations-to-vally (PR #15183) to evals/triggers/, stripped azure-sdk-mcp- prefix from graders to match bare MCP tool names - Port Validate-EvalTools.ps1 to scripts/, retargeted at evals/triggers/ with bare-name regex - Update .vally.yaml suites for new layout (scenarios, triggers, all) - Update README to document the split and per-trigger-file tool coverage - Add .gitignore for vally-results/ and results/

Replace per-area folders (scenarios/, triggers/) with tier-based folders. Feature area moves to a YAML tag, enabling tag-filtered suites. Add composite suites (pr-gate, nightly) and area-filtered suites in .vally.yaml. Update Validate-EvalTools.ps1 to scan evals/unit for triggers-*.eval.yaml. Refresh README and Run-LiveEvals.ps1 paths.

…Azure/azure-sdk-tools into feat/vally-tool-scenarios-15124

Drop the local-only convenience wrapper and refer directly to evals/setup/ensure-specs-clone.ps1 in docs and YAML comments. Users prime the spec clone manually and invoke 'vally eval --suite e2e'.

helen229 added 3 commits June 1, 2026 10:22

Add rename-client-property stub eval to Vally suite (#15124)

26cc6ef

Ports the deleted RenameClientPropertyScenario as a tool-calls-only stub. Full expected-diff grading + sparse-clone setup hook are tracked as follow-ups in the README.

github-actions Bot added the azsdk-cli Issues related to Azure/azure-sdk-tools::tools/azsdk-cli label Jun 1, 2026

This was referenced Jun 2, 2026

Wire vally eval CI job for Azure.Sdk.Tools.Vally tool-scenario evals #15829

Open

Wire workspace setup hooks + mock MCP environment for Vally tool-scenario evals #15831

Open

helen229 added 9 commits June 2, 2026 11:48

Fix tool name prefix in graders, timeout format, expand README

8e4f524

Merge branch 'main' into feat/vally-tool-scenarios-15124

c10063b

update the config and use gpt-5.4 model

02aee34

add disallowed

d1f212f

Merge branch 'feat/vally-tool-scenarios-15124' of https://github.com/…

66216b0

…Azure/azure-sdk-tools into feat/vally-tool-scenarios-15124

Vally: remove Run-LiveEvals.ps1 (local-only test wrapper)

a88ae11

Drop the local-only convenience wrapper and refer directly to evals/setup/ensure-specs-clone.ps1 in docs and YAML comments. Users prime the spec clone manually and invoke 'vally eval --suite e2e'.

some docs and test e2e one

bb47139

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Port tool-scenario benchmarks from Azure.Sdk.Tools.Cli.Evaluations to Vally (#15124)#15811

Port tool-scenario benchmarks from Azure.Sdk.Tools.Cli.Evaluations to Vally (#15124)#15811
helen229 wants to merge 12 commits into
mainfrom
feat/vally-tool-scenarios-15124

helen229 commented Jun 1, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

helen229 commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What's in the PR

New project: tools/azsdk-cli/Azure.Sdk.Tools.Vally/

evals/scenarios/ — 11 multi-step workflow evals (the #15124 port)

evals/triggers/ — 9 per-tool trigger evals (ported from #15183)

scripts/Validate-EvalTools.ps1 (ported from #15183)

What's not in this PR (deliberate)

Acknowledgements

Verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

helen229 commented Jun 1, 2026 •

edited

Loading

New project: `tools/azsdk-cli/Azure.Sdk.Tools.Vally/`

`evals/scenarios/` — 11 multi-step workflow evals (the #15124 port)

`evals/triggers/` — 9 per-tool trigger evals (ported from #15183)

`scripts/Validate-EvalTools.ps1` (ported from #15183)