Port tool-scenario benchmarks from Azure.Sdk.Tools.Cli.Evaluations to Vally (#15124)#15811
Draft
helen229 wants to merge 12 commits into
Draft
Port tool-scenario benchmarks from Azure.Sdk.Tools.Cli.Evaluations to Vally (#15124)#15811helen229 wants to merge 12 commits into
helen229 wants to merge 12 commits into
Conversation
Adds a new Vally eval suite under tools/azsdk-cli/Azure.Sdk.Tools.Vally/ for MCP tool / scenario evaluations, replacing the deleted Azure.Sdk.Tools.Cli.Benchmarks project (#15697). - README documents project intent, layout, local run instructions, and how to add a new scenario. - .vally.yaml wires the azsdk-mcp environment (stdio dotnet run against Azure.Sdk.Tools.Cli) and defines 'typespec' and 'all' suites. - evals/check-public-repo.eval.yaml is the first ported scenario (from the deleted CheckPublicRepoScenario): verifies the agent invokes azsdk_typespec_check_project_in_public_repo for a public-repo check prompt. Lints clean via 'vally lint --eval-spec'. - fixtures/.gitkeep reserves the per-scenario fixtures layout. Remaining scenarios from the deleted benchmark are tracked as a checklist in the project README and in #15124.
Adds eval YAMLs for every scenario that was deleted from Azure.Sdk.Tools.Cli.Benchmarks in #15697: - check-public-repo-then-validate - validate-typespec - typespec-generation-step02 - get-modified-typespec-projects (stub — needs git-repo fixture / setup hook) - add-arm-resource (stub — needs fixtures + npx tsp compile post-check) - create-release-plan - link-namespace-approval-issue - get-pr-link-current-branch - check-sdk-generation-status Each eval uses the built-in tool-calls grader for presence checks; the original benchmark's argument/order/forbidden/optional assertions are captured in prompt text + inline TODOs (require custom graders or upstream Vally support, documented in README). Also adds release-plan/github/pipeline suites to .vally.yaml. All 10 evals pass 'vally lint --eval-spec'.
Ports the deleted RenameClientPropertyScenario as a tool-calls-only stub. Full expected-diff grading + sparse-clone setup hook are tracked as follow-ups in the README.
This was referenced Jun 2, 2026
#15183 - Move 11 multi-step scenario evals to evals/scenarios/ - Port 9 per-tool trigger evals from jeo02/migrate-evaluations-to-vally (PR #15183) to evals/triggers/, stripped azure-sdk-mcp- prefix from graders to match bare MCP tool names - Port Validate-EvalTools.ps1 to scripts/, retargeted at evals/triggers/ with bare-name regex - Update .vally.yaml suites for new layout (scenarios, triggers, all) - Update README to document the split and per-trigger-file tool coverage - Add .gitignore for vally-results/ and results/
Replace per-area folders (scenarios/, triggers/) with tier-based folders. Feature area moves to a YAML tag, enabling tag-filtered suites. Add composite suites (pr-gate, nightly) and area-filtered suites in .vally.yaml. Update Validate-EvalTools.ps1 to scan evals/unit for triggers-*.eval.yaml. Refresh README and Run-LiveEvals.ps1 paths.
…Azure/azure-sdk-tools into feat/vally-tool-scenarios-15124
Drop the local-only convenience wrapper and refer directly to evals/setup/ensure-specs-clone.ps1 in docs and YAML comments. Users prime the spec clone manually and invoke 'vally eval --suite e2e'.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #15124.
Stands up
Azure.Sdk.Tools.Vallyas the home for MCP-tool scenario and trigger evals, ports the legacyAzure.Sdk.Tools.Cli.Evaluationsbenchmarks, and folds in the per-tool trigger evals from #15183 so we have a single eval surface.What's in the PR
New project:
tools/azsdk-cli/Azure.Sdk.Tools.Vally/.vally.yaml— singleazsdk-mcpenvironment that spawnsAzure.Sdk.Tools.Cliviadotnet run; named suites for selective execution (typespec,release-plan,github,pipeline,scenarios,triggers,all)..gitignore— excludes localvally-results/andresults/.README.md— explains how Vally evals relate to the per-skill evals under.github/skills/, lists scenario + trigger coverage, documents the run loop.evals/scenarios/— 11 multi-step workflow evals (the #15124 port)Ported from
Azure.Sdk.Tools.Cli.Evaluationsand reshaped for Vally'stool-callsgrader:check-public-repoazure-rest-api-specs?check-public-repo-then-validatevalidate-typespectsplinter/validationtypespec-generation-step02get-modified-typespec-projectsadd-arm-resourceazsdk_typespec_generate_authoring_planfor an ARM resourcecreate-release-planlink-namespace-approval-issueget-pr-link-current-branchcheck-sdk-generation-statusrename-client-propertyexpected-diffgrader (follow-up)evals/triggers/— 9 per-tool trigger evals (ported from #15183)One YAML per tool category; each stimulus is a single prompt expected to invoke one MCP tool. Used to catch tool-rename / description-drift regressions.
apiview,config,engsys,github,package,pipeline,releaseplan,typespec,verify— covering the bulk of theazsdk_*tool surface.scripts/Validate-EvalTools.ps1(ported from #15183)Drift detector. Runs
azsdk list --output jsonand cross-checks:evals/triggers/exists on the running MCP server (catches renames)hello_world,upgrade, codeowner helpers) are filtered outWhat's not in this PR (deliberate)
AZSDKTOOLS_AGENT_TESTINGtoggle — currentlyfalse. There's a real-e2e vs. safe-replay tradeoff worth a separate discussion; flipping it would need either a secondazsdk-mcp-liveenvironment or a CI policy. Left for a follow-up.rename-client-propertygrader — still a stub awaiting a Vallyexpected-diffgrader.ci.ymlundereng/pipelines/templatesis a follow-up.Acknowledgements
Trigger evals +
Validate-EvalTools.ps1ported from @jeo02's #15183 (migrate-evaluations-to-vally); prefixes stripped fromazure-sdk-mcp-azsdk_*→azsdk_*to match the bare MCP tool names emitted in trajectories. Once this PR merges, #15183 is superseded.Verification
dotnet buildonAzure.Sdk.Tools.Vally— green.vally eval --eval-spec evals/scenarios/check-public-repo-then-validate.eval.yaml— runs end-to-end against the MCP server and grades againsttool-calls(trajectory captured undervally-results/).scripts/Validate-EvalTools.ps1— runs against a live MCP server and produces the expected coverage report.