Skip to content

Port tool-scenario benchmarks from Azure.Sdk.Tools.Cli.Evaluations to Vally (#15124)#15811

Draft
helen229 wants to merge 12 commits into
mainfrom
feat/vally-tool-scenarios-15124
Draft

Port tool-scenario benchmarks from Azure.Sdk.Tools.Cli.Evaluations to Vally (#15124)#15811
helen229 wants to merge 12 commits into
mainfrom
feat/vally-tool-scenarios-15124

Conversation

@helen229
Copy link
Copy Markdown
Member

@helen229 helen229 commented Jun 1, 2026

Closes #15124.

Stands up Azure.Sdk.Tools.Vally as the home for MCP-tool scenario and trigger evals, ports the legacy Azure.Sdk.Tools.Cli.Evaluations benchmarks, and folds in the per-tool trigger evals from #15183 so we have a single eval surface.

What's in the PR

New project: tools/azsdk-cli/Azure.Sdk.Tools.Vally/

  • .vally.yaml — single azsdk-mcp environment that spawns Azure.Sdk.Tools.Cli via dotnet run; named suites for selective execution (typespec, release-plan, github, pipeline, scenarios, triggers, all).
  • .gitignore — excludes local vally-results/ and results/.
  • README.md — explains how Vally evals relate to the per-skill evals under .github/skills/, lists scenario + trigger coverage, documents the run loop.

evals/scenarios/ — 11 multi-step workflow evals (the #15124 port)

Ported from Azure.Sdk.Tools.Cli.Evaluations and reshaped for Vally's tool-calls grader:

Scenario Shape
check-public-repo Single-tool: is a TypeSpec project published in azure-rest-api-specs?
check-public-repo-then-validate Multi-tool, ordered: validate then check
validate-typespec Single-tool: tsp linter/validation
typespec-generation-step02 Step in the spec-PR generation flow
get-modified-typespec-projects Git-aware tool against current branch
add-arm-resource Calls azsdk_typespec_generate_authoring_plan for an ARM resource
create-release-plan Single-tool: create a release-plan work item
link-namespace-approval-issue Link an existing approval issue to a release plan
get-pr-link-current-branch Resolve the PR for the active git branch
check-sdk-generation-status Pipeline status lookup
rename-client-property Stub — needs expected-diff grader (follow-up)

evals/triggers/ — 9 per-tool trigger evals (ported from #15183)

One YAML per tool category; each stimulus is a single prompt expected to invoke one MCP tool. Used to catch tool-rename / description-drift regressions.

apiview, config, engsys, github, package, pipeline, releaseplan, typespec, verify — covering the bulk of the azsdk_* tool surface.

scripts/Validate-EvalTools.ps1 (ported from #15183)

Drift detector. Runs azsdk list --output json and cross-checks:

  • every tool referenced in evals/triggers/ exists on the running MCP server (catches renames)
  • every server tool has at least one trigger eval (catches new tools landing without coverage)
  • known-excluded tools (examples, hello_world, upgrade, codeowner helpers) are filtered out

What's not in this PR (deliberate)

  • AZSDKTOOLS_AGENT_TESTING toggle — currently false. There's a real-e2e vs. safe-replay tradeoff worth a separate discussion; flipping it would need either a second azsdk-mcp-live environment or a CI policy. Left for a follow-up.
  • rename-client-property grader — still a stub awaiting a Vally expected-diff grader.
  • CI wiring — the project builds and runs locally; a ci.yml under eng/pipelines/templates is a follow-up.

Acknowledgements

Trigger evals + Validate-EvalTools.ps1 ported from @jeo02's #15183 (migrate-evaluations-to-vally); prefixes stripped from azure-sdk-mcp-azsdk_*azsdk_* to match the bare MCP tool names emitted in trajectories. Once this PR merges, #15183 is superseded.

Verification

  • dotnet build on Azure.Sdk.Tools.Vally — green.
  • vally eval --eval-spec evals/scenarios/check-public-repo-then-validate.eval.yaml — runs end-to-end against the MCP server and grades against tool-calls (trajectory captured under vally-results/).
  • scripts/Validate-EvalTools.ps1 — runs against a live MCP server and produces the expected coverage report.

helen229 added 3 commits June 1, 2026 10:22
Adds a new Vally eval suite under tools/azsdk-cli/Azure.Sdk.Tools.Vally/ for MCP tool / scenario evaluations, replacing the deleted Azure.Sdk.Tools.Cli.Benchmarks project (#15697).

- README documents project intent, layout, local run instructions, and how to add a new scenario.

- .vally.yaml wires the azsdk-mcp environment (stdio dotnet run against Azure.Sdk.Tools.Cli) and defines 'typespec' and 'all' suites.

- evals/check-public-repo.eval.yaml is the first ported scenario (from the deleted CheckPublicRepoScenario): verifies the agent invokes azsdk_typespec_check_project_in_public_repo for a public-repo check prompt. Lints clean via 'vally lint --eval-spec'.

- fixtures/.gitkeep reserves the per-scenario fixtures layout.

Remaining scenarios from the deleted benchmark are tracked as a checklist in the project README and in #15124.
Adds eval YAMLs for every scenario that was deleted from Azure.Sdk.Tools.Cli.Benchmarks in #15697:

- check-public-repo-then-validate

- validate-typespec

- typespec-generation-step02

- get-modified-typespec-projects (stub — needs git-repo fixture / setup hook)

- add-arm-resource (stub — needs fixtures + npx tsp compile post-check)

- create-release-plan

- link-namespace-approval-issue

- get-pr-link-current-branch

- check-sdk-generation-status

Each eval uses the built-in tool-calls grader for presence checks; the original benchmark's argument/order/forbidden/optional assertions are captured in prompt text + inline TODOs (require custom graders or upstream Vally support, documented in README). Also adds release-plan/github/pipeline suites to .vally.yaml. All 10 evals pass 'vally lint --eval-spec'.
Ports the deleted RenameClientPropertyScenario as a tool-calls-only stub. Full expected-diff grading + sparse-clone setup hook are tracked as follow-ups in the README.
helen229 added 9 commits June 2, 2026 11:48
#15183

- Move 11 multi-step scenario evals to evals/scenarios/
- Port 9 per-tool trigger evals from jeo02/migrate-evaluations-to-vally (PR #15183) to evals/triggers/, stripped azure-sdk-mcp- prefix from graders to match bare MCP tool names
- Port Validate-EvalTools.ps1 to scripts/, retargeted at evals/triggers/ with bare-name regex
- Update .vally.yaml suites for new layout (scenarios, triggers, all)
- Update README to document the split and per-trigger-file tool coverage
- Add .gitignore for vally-results/ and results/
Replace per-area folders (scenarios/, triggers/) with tier-based folders. Feature area moves to a YAML tag, enabling tag-filtered suites. Add composite suites (pr-gate, nightly) and area-filtered suites in .vally.yaml. Update Validate-EvalTools.ps1 to scan evals/unit for triggers-*.eval.yaml. Refresh README and Run-LiveEvals.ps1 paths.
Drop the local-only convenience wrapper and refer directly to evals/setup/ensure-specs-clone.ps1 in docs and YAML comments. Users prime the spec clone manually and invoke 'vally eval --suite e2e'.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

azsdk-cli Issues related to Azure/azure-sdk-tools::tools/azsdk-cli

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Migrate Benchmarks + Tool invocation from evaluate

1 participant