Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
a0f9233
Scaffold Azure.Sdk.Tools.Vally tool-scenario eval suite (#15124)
helen229 May 27, 2026
701b7f8
Port remaining 9 benchmark scenarios to Vally (#15124)
helen229 May 27, 2026
26cc6ef
Add rename-client-property stub eval to Vally suite (#15124)
helen229 Jun 1, 2026
8e4f524
Fix tool name prefix in graders, timeout format, expand README
helen229 Jun 2, 2026
d9ea3e4
Reorganize evals into scenarios/ and triggers/; port trigger evals fr…
helen229 Jun 2, 2026
c10063b
Merge branch 'main' into feat/vally-tool-scenarios-15124
helen229 Jun 2, 2026
02aee34
update the config and use gpt-5.4 model
helen229 Jun 2, 2026
d1f212f
add disallowed
helen229 Jun 2, 2026
fd4eaf8
Vally: restructure evals into unit/integration/e2e test pyramid
helen229 Jun 2, 2026
66216b0
Merge branch 'feat/vally-tool-scenarios-15124' of https://github.com/…
helen229 Jun 2, 2026
a88ae11
Vally: remove Run-LiveEvals.ps1 (local-only test wrapper)
helen229 Jun 2, 2026
bb47139
some docs and test e2e one
helen229 Jun 3, 2026
4d89bac
update docs
helen229 Jun 3, 2026
f6f5c80
udpate design
helen229 Jun 3, 2026
3a8d609
update with skill evals
helen229 Jun 3, 2026
b7005b2
reorg based on the design
helen229 Jun 3, 2026
6db7c5f
remove the duplicates
helen229 Jun 3, 2026
b77dccb
add new scenarios
helen229 Jun 4, 2026
1264e9a
update the doc
helen229 Jun 4, 2026
aa714ab
update doc
helen229 Jun 4, 2026
f26cf1f
Merge remote-tracking branch 'origin/main' into feat/vally-tool-scena…
helen229 Jun 4, 2026
fda9ef9
update names
helen229 Jun 4, 2026
5b4fb6e
Vally: align release-planner mock stimuli with live e2e pattern
helen229 Jun 4, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions tools/azsdk-cli/Azure.Sdk.Tools.Vally/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
vally-results/
results/
99 changes: 99 additions & 0 deletions tools/azsdk-cli/Azure.Sdk.Tools.Vally/.vally.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
# Vally configuration for Azure SDK Tools MCP tool / scenario evaluations.
# See: https://vally.dev/reference/vally-config
#
# These are scenario evals (does the agent invoke the right MCP tool(s) for a
# given prompt?) and are intentionally separate from the per-skill evals under
# .github/skills/. See README.md for context.

paths:
evals: [evals/]
evalFilenames: ["*.eval.yaml"]
results: results/

environments:
# Default for unit + mock scenarios. Runs the dedicated Azure.Sdk.Tools.Mock
# MCP server — a separate process whose tool surface mirrors the real CLI
# but with deterministic in-memory responses.
#
# Relative `--project` paths are resolved by `dotnet` against the cwd of
# the vally invocation. Always run vally from this directory:
# cd tools/azsdk-cli/Azure.Sdk.Tools.Vally && vally eval ...
# Same convention as .github/skills/.vally.yaml.
azsdk-mcp-mock:
mcpServers:
azure-sdk-mcp:
type: stdio
command: dotnet
args: ["run", "--project", "../Azure.Sdk.Tools.Mock"]
timeout: "60s"

# Live MCP — real Azure.Sdk.Tools.Cli against real DevOps (test area path),
# real GitHub, real pipelines. AZSDKTOOLS_AGENT_TESTING=true keeps the
# handful of write tools (e.g. create_release_plan) inside the test area.
# Bound only by scenarios under evals/workflow-scenarios/live/ and selected
# by the `scenarios-live` / `nightly` suites.
azsdk-mcp-live:
mcpServers:
azure-sdk-mcp:
type: stdio
command: dotnet
args: ["run", "--project", "../Azure.Sdk.Tools.Cli", "--", "start"]
timeout: "5m"
env:
AZSDKTOOLS_AGENT_TESTING: "true"
AZSDKTOOLS_COLLECT_TELEMETRY: "false"

# Suites group evals for selective execution.
#
# Layout maps directly to suites — no tag-based mock/live filtering. Vally's
# suite filter is positive-match only (AND across keys, OR within values),
# so subfolders are the cleanest way to split mock vs live. See
# https://github.com/microsoft/vally suite-filter source.
suites:
# ---- by tier ----
unit:
description: |
Hermetic single-tool / trigger evals. No external I/O. Fast; the
foundation of the PR gate.
evals: ["evals/tools/*.eval.yaml"]

scenarios-mock:
description: |
Multi-tool scenarios against the mock MCP environment. Hermetic; safe
for PR gate.
evals: ["evals/workflow-scenarios/mock/*.eval.yaml"]

scenarios-live:
description: |
Scenarios against live MCP — real DevOps / GitHub / pipelines. Slow;
nightly only. Prime any required clones first via
`evals/setup/ensure-specs-clone.ps1`.
evals: ["evals/workflow-scenarios/live/*.eval.yaml"]

# ---- composite suites ----
pr-gate:
description: Hermetic tiers only (unit + scenarios-mock). Target for CI PR check.
evals:
- "evals/tools/*.eval.yaml"
- "evals/workflow-scenarios/mock/*.eval.yaml"
nightly:
description: All tiers including live scenarios.
evals: ["evals/**/*.eval.yaml"]

# ---- by feature area (tag-filtered) ----
release-plan:
description: All evals tagged area=release-plan.
filter: { area: release-plan }
evals: ["evals/**/*.eval.yaml"]
typespec:
description: All evals tagged area=typespec.
filter: { area: typespec }
evals: ["evals/**/*.eval.yaml"]
pipeline:
description: All evals tagged area=pipeline.
filter: { area: pipeline }
evals: ["evals/**/*.eval.yaml"]
github:
description: All evals tagged area=github.
filter: { area: github }
evals: ["evals/**/*.eval.yaml"]
274 changes: 274 additions & 0 deletions tools/azsdk-cli/Azure.Sdk.Tools.Vally/README.md

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
<#
.SYNOPSIS
Ensures a per-user shallow+sparse cache clone of Azure/azure-rest-api-specs
exists and is reasonably fresh.

.DESCRIPTION
Run this before invoking the e2e suite (vally eval --suite e2e).
Maintains a cache clone that Vally's `environment.git.source` points at,
so individual eval YAMLs don't need a pre-existing checkout.

- First run: shallow + blobless + cone-sparse clone (only
specification/contosowidgetmanager/ to keep size minimal).
- Subsequent runs within -MaxAgeHours: noop.
- Subsequent runs past -MaxAgeHours: `git fetch --depth 1 origin main` and
fast-forward `main`.

Cache lives at:
Windows: $env:USERPROFILE\.vally-cache\azure-rest-api-specs
*nix: $HOME/.vally-cache/azure-rest-api-specs

.PARAMETER MaxAgeHours
Skip the `git fetch` if the cache was last refreshed within this many
hours. Default: 24.

.PARAMETER SparseCheckoutPaths
Cone-sparse paths to include. Default: specification/contosowidgetmanager.
Pass @() to disable sparse-checkout (full tree).
#>
[CmdletBinding()]
param(
[int] $MaxAgeHours = 24,
[string[]] $SparseCheckoutPaths = @('specification/contosowidgetmanager')
)

$ErrorActionPreference = 'Stop'
Set-StrictMode -Version 4

$cacheRoot = if ($env:USERPROFILE) { Join-Path $env:USERPROFILE '.vally-cache' } else { Join-Path $HOME '.vally-cache' }
$cache = Join-Path $cacheRoot 'azure-rest-api-specs'
$stamp = Join-Path $cache '.vally-last-fetch'

if (-not (Test-Path (Join-Path $cache '.git'))) {
Write-Host "[ensure-specs-clone] Cloning azure-rest-api-specs into cache: $cache"
New-Item -ItemType Directory -Force -Path $cacheRoot | Out-Null
git clone --depth 1 --filter=blob:none --no-checkout `
https://github.com/Azure/azure-rest-api-specs.git $cache | Out-Null
if ($SparseCheckoutPaths.Count -gt 0) {
git -C $cache sparse-checkout init --cone | Out-Null
git -C $cache sparse-checkout set @SparseCheckoutPaths | Out-Null
}
git -C $cache checkout main | Out-Null
Set-Content -Path $stamp -Value (Get-Date -Format o)
} else {
$isStale = $true
if (Test-Path $stamp) {
$age = (Get-Date) - (Get-Item $stamp).LastWriteTime
$isStale = $age.TotalHours -gt $MaxAgeHours
}
if ($isStale) {
Write-Host "[ensure-specs-clone] Refreshing cache (>$MaxAgeHours h old): $cache"
git -C $cache fetch --depth 1 origin main | Out-Null
git -C $cache reset --hard origin/main | Out-Null
Set-Content -Path $stamp -Value (Get-Date -Format o)
} else {
Write-Host "[ensure-specs-clone] Cache is fresh (<$MaxAgeHours h): $cache"
}
}

# Echo the cache path so the wrapper can capture it.
Write-Output $cache
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
name: azsdk-mcp-tool-scenarios
description: |
Add-arm-resource: end-to-end scenario for authoring a new ARM resource
via TypeSpec. This is a complex, file-producing scenario (not a single
tool-call check) that needs a real fixture + tsp compile verification.
version: "1.0"
type: capability


tags:
tier: unit
area: typespec

environment: azsdk-mcp-mock

config:
runs: 1
timeout: 30m
model: gpt-5.4
executor: copilot-sdk

stimuli:
- name: add-arm-resource
prompt: |
In the specification/widget/resource-manager/Microsoft.Widget/Widget project,
add an ARM resource named 'Asset' with CRUD operations.
constraints:
max_turns: 20
max_tokens: 50000
# TODO: seed a fixture (environment.files or git) for the Microsoft.Widget
# project, add `file-exists` + `file-contains` graders on the produced
# asset.tsp, and a `run-command` grader to verify `npx tsp compile`.
graders:
- type: tool-calls
config:
required:
- edit
- azsdk_typespec_generate_authoring_plan

scoring:
weights:
tool-calls: 1
threshold: 1.0

Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
name: azsdk-mcp-tool-scenarios
description: |
Tool-scenario evaluation suite for the azsdk MCP server. Verifies the
agent invokes the right MCP tools for given prompts, independent of any
specific skill.
version: "1.0"
type: capability


tags:
tier: unit
area: typespec

environment: azsdk-mcp-mock

config:
runs: 1
timeout: 30m
model: gpt-5.4
executor: copilot-sdk

stimuli:
- name: check-public-repo
prompt: |
Check if my TypeSpec project is in the public repo.
My setup has already been verified, do not run azsdk_verify_setup.
Project root: specification/contosowidgetmanager/Contoso.WidgetManager.
constraints:
max_turns: 5
max_tokens: 5000
graders:
- type: tool-calls
config:
required:
- azsdk_typespec_check_project_in_public_repo
disallowed:
- azsdk_verify_setup

scoring:
weights:
tool-calls: 1
threshold: 1.0

Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
name: azsdk-mcp-tool-scenarios
description: |
Check-sdk-generation-status: the agent should call azsdk_get_pipeline_status
to check the SDK generation pipeline status.
version: "1.0"
type: capability


tags:
tier: unit
area: pipeline

environment: azsdk-mcp-mock

config:
runs: 1
timeout: 30m
model: gpt-5.4
executor: copilot-sdk

stimuli:
- name: check-sdk-generation-status
prompt: |
Check the SDK generation pipeline status for build ID 5513110.
My setup has already been verified, do not run azsdk_verify_setup.
constraints:
max_turns: 5
max_tokens: 5000
# TODO: assert buildId=5513110 — blocked on https://github.com/Azure/azure-sdk-tools/issues/15833 (Vally tool-calls grader needs generic args matcher).
graders:
- type: tool-calls
config:
required:
- azsdk_get_pipeline_status
disallowed:
- azsdk_verify_setup

scoring:
weights:
tool-calls: 1
threshold: 1.0

Loading
Loading