You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Run in the Vally project's own cwd, so tools that read spec files / run git (e.g. azsdk_run_typespec_validation, azsdk_typespec_check_project_in_public_repo, azsdk_get_modified_typespec_projects, azsdk_typespec_generate_authoring_plan) operate on a workspace that does not contain azure-rest-api-specs content.
Always target the live MCP server (Azure.Sdk.Tools.Cli). Scenarios with destructive side effects — create-release-plan, link-namespace-approval-issue, future release-sdk / link-sdk-pr evals — would create real ADO work items / GitHub links on every nightly run if CI (Wire vally eval CI job for Azure.Sdk.Tools.Vally tool-scenario evals #15829) ships as-is.
Shared helper script fixtures/setup-specs.ps1 that idempotently sparse-clones Azure/azure-rest-api-specs at a pinned SHA ($env:SPECS_SHA, defaulting to main) into the workspace and runs git sparse-checkout set for the paths the eval references.
For future SDK-generation scenarios, the same pattern extends to language repos (azure-sdk-for-net, -python, -js, -java, -go) via a parallel fixtures/setup-sdk-repo.ps1.
3. SHA pinning for reproducibility
Default to a known-good azure-rest-api-specs SHA stored in fixtures/specs.lock.
Nightly CI overrides via $env:SPECS_SHA=main to detect upstream regressions; PR/manual runs use the pinned SHA for stable diffing.
4. Cache the clone within a CI job
One sparse clone per job, reused across all evals (~30s × 11 saved). In GitHub Actions, the clone lives in $RUNNER_TEMP/specs and the setup script no-ops if the target dir already exists.
Pros / cons considered
Aspect
Option A (live only)
Option C (mock + live)
Config
1 environment
2 environments
Side effects
Real on every run
Only on live-tagged evals
CI secrets
Full (ADO + GH)
Reduced
Schema-drift risk
None
Real — mock must track real tool signatures
Faithfulness to prod
Highest
Mixed
Right when...
Mock unmaintained, no destructive tools
Nightly CI + destructive tools exist (our case)
C is preferred because (a) nightly CI is in scope (#15829), (b) destructive evals already exist in the suite, and (c) Azure.Sdk.Tools.Mock is already merged and intended for exactly this.
Acceptance criteria
.vally.yaml declares both azsdk-mcp-live and azsdk-mcp-mock environments.
fixtures/specs.lock checked in with a pinned SHA + comment on how to refresh.
Every eval declares its environment: and (if needed) a setup: block.
create-release-plan + link-namespace-approval-issue run against the mock (verified by inspecting trajectory — tool calls present, no real ADO work item created).
validate-typespec runs against the live server with a workspace that contains real spec files (verified by tool returning non-error output).
README updated to document the live-vs-mock decision matrix and the setup-hook contract.
Schema-parity tests between Azure.Sdk.Tools.Cli tool responses and Azure.Sdk.Tools.Mock handler responses — separate concern; if needed, file a follow-up against the mock project.
Upstream vally-cli grader additions (forbidden, argument-matching) — captured in the README follow-ups list.
Context
Follow-up to PR #15811 and issue #15829. The 11 ported Vally evals under
tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/currently:git(e.g.azsdk_run_typespec_validation,azsdk_typespec_check_project_in_public_repo,azsdk_get_modified_typespec_projects,azsdk_typespec_generate_authoring_plan) operate on a workspace that does not containazure-rest-api-specscontent.Azure.Sdk.Tools.Cli). Scenarios with destructive side effects —create-release-plan,link-namespace-approval-issue, futurerelease-sdk/link-sdk-prevals — would create real ADO work items / GitHub links on every nightly run if CI (Wirevally evalCI job for Azure.Sdk.Tools.Vally tool-scenario evals #15829) ships as-is.Proposed approach (Option C)
1. Two MCP environments in
.vally.yamlPer-eval opt-in via
environment:. Initial split:validate-typespectspoutput to guide the agent loopcheck-public-repocheck-public-repo-then-validatetypespec-generation-step02get-modified-typespec-projectsgitstateadd-arm-resourcerename-client-propertyget-pr-link-current-branchcheck-sdk-generation-statuscreate-release-planlink-namespace-approval-issue2. Per-eval
setup:hook for repo prepShared helper script
fixtures/setup-specs.ps1that idempotently sparse-clonesAzure/azure-rest-api-specsat a pinned SHA ($env:SPECS_SHA, defaulting tomain) into the workspace and runsgit sparse-checkout setfor the paths the eval references.Per-eval invocation:
For future SDK-generation scenarios, the same pattern extends to language repos (
azure-sdk-for-net,-python,-js,-java,-go) via a parallelfixtures/setup-sdk-repo.ps1.3. SHA pinning for reproducibility
azure-rest-api-specsSHA stored infixtures/specs.lock.$env:SPECS_SHA=mainto detect upstream regressions; PR/manual runs use the pinned SHA for stable diffing.4. Cache the clone within a CI job
One sparse clone per job, reused across all evals (~30s × 11 saved). In GitHub Actions, the clone lives in
$RUNNER_TEMP/specsand the setup script no-ops if the target dir already exists.Pros / cons considered
C is preferred because (a) nightly CI is in scope (#15829), (b) destructive evals already exist in the suite, and (c)
Azure.Sdk.Tools.Mockis already merged and intended for exactly this.Acceptance criteria
.vally.yamldeclares bothazsdk-mcp-liveandazsdk-mcp-mockenvironments.fixtures/setup-specs.ps1exists, idempotent, honors$env:SPECS_SHA.fixtures/specs.lockchecked in with a pinned SHA + comment on how to refresh.environment:and (if needed) asetup:block.create-release-plan+link-namespace-approval-issuerun against the mock (verified by inspecting trajectory — tool calls present, no real ADO work item created).validate-typespecruns against the live server with a workspace that contains real spec files (verified by tool returning non-error output).vally evalCI job for Azure.Sdk.Tools.Vally tool-scenario evals #15829) consumes the new structure without regressing scenario count.Out of scope
Azure.Sdk.Tools.Clitool responses andAzure.Sdk.Tools.Mockhandler responses — separate concern; if needed, file a follow-up against the mock project.vally-cligrader additions (forbidden, argument-matching) — captured in the README follow-ups list.References
vally evalCI job for Azure.Sdk.Tools.Vally tool-scenario evals #15829tools/azsdk-cli/Azure.Sdk.Tools.Mock/