Skip to content

Wire workspace setup hooks + mock MCP environment for Vally tool-scenario evals #15831

@helen229

Description

@helen229

Context

Follow-up to PR #15811 and issue #15829. The 11 ported Vally evals under tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/ currently:

  1. Run in the Vally project's own cwd, so tools that read spec files / run git (e.g. azsdk_run_typespec_validation, azsdk_typespec_check_project_in_public_repo, azsdk_get_modified_typespec_projects, azsdk_typespec_generate_authoring_plan) operate on a workspace that does not contain azure-rest-api-specs content.
  2. Always target the live MCP server (Azure.Sdk.Tools.Cli). Scenarios with destructive side effects — create-release-plan, link-namespace-approval-issue, future release-sdk / link-sdk-pr evals — would create real ADO work items / GitHub links on every nightly run if CI (Wire vally eval CI job for Azure.Sdk.Tools.Vally tool-scenario evals #15829) ships as-is.

Proposed approach (Option C)

1. Two MCP environments in .vally.yaml

environments:
  azsdk-mcp-live:
    mcpServers:
      azure-sdk-mcp:
        command: dotnet
        args: ["run", "--project", "../Azure.Sdk.Tools.Cli", "--", "start"]
        timeout: "5m"
  azsdk-mcp-mock:
    mcpServers:
      azure-sdk-mcp:
        command: dotnet
        args: ["run", "--project", "../Azure.Sdk.Tools.Mock", "--", "start"]
        timeout: "5m"

Per-eval opt-in via environment:. Initial split:

Eval Env Reason
validate-typespec live needs real tsp output to guide the agent loop
check-public-repo live hits real GitHub
check-public-repo-then-validate live same
typespec-generation-step02 live reads workspace
get-modified-typespec-projects live needs real git state
add-arm-resource live reads spec files
rename-client-property live reads + writes spec files
get-pr-link-current-branch live reads real branch
check-sdk-generation-status mock tool-call grader only; mock is fine
create-release-plan mock destructive on live
link-namespace-approval-issue mock destructive on live

2. Per-eval setup: hook for repo prep

Shared helper script fixtures/setup-specs.ps1 that idempotently sparse-clones Azure/azure-rest-api-specs at a pinned SHA ($env:SPECS_SHA, defaulting to main) into the workspace and runs git sparse-checkout set for the paths the eval references.

Per-eval invocation:

setup:
  - run: pwsh fixtures/setup-specs.ps1 specification/contosowidgetmanager/Contoso.WidgetManager

For future SDK-generation scenarios, the same pattern extends to language repos (azure-sdk-for-net, -python, -js, -java, -go) via a parallel fixtures/setup-sdk-repo.ps1.

3. SHA pinning for reproducibility

  • Default to a known-good azure-rest-api-specs SHA stored in fixtures/specs.lock.
  • Nightly CI overrides via $env:SPECS_SHA=main to detect upstream regressions; PR/manual runs use the pinned SHA for stable diffing.

4. Cache the clone within a CI job

One sparse clone per job, reused across all evals (~30s × 11 saved). In GitHub Actions, the clone lives in $RUNNER_TEMP/specs and the setup script no-ops if the target dir already exists.

Pros / cons considered

Aspect Option A (live only) Option C (mock + live)
Config 1 environment 2 environments
Side effects Real on every run Only on live-tagged evals
CI secrets Full (ADO + GH) Reduced
Schema-drift risk None Real — mock must track real tool signatures
Faithfulness to prod Highest Mixed
Right when... Mock unmaintained, no destructive tools Nightly CI + destructive tools exist (our case)

C is preferred because (a) nightly CI is in scope (#15829), (b) destructive evals already exist in the suite, and (c) Azure.Sdk.Tools.Mock is already merged and intended for exactly this.

Acceptance criteria

  • .vally.yaml declares both azsdk-mcp-live and azsdk-mcp-mock environments.
  • fixtures/setup-specs.ps1 exists, idempotent, honors $env:SPECS_SHA.
  • fixtures/specs.lock checked in with a pinned SHA + comment on how to refresh.
  • Every eval declares its environment: and (if needed) a setup: block.
  • create-release-plan + link-namespace-approval-issue run against the mock (verified by inspecting trajectory — tool calls present, no real ADO work item created).
  • validate-typespec runs against the live server with a workspace that contains real spec files (verified by tool returning non-error output).
  • README updated to document the live-vs-mock decision matrix and the setup-hook contract.
  • CI job (Wire vally eval CI job for Azure.Sdk.Tools.Vally tool-scenario evals #15829) consumes the new structure without regressing scenario count.

Out of scope

  • Adding new eval scenarios (tracked separately).
  • Schema-parity tests between Azure.Sdk.Tools.Cli tool responses and Azure.Sdk.Tools.Mock handler responses — separate concern; if needed, file a follow-up against the mock project.
  • Upstream vally-cli grader additions (forbidden, argument-matching) — captured in the README follow-ups list.

References

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions