Wire workspace setup hooks + mock MCP environment for Vally tool-scenario evals

## Context

Follow-up to PR #15811 and issue #15829. The 11 ported Vally evals under [`tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/`](https://github.com/Azure/azure-sdk-tools/tree/main/tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals) currently:

1. Run in the Vally project's own cwd, so tools that read spec files / run `git` (e.g. `azsdk_run_typespec_validation`, `azsdk_typespec_check_project_in_public_repo`, `azsdk_get_modified_typespec_projects`, `azsdk_typespec_generate_authoring_plan`) operate on a workspace that does not contain `azure-rest-api-specs` content.
2. Always target the **live** MCP server (`Azure.Sdk.Tools.Cli`). Scenarios with destructive side effects — `create-release-plan`, `link-namespace-approval-issue`, future `release-sdk` / `link-sdk-pr` evals — would create real ADO work items / GitHub links on every nightly run if CI (#15829) ships as-is.

## Proposed approach (Option C)

### 1. Two MCP environments in [`.vally.yaml`](https://github.com/Azure/azure-sdk-tools/blob/main/tools/azsdk-cli/Azure.Sdk.Tools.Vally/.vally.yaml)

```yaml
environments:
  azsdk-mcp-live:
    mcpServers:
      azure-sdk-mcp:
        command: dotnet
        args: ["run", "--project", "../Azure.Sdk.Tools.Cli", "--", "start"]
        timeout: "5m"
  azsdk-mcp-mock:
    mcpServers:
      azure-sdk-mcp:
        command: dotnet
        args: ["run", "--project", "../Azure.Sdk.Tools.Mock", "--", "start"]
        timeout: "5m"
```

Per-eval opt-in via `environment:`. Initial split:

| Eval | Env | Reason |
|---|---|---|
| `validate-typespec` | live | needs real `tsp` output to guide the agent loop |
| `check-public-repo` | live | hits real GitHub |
| `check-public-repo-then-validate` | live | same |
| `typespec-generation-step02` | live | reads workspace |
| `get-modified-typespec-projects` | live | needs real `git` state |
| `add-arm-resource` | live | reads spec files |
| `rename-client-property` | live | reads + writes spec files |
| `get-pr-link-current-branch` | live | reads real branch |
| `check-sdk-generation-status` | mock | tool-call grader only; mock is fine |
| **`create-release-plan`** | **mock** | **destructive on live** |
| **`link-namespace-approval-issue`** | **mock** | **destructive on live** |

### 2. Per-eval `setup:` hook for repo prep

Shared helper script `fixtures/setup-specs.ps1` that idempotently sparse-clones `Azure/azure-rest-api-specs` at a pinned SHA (`$env:SPECS_SHA`, defaulting to `main`) into the workspace and runs `git sparse-checkout set` for the paths the eval references.

Per-eval invocation:

```yaml
setup:
  - run: pwsh fixtures/setup-specs.ps1 specification/contosowidgetmanager/Contoso.WidgetManager
```

For future SDK-generation scenarios, the same pattern extends to language repos (`azure-sdk-for-net`, `-python`, `-js`, `-java`, `-go`) via a parallel `fixtures/setup-sdk-repo.ps1`.

### 3. SHA pinning for reproducibility

- Default to a known-good `azure-rest-api-specs` SHA stored in `fixtures/specs.lock`.
- Nightly CI overrides via `$env:SPECS_SHA=main` to detect upstream regressions; PR/manual runs use the pinned SHA for stable diffing.

### 4. Cache the clone within a CI job

One sparse clone per job, reused across all evals (~30s × 11 saved). In GitHub Actions, the clone lives in `$RUNNER_TEMP/specs` and the setup script no-ops if the target dir already exists.

## Pros / cons considered

| Aspect | Option A (live only) | **Option C (mock + live)** |
|---|---|---|
| Config | 1 environment | 2 environments |
| Side effects | Real on every run | Only on live-tagged evals |
| CI secrets | Full (ADO + GH) | Reduced |
| Schema-drift risk | None | Real — mock must track real tool signatures |
| Faithfulness to prod | Highest | Mixed |
| Right when... | Mock unmaintained, no destructive tools | Nightly CI + destructive tools exist (our case) |

C is preferred because (a) nightly CI is in scope (#15829), (b) destructive evals already exist in the suite, and (c) `Azure.Sdk.Tools.Mock` is already merged and intended for exactly this.

## Acceptance criteria

- [ ] `.vally.yaml` declares both `azsdk-mcp-live` and `azsdk-mcp-mock` environments.
- [ ] `fixtures/setup-specs.ps1` exists, idempotent, honors `$env:SPECS_SHA`.
- [ ] `fixtures/specs.lock` checked in with a pinned SHA + comment on how to refresh.
- [ ] Every eval declares its `environment:` and (if needed) a `setup:` block.
- [ ] `create-release-plan` + `link-namespace-approval-issue` run against the mock (verified by inspecting trajectory — tool calls present, no real ADO work item created).
- [ ] `validate-typespec` runs against the live server with a workspace that contains real spec files (verified by tool returning non-error output).
- [ ] README updated to document the live-vs-mock decision matrix and the setup-hook contract.
- [ ] CI job (#15829) consumes the new structure without regressing scenario count.

## Out of scope

- Adding new eval scenarios (tracked separately).
- Schema-parity tests between `Azure.Sdk.Tools.Cli` tool responses and `Azure.Sdk.Tools.Mock` handler responses — separate concern; if needed, file a follow-up against the mock project.
- Upstream `vally-cli` grader additions (`forbidden`, argument-matching) — captured in the README follow-ups list.

## References

- PR: #15811
- CI follow-up: #15829
- Migration tracker: #15124
- Mock server project: [`tools/azsdk-cli/Azure.Sdk.Tools.Mock/`](https://github.com/Azure/azure-sdk-tools/tree/main/tools/azsdk-cli/Azure.Sdk.Tools.Mock)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wire workspace setup hooks + mock MCP environment for Vally tool-scenario evals #15831

Context

Proposed approach (Option C)

1. Two MCP environments in `.vally.yaml`

2. Per-eval `setup:` hook for repo prep

3. SHA pinning for reproducibility

4. Cache the clone within a CI job

Pros / cons considered

Acceptance criteria

Out of scope

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Eval	Env	Reason
`validate-typespec`	live	needs real `tsp` output to guide the agent loop
`check-public-repo`	live	hits real GitHub
`check-public-repo-then-validate`	live	same
`typespec-generation-step02`	live	reads workspace
`get-modified-typespec-projects`	live	needs real `git` state
`add-arm-resource`	live	reads spec files
`rename-client-property`	live	reads + writes spec files
`get-pr-link-current-branch`	live	reads real branch
`check-sdk-generation-status`	mock	tool-call grader only; mock is fine
`create-release-plan`	mock	destructive on live
`link-namespace-approval-issue`	mock	destructive on live

Aspect	Option A (live only)	Option C (mock + live)
Config	1 environment	2 environments
Side effects	Real on every run	Only on live-tagged evals
CI secrets	Full (ADO + GH)	Reduced
Schema-drift risk	None	Real — mock must track real tool signatures
Faithfulness to prod	Highest	Mixed
Right when...	Mock unmaintained, no destructive tools	Nightly CI + destructive tools exist (our case)

Wire workspace setup hooks + mock MCP environment for Vally tool-scenario evals #15831

Description

Context

Proposed approach (Option C)

1. Two MCP environments in .vally.yaml

2. Per-eval setup: hook for repo prep

3. SHA pinning for reproducibility

4. Cache the clone within a CI job

Pros / cons considered

Acceptance criteria

Out of scope

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

1. Two MCP environments in `.vally.yaml`

2. Per-eval `setup:` hook for repo prep