[Feature]: M3 tool-calling harness fixes (Vakra matching + undocumented outputs)

## Feature Request

Fix two M3 **harness-side** tool-calling problems so CUGA is scored fairly on Vakra:

1. **Registry tool-name prefix** — stop exposing `task_<n>_<domain>_` prefixes that break Vakra gold-to-live tool matching
2. **Undocumented MCP outputs** — guide the agent to probe and handle schema-less tool responses without Python type crashes

## Motivation / Problem

**Prefix mismatch:** The registry named MCP apps `task_2_hockey`, etc., so live tool names did not match Vakra gold (`hockey_get_…`). `_match_live_name` failed and groundedness was often **0** even when the agent called the right API.

**Output shape crashes:** Many M3 tools have empty response documentation. The model assumed dict-shaped results and used `.get()` or key access on lists/strings, failing tasks independent of reasoning quality.

Both issues inflate false failures in harness metrics; they are not agent-quality regressions.

## Use Case

As someone evaluating CUGA on M3 with Vakra:
- I need groundedness and tool-call scoring to reflect actual API usage, not naming conventions
- I need long runs to complete without dying on `AttributeError` / type errors after valid tool calls
- I want a CI-friendly guard (`check_no_task_prefix.py`) so legacy prefixed names cannot creep back into saved results

## Proposed Solution

### 1) Vakra tool-name matching

- Expand registry config with **bare domain** as the MCP app name (e.g. `hockey`, not `task_2_hockey`)
- Filter each domain agent with `FilteredToolProvider(app_name=<domain>)`
- Keep **backward-compatible** suffix matching in `m3_vakra_score.py` for old result bundles
- Add `capability_filter` / collision guard in `expand_registry_config`
- Add `scripts/check_no_task_prefix.py` to fail if any result JSON still contains `task_<n>_<domain>_` in tool names

### 2) Undocumented tool outputs

- Add `M3_SPECIAL_INSTRUCTIONS` (eval-only) and pass via SDK `special_instructions` on `CugaAgent`
- Instruct: isolated probe with compact shape summary on first use; defensive `isinstance` / normalization on follow-up code

**Acceptance criteria**
- [ ] New runs emit bare-domain tool names only
- [ ] Vakra groundedness > 0 on tool-correct answers in a smoke domain
- [ ] No task failures from type errors on undocumented tool outputs in representative runs

## Alternatives Considered

- **Rewrite Vakra gold to include prefixes** — changes benchmark semantics; judges/GT stay off-limits per harness charter
- **Post-process tool names only at scoring time** — insufficient; agent and trajectories still see wrong names
- **Fix in `cuga-agent` registry only** — still need eval-side expansion, filtering, guards, and instructions in this repo

## Priority

Critical - Blocking my use case

## Additional Context

- **Parent / epic:** Sub-issue of #37 ([Feature]: Improve evaluation harness to improve CUGA score on Vakra (m3))
- **Implementation tracked in:** PR #3 (`fix/m3-harness-bugs`), closes this issue when merged
- **Related:** #38 (policies) — orthogonal; prefix fix must land before policy A/B conclusions are meaningful
- **Full-run result (prefix fix, no policies):** 67/200 (33.5%) on `small_train.zip` — see PR #3 description / bundle `20260603_220001_default`
- **Analysis:** `docs/m3-vakra-analysis-20260428/cuga_vs_react_full_analysis.md`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: M3 tool-calling harness fixes (Vakra matching + undocumented outputs) #39

Feature Request

Motivation / Problem

Use Case

Proposed Solution

1) Vakra tool-name matching

2) Undocumented tool outputs

Alternatives Considered

Priority

Additional Context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Feature]: M3 tool-calling harness fixes (Vakra matching + undocumented outputs) #39

Description

Feature Request

Motivation / Problem

Use Case

Proposed Solution

1) Vakra tool-name matching

2) Undocumented tool outputs

Alternatives Considered

Priority

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions