Feature Request
Fix two M3 harness-side tool-calling problems so CUGA is scored fairly on Vakra:
- Registry tool-name prefix — stop exposing
task_<n>_<domain>_ prefixes that break Vakra gold-to-live tool matching
- Undocumented MCP outputs — guide the agent to probe and handle schema-less tool responses without Python type crashes
Motivation / Problem
Prefix mismatch: The registry named MCP apps task_2_hockey, etc., so live tool names did not match Vakra gold (hockey_get_…). _match_live_name failed and groundedness was often 0 even when the agent called the right API.
Output shape crashes: Many M3 tools have empty response documentation. The model assumed dict-shaped results and used .get() or key access on lists/strings, failing tasks independent of reasoning quality.
Both issues inflate false failures in harness metrics; they are not agent-quality regressions.
Use Case
As someone evaluating CUGA on M3 with Vakra:
- I need groundedness and tool-call scoring to reflect actual API usage, not naming conventions
- I need long runs to complete without dying on
AttributeError / type errors after valid tool calls
- I want a CI-friendly guard (
check_no_task_prefix.py) so legacy prefixed names cannot creep back into saved results
Proposed Solution
1) Vakra tool-name matching
- Expand registry config with bare domain as the MCP app name (e.g.
hockey, not task_2_hockey)
- Filter each domain agent with
FilteredToolProvider(app_name=<domain>)
- Keep backward-compatible suffix matching in
m3_vakra_score.py for old result bundles
- Add
capability_filter / collision guard in expand_registry_config
- Add
scripts/check_no_task_prefix.py to fail if any result JSON still contains task_<n>_<domain>_ in tool names
2) Undocumented tool outputs
- Add
M3_SPECIAL_INSTRUCTIONS (eval-only) and pass via SDK special_instructions on CugaAgent
- Instruct: isolated probe with compact shape summary on first use; defensive
isinstance / normalization on follow-up code
Acceptance criteria
Alternatives Considered
- Rewrite Vakra gold to include prefixes — changes benchmark semantics; judges/GT stay off-limits per harness charter
- Post-process tool names only at scoring time — insufficient; agent and trajectories still see wrong names
- Fix in
cuga-agent registry only — still need eval-side expansion, filtering, guards, and instructions in this repo
Priority
Critical - Blocking my use case
Additional Context
Feature Request
Fix two M3 harness-side tool-calling problems so CUGA is scored fairly on Vakra:
task_<n>_<domain>_prefixes that break Vakra gold-to-live tool matchingMotivation / Problem
Prefix mismatch: The registry named MCP apps
task_2_hockey, etc., so live tool names did not match Vakra gold (hockey_get_…)._match_live_namefailed and groundedness was often 0 even when the agent called the right API.Output shape crashes: Many M3 tools have empty response documentation. The model assumed dict-shaped results and used
.get()or key access on lists/strings, failing tasks independent of reasoning quality.Both issues inflate false failures in harness metrics; they are not agent-quality regressions.
Use Case
As someone evaluating CUGA on M3 with Vakra:
AttributeError/ type errors after valid tool callscheck_no_task_prefix.py) so legacy prefixed names cannot creep back into saved resultsProposed Solution
1) Vakra tool-name matching
hockey, nottask_2_hockey)FilteredToolProvider(app_name=<domain>)m3_vakra_score.pyfor old result bundlescapability_filter/ collision guard inexpand_registry_configscripts/check_no_task_prefix.pyto fail if any result JSON still containstask_<n>_<domain>_in tool names2) Undocumented tool outputs
M3_SPECIAL_INSTRUCTIONS(eval-only) and pass via SDKspecial_instructionsonCugaAgentisinstance/ normalization on follow-up codeAcceptance criteria
Alternatives Considered
cuga-agentregistry only — still need eval-side expansion, filtering, guards, and instructions in this repoPriority
Critical - Blocking my use case
Additional Context
fix/m3-harness-bugs), closes this issue when mergedsmall_train.zip— see PR fix(m3): repair harness bugs that artificially zeroed CUGA M3 pass rate #3 description / bundle20260603_220001_defaultdocs/m3-vakra-analysis-20260428/cuga_vs_react_full_analysis.md