Skip to content

[Feature]: M3 tool-calling harness fixes (Vakra matching + undocumented outputs) #39

@haroldship

Description

@haroldship

Feature Request

Fix two M3 harness-side tool-calling problems so CUGA is scored fairly on Vakra:

  1. Registry tool-name prefix — stop exposing task_<n>_<domain>_ prefixes that break Vakra gold-to-live tool matching
  2. Undocumented MCP outputs — guide the agent to probe and handle schema-less tool responses without Python type crashes

Motivation / Problem

Prefix mismatch: The registry named MCP apps task_2_hockey, etc., so live tool names did not match Vakra gold (hockey_get_…). _match_live_name failed and groundedness was often 0 even when the agent called the right API.

Output shape crashes: Many M3 tools have empty response documentation. The model assumed dict-shaped results and used .get() or key access on lists/strings, failing tasks independent of reasoning quality.

Both issues inflate false failures in harness metrics; they are not agent-quality regressions.

Use Case

As someone evaluating CUGA on M3 with Vakra:

  • I need groundedness and tool-call scoring to reflect actual API usage, not naming conventions
  • I need long runs to complete without dying on AttributeError / type errors after valid tool calls
  • I want a CI-friendly guard (check_no_task_prefix.py) so legacy prefixed names cannot creep back into saved results

Proposed Solution

1) Vakra tool-name matching

  • Expand registry config with bare domain as the MCP app name (e.g. hockey, not task_2_hockey)
  • Filter each domain agent with FilteredToolProvider(app_name=<domain>)
  • Keep backward-compatible suffix matching in m3_vakra_score.py for old result bundles
  • Add capability_filter / collision guard in expand_registry_config
  • Add scripts/check_no_task_prefix.py to fail if any result JSON still contains task_<n>_<domain>_ in tool names

2) Undocumented tool outputs

  • Add M3_SPECIAL_INSTRUCTIONS (eval-only) and pass via SDK special_instructions on CugaAgent
  • Instruct: isolated probe with compact shape summary on first use; defensive isinstance / normalization on follow-up code

Acceptance criteria

  • New runs emit bare-domain tool names only
  • Vakra groundedness > 0 on tool-correct answers in a smoke domain
  • No task failures from type errors on undocumented tool outputs in representative runs

Alternatives Considered

  • Rewrite Vakra gold to include prefixes — changes benchmark semantics; judges/GT stay off-limits per harness charter
  • Post-process tool names only at scoring time — insufficient; agent and trajectories still see wrong names
  • Fix in cuga-agent registry only — still need eval-side expansion, filtering, guards, and instructions in this repo

Priority

Critical - Blocking my use case

Additional Context

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions