From c0ce9f10facf28c0d41969c6945025adedb3a556 Mon Sep 17 00:00:00 2001 From: Harold Ship Date: Wed, 20 May 2026 14:37:56 +0300 Subject: [PATCH 01/20] fix(m3): fix harness bugs that artificially zeroed CUGA M3 pass rate MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The M3 harness in this repo had several mutually-reinforcing bugs that caused CUGA's M3 (vakra) pass rate to be near zero on PF cases (cases where ReAct passes and CUGA fails), independent of agent quality. This commit fixes the cluster: 1. Tool-name app-prefix breaks vakra's _match_live_name (groundedness=0 on otherwise-correct answers). Registry now uses bare-domain app names; m3_data_loader and m3_vakra_score gain backward-compat suffix- match paths for legacy bundles. Collision guard added because de-prefixing collapses cross-task duplicates (books, mondial_geo, soccer_2016 in tasks 2 and 3); expand_registry_config gains a capability_filter to pre-narrow. 2. M3Evaluator.evaluate_all multiturn branch checked self.task_id (singular, set only when N=1) instead of self.task_ids — so `--task ` silently ran all ~46 samples in the capability. Switched to plural and lowercase-membership. 3. eval_m3.py:start_registry_server hardcoded port 8001 in three places. Now reads REGISTRY_PORT or DYNACONF_SERVER_PORTS__REGISTRY. 4. eval.sh started an outer registry that collided with eval_m3.py's per-service registry on the same port. Every code path through eval.sh uses --from-config and self-manages its registry, so the outer-start is dead weight; gated behind SKIP_SERVER_START=false now. 5. Per-domain CugaAgent constructors were re-loading policies from disk and writing back the conflict-resolver's culled set, so the .cuga folder count decreased monotonically across domains. Pass auto_load_policies=False, filesystem_sync=False; load policies once via a new _load_m3_policies helper. 6. --capability / --task with nargs="*" overwrote on a second invocation; switched both to action="extend", default=[]. UUID detection now strips non-UUID items before passing the filter down. 7. M3 policy bundle (P-OF-1, P-PB-1..4, P-TG-1..2) authored in benchmarks/m3/policies/ as markdown with YAML frontmatter; compiled to policies.json by scripts/policies_md_to_json.py. eval.sh and compare.sh grew --no-policies and --compare-policies flags mirroring bpo. DYNACONF_POLICY__ENABLED flipped to true in m3.env. Bundle dir naming annotates the policy mode. 8. scripts/check_no_task_prefix.py smoke-tests result files to ensure no tool call still carries the legacy task___ prefix. Headline result on a 4-PF x 5-runs x 2-configs sweep: - baseline: 0/10 - + tool-prefix removal: 5/10 (no-policies) / 4/10 (policies) - + remaining fixes (4 PFs x 5 runs): 81.2% no-policies / 50.0% policies The policy bundle is net-negative on these 4 cases once the tool-prefix root cause is fixed; one task improves (75 -> 100%), one regresses (75 -> 0%). Re-running policies against the full 200-case M3 set is the next step. Full implementation notes appended to the analysis report under "Post-analysis: what we actually changed and what each change did". Out of scope (deferred): nested-arg sandbox codegen bug (cuga-agent), movie_platform/professional_basketball MCP-client health, and splitting results/ per compare.sh invocation to prevent parallel-eval contamination. Refs: #99 --- benchmarks/helpers/bundle.py | 14 +- benchmarks/m3/compare.sh | 38 +- benchmarks/m3/config/m3.env | 4 +- benchmarks/m3/eval.sh | 47 +- benchmarks/m3/eval_m3.py | 238 +++- benchmarks/m3/m3_data_loader.py | 17 +- benchmarks/m3/m3_vakra_score.py | 66 +- .../P-OF-1-single-tool-fact-citation.md | 60 + .../policies/P-OF-2-strip-hedging.md.disabled | 69 ++ .../m3/policies/P-PB-1-no-enumeration.md | 57 + ...B-2-one-composite-tool-no-corroboration.md | 84 ++ .../policies/P-PB-3-no-idempotent-retries.md | 64 + .../P-PB-4-validation-error-recovery.md | 89 ++ .../P-TG-1-mountain-count-disambiguation.md | 47 + ...-2-country-with-most-umpires-returns-id.md | 45 + benchmarks/m3/policies/policies.json | 177 +++ .../cuga_vs_react_full_analysis.md | 1072 +++++++++++++++++ scripts/check_no_task_prefix.py | 109 ++ scripts/policies_md_to_json.py | 138 +++ uv.lock | 6 +- 20 files changed, 2351 insertions(+), 90 deletions(-) create mode 100644 benchmarks/m3/policies/P-OF-1-single-tool-fact-citation.md create mode 100644 benchmarks/m3/policies/P-OF-2-strip-hedging.md.disabled create mode 100644 benchmarks/m3/policies/P-PB-1-no-enumeration.md create mode 100644 benchmarks/m3/policies/P-PB-2-one-composite-tool-no-corroboration.md create mode 100644 benchmarks/m3/policies/P-PB-3-no-idempotent-retries.md create mode 100644 benchmarks/m3/policies/P-PB-4-validation-error-recovery.md create mode 100644 benchmarks/m3/policies/P-TG-1-mountain-count-disambiguation.md create mode 100644 benchmarks/m3/policies/P-TG-2-country-with-most-umpires-returns-id.md create mode 100644 benchmarks/m3/policies/policies.json create mode 100644 docs/m3-vakra-analysis-20260428/cuga_vs_react_full_analysis.md create mode 100644 scripts/check_no_task_prefix.py create mode 100644 scripts/policies_md_to_json.py diff --git a/benchmarks/helpers/bundle.py b/benchmarks/helpers/bundle.py index d6d875e..bac1526 100644 --- a/benchmarks/helpers/bundle.py +++ b/benchmarks/helpers/bundle.py @@ -516,7 +516,19 @@ def assemble_compare_bundle( timestamp = datetime.now(timezone.utc).strftime("%Y%m%d_%H%M%S") models = sorted(set(k.split(":")[0] for k in config_results)) - bundle_dir = bundle_root / f"{timestamp}_compare_{'_'.join(models)}" + # Detect inner-dim variants (agent and/or policy mode) so the dir name + # reflects what was compared. Config keys are "model[:agent[:policy_mode]]". + agents = sorted({parts[1] for k in config_results if len(parts := k.split(":")) > 1 and parts[1]}) + policy_modes = sorted({parts[2] for k in config_results if len(parts := k.split(":")) > 2 and parts[2]}) + suffix_bits = ["_".join(models)] + if len(agents) > 1: + suffix_bits.append("_".join(agents)) + if len(policy_modes) > 1: + suffix_bits.append("_vs_".join(policy_modes)) # e.g. "policies_vs_no-policies" + elif len(policy_modes) == 1 and policy_modes[0] == "no-policies": + suffix_bits.append("no-policies") + suffix = "_".join(suffix_bits) + bundle_dir = bundle_root / f"{timestamp}_compare_{suffix}" bundle_dir.mkdir(parents=True, exist_ok=True) # Per-run results diff --git a/benchmarks/m3/compare.sh b/benchmarks/m3/compare.sh index 7be9540..7359d65 100755 --- a/benchmarks/m3/compare.sh +++ b/benchmarks/m3/compare.sh @@ -38,6 +38,7 @@ MODELS="${MODELS:-gpt-oss}" AGENT="${AGENT:-cuga}" AGENTS="${AGENTS:-}" COMPARE_AGENTS="${COMPARE_AGENTS:-false}" +COMPARE_POLICIES="${COMPARE_POLICIES:-false}" NO_BUNDLE="${NO_BUNDLE:-false}" BUNDLE_ZIP="${BUNDLE_ZIP:-false}" FORWARDED_ARGS=() @@ -72,6 +73,10 @@ while [[ $idx -lt ${#ARGS[@]} ]]; do COMPARE_AGENTS=true idx=$((idx+1)) ;; + --compare-policies) + COMPARE_POLICIES=true + idx=$((idx+1)) + ;; --no-bundle) NO_BUNDLE=true idx=$((idx+1)) @@ -102,11 +107,18 @@ fi IFS=',' read -ra MODEL_LIST <<< "$MODELS" IFS=',' read -ra AGENT_LIST <<< "$AGENTS" -# Build CONFIGS as the cartesian product MODEL_LIST × AGENT_LIST, with labels "model:agent". +# Build CONFIGS as the cartesian product MODEL_LIST × AGENT_LIST × POLICY_MODE. +# When --compare-policies is off, the inner dim collapses to a single "policies" +# entry so the label format stays consistent (always model:agent:policy). CONFIGS=() for _m in "${MODEL_LIST[@]}"; do for _a in "${AGENT_LIST[@]}"; do - CONFIGS+=("${_m}:${_a}") + if [[ "$COMPARE_POLICIES" == "true" ]]; then + CONFIGS+=("${_m}:${_a}:policies") + CONFIGS+=("${_m}:${_a}:no-policies") + else + CONFIGS+=("${_m}:${_a}:policies") + fi done done @@ -123,10 +135,13 @@ echo "" if [[ "$DRY_RUN" == "true" ]]; then echo -e "${YELLOW:-}DRY RUN — showing planned commands:${NC:-}" for config in "${CONFIGS[@]}"; do - model="${config%%:*}" - agent="${config##*:}" + IFS=':' read -r model agent policy_mode <<< "$config" + extra="" + if [[ "$policy_mode" == "no-policies" ]]; then + extra=" --no-policies" + fi for ((r=1; r<=RUNS; r++)); do - echo " [${config} run ${r}/${RUNS}] ./eval.sh --agent ${agent} ${FORWARDED_ARGS[*]}" + echo " [${config} run ${r}/${RUNS}] ./eval.sh --agent ${agent}${extra} ${FORWARDED_ARGS[*]}" done done exit 0 @@ -150,7 +165,7 @@ compare_t0=$(date +%s) compare_cleanup() { echo -e "${YELLOW:-}Stopping servers...${NC:-}" - kill_port_processes 8001 + kill_port_processes "${REGISTRY_PORT:-8001}" } trap compare_cleanup EXIT INT TERM @@ -181,8 +196,7 @@ _list_results_for_agent() { } for config in "${CONFIGS[@]}"; do - model="${config%%:*}" - agent="${config##*:}" + IFS=':' read -r model agent policy_mode <<< "$config" echo -e "${BLUE:-}══════════════════════════════════════════════════════════════${NC:-}" echo -e "${CYAN:-}Configuration: ${config}${NC:-}" @@ -192,6 +206,12 @@ for config in "${CONFIGS[@]}"; do apply_model_profile "$model" fi + # Per-config extra args (e.g., --no-policies when comparing policy modes). + config_extra_args=() + if [[ "$policy_mode" == "no-policies" ]]; then + config_extra_args+=(--no-policies) + fi + # Snapshot agent-specific result files and trajectory folders before this # config's runs. Filtering by agent prevents stale files from the OTHER # agent leaking into this config's recent_files. @@ -206,7 +226,7 @@ for config in "${CONFIGS[@]}"; do fi run_t0=$(date +%s) - if bash "$SCRIPT_DIR/eval.sh" --agent "$agent" --no-bundle "${FORWARDED_ARGS[@]}"; then + if bash "$SCRIPT_DIR/eval.sh" --agent "$agent" --no-bundle "${config_extra_args[@]}" "${FORWARDED_ARGS[@]}"; then run_dur=$(( $(date +%s) - run_t0 )) echo -e "${GREEN:-}✓${NC:-} Run $r complete in $(fmt_duration $run_dur)" else diff --git a/benchmarks/m3/config/m3.env b/benchmarks/m3/config/m3.env index 433c738..54573b8 100644 --- a/benchmarks/m3/config/m3.env +++ b/benchmarks/m3/config/m3.env @@ -17,7 +17,7 @@ DYNACONF_ADVANCED_FEATURES__REGISTRY=true # M3 evaluation script starts its own registry server with expanded config # Skip the default registry startup in eval.sh to avoid conflicts SKIP_SERVER_START=true -DYNACONF_POLICY__ENABLED=false +DYNACONF_POLICY__ENABLED=true DYNACONF_ADVANCED_FEATURES__BENCHMARK=m3 # DYNACONF_ADVANCED_FEATURES__FORCE_AUTONOMOUS_MODE=false DYNACONF_ADVANCED_FEATURES__PATH_SEGMENT_INDEX=3 @@ -60,4 +60,4 @@ API_KEY=${GROQ_API_KEY} # meaningfully different (ExactMatch always fails on agent-recorded payloads # vs zip GT entries; Correctness/Groundedness see uncanonical responses) — so # `on` keeps verdicts honest by failing loudly when the container is down. -M3_VAKRA_LIVE_MCP=on \ No newline at end of file +M3_VAKRA_LIVE_MCP=on diff --git a/benchmarks/m3/eval.sh b/benchmarks/m3/eval.sh index 6fd30c6..ab82f52 100755 --- a/benchmarks/m3/eval.sh +++ b/benchmarks/m3/eval.sh @@ -39,6 +39,7 @@ for arg in "$@"; do echo " --difficulty LEVEL Filter by difficulty level (easy, medium, hard)" echo " --no-bundle Skip reproducibility bundle creation" echo " --bundle-zip Create zip archive of bundle" + echo " --no-policies Disable CUGA policies (for baselining; default: enabled)" echo " --model-profile Model profile (for bundle metadata)" echo "" echo "Examples:" @@ -57,6 +58,7 @@ MULTITURN=false M3_DATA=false M3_DATA_PATH="" NO_GROUND_TRUTH=false +NO_POLICIES=false PASSTHROUGH_ARGS=() while [[ $# -gt 0 ]]; do @@ -86,6 +88,10 @@ while [[ $# -gt 0 ]]; do BUNDLE_ZIP=true shift ;; + --no-policies) + NO_POLICIES=true + shift + ;; --model-profile) MODEL_PROFILE="$2" shift 2 @@ -106,7 +112,7 @@ while [[ $# -gt 0 ]]; do done -REGISTRY_PORT=8001 +REGISTRY_PORT="${REGISTRY_PORT:-8001}" REGISTRY_PID="" cleanup() { @@ -150,15 +156,19 @@ echo -e "${BLUE:-}║ M3 Benchmark Evaluation echo -e "${BLUE:-}╚════════════════════════════════════════════════════════════╝${NC:-}" echo "" -# Start registry if not skipped -if [ "${SKIP_SERVER_START:-false}" != "true" ]; then - # Kill any stale process on the registry port before starting - if port_in_use $REGISTRY_PORT 2>/dev/null; then - echo -e "${YELLOW:-}Killing existing process on port $REGISTRY_PORT...${NC:-}" - lsof -ti :$REGISTRY_PORT | xargs kill 2>/dev/null || true - sleep 1 - fi +# Kill any stale process on the registry port before delegating to the eval +# script. eval_m3.py / eval_m3_react.py / eval_m3_multiturn all spin up their +# own per-service registry on $REGISTRY_PORT (see start_registry_server() in +# eval_m3.py), so starting another registry here would just collide on the +# port. Opt-in: set SKIP_SERVER_START=false explicitly if you want this script +# to also start an "outer" registry (legacy flow). +if port_in_use $REGISTRY_PORT 2>/dev/null; then + echo -e "${YELLOW:-}Killing existing process on port $REGISTRY_PORT...${NC:-}" + lsof -ti :$REGISTRY_PORT | xargs kill 2>/dev/null || true + sleep 1 +fi +if [ "${SKIP_SERVER_START:-true}" = "false" ]; then echo -e "${YELLOW:-}Starting registry server on port $REGISTRY_PORT...${NC:-}" bash "$SCRIPT_DIR/run_registry.sh" > /tmp/m3_registry.log 2>&1 & REGISTRY_PID=$! @@ -185,6 +195,25 @@ EVAL_M3_EXTRA=() if [ "$NO_GROUND_TRUTH" = "true" ]; then EVAL_M3_EXTRA+=(--no-ground-truth) fi +if [ "$NO_POLICIES" = "true" ]; then + EVAL_M3_EXTRA+=(--no-policies) +fi + +# Compile policy markdowns -> policies.json (unless policies are disabled). +# CUGA's policy engine is turned on in benchmarks/m3/config/m3.env via +# DYNACONF_POLICY__ENABLED=true (mirrors bpo). With --no-policies, the engine +# is still on but no policies get loaded — same pattern as benchmarks/bpo. +# Same pattern as benchmarks/bpo: the json is what CUGA loads; the .md files +# are the human-readable source of truth. +POLICIES_DIR="$SCRIPT_DIR/policies" +if [ "$NO_POLICIES" != "true" ] && [ -d "$POLICIES_DIR" ]; then + if ls "$POLICIES_DIR"/*.md >/dev/null 2>&1; then + echo -e "${YELLOW:-}Compiling policy markdowns -> policies.json...${NC:-}" + uv run --no-sync python "$PROJECT_ROOT/scripts/policies_md_to_json.py" \ + --policies-dir "$POLICIES_DIR" \ + --output "$POLICIES_DIR/policies.json" + fi +fi # Select eval script if [ "$M3_DATA" = "true" ]; then diff --git a/benchmarks/m3/eval_m3.py b/benchmarks/m3/eval_m3.py index 7df3cb1..f28c895 100644 --- a/benchmarks/m3/eval_m3.py +++ b/benchmarks/m3/eval_m3.py @@ -89,9 +89,49 @@ save_evaluation_results, setup_langfuse, ) +from benchmarks.helpers.sdk_eval_helpers import add_policy_via_agent, clear_all_policies from benchmarks.m3.m3_data_loader import M3DataLoader, diff_tool_calls +async def _load_m3_policies(agent: CugaAgent, policies_enabled: bool = True) -> None: + """Load CUGA policies into the per-domain agent. + + Mirrors the bpo eval_bench_sdk.py pattern: clear any pre-existing policies + from the agent's policy DB, then (if enabled) load each entry in + benchmarks/m3/policies/policies.json and register it. The .json is + compiled from .md by scripts/policies_md_to_json.py — driven by eval.sh + before this code runs. + """ + await clear_all_policies(agent) + if not policies_enabled: + logger.info("Policies disabled (--no-policies)") + return + policies_file = os.path.join(os.path.dirname(__file__), "policies", "policies.json") + if not os.path.exists(policies_file): + logger.warning(f"Policies file not found: {policies_file} — running without policies") + return + from cuga.backend.cuga_graph.policy.models import OutputFormatter, Playbook, ToolGuide + + with open(policies_file) as f: + policies_data = json.load(f) + logger.info(f"Loading {len(policies_data)} policy/policies from policies.json...") + loaded = 0 + for pdata in policies_data: + ptype = pdata.get("type", "") + if ptype == "playbook": + policy = Playbook.model_validate(pdata) + elif ptype == "tool_guide": + policy = ToolGuide.model_validate(pdata) + elif ptype == "output_formatter": + policy = OutputFormatter.model_validate(pdata) + else: + logger.warning(f"Unknown policy type: {ptype}, skipping") + continue + await add_policy_via_agent(agent, policy) + loaded += 1 + logger.info(f"✅ Loaded {loaded} policy/policies") + + # m3_vakra_score is imported lazily — its top-level evaluator import instantiates # Groq/OpenAI LLM judges at class-body time, which raises if API_KEY is unset. # --no-ground-truth runs never need scoring, so let them succeed without judge env. @@ -150,7 +190,11 @@ class FilteredToolProvider: await olympics_provider.initialize() # Agent only sees olympics tools - agent = CugaAgent(tool_provider=olympics_provider) + agent = CugaAgent( + tool_provider=olympics_provider, + auto_load_policies=False, + filesystem_sync=False, + ) """ def __init__(self, base_provider, app_name: str): @@ -838,17 +882,17 @@ async def evaluate_all( # Multi-turn format: list of samples with sample_id/uuid, dialogue, etc. samples = data - # Filter by task_id (sample_id or uuid) if specified - if self.task_id: - samples = [ - s - for s in samples - if s.get("sample_id", s.get("uuid", "")).lower() == self.task_id.lower() - ] + # Filter by task_ids (sample_id or uuid) if specified. The plural + # form `self.task_ids` is what gets populated for both 1 and N + # UUIDs; `self.task_id` is only set when exactly one UUID was + # passed, so use the plural to handle both cases. + if self.task_ids: + wanted = {tid.lower() for tid in self.task_ids} + samples = [s for s in samples if s.get("sample_id", s.get("uuid", "")).lower() in wanted] if not samples: - logger.error(f"Sample '{self.task_id}' not found in test data") + logger.error(f"Sample(s) {self.task_ids} not found in test data") return - logger.info(f"Filtered to sample: {self.task_id}") + logger.info(f"Filtered to {len(samples)} sample(s): {self.task_ids}") else: logger.info(f"Evaluating all {len(samples)} samples") @@ -1015,8 +1059,8 @@ def _save_ground_truth_format(self, output_dir: Path) -> Path: # Shared helpers ------------------------------------------------ # Build the registry prefix to strip from tool names: # Registry prefixes tools as "{app_name}_{tool_name}" where - # app_name = "task_{task_id}_{domain}" - registry_prefix = f"task_{task_id}_{domain}_" + # app_name = "{domain}" (no task__ prefix). + registry_prefix = f"{domain}_" def _strip_prefix(name: str) -> str: """Strip the registry app prefix from a tool name.""" @@ -1367,9 +1411,13 @@ def _dom_name(dc): ) try: - # Registry mode: Use FilteredToolProvider for domain isolation - # App name in registry is prefixed with task_id to avoid collisions across tasks - registry_app_name = f"task_{task_id}_{domain}" + # Registry mode: Use FilteredToolProvider for domain isolation. + # The registry app name is just the domain — no `task__` prefix — + # so the tool names CUGA records start with the domain itself, not + # the task ID. Cross-task collisions are prevented by the collision + # guard in expand_registry_config (and in practice each eval run is + # narrowed to a single task via --capability). + registry_app_name = domain logger.info( f"🔧 Creating filtered tool provider for domain: {domain} (registry app: {registry_app_name})" ) @@ -1378,7 +1426,7 @@ def _dom_name(dc): # This provides defense-in-depth: registry filters at MCP level, we filter at agent level filtered_provider = FilteredToolProvider( base_provider=tool_provider, # Shared provider with all domains - app_name=registry_app_name, # Filter to only this domain's tools (task-prefixed) + app_name=registry_app_name, # Filter to only this domain's tools ) await filtered_provider.initialize() @@ -1396,10 +1444,22 @@ def _dom_name(dc): evaluator.agent = CugaAgent( tool_provider=filtered_provider, # Only sees this domain's tools callbacks=callbacks, + # Policies are loaded explicitly by _load_m3_policies below per + # eval run. Disable .cuga auto-load and filesystem sync to keep + # the per-domain agent's policy set deterministic — otherwise + # the .cuga folder drifts across domain iterations and policies + # disappear mid-run (see investigation 2026-05-17). + auto_load_policies=False, + filesystem_sync=False, ) evaluator.langfuse_handler = langfuse_handler logger.info(f"Agent created with filtered tool provider (domain: {domain})") + # Load CUGA policies for this per-domain agent (mirrors benchmarks/bpo + # eval_bench_sdk.py). The source of truth is benchmarks/m3/policies/*.md; + # eval.sh compiles them to policies.json before invoking us. + await _load_m3_policies(evaluator.agent, policies_enabled=not getattr(args, "no_policies", False)) + # DEBUG: Verify agent can see tools (check filtered provider) try: filtered_tools = await filtered_provider.get_all_tools() @@ -1511,29 +1571,39 @@ async def start_registry_server(config_path: str) -> subprocess.Popen: import os import subprocess - # Check if port 8001 is already in use - logger.info("🔍 Checking if port 8001 is available...") + # Honour caller-provided port via REGISTRY_PORT (set by eval.sh) and + # DYNACONF_SERVER_PORTS__REGISTRY (set when the agent's settings.toml + # registry port is overridden). Both must match: CUGA-agent reads the + # DYNACONF value when constructing HTTP requests to its registry; the + # registry server must listen on the same port. Default 8001. + _port_env = os.environ.get("REGISTRY_PORT") or os.environ.get("DYNACONF_SERVER_PORTS__REGISTRY") + registry_port = int(_port_env) if _port_env else 8001 + + # Check if the registry port is already in use + logger.info(f"🔍 Checking if port {registry_port} is available...") try: import socket sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) - result = sock.connect_ex(('127.0.0.1', 8001)) + result = sock.connect_ex(('127.0.0.1', registry_port)) sock.close() if result == 0: # Port is in use - logger.error("❌ Port 8001 is already in use!") + logger.error(f"❌ Port {registry_port} is already in use!") logger.error("Another registry server or process is using this port.") logger.error("") logger.error("To fix this, run one of these commands:") - logger.error(" 1. Kill processes on port 8001:") - logger.error(" lsof -ti :8001 | xargs kill") + logger.error(f" 1. Kill processes on port {registry_port}:") + logger.error(f" lsof -ti :{registry_port} | xargs kill") logger.error("") logger.error(" 2. Or find and kill specific process:") - logger.error(" lsof -i :8001") + logger.error(f" lsof -i :{registry_port}") logger.error(" kill ") logger.error("") - raise RuntimeError("Port 8001 is already in use. Please kill the existing process first.") + raise RuntimeError( + f"Port {registry_port} is already in use. Please kill the existing process first." + ) except RuntimeError: raise # Re-raise the port-in-use error except Exception as e: @@ -1592,7 +1662,7 @@ async def start_registry_server(config_path: str) -> subprocess.Popen: # tree (uv wrapper → python → uvicorn → any docker exec children) in # one shot via killpg. process.terminate() on its own only SIGTERMs # the `uv` wrapper, and that doesn't always propagate to uvicorn. - process = subprocess.Popen( + process = subprocess.Popen( # noqa: S603 — args are constant literals, no untrusted input [ # noqa: S607 — uv resolved from PATH by design "uv", "run", @@ -1603,7 +1673,7 @@ async def start_registry_server(config_path: str) -> subprocess.Popen: "--host", "127.0.0.1", "--port", - "8001", + str(registry_port), ], stdout=log_file, stderr=subprocess.STDOUT, # Combine stderr with stdout @@ -1622,7 +1692,7 @@ async def start_registry_server(config_path: str) -> subprocess.Popen: for attempt in range(max_retries): try: async with httpx.AsyncClient() as client: - response = await client.get("http://localhost:8001/applications", timeout=5.0) + response = await client.get(f"http://localhost:{registry_port}/applications", timeout=5.0) if response.status_code == 200: apps = response.json() logger.info( @@ -1653,7 +1723,7 @@ async def start_registry_server(config_path: str) -> subprocess.Popen: # Check if all apps are ready (have tools loaded) # Note: Registry doesn't have /health endpoint, so we check /applications directly apps_response = await client.get( - "http://localhost:8001/applications", timeout=5.0 + f"http://localhost:{registry_port}/applications", timeout=5.0 ) if apps_response.status_code == 200: apps = apps_response.json() @@ -1801,7 +1871,10 @@ def rewrite_config_with_loader_domains(config_path: str, m3_data_loader: M3DataL return path -def expand_registry_config(config_path: str) -> str: +def expand_registry_config( + config_path: str, + capability_filter: Optional[List[str]] = None, +) -> str: """Expand registry config by replacing {domain} placeholders with actual domains and expanding environment variables. @@ -1811,6 +1884,14 @@ def expand_registry_config(config_path: str) -> str: Args: config_path: Path to the generic config file with {domain} placeholders + capability_filter: Optional list of source-yaml service names (e.g. + ``["m3_task_2"]``). When provided, services whose key is not in + this list are skipped before expansion. This prevents the + post-expansion collision guard from firing when two tasks share a + domain name (e.g. both ``m3_task_2`` and ``m3_task_3`` define + ``books``). Items that don't look like service-name filters + (UUIDs, ``hockey_395_0``-style test-case IDs) are ignored — pass + them through as-is. Returns: Path to the temporary expanded config file @@ -1834,8 +1915,27 @@ def expand_registry_config(config_path: str) -> str: services = config.get("services", []) expanded_services = [] + # Build the set of source-service-name filters from capability_filter. Items + # that look like UUIDs or test-case IDs (hockey_395_0) are not service-name + # filters and don't constrain the expansion at all. + _service_filter: Optional[set] = None + if capability_filter: + import re as _re_cap + + _uuid_re = _re_cap.compile(r"^[a-f0-9]{12}-[a-f0-9]{12}$") + _testcase_re = _re_cap.compile(r"^[a-z_]+_\d+_\d+$") + cap_items = [f for f in capability_filter if not _uuid_re.match(f) and not _testcase_re.match(f)] + if cap_items: + _service_filter = set(cap_items) + logger.info( + f"Pre-expansion filter: only services matching {sorted(_service_filter)} will be expanded" + ) + for service_dict in services: service_name = list(service_dict.keys())[0] + if _service_filter is not None and service_name not in _service_filter: + logger.info(f" Skipping (filtered out): {service_name}") + continue service_config = service_dict[service_name] metadata = service_config.get("metadata", {}) @@ -1856,13 +1956,13 @@ def expand_registry_config(config_path: str) -> str: domain_name = domain_config.get("name") domain_multiturn = domain_config.get("multiturn") - # Prefix service name with task_id to avoid name collisions when multiple tasks - # share the same domain name (e.g. both task_1 and task_2 have "address"). - # The registry uses the service name as the unique app identifier. - # The registry strips this prefix before calling the MCP server tool, so the - # container always receives the original unprefixed tool name. - task_id_val = metadata.get("task_id", "unknown") - expanded_service_name = f"task_{task_id_val}_{domain_name}" + # The expanded service name is just the domain. The registry uses + # this as the unique app identifier and CombinedToolProvider prefixes + # each MCP tool with `_`, so CUGA's recorded tool names + # start with the bare domain (e.g. `codebase_comments_get_…`). + # Cross-task collisions (two tasks sharing a domain) are caught + # by the post-expansion check below. + expanded_service_name = domain_name # Deep copy service config import copy @@ -1895,6 +1995,23 @@ def expand_registry_config(config_path: str) -> str: expanded_services.append(service_dict) logger.info(f" Kept as-is: {service_name}") + # Collision guard: detect duplicate expanded service names. Since we now use + # the bare domain as the service name, two tasks sharing a domain (e.g. + # both task_2 and task_3 have "books") would silently overwrite each other + # when the dict-list is dumped to yaml. Fail loudly instead — the caller + # should narrow to a single task with --capability before getting here. + from collections import Counter as _Counter + + _service_names = [list(s.keys())[0] for s in expanded_services] + _dups = sorted(n for n, c in _Counter(_service_names).items() if c > 1) + if _dups: + raise RuntimeError( + "Service-name collision in expanded registry config: " + f"{_dups}. This usually means multiple tasks share a domain name. " + "Narrow to a single task via --capability before expansion, " + "or differentiate the domain names in the source yaml." + ) + # Create temporary config file expanded_config = {"services": expanded_services} @@ -2061,8 +2178,13 @@ async def run_config_mode(args, container_runtime: str): rewritten_config_path = rewrite_config_with_loader_domains(args.from_config, m3_data_loader) source_config_path = rewritten_config_path - # Expand config if it contains {domain} placeholders - expanded_config_path = expand_registry_config(source_config_path) + # Expand config if it contains {domain} placeholders. Pre-filter source + # services by --capability so the bare-domain expanded names (e.g. + # `books` from m3_task_2 vs `books` from m3_task_3) can't collide in + # the same expanded yaml. UUID / hockey_395_0-style items in args.task + # don't constrain the source service set; they're filtered later. + _capability_filter = list(args.task) if getattr(args, "task", None) else None + expanded_config_path = expand_registry_config(source_config_path, capability_filter=_capability_filter) temp_config_created = expanded_config_path != args.from_config # Check if registry mode is enabled @@ -2114,11 +2236,28 @@ async def run_config_mode(args, container_runtime: str): # Check if any filter looks like a test case name (contains domain_number_number pattern) test_case_pattern = r'^[a-z_]+_\d+_\d+$' + # Also accept the --m3-data UUID format (12hex-12hex), e.g. "1960f609e439-e5d337d143b6". + # When UUIDs are used, the user must also pass --domain to constrain which + # service these UUIDs come from (a UUID alone doesn't encode its domain). + uuid_filter_pattern = r'^[a-f0-9]{12}-[a-f0-9]{12}$' task_filters = [task_filter] if isinstance(task_filter, str) else task_filter is_test_case_filter = any(_re.match(test_case_pattern, tf) for tf in task_filters) - - if is_test_case_filter: + is_uuid_filter = any(_re.match(uuid_filter_pattern, tf) for tf in task_filters) + + if is_uuid_filter: + # UUID filter: skip domain extraction (caller must use --domain), + # set test_case_filter so the evaluator filters per-sample at the + # right point. Strip out items that aren't sample UUIDs (e.g. a + # capability name like "m3_task_2" passed alongside via + # --capability) — those don't match any sample_id and would just + # be dead weight inside the per-sample filter. Capability-name + # items are already handled by expand_registry_config's + # capability_filter and the service-name filter below. + uuid_only_filters = [tf for tf in task_filters if _re.match(uuid_filter_pattern, tf)] + logger.info(f"Detected UUID-style test case filter: {uuid_only_filters}") + args.test_case_filter = uuid_only_filters + elif is_test_case_filter: # This is a test case filter - extract domain and pass to evaluator logger.info(f"Detected test case filter: {task_filters}") @@ -2492,16 +2631,21 @@ def detect_container_runtime(): # Task filtering. `--capability` is the preferred name when selecting a # service like `m3_task_2` / `m3_task_3`; `--task` is kept as an alias # for backward compatibility (it's referenced in README, other scripts, - # and older tooling). Both feed the same dest. + # and older tooling). Both feed the same dest via action='extend', so + # `--capability m3_task_2 --task ` appends both into args.task + # (the previous default `store` action made the second flag overwrite + # the first, which silently dropped one of the filters). parser.add_argument( "--capability", "--task", dest="task", type=str, nargs="*", - default=None, + action="extend", + default=[], help="Filter by capability/service name (e.g., 'm3_task_2') or by a " - "test-case ID (e.g., 'hockey_395_0'). Accepts multiple. " + "test-case ID (e.g., 'hockey_395_0' or M3-data UUID). Accepts " + "multiple values and multiple invocations (they're appended). " "Overrides --difficulty.", ) parser.add_argument( @@ -2566,6 +2710,14 @@ def detect_container_runtime(): "domain list is taken from the data source rather than the YAML " "config, so unlabeled test domains run without editing the config.", ) + parser.add_argument( + "--no-policies", + action="store_true", + help="Disable CUGA policies (mirrors benchmarks/bpo). When enabled " + "(default), policies are loaded per-domain from " + "benchmarks/m3/policies/policies.json after the per-domain agent is " + "constructed.", + ) from benchmarks.helpers.logging_args import add_log_level_args, apply_log_level diff --git a/benchmarks/m3/m3_data_loader.py b/benchmarks/m3/m3_data_loader.py index bb64c1c..838e171 100644 --- a/benchmarks/m3/m3_data_loader.py +++ b/benchmarks/m3/m3_data_loader.py @@ -262,8 +262,21 @@ def load_domain(self, task_id: int, domain: str) -> List[Dict[str, Any]]: def strip_registry_prefix(name: str, task_id: int, domain: str) -> str: - """Strip the `task___` prefix the registry adds.""" - prefix = f"task_{task_id}_{domain}_" + """Strip the registry app-name prefix the registry adds. + + Current layout: the registry app_name is just the domain, so tool names + arrive as ``_``. We strip a single ``_`` + prefix when present. + + The ``task_id`` argument is retained for source-compatibility with older + bundles (where the prefix was ``task___``); we also try + that legacy form so this function correctly normalises both new and old + saved data. + """ + legacy_prefix = f"task_{task_id}_{domain}_" + if name.startswith(legacy_prefix): + return name[len(legacy_prefix) :] + prefix = f"{domain}_" if name.startswith(prefix): return name[len(prefix) :] return name diff --git a/benchmarks/m3/m3_vakra_score.py b/benchmarks/m3/m3_vakra_score.py index ee3f3b4..1d38fba 100644 --- a/benchmarks/m3/m3_vakra_score.py +++ b/benchmarks/m3/m3_vakra_score.py @@ -79,12 +79,19 @@ def capability_name_for_task_id(task_id: Any) -> Optional[str]: def _strip_registry_prefix(name: str) -> str: - """Strip a leading ``task___`` registry prefix if present. - - The registry server (benchmarks/m3/run_registry.sh) renames each capability - container's MCP tools as ``task___``. The - underlying MCP server itself exposes the long auto-generated operation_id - (e.g. ``get_players_by_position_no_shoot_catch_v1_hockey_players_by_position_no_shoot_catch_get``). + """Strip a leading registry app-name prefix if present. + + Current layout: the registry app_name is just the domain, so tool names + arrive as ``_`` — but the domain can itself contain + underscores (``codebase_comments``, ``world_development_indicators``…), + so a plain regex can't say where the prefix ends. We leave the + domain-only prefix to be resolved by :func:`_match_live_name` (suffix + match against the live MCP tool list). + + The legacy layout used ``task___``; bundles + saved before the prefix-removal change still carry that form. We strip + that regex deterministically when present so old data continues to + score correctly. """ global _REGISTRY_PREFIX_RE if _REGISTRY_PREFIX_RE is None: @@ -110,16 +117,19 @@ def _collect_tool_names(dialogues: List[Dict[str, Any]]) -> List[str]: def _match_live_name(name: str, live_tool_names: List[str]) -> Optional[str]: - """Resolve ``name`` to a live MCP tool name, accounting for the three - naming conventions in play: + """Resolve ``name`` to a live MCP tool name, accounting for the naming + conventions in play: - - registry-prefixed: ``task___`` (what the agent records) + - registry-prefixed (legacy): ``task___`` + - registry-prefixed (current): ``_`` - short form: ```` (what the capability container often exposes) - long form: ``_v1__<...>_get`` (what the zip's gold_sequence carries — FastAPI auto-generated operation_id) - The matcher tries exact, then directional prefix matches in both directions, - so a live "short" name resolves both registry-prefixed and long-form inputs. + The matcher tries exact, then directional prefix matches in both + directions, then a final suffix match so a live long-form name resolves + a domain-only-prefixed input (the current layout after dropping the + ``task__`` segment from the registry app name). """ if name in live_tool_names: return name @@ -127,21 +137,35 @@ def _match_live_name(name: str, live_tool_names: List[str]) -> Optional[str]: if stripped != name and stripped in live_tool_names: return stripped - candidates: List[str] = [] + forward_candidates: List[str] = [] + suffix_candidates: List[str] = [] for ln in live_tool_names: # Live name is the canonical short form, name extends it (long form). if name.startswith(ln + "_"): - candidates.append(ln) + forward_candidates.append(ln) # Live name extends the (possibly stripped) input (live is long form). elif ln.startswith(stripped + "_"): - candidates.append(ln) - - if not candidates: - return None - if len(candidates) == 1: - return candidates[0] - # Tie-break: shortest match — closest to the canonical short form. - return min(candidates, key=len) + forward_candidates.append(ln) + # Input is the live name preceded by a registry app-name prefix — + # the current bare-domain layout (e.g. ``codebase_comments_get_X`` + # → ``get_X``). Take the longest matching suffix on tie-break: + # it's the most specific live tool the input could refer to. + elif name.endswith("_" + ln): + suffix_candidates.append(ln) + + # Prefer forward matches (existing semantics): shortest = closest to the + # canonical short form. + if forward_candidates: + if len(forward_candidates) == 1: + return forward_candidates[0] + return min(forward_candidates, key=len) + # Fall back to suffix matches (new path for bare-domain registry prefix): + # longest = most specific live name reachable from the tail of the input. + if suffix_candidates: + if len(suffix_candidates) == 1: + return suffix_candidates[0] + return max(suffix_candidates, key=len) + return None def _build_name_map( diff --git a/benchmarks/m3/policies/P-OF-1-single-tool-fact-citation.md b/benchmarks/m3/policies/P-OF-1-single-tool-fact-citation.md new file mode 100644 index 0000000..babcdd8 --- /dev/null +++ b/benchmarks/m3/policies/P-OF-1-single-tool-fact-citation.md @@ -0,0 +1,60 @@ +--- +id: output_formatter_single_tool_fact_citation +type: output_formatter +name: P-OF-1 — Single-Tool-Fact Citation +description: Single-fact answers must cite the originating API/tool as the source of the value. +priority: 100 +enabled: true +format_type: markdown +triggers: + - type: natural_language + target: agent_response + case_sensitive: false + operator: or + value: + - the response answers a single-fact question (number, name, date, identifier, percentage, ratio, or single-row attribute) that was retrieved from a tool or API call + - the response cites a value taken directly from a tool result + - the answer reports a single value retrieved from one or more data-fetching tools +--- + +# P-OF-1 — Single-Tool-Fact Citation + +## Policy + +When the assistant answers a question whose answer is a **single fact** — a number, name, date, identifier, percentage, ratio, or single-row attribute — and that fact was obtained from a tool/API call, the assistant must cite the originating tool/API as the source of the value in the final answer. + +## Rationale + +This policy enforces the standard data-provenance requirement that all analytical or dashboard-style answers carry an audit trail. Across the regulated and reporting-driven contexts this assistant is deployed in — financial dashboards, healthcare analytics, sports statistics, public-development indicators, academic citation, e-commerce reporting — every numeric or factual claim in a response must be traceable to its system of record. Without source attribution, downstream consumers cannot verify the figure, replicate the query, or assess the freshness of the data. + +The policy applies uniformly across all dashboard-API and multi-hop-reasoning workflows, regardless of the underlying domain (publications, sports, geography, education, e-commerce, etc.). + +## Format requirement + +The final answer to a single-fact question must include source attribution in one of the following equivalent forms (the assistant may choose the most natural style for the answer): + +1. **Inline citation** — `". Source: ."` + Example: *"The Adjusted net enrolment rate for Algeria from 1975 to 1980 averages 77.0. Source: `get_adjusted_net_enrolment_avg`."* +2. **Natural-language attribution** — `"Per , ."` + Example: *"Per the World Development Indicators API, the average is 77.0."* +3. **Parenthetical citation** — `" (from )."` + +The cited tool name should be the actual API/tool the assistant invoked to retrieve the value. If multiple tools contributed, cite the tool whose response directly produced the cited value. + +## Scope + +- **Applies** when the answer's value originates from a single tool/API call. +- **Applies** to single-fact answers in `capability_2_dashboard_apis` and `capability_3_multihop_reasoning` workflows, across all 16 covered domains (authors, books, codebase_comments, hockey, mondial_geo, movie_platform, professional_basketball, soccer_2016, student_loan, talkingdata, beer_factory, college_completion, computer_student, disney, trains, university, world_development_indicators). +- **Does not apply** to general explanations or definitional answers not tied to a specific data retrieval. +- **Does not apply** to aggregated values whose provenance spans multiple tools (those are governed by a separate citation policy if and when one is added). + +## Examples + +- ✓ "Per the books API, *Hyperion* was published in 1989." +- ✓ "There are 3 ICRA papers from 2012 (source: `get_conference_short_name_most_papers_v1`)." +- ✗ "There are 3 ICRA papers from 2012." (no source attribution — fails policy) +- ✗ "The most popular conference in 2012 was ICRA, based on the available data." (vague — fails policy) + +## Reformatting instruction (LLM-facing) + +If the agent's draft final answer reports a single fact retrieved from a tool, rewrite it so that the originating tool name (and, where applicable, the result field or data system) is cited in the answer. Use the most natural of the three formats above. Do **not** invent tool names that were not actually called in the current conversation; if the originating tool name is unavailable, cite the data system or capability instead (e.g., "the dashboard API" or "the world development indicators dataset"). Do not change the factual value itself. diff --git a/benchmarks/m3/policies/P-OF-2-strip-hedging.md.disabled b/benchmarks/m3/policies/P-OF-2-strip-hedging.md.disabled new file mode 100644 index 0000000..e20dfb8 --- /dev/null +++ b/benchmarks/m3/policies/P-OF-2-strip-hedging.md.disabled @@ -0,0 +1,69 @@ +--- +id: output_formatter_strip_hedging +type: output_formatter +name: P-OF-2 — Strip Hedging and Unsolicited Meta-Commentary +description: When the answer contains a resolved value, strip hedging language, dataset meta-commentary, and unsolicited "For context" appendices. +priority: 90 +enabled: true +format_type: markdown +triggers: + - type: natural_language + target: agent_response + case_sensitive: false + operator: or + threshold: 0.6 + value: + - the response contains hedging language such as "upper bound", "may be lower", "approximately", "cannot be completed", "this figure is", "an estimate", "the result may include" + - the response contains an unsolicited context appendix beginning with "For context", "Note that", "It is worth mentioning", or similar + - the response contains dataset meta-commentary such as "the dataset does not provide", "this dataset is about X not Y", "no tool was found to compute" + - the response answers a single factual question but appends caveats, alternative interpretations, or runner-up explanations the user did not ask for +--- + +# P-OF-2 — Strip Hedging and Unsolicited Meta-Commentary + +## Policy + +When the assistant has already resolved a factual question and identified the answer (a number, name, date, identifier, percentage, list, or single-row attribute), the final response must contain only the resolved answer plus the source citation required by [[output_formatter_single_tool_fact_citation]]. Hedging language, unsolicited context appendices, and dataset meta-commentary must be removed. + +## Rationale + +This policy enforces the standard reporting requirement that analytical responses be **concise and decisive**. In a dashboard, executive-reporting, or audit context, downstream consumers read the answer at the top of the response; trailing hedges and "for context" tangents add noise, dilute confidence, and (in some regulated settings) muddy the audit trail by mixing factual claims with the assistant's own commentary. + +The policy applies uniformly across all dashboard-API and multi-hop-reasoning workflows: the only exception is when the user *explicitly* asks for context, caveats, or alternatives. + +## What to strip (and why) + +The following clause types must be removed when a resolved answer is present in the same response: + +1. **Hedging language** — "this figure is an upper bound", "may be lower than the true value", "approximately", "cannot be completed because…", "the result may include…", "this is an estimate". +2. **"For context" appendices** — paragraphs beginning with *"For context"*, *"Note that"*, *"It is worth mentioning"*, *"By way of comparison"*, *"In addition"*, etc., that introduce information the user did not ask for. +3. **Dataset meta-commentary** — statements about the dataset itself rather than the answer ("the dataset does not provide a tool to do X", "this dataset is about cricket not soccer", "no tool was found to compute Y when the question asked about Z"). +4. **Runner-up enumeration** — sentences listing alternatives the user did not ask about (e.g., listing the second- and third-place candidates when the user asked for the single top item; see [[playbook_no_enumeration]] for the complementary planner-side rule). +5. **Self-doubt prefaces** — "I am not entirely sure but…", "based on the limited data I have…", "this might not be exactly what you asked for". + +## What to keep + +- The resolved answer itself (value, units, name, list). +- The source citation required by P-OF-1. +- Any clarification the user *explicitly* asked for in the same turn. +- Genuine refusal — if the response is "I do not have enough data to answer", keep that and do **not** force a fabricated answer. + +## Format requirement + +Rewrite the response into the shortest form that contains: +1. The resolved factual answer. +2. The P-OF-1-required source citation. +3. (Optional) a single clarifying clause that names the filter applied (e.g., "for year 1995"). + +Strip everything else. + +## Examples + +- ✗ Original: *"The country with the most umpires is the United States with 47 umpires. However, this figure is an upper bound because the dataset only records umpires who officiated at least one international match. For context, England follows closely with 41 umpires."* + ✓ Rewritten: *"The country with the most umpires is the United States with 47 umpires (source: `get_country_umpire_counts`)."* +- ✗ Original: *"BYU-Idaho had a graduation rate of 36%. The dataset does not provide a separate tool for online-only programs; this figure includes all enrolled students. For context, BYU-Provo's rate was 78%."* + ✓ Rewritten: *"BYU-Idaho had a graduation rate of 36% (source: `get_graduation_rate_by_institution`)."* + +## Reformatting instruction (LLM-facing) + +If the assistant's draft response contains a resolved factual answer along with hedging clauses, unsolicited context appendices, dataset meta-commentary, runner-up enumeration, or self-doubt prefaces, rewrite the response to emit **only** the resolved answer and the required source citation. Do not invent or alter the resolved value. If the response is a genuine refusal because the data is unavailable, keep it as a refusal and do not invent an answer. diff --git a/benchmarks/m3/policies/P-PB-1-no-enumeration.md b/benchmarks/m3/policies/P-PB-1-no-enumeration.md new file mode 100644 index 0000000..2ff70a7 --- /dev/null +++ b/benchmarks/m3/policies/P-PB-1-no-enumeration.md @@ -0,0 +1,57 @@ +--- +id: playbook_no_enumeration +type: playbook +name: P-PB-1 — No Enumeration When a Single Item Is Asked +description: When the question requests a single item, return only that item; do not enumerate runners-up or alternatives. +priority: 100 +enabled: true +triggers: + - type: natural_language + target: intent + case_sensitive: false + operator: or + threshold: 0.6 + value: + - the user asks for a single item (which X, the X with the most/least Y, the top X, the highest, the largest, the smallest) + - the user asks "which conference", "which company", "which country", "which city", "which solution", "which repository", or another singular "which" form + - the user asks "what is the X" expecting a single value or single named entity + - the user asks for "the" specific item (the city of the lake at coordinates Y, the solution path with the highest processed time) +--- + +# P-PB-1 — No Enumeration When a Single Item Is Asked + +## Policy + +When the user's question requests a **single item or value** (singular phrasing such as *"which conference"*, *"the city with the most"*, *"the highest"*, *"the top"*, *"the largest"*, *"the smallest"*), the assistant must return only that single item. Listing runners-up, near-misses, alternatives, or "Top N" enumerations the user did not request is prohibited. + +## Rationale + +This is a basic answer-shape requirement for analytical and dashboard-style responses. When a stakeholder asks "which region had the most sales last quarter?", they expect a single region as the answer — not a leaderboard. Enumerating alternatives: + +- Increases the cognitive load on the reader, who has to find the answer inside a list. +- Risks downstream misuse (the reader may pick the wrong row). +- In audit and reporting contexts, dilutes the decision the answer is supposed to support. + +## Required behaviour + +Before producing the final answer: + +1. Detect whether the user's intent is singular ("which X", "the X with the most/least Y", "the highest", "the top", "the largest") or plural ("list all X", "which X meet condition Y", "show me the X's that…"). +2. If singular: return only the resolved single item, with the source citation required by [[output_formatter_single_tool_fact_citation]]. +3. If plural: enumerate as requested. + +Do not include runners-up "for context", do not include "Top 3" when only the top 1 was requested, do not include "the next-best alternative is …". + +## Examples + +- ✗ Question: *"In the year 2012, which conference had the most papers presented?"* + ✗ Wrong: *"The conference with the most papers in 2012 was ICRA. The next two were CVPR and NeurIPS."* + ✓ Right: *"In 2012, ICRA had the most papers presented (source: `get_conference_short_name_most_papers_v1`)."* +- ✗ Question: *"The city of the lake at (-85.35, 11.6)?"* + ✗ Wrong: *"The city is Granada. Nearby cities include Rivas and Masaya, which also border the lake."* + ✓ Right: *"The city is Granada (source: `get_city_by_lake_coordinates`)."* +- ✓ Question: *"List all books published in 1995"* — enumeration is requested, so a list is the correct shape. + +## Interaction with other policies + +This playbook complements [[output_formatter_strip_hedging]] (which strips runner-up clauses *after* the response is drafted) by preventing the planner from collecting the enumeration in the first place. Both can fire on the same case: this one shapes the upstream plan; the OutputFormatter cleans up if any slipped through. diff --git a/benchmarks/m3/policies/P-PB-2-one-composite-tool-no-corroboration.md b/benchmarks/m3/policies/P-PB-2-one-composite-tool-no-corroboration.md new file mode 100644 index 0000000..3ae54aa --- /dev/null +++ b/benchmarks/m3/policies/P-PB-2-one-composite-tool-no-corroboration.md @@ -0,0 +1,84 @@ +--- +id: playbook_one_composite_tool_no_corroboration +type: playbook +name: P-PB-2 — One Composite Tool, No Corroboration +description: For percentage, ratio, and proportion questions, use the single endpoint that returns the composite value directly; do not also call the raw component tools to corroborate it. +priority: 100 +enabled: true +triggers: + - type: natural_language + target: intent + case_sensitive: false + operator: or + threshold: 0.6 + value: + - the user asks for a percentage (forks-to-stars %, conversion rate, success rate, ratio, proportion, share) + - the user asks for a ratio of two quantities (X-to-Y ratio, X per Y, X over Y) + - the user asks for an aggregate metric (average, mean, total) that a single endpoint returns directly + - the user asks for a "percentage difference" or "% change" or "% of total" or similar composite metric + - type: keyword + target: intent + case_sensitive: false + operator: or + value: + - percentage + - "%" + - ratio + - proportion + - share of + - per cent + - rate of + - conversion rate +--- + +# P-PB-2 — One Composite Tool, No Corroboration + +## Policy + +When a single endpoint returns the composite metric the user asked for (percentage, ratio, proportion, share, aggregate), the assistant must: + +1. Call only that endpoint. +2. Report the returned value (subject to [[output_formatter_single_tool_fact_citation]] for source attribution). + +The assistant must **not** also call the raw component endpoints (the numerator and denominator tools) to re-derive or "double-check" the composite value. + +## Rationale + +This policy enforces two related principles from analytical and dashboard reporting: + +1. **Source-of-truth discipline.** When the data system exposes a tool that returns the composite metric directly, that tool is the source of truth. Re-deriving the value from component tools introduces consistency risk (numerator and denominator may be computed over different time windows, populations, or filters than the composite tool uses) and produces an answer that is *less trustworthy*, not more. +2. **Tool-call frugality.** Each extra tool call costs LLM tokens, latency, and (for paid APIs) money. When the answer is already in hand from the composite tool, additional calls add no value. + +## Required behaviour + +For percentage / ratio / proportion / aggregate questions: + +1. **Identify the composite tool first** — the tool whose name and description directly match the requested metric (e.g., `get_forks_to_stars_percentage`, `get_conversion_rate`, `get_average_X`, `get_X_per_Y`). +2. **Call only that tool** with the appropriate parameters. +3. **Report the returned value** with the source citation required by P-OF-1. + +Explicitly forbidden: +- Calling `get_repo_forks` and `get_repo_stars` separately, then dividing, **when** `get_forks_to_stars_percentage` exists. +- Calling `get_total_X` and `get_count_X` separately to compute an average, **when** `get_average_X` exists. +- Re-running the composite tool with the same arguments to "verify" the value. + +## Exceptions + +This policy does **not** apply when: +- No composite tool exists for the requested metric (then the assistant must compute it from components — that is the only path). +- The user explicitly asks for the component values *as well as* the composite ("give me the forks count, stars count, and forks-to-stars percentage"). +- The composite tool returned a clearly invalid value (HTTP error, type-validation failure) — then the assistant may fall back to components and must say so. + +## Examples + +- ✗ Question: *"What is the forks-to-stars percentage for solution 104086?"* + ✗ Wrong: *Call `get_forks_to_stars_percentage(solution=104086)` → 0.00%. Then also call `get_repo_forks` and `get_repo_stars` to "double-check". Then report `0 forks / 1 star = 0.00%, confirmed by `get_forks_to_stars_percentage`.* + ✓ Right: *Call `get_forks_to_stars_percentage(solution=104086)` → 0.00%. Report: *"The forks-to-stars percentage for solution 104086 is 0.00% (source: `get_forks_to_stars_percentage`)."** +- ✗ Question: *"Average net enrolment rate for Algeria 1975–1980?"* + ✗ Wrong: *Call `get_average_enrolment_rate(country=Algeria, start=1975, end=1980)` → 77.0. Then also call `get_enrolment_rate(year=1975)`, …, `get_enrolment_rate(year=1980)` and average them yourself.* + ✓ Right: *Call `get_average_enrolment_rate(country=Algeria, start=1975, end=1980)` → 77.0. Report once with citation.* + +## Interaction with other policies + +- [[playbook_no_idempotent_retries]] forbids calling the same tool with the same arguments twice; this policy forbids calling **redundant** tools after a composite tool has already answered. +- [[output_formatter_single_tool_fact_citation]] handles the source-citation requirement for the single composite tool's value. diff --git a/benchmarks/m3/policies/P-PB-3-no-idempotent-retries.md b/benchmarks/m3/policies/P-PB-3-no-idempotent-retries.md new file mode 100644 index 0000000..c105ab0 --- /dev/null +++ b/benchmarks/m3/policies/P-PB-3-no-idempotent-retries.md @@ -0,0 +1,64 @@ +--- +id: playbook_no_idempotent_retries +type: playbook +name: P-PB-3 — No Idempotent Retries +description: Do not re-invoke a tool that returned a deterministic value with the same arguments during the same turn. +priority: 100 +enabled: true +triggers: + - type: natural_language + target: intent + case_sensitive: false + operator: or + threshold: 0.5 + value: + - any question whose answer is retrieved by a deterministic data-fetching tool (lookup, count, average, sum, ratio, attribute fetch) + - a single-fact or single-list question that resolves via one tool call +--- + +# P-PB-3 — No Idempotent Retries + +## Policy + +Once a tool has returned a non-error value for a given set of arguments during a turn, the assistant must not re-invoke that same tool with the same arguments to re-fetch, verify, or "double-check" the value. + +## Rationale + +The data-fetching tools in this benchmark and in the standard analytics/dashboard contexts are **deterministic**: calling `get_repo_stars(solution_id=83855)` twice in the same minute returns the same value. Re-invoking such a tool: + +- Adds latency to the final answer with zero information gain. +- Costs LLM tokens (the agent has to parse the duplicate response and explain why it called the tool a second time). +- For paid APIs, costs money. +- Creates a misleading audit trail in which the system-of-record appears to have been queried multiple times for a single decision. + +This policy is the runtime guard. The planner-level [[playbook_one_composite_tool_no_corroboration]] addresses a related but distinct anti-pattern (calling *different* tools to corroborate); P-PB-3 specifically addresses calling the *same* tool repeatedly. + +## Required behaviour + +For each turn, the assistant must: + +1. Maintain awareness of the (tool_name, arguments) pairs already called in this turn. +2. If the planner or reflection step proposes a tool call with the same (tool_name, arguments) as one already executed and the prior result was not an error, **skip the call** and re-use the prior result. +3. If the planner proposes the same tool with **different arguments**, that is allowed (it is not the same call). +4. Once the answer is derivable from the calls already made, emit the final answer and end the turn. + +## Exceptions + +This policy does **not** apply when: +- The prior call returned an error or a transport-level failure (HTTP 5xx, timeout, schema-validation error). Retrying after an error is allowed. +- The prior call was made with materially different arguments (different filter, different time window, different ID). +- The user explicitly asks for a re-fetch ("re-query the API and confirm the current value"). +- The tool is documented as non-deterministic (e.g., a tool that returns a sampled or time-of-day-dependent value). None of the M3 capability_2_dashboard_apis or capability_3_multihop_reasoning tools meet this criterion. + +## Examples + +- ✗ Question: *"What are the solution ids for repositories with 238 forks?"* + ✗ Wrong: *Call `get_solution_ids_by_repo_forks(forks=238)` → ["62258", "258160"]. Then, "to verify", call `get_repo_forks(solution_id=62258)`, `get_repo_forks(solution_id=258160)`, then again `get_solution_ids_by_repo_forks(forks=238)`.* + ✓ Right: *Call `get_solution_ids_by_repo_forks(forks=238)` → ["62258", "258160"]. Report: *"The solution ids are 62258 and 258160 (source: `get_solution_ids_by_repo_forks`)."** +- ✗ Wrong: *Calling `get_average_processed_time(url=X)` twice in the same turn to "confirm" the average.* + ✓ Right: *Single call; emit result.* + +## Interaction with other policies + +- [[playbook_one_composite_tool_no_corroboration]] forbids calling redundant **different** tools to corroborate; this policy forbids calling the **same** tool twice with the same arguments. +- [[output_formatter_strip_hedging]] cleans up "to verify, I also ran the call again" prose if any slips through. diff --git a/benchmarks/m3/policies/P-PB-4-validation-error-recovery.md b/benchmarks/m3/policies/P-PB-4-validation-error-recovery.md new file mode 100644 index 0000000..2a7474d --- /dev/null +++ b/benchmarks/m3/policies/P-PB-4-validation-error-recovery.md @@ -0,0 +1,89 @@ +--- +id: playbook_validation_error_recovery +type: playbook +name: P-PB-4 — Validation-Error Recovery +description: When a tool returns a parameter-validation error, diagnose from the tool's schema, recover the missing or wrong-typed argument from prior responses, and retry once — instead of abandoning, randomly retrying, or pivoting to a worse tool. +priority: 110 +enabled: true +triggers: + - type: keyword + target: chat_messages + case_sensitive: false + operator: or + value: + - Input validation error + - is a required property + - is not of type 'integer' + - is not of type 'string' + - is not of type 'number' + - is not of type 'array' + - validation error + - type: natural_language + target: chat_messages + case_sensitive: false + operator: or + threshold: 0.65 + value: + - a previous tool call returned an input-validation error about a required property being missing + - a previous tool call returned a type-mismatch error such as "X is not of type 'integer'" or "X is not of type 'string'" + - a recent tool call failed because of missing or wrong-typed arguments, not because the underlying data is unavailable +--- + +# P-PB-4 — Validation-Error Recovery + +## Policy + +When a tool call returns a parameter-validation error — most commonly `"Input validation error: 'X' is a required property"` or `"'Y' is not of type 'integer'"` — the planner must perform a one-time **structured recovery** rather than abandoning the call, randomly retrying, or pivoting to a different (and usually worse) tool. + +## Rationale + +In analytical and dashboard workflows, the difference between a successful run and a failed run is often a single misformed argument: the right tool was selected, but the planner passed `path=[]` instead of `path="x.sln"`, or omitted a parameter that the API requires. **The data is reachable. The agent just needs to fix the call.** + +The typical observed behaviours when CUGA receives a validation error are: + +1. Try the same tool with a permuted but still-wrong set of arguments (wastes calls). +2. Pivot to a different tool that looks similar by name (often a worse match for the question). +3. Conclude the data is unavailable and emit a refusal. + +All three pollute the trajectory with failed calls without retrieving the data the next step needs — and (importantly for downstream consumers and audit logs) without producing the successful tool responses that establish the answer's provenance. A clean recovery puts the right value back on the table. + +## Required behaviour + +When a tool result contains a parameter-validation error: + +1. **Identify the failing parameter.** Parse the error message to extract: + - The parameter name (e.g., `'path'`, `'summary'`, `'processed_time'`, `'solution_id'`). + - The failure kind: *missing required property*, *wrong type* (`is not of type 'integer'`, `is not of type 'string'`, etc.), or *invalid value*. +2. **Recover the value.** Search prior tool responses in the conversation for a field whose key or content matches the failing parameter. Typical sources: + - A list-returning tool whose single element is the value (e.g., a previous call returned `{"solution_paths": ["x.sln"]}` and the failing parameter is `path` — pass `"x.sln"`). + - A scalar field with a matching name (e.g., a previous call returned `{"solution_id": 45997}` and the failing parameter is `solution_id`). + - A typed value that needs coercion (e.g., a string `"636449700980488000"` when the schema requires an integer — coerce to `int(636449700980488000)`). +3. **Retry, once.** Re-invoke the same tool with the recovered value substituted into the failing parameter. Keep the rest of the arguments unchanged. +4. **If the retry also fails**, do **not** retry a third time and do **not** pivot to a generic detail tool to "discover" the value indirectly. Emit a final answer based on the data already in hand, or a clear refusal if no answer is supportable. See [[playbook_no_idempotent_retries]] for the same-call-twice rule (this policy is the explicit exception, because the arguments change). + +## Common parameter-recovery patterns + +These cover the validation errors seen most often in `capability_2_dashboard_apis` and `capability_3_multihop_reasoning`: + +| Validation error | Where to look for the value | What to pass | +| --- | --- | --- | +| `'path' is a required property` | Prior responses with `paths`, `solution_paths`, or `solution_path` field | The single element / first element | +| `'summary' is a required property` | Prior responses with `summary`, `description`, or `body` field | Pass it through | +| `'X_id' is a required property` | Prior `X_id`, `id` (in an object containing X), or a "by name → id" lookup tool's response | The integer ID | +| `X is not of type 'integer'` | Same value is in hand; it's just typed as a string | Coerce to integer | +| `X is not of type 'string'` | Same value is in hand; it's typed as a list or number | First element of list, or `str(number)` | +| `X is not of type 'array'` | A scalar is in hand and the API wants a list | Wrap in a single-element list | + +## What this playbook does NOT permit + +- Calling the failing tool a third time after the first retry also fails. +- **Manufacturing** a parameter value not present in any prior tool response (do not invent IDs, paths, or timestamps to satisfy the schema). +- Treating "tool not found", "404", "500", or timeout errors as validation errors — those are different recovery scenarios and this policy does not apply. +- Suppressing the validation error from the final answer if no recovery succeeded — be honest about what was retrieved and what wasn't. + +## Interaction with other policies + +- [[playbook_no_idempotent_retries]] (P-PB-3) forbids a second identical call; this policy is explicitly the *one* permitted retry, because the arguments change between attempts. +- [[playbook_one_composite_tool_no_corroboration]] (P-PB-2) takes precedence in *choosing* the right tool; this policy operates **after** the right tool has been chosen and only needs argument repair. +- [[output_formatter_single_tool_fact_citation]] (P-OF-1) applies normally to the final answer; the recovered (now-successful) call is the citable source. +- This policy has higher priority (110) than the default planning playbooks (100) because it is a corrective action that should pre-empt re-exploration. diff --git a/benchmarks/m3/policies/P-TG-1-mountain-count-disambiguation.md b/benchmarks/m3/policies/P-TG-1-mountain-count-disambiguation.md new file mode 100644 index 0000000..d69da24 --- /dev/null +++ b/benchmarks/m3/policies/P-TG-1-mountain-count-disambiguation.md @@ -0,0 +1,47 @@ +--- +id: tool_guide_mountain_count_most_populous_country +type: tool_guide +name: P-TG-1 — `get_mountain_count_most_populous_country` Disambiguation +description: Clarifies that `get_mountain_count_most_populous_country` is the right tool when the user asks for mountains in the country with the largest/greatest/most population. +priority: 100 +enabled: true +prepend: true +target_tools: + - get_mountain_count_most_populous_country +triggers: [] +--- + +# P-TG-1 — `get_mountain_count_most_populous_country` Disambiguation + +## Policy + +This `ToolGuide` enriches CUGA's view of the `get_mountain_count_most_populous_country` tool description so the shortlister surfaces it for the right intents and so the planner does not compose it with unrelated geography tools. + +## Rationale + +When the user asks *"How many mountains are in the most populous country?"* (or a paraphrase such as *"the country with the largest population"*, *"the country with the most people"*, *"the country with the greatest population"*), the right tool is the single composite endpoint `get_mountain_count_most_populous_country`. The CUGA shortlister can miss this because the user's phrasing uses *"most populous"* / *"largest population"* while the tool name encodes the same concept differently. + +When the shortlister misses the composite tool, CUGA tends to compose two unrelated tools (a population-ranking tool plus a mountain-counting tool keyed by country), which: + +- Costs extra LLM calls and tool calls. +- Produces a brittle chain that can fail if the population-ranking tool's country naming does not match the mountain-counting tool's country naming. +- Risks a wrong answer if the population tool returns a list of "most populous" countries while the question asks for *the* single country. + +## What this ToolGuide adds + +The following content is prepended to the tool's stored description so that the shortlister's embedding match and the planner's prompt both see it: + +**Use this tool when the user asks for:** +- The mountain count of the most populous country +- The number of mountains in the country with the largest, greatest, or highest population +- "How many mountains" combined with "most people", "biggest population", "most populated country" + +**Do NOT compose with city-population, country-population-ranking, or per-country mountain-listing tools.** This single tool returns the answer directly. + +**Returns:** a single integer — the count of mountains in the country with the largest population. + +## Scope and limits + +This `ToolGuide` only changes CUGA's internal view of the tool's description (per the `ToolGuide` policy mechanism). It does **not** modify the upstream MCP tool definition, which is part of the benchmark and remains untouched. + +The policy is narrow on purpose: it targets exactly one tool (the one CUGA missed in the analyzed PF case). If the same disambiguation pattern recurs for other composite tools, additional `ToolGuide` policies should be added for each rather than broadening this one. diff --git a/benchmarks/m3/policies/P-TG-2-country-with-most-umpires-returns-id.md b/benchmarks/m3/policies/P-TG-2-country-with-most-umpires-returns-id.md new file mode 100644 index 0000000..b9be238 --- /dev/null +++ b/benchmarks/m3/policies/P-TG-2-country-with-most-umpires-returns-id.md @@ -0,0 +1,45 @@ +--- +id: tool_guide_country_with_most_umpires_returns_id +type: tool_guide +name: P-TG-2 — `get_country_with_most_umpires` Returns an ID, Not a Name +description: Clarifies that `get_country_with_most_umpires` returns a numeric country ID and must be chained with a name-lookup tool before the country can be reported by name. +priority: 100 +enabled: true +prepend: true +triggers: [] +target_tools: + - get_country_with_most_umpires +--- + +# P-TG-2 — `get_country_with_most_umpires` Returns an ID, Not a Name + +## Policy + +This `ToolGuide` enriches CUGA's view of `get_country_with_most_umpires` so the planner knows the response is a country **ID** (an integer key into the country table) and not a country **name**. The planner is instructed to chain the result through `get_country_name_by_id` when the user has asked for a country by name. + +## Rationale + +When the user asks *"From which country are the most umpires?"*, they expect a country name in the answer (e.g., "England", "Australia"). The composite tool `get_country_with_most_umpires` is the correct entry point — it returns the answer directly — but its response shape is `{country_id: , umpire_count: }`. Without this disambiguation, CUGA tends to either: + +- Emit the raw ID as the answer ("The country with the most umpires has ID 1, with 27 umpires."), which is technically true but useless to the reader. +- Conclude that the dataset "does not provide a tool to translate ID 1 into a name" and refuse — wrong, because `get_country_name_by_id` exists and is the obvious next call. + +Both behaviours fail an analytics-style reader's basic expectation that a "which country" question is answered with a country name. + +## What this ToolGuide adds + +The following content is prepended to the tool's stored description so the planner sees it before it sends the response back to the user: + +**Return shape:** `{country_id: , umpire_count: }` — the `country_id` is a numeric primary key, NOT a country name. + +**Required follow-up when the user asked for a country by name:** chain with `get_country_name_by_id(country_id=)` to translate the ID into a name before producing the final answer. + +**Do NOT report the raw `country_id` to the user when they asked for the country itself.** Reporting "ID = 1" to a user who asked "which country" is a policy violation: see [[output_formatter_strip_hedging]] for the answer-shape requirement and [[output_formatter_single_tool_fact_citation]] for the citation requirement. + +**Do NOT refuse with "the dataset does not provide a name lookup tool"** — `get_country_name_by_id` exists in the same capability and is the canonical name lookup. + +## Scope and limits + +This `ToolGuide` only changes CUGA's internal view of the tool description (per the `ToolGuide` policy mechanism). It does **not** modify the upstream MCP tool definition, which is part of the benchmark and remains untouched. + +The same pattern (composite tool returns an ID, name lookup lives in a separate tool) likely recurs across the `capability_2_dashboard_apis` and `capability_3_multihop_reasoning` domains. If new "X with most Y returns an ID" cases turn up, add a focused `ToolGuide` per tool rather than broadening this one — narrow, tool-specific guides are easier to debug than a single sprawling rule. diff --git a/benchmarks/m3/policies/policies.json b/benchmarks/m3/policies/policies.json new file mode 100644 index 0000000..6404ff1 --- /dev/null +++ b/benchmarks/m3/policies/policies.json @@ -0,0 +1,177 @@ +[ + { + "id": "output_formatter_single_tool_fact_citation", + "type": "output_formatter", + "name": "P-OF-1 — Single-Tool-Fact Citation", + "description": "Single-fact answers must cite the originating API/tool as the source of the value.", + "priority": 100, + "enabled": true, + "format_type": "markdown", + "triggers": [ + { + "type": "natural_language", + "target": "agent_response", + "case_sensitive": false, + "operator": "or", + "value": [ + "the response answers a single-fact question (number, name, date, identifier, percentage, ratio, or single-row attribute) that was retrieved from a tool or API call", + "the response cites a value taken directly from a tool result", + "the answer reports a single value retrieved from one or more data-fetching tools" + ] + } + ], + "format_config": "# P-OF-1 — Single-Tool-Fact Citation\n\n## Policy\n\nWhen the assistant answers a question whose answer is a **single fact** — a number, name, date, identifier, percentage, ratio, or single-row attribute — and that fact was obtained from a tool/API call, the assistant must cite the originating tool/API as the source of the value in the final answer.\n\n## Rationale\n\nThis policy enforces the standard data-provenance requirement that all analytical or dashboard-style answers carry an audit trail. Across the regulated and reporting-driven contexts this assistant is deployed in — financial dashboards, healthcare analytics, sports statistics, public-development indicators, academic citation, e-commerce reporting — every numeric or factual claim in a response must be traceable to its system of record. Without source attribution, downstream consumers cannot verify the figure, replicate the query, or assess the freshness of the data.\n\nThe policy applies uniformly across all dashboard-API and multi-hop-reasoning workflows, regardless of the underlying domain (publications, sports, geography, education, e-commerce, etc.).\n\n## Format requirement\n\nThe final answer to a single-fact question must include source attribution in one of the following equivalent forms (the assistant may choose the most natural style for the answer):\n\n1. **Inline citation** — `\". Source: .\"`\n Example: *\"The Adjusted net enrolment rate for Algeria from 1975 to 1980 averages 77.0. Source: `get_adjusted_net_enrolment_avg`.\"*\n2. **Natural-language attribution** — `\"Per , .\"`\n Example: *\"Per the World Development Indicators API, the average is 77.0.\"*\n3. **Parenthetical citation** — `\" (from ).\"`\n\nThe cited tool name should be the actual API/tool the assistant invoked to retrieve the value. If multiple tools contributed, cite the tool whose response directly produced the cited value.\n\n## Scope\n\n- **Applies** when the answer's value originates from a single tool/API call.\n- **Applies** to single-fact answers in `capability_2_dashboard_apis` and `capability_3_multihop_reasoning` workflows, across all 16 covered domains (authors, books, codebase_comments, hockey, mondial_geo, movie_platform, professional_basketball, soccer_2016, student_loan, talkingdata, beer_factory, college_completion, computer_student, disney, trains, university, world_development_indicators).\n- **Does not apply** to general explanations or definitional answers not tied to a specific data retrieval.\n- **Does not apply** to aggregated values whose provenance spans multiple tools (those are governed by a separate citation policy if and when one is added).\n\n## Examples\n\n- ✓ \"Per the books API, *Hyperion* was published in 1989.\"\n- ✓ \"There are 3 ICRA papers from 2012 (source: `get_conference_short_name_most_papers_v1`).\"\n- ✗ \"There are 3 ICRA papers from 2012.\" (no source attribution — fails policy)\n- ✗ \"The most popular conference in 2012 was ICRA, based on the available data.\" (vague — fails policy)\n\n## Reformatting instruction (LLM-facing)\n\nIf the agent's draft final answer reports a single fact retrieved from a tool, rewrite it so that the originating tool name (and, where applicable, the result field or data system) is cited in the answer. Use the most natural of the three formats above. Do **not** invent tool names that were not actually called in the current conversation; if the originating tool name is unavailable, cite the data system or capability instead (e.g., \"the dashboard API\" or \"the world development indicators dataset\"). Do not change the factual value itself.\n" + }, + { + "id": "playbook_no_enumeration", + "type": "playbook", + "name": "P-PB-1 — No Enumeration When a Single Item Is Asked", + "description": "When the question requests a single item, return only that item; do not enumerate runners-up or alternatives.", + "priority": 100, + "enabled": true, + "triggers": [ + { + "type": "natural_language", + "target": "intent", + "case_sensitive": false, + "operator": "or", + "threshold": 0.6, + "value": [ + "the user asks for a single item (which X, the X with the most/least Y, the top X, the highest, the largest, the smallest)", + "the user asks \"which conference\", \"which company\", \"which country\", \"which city\", \"which solution\", \"which repository\", or another singular \"which\" form", + "the user asks \"what is the X\" expecting a single value or single named entity", + "the user asks for \"the\" specific item (the city of the lake at coordinates Y, the solution path with the highest processed time)" + ] + } + ], + "markdown_content": "# P-PB-1 — No Enumeration When a Single Item Is Asked\n\n## Policy\n\nWhen the user's question requests a **single item or value** (singular phrasing such as *\"which conference\"*, *\"the city with the most\"*, *\"the highest\"*, *\"the top\"*, *\"the largest\"*, *\"the smallest\"*), the assistant must return only that single item. Listing runners-up, near-misses, alternatives, or \"Top N\" enumerations the user did not request is prohibited.\n\n## Rationale\n\nThis is a basic answer-shape requirement for analytical and dashboard-style responses. When a stakeholder asks \"which region had the most sales last quarter?\", they expect a single region as the answer — not a leaderboard. Enumerating alternatives:\n\n- Increases the cognitive load on the reader, who has to find the answer inside a list.\n- Risks downstream misuse (the reader may pick the wrong row).\n- In audit and reporting contexts, dilutes the decision the answer is supposed to support.\n\n## Required behaviour\n\nBefore producing the final answer:\n\n1. Detect whether the user's intent is singular (\"which X\", \"the X with the most/least Y\", \"the highest\", \"the top\", \"the largest\") or plural (\"list all X\", \"which X meet condition Y\", \"show me the X's that…\").\n2. If singular: return only the resolved single item, with the source citation required by [[output_formatter_single_tool_fact_citation]].\n3. If plural: enumerate as requested.\n\nDo not include runners-up \"for context\", do not include \"Top 3\" when only the top 1 was requested, do not include \"the next-best alternative is …\".\n\n## Examples\n\n- ✗ Question: *\"In the year 2012, which conference had the most papers presented?\"*\n ✗ Wrong: *\"The conference with the most papers in 2012 was ICRA. The next two were CVPR and NeurIPS.\"*\n ✓ Right: *\"In 2012, ICRA had the most papers presented (source: `get_conference_short_name_most_papers_v1`).\"*\n- ✗ Question: *\"The city of the lake at (-85.35, 11.6)?\"*\n ✗ Wrong: *\"The city is Granada. Nearby cities include Rivas and Masaya, which also border the lake.\"*\n ✓ Right: *\"The city is Granada (source: `get_city_by_lake_coordinates`).\"*\n- ✓ Question: *\"List all books published in 1995\"* — enumeration is requested, so a list is the correct shape.\n\n## Interaction with other policies\n\nThis playbook complements [[output_formatter_strip_hedging]] (which strips runner-up clauses *after* the response is drafted) by preventing the planner from collecting the enumeration in the first place. Both can fire on the same case: this one shapes the upstream plan; the OutputFormatter cleans up if any slipped through.\n" + }, + { + "id": "playbook_one_composite_tool_no_corroboration", + "type": "playbook", + "name": "P-PB-2 — One Composite Tool, No Corroboration", + "description": "For percentage, ratio, and proportion questions, use the single endpoint that returns the composite value directly; do not also call the raw component tools to corroborate it.", + "priority": 100, + "enabled": true, + "triggers": [ + { + "type": "natural_language", + "target": "intent", + "case_sensitive": false, + "operator": "or", + "threshold": 0.6, + "value": [ + "the user asks for a percentage (forks-to-stars %, conversion rate, success rate, ratio, proportion, share)", + "the user asks for a ratio of two quantities (X-to-Y ratio, X per Y, X over Y)", + "the user asks for an aggregate metric (average, mean, total) that a single endpoint returns directly", + "the user asks for a \"percentage difference\" or \"% change\" or \"% of total\" or similar composite metric" + ] + }, + { + "type": "keyword", + "target": "intent", + "case_sensitive": false, + "operator": "or", + "value": [ + "percentage", + "%", + "ratio", + "proportion", + "share of", + "per cent", + "rate of", + "conversion rate" + ] + } + ], + "markdown_content": "# P-PB-2 — One Composite Tool, No Corroboration\n\n## Policy\n\nWhen a single endpoint returns the composite metric the user asked for (percentage, ratio, proportion, share, aggregate), the assistant must:\n\n1. Call only that endpoint.\n2. Report the returned value (subject to [[output_formatter_single_tool_fact_citation]] for source attribution).\n\nThe assistant must **not** also call the raw component endpoints (the numerator and denominator tools) to re-derive or \"double-check\" the composite value.\n\n## Rationale\n\nThis policy enforces two related principles from analytical and dashboard reporting:\n\n1. **Source-of-truth discipline.** When the data system exposes a tool that returns the composite metric directly, that tool is the source of truth. Re-deriving the value from component tools introduces consistency risk (numerator and denominator may be computed over different time windows, populations, or filters than the composite tool uses) and produces an answer that is *less trustworthy*, not more.\n2. **Tool-call frugality.** Each extra tool call costs LLM tokens, latency, and (for paid APIs) money. When the answer is already in hand from the composite tool, additional calls add no value.\n\n## Required behaviour\n\nFor percentage / ratio / proportion / aggregate questions:\n\n1. **Identify the composite tool first** — the tool whose name and description directly match the requested metric (e.g., `get_forks_to_stars_percentage`, `get_conversion_rate`, `get_average_X`, `get_X_per_Y`).\n2. **Call only that tool** with the appropriate parameters.\n3. **Report the returned value** with the source citation required by P-OF-1.\n\nExplicitly forbidden:\n- Calling `get_repo_forks` and `get_repo_stars` separately, then dividing, **when** `get_forks_to_stars_percentage` exists.\n- Calling `get_total_X` and `get_count_X` separately to compute an average, **when** `get_average_X` exists.\n- Re-running the composite tool with the same arguments to \"verify\" the value.\n\n## Exceptions\n\nThis policy does **not** apply when:\n- No composite tool exists for the requested metric (then the assistant must compute it from components — that is the only path).\n- The user explicitly asks for the component values *as well as* the composite (\"give me the forks count, stars count, and forks-to-stars percentage\").\n- The composite tool returned a clearly invalid value (HTTP error, type-validation failure) — then the assistant may fall back to components and must say so.\n\n## Examples\n\n- ✗ Question: *\"What is the forks-to-stars percentage for solution 104086?\"*\n ✗ Wrong: *Call `get_forks_to_stars_percentage(solution=104086)` → 0.00%. Then also call `get_repo_forks` and `get_repo_stars` to \"double-check\". Then report `0 forks / 1 star = 0.00%, confirmed by `get_forks_to_stars_percentage`.*\n ✓ Right: *Call `get_forks_to_stars_percentage(solution=104086)` → 0.00%. Report: *\"The forks-to-stars percentage for solution 104086 is 0.00% (source: `get_forks_to_stars_percentage`).\"**\n- ✗ Question: *\"Average net enrolment rate for Algeria 1975–1980?\"*\n ✗ Wrong: *Call `get_average_enrolment_rate(country=Algeria, start=1975, end=1980)` → 77.0. Then also call `get_enrolment_rate(year=1975)`, …, `get_enrolment_rate(year=1980)` and average them yourself.*\n ✓ Right: *Call `get_average_enrolment_rate(country=Algeria, start=1975, end=1980)` → 77.0. Report once with citation.*\n\n## Interaction with other policies\n\n- [[playbook_no_idempotent_retries]] forbids calling the same tool with the same arguments twice; this policy forbids calling **redundant** tools after a composite tool has already answered.\n- [[output_formatter_single_tool_fact_citation]] handles the source-citation requirement for the single composite tool's value.\n" + }, + { + "id": "playbook_no_idempotent_retries", + "type": "playbook", + "name": "P-PB-3 — No Idempotent Retries", + "description": "Do not re-invoke a tool that returned a deterministic value with the same arguments during the same turn.", + "priority": 100, + "enabled": true, + "triggers": [ + { + "type": "natural_language", + "target": "intent", + "case_sensitive": false, + "operator": "or", + "threshold": 0.5, + "value": [ + "any question whose answer is retrieved by a deterministic data-fetching tool (lookup, count, average, sum, ratio, attribute fetch)", + "a single-fact or single-list question that resolves via one tool call" + ] + } + ], + "markdown_content": "# P-PB-3 — No Idempotent Retries\n\n## Policy\n\nOnce a tool has returned a non-error value for a given set of arguments during a turn, the assistant must not re-invoke that same tool with the same arguments to re-fetch, verify, or \"double-check\" the value.\n\n## Rationale\n\nThe data-fetching tools in this benchmark and in the standard analytics/dashboard contexts are **deterministic**: calling `get_repo_stars(solution_id=83855)` twice in the same minute returns the same value. Re-invoking such a tool:\n\n- Adds latency to the final answer with zero information gain.\n- Costs LLM tokens (the agent has to parse the duplicate response and explain why it called the tool a second time).\n- For paid APIs, costs money.\n- Creates a misleading audit trail in which the system-of-record appears to have been queried multiple times for a single decision.\n\nThis policy is the runtime guard. The planner-level [[playbook_one_composite_tool_no_corroboration]] addresses a related but distinct anti-pattern (calling *different* tools to corroborate); P-PB-3 specifically addresses calling the *same* tool repeatedly.\n\n## Required behaviour\n\nFor each turn, the assistant must:\n\n1. Maintain awareness of the (tool_name, arguments) pairs already called in this turn.\n2. If the planner or reflection step proposes a tool call with the same (tool_name, arguments) as one already executed and the prior result was not an error, **skip the call** and re-use the prior result.\n3. If the planner proposes the same tool with **different arguments**, that is allowed (it is not the same call).\n4. Once the answer is derivable from the calls already made, emit the final answer and end the turn.\n\n## Exceptions\n\nThis policy does **not** apply when:\n- The prior call returned an error or a transport-level failure (HTTP 5xx, timeout, schema-validation error). Retrying after an error is allowed.\n- The prior call was made with materially different arguments (different filter, different time window, different ID).\n- The user explicitly asks for a re-fetch (\"re-query the API and confirm the current value\").\n- The tool is documented as non-deterministic (e.g., a tool that returns a sampled or time-of-day-dependent value). None of the M3 capability_2_dashboard_apis or capability_3_multihop_reasoning tools meet this criterion.\n\n## Examples\n\n- ✗ Question: *\"What are the solution ids for repositories with 238 forks?\"*\n ✗ Wrong: *Call `get_solution_ids_by_repo_forks(forks=238)` → [\"62258\", \"258160\"]. Then, \"to verify\", call `get_repo_forks(solution_id=62258)`, `get_repo_forks(solution_id=258160)`, then again `get_solution_ids_by_repo_forks(forks=238)`.*\n ✓ Right: *Call `get_solution_ids_by_repo_forks(forks=238)` → [\"62258\", \"258160\"]. Report: *\"The solution ids are 62258 and 258160 (source: `get_solution_ids_by_repo_forks`).\"**\n- ✗ Wrong: *Calling `get_average_processed_time(url=X)` twice in the same turn to \"confirm\" the average.*\n ✓ Right: *Single call; emit result.*\n\n## Interaction with other policies\n\n- [[playbook_one_composite_tool_no_corroboration]] forbids calling redundant **different** tools to corroborate; this policy forbids calling the **same** tool twice with the same arguments.\n- [[output_formatter_strip_hedging]] cleans up \"to verify, I also ran the call again\" prose if any slips through.\n" + }, + { + "id": "playbook_validation_error_recovery", + "type": "playbook", + "name": "P-PB-4 — Validation-Error Recovery", + "description": "When a tool returns a parameter-validation error, diagnose from the tool's schema, recover the missing or wrong-typed argument from prior responses, and retry once — instead of abandoning, randomly retrying, or pivoting to a worse tool.", + "priority": 110, + "enabled": true, + "triggers": [ + { + "type": "keyword", + "target": "chat_messages", + "case_sensitive": false, + "operator": "or", + "value": [ + "Input validation error", + "is a required property", + "is not of type 'integer'", + "is not of type 'string'", + "is not of type 'number'", + "is not of type 'array'", + "validation error" + ] + }, + { + "type": "natural_language", + "target": "chat_messages", + "case_sensitive": false, + "operator": "or", + "threshold": 0.65, + "value": [ + "a previous tool call returned an input-validation error about a required property being missing", + "a previous tool call returned a type-mismatch error such as \"X is not of type 'integer'\" or \"X is not of type 'string'\"", + "a recent tool call failed because of missing or wrong-typed arguments, not because the underlying data is unavailable" + ] + } + ], + "markdown_content": "# P-PB-4 — Validation-Error Recovery\n\n## Policy\n\nWhen a tool call returns a parameter-validation error — most commonly `\"Input validation error: 'X' is a required property\"` or `\"'Y' is not of type 'integer'\"` — the planner must perform a one-time **structured recovery** rather than abandoning the call, randomly retrying, or pivoting to a different (and usually worse) tool.\n\n## Rationale\n\nIn analytical and dashboard workflows, the difference between a successful run and a failed run is often a single misformed argument: the right tool was selected, but the planner passed `path=[]` instead of `path=\"x.sln\"`, or omitted a parameter that the API requires. **The data is reachable. The agent just needs to fix the call.**\n\nThe typical observed behaviours when CUGA receives a validation error are:\n\n1. Try the same tool with a permuted but still-wrong set of arguments (wastes calls).\n2. Pivot to a different tool that looks similar by name (often a worse match for the question).\n3. Conclude the data is unavailable and emit a refusal.\n\nAll three pollute the trajectory with failed calls without retrieving the data the next step needs — and (importantly for downstream consumers and audit logs) without producing the successful tool responses that establish the answer's provenance. A clean recovery puts the right value back on the table.\n\n## Required behaviour\n\nWhen a tool result contains a parameter-validation error:\n\n1. **Identify the failing parameter.** Parse the error message to extract:\n - The parameter name (e.g., `'path'`, `'summary'`, `'processed_time'`, `'solution_id'`).\n - The failure kind: *missing required property*, *wrong type* (`is not of type 'integer'`, `is not of type 'string'`, etc.), or *invalid value*.\n2. **Recover the value.** Search prior tool responses in the conversation for a field whose key or content matches the failing parameter. Typical sources:\n - A list-returning tool whose single element is the value (e.g., a previous call returned `{\"solution_paths\": [\"x.sln\"]}` and the failing parameter is `path` — pass `\"x.sln\"`).\n - A scalar field with a matching name (e.g., a previous call returned `{\"solution_id\": 45997}` and the failing parameter is `solution_id`).\n - A typed value that needs coercion (e.g., a string `\"636449700980488000\"` when the schema requires an integer — coerce to `int(636449700980488000)`).\n3. **Retry, once.** Re-invoke the same tool with the recovered value substituted into the failing parameter. Keep the rest of the arguments unchanged.\n4. **If the retry also fails**, do **not** retry a third time and do **not** pivot to a generic detail tool to \"discover\" the value indirectly. Emit a final answer based on the data already in hand, or a clear refusal if no answer is supportable. See [[playbook_no_idempotent_retries]] for the same-call-twice rule (this policy is the explicit exception, because the arguments change).\n\n## Common parameter-recovery patterns\n\nThese cover the validation errors seen most often in `capability_2_dashboard_apis` and `capability_3_multihop_reasoning`:\n\n| Validation error | Where to look for the value | What to pass |\n| --- | --- | --- |\n| `'path' is a required property` | Prior responses with `paths`, `solution_paths`, or `solution_path` field | The single element / first element |\n| `'summary' is a required property` | Prior responses with `summary`, `description`, or `body` field | Pass it through |\n| `'X_id' is a required property` | Prior `X_id`, `id` (in an object containing X), or a \"by name → id\" lookup tool's response | The integer ID |\n| `X is not of type 'integer'` | Same value is in hand; it's just typed as a string | Coerce to integer |\n| `X is not of type 'string'` | Same value is in hand; it's typed as a list or number | First element of list, or `str(number)` |\n| `X is not of type 'array'` | A scalar is in hand and the API wants a list | Wrap in a single-element list |\n\n## What this playbook does NOT permit\n\n- Calling the failing tool a third time after the first retry also fails.\n- **Manufacturing** a parameter value not present in any prior tool response (do not invent IDs, paths, or timestamps to satisfy the schema).\n- Treating \"tool not found\", \"404\", \"500\", or timeout errors as validation errors — those are different recovery scenarios and this policy does not apply.\n- Suppressing the validation error from the final answer if no recovery succeeded — be honest about what was retrieved and what wasn't.\n\n## Interaction with other policies\n\n- [[playbook_no_idempotent_retries]] (P-PB-3) forbids a second identical call; this policy is explicitly the *one* permitted retry, because the arguments change between attempts.\n- [[playbook_one_composite_tool_no_corroboration]] (P-PB-2) takes precedence in *choosing* the right tool; this policy operates **after** the right tool has been chosen and only needs argument repair.\n- [[output_formatter_single_tool_fact_citation]] (P-OF-1) applies normally to the final answer; the recovered (now-successful) call is the citable source.\n- This policy has higher priority (110) than the default planning playbooks (100) because it is a corrective action that should pre-empt re-exploration.\n" + }, + { + "id": "tool_guide_mountain_count_most_populous_country", + "type": "tool_guide", + "name": "P-TG-1 — `get_mountain_count_most_populous_country` Disambiguation", + "description": "Clarifies that `get_mountain_count_most_populous_country` is the right tool when the user asks for mountains in the country with the largest/greatest/most population.", + "priority": 100, + "enabled": true, + "prepend": true, + "target_tools": [ + "get_mountain_count_most_populous_country" + ], + "triggers": [], + "guide_content": "# P-TG-1 — `get_mountain_count_most_populous_country` Disambiguation\n\n## Policy\n\nThis `ToolGuide` enriches CUGA's view of the `get_mountain_count_most_populous_country` tool description so the shortlister surfaces it for the right intents and so the planner does not compose it with unrelated geography tools.\n\n## Rationale\n\nWhen the user asks *\"How many mountains are in the most populous country?\"* (or a paraphrase such as *\"the country with the largest population\"*, *\"the country with the most people\"*, *\"the country with the greatest population\"*), the right tool is the single composite endpoint `get_mountain_count_most_populous_country`. The CUGA shortlister can miss this because the user's phrasing uses *\"most populous\"* / *\"largest population\"* while the tool name encodes the same concept differently.\n\nWhen the shortlister misses the composite tool, CUGA tends to compose two unrelated tools (a population-ranking tool plus a mountain-counting tool keyed by country), which:\n\n- Costs extra LLM calls and tool calls.\n- Produces a brittle chain that can fail if the population-ranking tool's country naming does not match the mountain-counting tool's country naming.\n- Risks a wrong answer if the population tool returns a list of \"most populous\" countries while the question asks for *the* single country.\n\n## What this ToolGuide adds\n\nThe following content is prepended to the tool's stored description so that the shortlister's embedding match and the planner's prompt both see it:\n\n**Use this tool when the user asks for:**\n- The mountain count of the most populous country\n- The number of mountains in the country with the largest, greatest, or highest population\n- \"How many mountains\" combined with \"most people\", \"biggest population\", \"most populated country\"\n\n**Do NOT compose with city-population, country-population-ranking, or per-country mountain-listing tools.** This single tool returns the answer directly.\n\n**Returns:** a single integer — the count of mountains in the country with the largest population.\n\n## Scope and limits\n\nThis `ToolGuide` only changes CUGA's internal view of the tool's description (per the `ToolGuide` policy mechanism). It does **not** modify the upstream MCP tool definition, which is part of the benchmark and remains untouched.\n\nThe policy is narrow on purpose: it targets exactly one tool (the one CUGA missed in the analyzed PF case). If the same disambiguation pattern recurs for other composite tools, additional `ToolGuide` policies should be added for each rather than broadening this one.\n" + }, + { + "id": "tool_guide_country_with_most_umpires_returns_id", + "type": "tool_guide", + "name": "P-TG-2 — `get_country_with_most_umpires` Returns an ID, Not a Name", + "description": "Clarifies that `get_country_with_most_umpires` returns a numeric country ID and must be chained with a name-lookup tool before the country can be reported by name.", + "priority": 100, + "enabled": true, + "prepend": true, + "triggers": [], + "target_tools": [ + "get_country_with_most_umpires" + ], + "guide_content": "# P-TG-2 — `get_country_with_most_umpires` Returns an ID, Not a Name\n\n## Policy\n\nThis `ToolGuide` enriches CUGA's view of `get_country_with_most_umpires` so the planner knows the response is a country **ID** (an integer key into the country table) and not a country **name**. The planner is instructed to chain the result through `get_country_name_by_id` when the user has asked for a country by name.\n\n## Rationale\n\nWhen the user asks *\"From which country are the most umpires?\"*, they expect a country name in the answer (e.g., \"England\", \"Australia\"). The composite tool `get_country_with_most_umpires` is the correct entry point — it returns the answer directly — but its response shape is `{country_id: , umpire_count: }`. Without this disambiguation, CUGA tends to either:\n\n- Emit the raw ID as the answer (\"The country with the most umpires has ID 1, with 27 umpires.\"), which is technically true but useless to the reader.\n- Conclude that the dataset \"does not provide a tool to translate ID 1 into a name\" and refuse — wrong, because `get_country_name_by_id` exists and is the obvious next call.\n\nBoth behaviours fail an analytics-style reader's basic expectation that a \"which country\" question is answered with a country name.\n\n## What this ToolGuide adds\n\nThe following content is prepended to the tool's stored description so the planner sees it before it sends the response back to the user:\n\n**Return shape:** `{country_id: , umpire_count: }` — the `country_id` is a numeric primary key, NOT a country name.\n\n**Required follow-up when the user asked for a country by name:** chain with `get_country_name_by_id(country_id=)` to translate the ID into a name before producing the final answer.\n\n**Do NOT report the raw `country_id` to the user when they asked for the country itself.** Reporting \"ID = 1\" to a user who asked \"which country\" is a policy violation: see [[output_formatter_strip_hedging]] for the answer-shape requirement and [[output_formatter_single_tool_fact_citation]] for the citation requirement.\n\n**Do NOT refuse with \"the dataset does not provide a name lookup tool\"** — `get_country_name_by_id` exists in the same capability and is the canonical name lookup.\n\n## Scope and limits\n\nThis `ToolGuide` only changes CUGA's internal view of the tool description (per the `ToolGuide` policy mechanism). It does **not** modify the upstream MCP tool definition, which is part of the benchmark and remains untouched.\n\nThe same pattern (composite tool returns an ID, name lookup lives in a separate tool) likely recurs across the `capability_2_dashboard_apis` and `capability_3_multihop_reasoning` domains. If new \"X with most Y returns an ID\" cases turn up, add a focused `ToolGuide` per tool rather than broadening this one — narrow, tool-specific guides are easier to debug than a single sprawling rule.\n" + } +] diff --git a/docs/m3-vakra-analysis-20260428/cuga_vs_react_full_analysis.md b/docs/m3-vakra-analysis-20260428/cuga_vs_react_full_analysis.md new file mode 100644 index 0000000..1e2545a --- /dev/null +++ b/docs/m3-vakra-analysis-20260428/cuga_vs_react_full_analysis.md @@ -0,0 +1,1072 @@ +# CUGA vs LangGraph ReAct on M3 (vakra) — full failure-mode analysis + +**Bundle:** `benchmarks/m3/evaluation_bundles/20260428_201443_default/` +**Model (both agents):** `openai/gpt-oss-120b` via Groq +**CUGA agent:** `cuga_sdk` v0.2.20 (git df40ff98), mode `accurate`, `lite_mode=true` +**ReAct agent:** LangGraph ReAct, run files `benchmarks/m3/results_react/task{2,3}_lg_gpt-oss-120b.json` +**Benchmark:** M3 capability_dashboard_apis (Task 2, 100 cases) + capability_multihop_reasoning (Task 3, 100 cases) = 200 cases +**Scoring:** M3 vakra (three LLM judges: answer correctness, tool-call exactmatch, groundedness). A turn passes if its aggregated score ≥ 1.0; per the M3 aggregation rule any single sub-judge of 1.0 with at least one corroborating 1.0 (or aggregation defaulting through `None`) triggers pass. + +> **Scope note for this rewrite.** The vakra evaluator (judges + aggregation), the benchmark groundtruth, the upstream MCP tool definitions, and the ReAct baseline are all **off-limits** as remediation levers — they are the benchmark and must not be modified. All remediations below are inside CUGA (cuga-agent + this repo's CUGA config), evaluated across a fixed lever taxonomy: **policy, memory, tool-enrichment (CUGA-side), configuration, prompt, code**. We prefer the policy/memory/config/code levers over a global system-prompt change wherever a more scoped lever fits. + +--- + +## Executive summary (rewritten) + +The 41 PF cases (ReAct passed, CUGA failed) are dominated by a single asymmetry: the groundedness judge returns `no` on CUGA's final answer in 26 of 41 cases (24 with `(answer=1.0, exactmatch=0.0, groundedness=0.0)` + 2 with `(None, 1.0, 0.0)`) while returning `yes` for ReAct on the same questions with near-identical wording. We cannot change how that judge scores. What we **can** do is shape CUGA's final answer and tool-selection behaviour from the CUGA side so the judge's prompt finds the keywords it expects. + +The most important finding of this rewrite is that CUGA already ships a rich **policy engine** (`Playbook`, `IntentGuard`, `ToolGuide`, `ToolApproval`, `OutputFormatter`, `CustomPolicy` — see `cuga-agent/src/cuga/backend/cuga_graph/policy/models.py`) that was **disabled** in this run (`DYNACONF_POLICY__ENABLED=false`). The `OutputFormatter` policy in particular is hooked into the LITE_MODE path that this benchmark uses (`cuga_lite_node.py:_apply_output_formatter`), so it can rewrite the final answer per matched trigger without touching prompts globally. **A small policy bundle (one shared `OutputFormatter` for single-fact answers, one for hedging strip, a handful of `ToolGuide` policies for the tool-disambiguation cases, and 2–3 `Playbook` rules for chain-minimization) is the highest-leverage CUGA-side fix in the bundle.** + +The other two priorities are: (a) a small **code change** to the CUGA sandbox codegen step to stop wrapping previous-tool-result dicts into the next call's argument (the "nested-argument bug" — directly causes 7 PF cases and inflates cost across many more), and (b) **needs-investigation** for `movie_platform` and `professional_basketball` (7 PF + 13 FF cases with no CUGA trace — likely an MCP-client/registry health issue on the CUGA side, not a benchmark issue). + +Per-case lever comparisons are in §3. The policy/memory/tool-enrichment/configuration/prompt/code framework is applied uniformly per case. + +--- + +## 1. Headline summary + +| Task | Capability | CUGA pass | ReAct pass | Δ | +| --- | --- | --- | --- | --- | +| 2 | capability_dashboard_apis | **28/100 (28%)** | 49/100 (49%) | −21 pp | +| 3 | capability_multihop_reasoning | **12/100 (12%)** | 23/100 (23%) | −11 pp | +| **Total** | both | **40/200 (20%)** | 72/200 (36%) | −16 pp | + +### Per-domain pass rates (CUGA / ReAct) + +#### Task 2 — dashboard_apis + +| Domain | CUGA | ReAct | +| --- | --- | --- | +| authors | 7/10 | 5/10 | +| books | 7/10 | 7/10 | +| codebase_comments | **0/10** | 4/10 | +| hockey | 6/10 | 7/10 | +| mondial_geo | **0/10** | 3/10 | +| movie_platform | **0/10** | 4/10 | +| professional_basketball | **0/10** | 3/10 | +| soccer_2016 | **0/10** | 5/10 | +| student_loan | **0/10** | 4/10 | +| talkingdata | 8/10 | 7/10 | + +#### Task 3 — multihop_reasoning + +| Domain | CUGA | ReAct | +| --- | --- | --- | +| beer_factory | 0/10 | 2/10 | +| books | 6/10 | 5/10 | +| college_completion | 0/10 | 1/10 | +| computer_student | 0/10 | 2/10 | +| disney | 2/10 | 3/10 | +| mondial_geo | 0/10 | 2/10 | +| soccer_2016 | 0/10 | 5/10 | +| trains | 1/10 | 0/10 | +| university | 3/10 | 3/10 | +| world_development_indicators | 0/10 | 0/10 | + +CUGA is bimodal: it pegs 60–80% on `authors`, `books`, `hockey`, `talkingdata`, `books` (T3) — and 0% on 12 of the remaining 15 domain×task combos. ReAct is roughly uniform (20–50% almost everywhere). This pattern is the single most important signal in the data: **CUGA either nails a domain or fails it completely.** That is a deterministic-failure pattern, not a sampling-variance pattern, and it means a single root cause is likely responsible for a domain's zero score. + +### The four quadrants + +| | CUGA pass | CUGA fail | total | +| --- | --- | --- | --- | +| **ReAct pass** | PP=31 | PF=41 | 72 | +| **ReAct fail** | FP=9 | FF=119 | 128 | +| **total** | 40 | 160 | 200 | + +This report's actionable content focuses on the **41 PF cases** (cases ReAct passed but CUGA failed). Those are the gap — closing them lifts CUGA to 81/200 (40.5%), comfortably above ReAct's 72/200. The FF appendix is just listed at the end. + +--- + +## 2. Failure-mode clustering — the actionable summary + +Each of the 41 PF cases was classified by inspecting (a) the CUGA langfuse trace's tool sequence and final answer, (b) the CUGA vakra sub-scores (answer / exactmatch / groundedness), and (c) the M3 judge explanations. + +### CUGA PF score-pattern distribution (the asymmetry) + +| (answer_s, exactmatch_s, groundedness_s) | n | meaning | +| --- | --- | --- | +| **(1.0, 0.0, 0.0)** | 24 | CUGA gave the right answer; tool-call exactmatch failed (over- or differently-called); **groundedness judge returned `no`**. ReAct on the *same* questions all scored (1.0, 0.0, **1.0**) and passed. | +| (0.0, 0.0, None) | 15 | CUGA's answer was actually judged incorrect — a genuine answer failure. | +| (None, 1.0, 0.0) | 2 | Tool-call exactmatch passed but groundedness=0. | + +ReAct's PF score pattern is uniformly `(1.0, 0.0, 1.0)` on all 41 — i.e., it never matches the tool sequence exactly either, and it always passes the answer judge AND the groundedness judge. **The 26 CUGA cases (24+2) that pass at least one judge with `1.0` but lose because `groundedness=0` are the largest single closable lever in this report** — and they are the cases where a CUGA-side `OutputFormatter` policy can rewrite the final answer to put the keywords/values the groundedness judge looks for back in front of it. + +### 2.1 The lever taxonomy used in this rewrite + +For each PF case in §3, we evaluate the following levers in fixed order. Each lever that is plausibly relevant gets one sentence; clearly-irrelevant levers are skipped to avoid padding. + +1. **Policy** — CUGA's policy engine (`Playbook` / `IntentGuard` / `ToolGuide` / `ToolApproval` / `OutputFormatter` / `CustomPolicy`). Currently disabled in this run (`DYNACONF_POLICY__ENABLED=false`). The engine matches triggers (keyword, natural_language, app, state, tool, always) against the `intent`, `agent_response`, etc. Most relevant variants for this report: + - `OutputFormatter` rewrites the *final* AI message based on trigger matches against the response itself (markdown rewriting, JSON-schema reshape, or direct string replacement) — wired into the LITE_MODE path via `cuga_lite_node._apply_output_formatter`. **This is the natural fit for the groundedness cluster**: it can prepend a tool/key citation, strip hedging, or strip unsolicited context paragraphs without touching the global FinalAnswerAgent prompt. + - `Playbook` injects step-by-step guidance keyed by intent — useful for "use this single composite tool, do not corroborate", or for "do not enumerate runners-up when a single item is asked". + - `ToolGuide` enriches CUGA's *view* of a tool's description (prepended/appended markdown) — useful for tool-disambiguation cases and for steering the shortlister's embedding match. **Note**: this changes the description CUGA stores/uses, not the upstream MCP tool itself. +2. **Memory** — agentic_memory subsystem in `cuga-agent/src/cuga/backend/memory/agentic_memory/`. Could provide few-shot trajectories from past similar runs. Helps tasks that benefit from seeing the canonical tool path before. Cold-start ineffective. +3. **Tool enrichment** (CUGA-side only) — overlaps with `ToolGuide` policy above, plus CUGA's shortlister/registry behaviour, retry/argument-coercion wrappers around MCP calls. +4. **Configuration** — env vars in `benchmarks/m3/config/m3.env` and the run's `metadata.json`: `CUGA_MODE`, `LITE_MODE`, `LITE_MODE_TOOL_THRESHOLD`, `SHORTLISTING_TOOL_THRESHOLD`, `DECOMPOSITION_STRATEGY`, `REFLECTION_ENABLED`, `ENABLE_TODOS`, `FORCE_AUTONOMOUS_MODE`, `TOOL_CALL_TIMEOUT`, `PATH_SEGMENT_INDEX`, `TRACKER_ENABLED`, `agent_setting_config`, `model_profile`. +5. **Prompt change** — *deprioritized*. Global FinalAnswerAgent prompt edits affect all runs and are a heavy hammer for benchmark-scoped issues. Only recommended where no other lever fits. +6. **Code change** — actual code edits to cuga-agent (parser fixes, bug fixes, control-flow). Reserved for clear bugs (notably the nested-arg bug). + +### 2.2 Cluster table + +Cluster names and counts are unchanged from the original analysis; the rightmost column is the new "best CUGA-side lever (runner-up)" framing. + +| failure mode | n | % of PF | best lever (runner-up) | +| --- | --- | --- | --- | +| **verbose_answer_or_extra_tools** | 23 | 56% | **policy: `OutputFormatter`** (shared single-fact citation that prepends tool name + key to the answer; second OutputFormatter to strip hedging / "for context" appendices / dataset meta-commentary). Runner-up: prompt change to FinalAnswerAgent (same effect, global). See §2.3.1. | +| **no_trace_pre_llm_crash** | 7 | 17% | **needs-investigation + code**: re-run `movie_platform` and `professional_basketball` MCP under the same CUGA config to capture registry stderr; add a CUGA-side "empty tool list → loud error" failsafe in the MCP-client glue. Policy/memory are inapplicable when the LLM never ran. See §2.3.2. | +| **wrong_args_or_aggregation** | 7 | 17% | Two sub-patterns. C1 (nested-argument bug): **code** (sandbox codegen rule: when binding a value from a prior tool's response, subscript the JSON key, never pass the whole dict). C2 (expressed-uncertainty / wrong-aggregation): **policy: `OutputFormatter`** to strip hedging tokens + `Playbook` "never rescale numeric tool outputs" / "prefer purpose-built tool over generic detail tool". See §2.3.3. | +| **timeout_or_giveup** | 2 | 5% | **policy: `Playbook`** ("emit confident final on first numeric match for the intent") + `OutputFormatter` strip-hedging. Runner-up: configuration (try `ENABLE_TODOS=true` or `REFLECTION_ENABLED=false` ablation to bound exploration). See §2.3.4. | +| **wrong_tool_selected** | 1 | 2% | **policy: `ToolGuide`** enriching `get_mountain_count_most_populous_country` description with "country with the largest/greatest/most population" so the shortlister surfaces it. Runner-up: memory if accumulated across runs. See §2.3.5. | +| **hallucinated_no_tool** | 1 | 2% | **code**: CUGA-side sandbox guard — if the generated `await ...` references a function not in the resolved tool registry, raise a hard error instead of letting the code block become the final answer. Also fix the underlying registry health for the domain. Same root family as the no-trace cluster. See §2.3.6. | + +### 2.3 Cluster-level remediations + +#### 2.3.1 Cluster A — `verbose_answer_or_extra_tools` (n=23): the groundedness-judge gap + +Concrete pattern. CUGA's `final_response` typically matches ReAct's almost word-for-word; the answer judge gives both `1.0`. Yet the groundedness judge (also gpt-oss-120b) returns "no" for CUGA and "yes" for ReAct. Reading the judge's reasoning across all 23 cases reveals a consistent pattern: **the groundedness judge claims "the document provides no information" or "the document only indicates the relevant tool was not found"** — *even though the predicted `tool_response` payload (e.g. `{"stars": 272}`) is literally embedded in the document the judge sees.* This is small-model judge confabulation. We cannot change the judge. We can stack the deck from CUGA's side. + +A precise example (uuid `1960f609e439-e5d337d143b6`, T2 codebase_comments): +- Q: "How many stars does the repository of the solution No. 45997 have?" +- ReAct ans: "The repository associated with solution **#45997** has **272 stars**." +- CUGA ans: "Solution #45997's repository has **272** stars." +- Vakra prediction file CUGA tool_response: `["{\"stars\": 272}"]` +- ReAct gnd judge: "The response correctly restates the document's fact that solution #45997's repository has 272 stars" → score 1 +- CUGA gnd judge: "The response provides a specific star count (272) that is not present in the document, which only indicates the query could not retrieve star information" → score 0 + +**Recommended remediation — policy first.** Build a small `OutputFormatter` policy bundle and enable it for the M3 run only (`DYNACONF_POLICY__ENABLED=true`, point the policy folder at a benchmark-scoped policy directory analogous to `benchmarks/bpo/policies/`). + +- **Policy P-OF-1 ("cite tool and key on single-fact answers"):** `OutputFormatter`, `format_type=markdown`, trigger = natural-language match on `agent_response` like "response contains a single fact or single short list that came from one tool call". `format_config` instructs the LLM to rewrite the response so it (a) opens with a one-clause restatement that literally repeats the JSON key from the tool response — e.g. *"The repository's star count, returned by `get_repo_stars_by_solution_id`, is `272`."* — and (b) preserves the original answer body. Expected to flip ~10–15 of the 24 `(1.0, 0.0, 0.0)` cases. Cost: one LLM call per response when the policy matches. + + Rationale vs the alternatives: a global FinalAnswerAgent prompt change has the same effect on this judge but applies to all runs of CUGA (chat, browser tasks, non-M3 benchmarks) and is harder to reason about. `OutputFormatter` is scoped: it fires only on responses that match its trigger, leaves the rest of CUGA's behaviour alone, and can be disabled by flipping `DYNACONF_POLICY__ENABLED`. + +- **Policy P-OF-2 ("strip hedging / unsolicited context / dataset meta-commentary"):** second `OutputFormatter`, `format_type=markdown`, triggers on `agent_response` containing phrases like "upper bound", "cannot be completed", "may be lower", "For context", "However, X is actually Y", "The dataset does not provide a tool to". `format_config` instructs: "If the answer contains a numeric value or named entity that resolves the question, emit only that resolved answer; strip hedging clauses, dataset meta-commentary, and unsolicited `For context` appendices." Recovers the soccer_2016 hedge cases, the BYU-Idaho appendix in college_completion, the cricket-vs-soccer meta-commentary, and the computer_student "upper bound" case. + + Rationale vs the alternatives: a prompt change would do the same job, but in our PF/FP analysis (see §5) CUGA's confident answer style is a *strength* on the 9 FP cases — we want to remove hedging selectively (only when a numeric value is already in hand), not blanket-restructure the FinalAnswerAgent prompt. + +- **Policy P-PB-1 ("no enumeration when a single item is asked"):** `Playbook`, trigger on intent containing singular phrasing ("which conference", "the city of", "the solution path with..."). `markdown_content` instructs: "Return only the single requested item; do not enumerate runners-up or 'Top N' alternatives." Targets the ICRA case and similar. + +- **Policy P-PB-2 ("one composite tool, no corroboration"):** `Playbook`, trigger on percent/ratio/proportion intents. Instructs: "If a single endpoint returns the percentage/ratio directly, use only that endpoint. Do not also call the raw component tools to corroborate." + +- **Policy P-PB-3 ("no idempotent retries"):** `Playbook`, instructs: "Do not re-invoke a tool that returned a deterministic value during the same turn; emit the answer." + +**What policy cannot do here.** None of these policies can rescue a case where CUGA called the wrong tool entirely (Cluster E) or where the underlying capability did not run (Cluster B). Those need different levers; see below. + +#### 2.3.2 Cluster B — `no_trace_pre_llm_crash` (n=7 in PF, 22 total) + +22 of 200 cases produced no CUGA langfuse trace at all. These split into: + +- **`movie_platform` (8 of 10 cases in T2)** — 7 of these are FF (also ReAct fail), 1 is PF. +- **`professional_basketball` (10 of 10 cases in T2)** — all FF. +- Sporadic singletons in `codebase_comments`, `mondial_geo`, `soccer_2016`, `college_completion`. + +The PF cases that fall in this bucket are: + +| uuid | task | domain | +| --- | --- | --- | +| 31d9743578dc-20fe1c6e0318 | 2 | movie_platform | +| 31d9743578dc-3b59b2d5a9b3 | 2 | movie_platform | +| 31d9743578dc-fa8256c2888f | 2 | movie_platform | +| d14bbb0be92d-781ff55b91b7 | 2 | professional_basketball | +| d14bbb0be92d-b94c0c0446e9 | 2 | professional_basketball | +| d14bbb0be92d-7d51d5f6098d | 2 | professional_basketball | +| fe971e7f850a-0bd47606e297 | 2 | soccer_2016 (this one *does* have a CUGA result but with 0 actual tool calls and an empty final_response, which is the same symptom) | + +For the entire bucket, the `report.md` row shows blank tokens/duration. This is consistent with **a registry / MCP-client startup failure** on the CUGA side for those domains — the CUGA agent didn't obtain a tool list and produced nothing. The fact that `professional_basketball` is 100% no-trace is the strongest signal. + +**Why policy / memory / config alone cannot fix this.** Policies are evaluated against `intent` / `agent_response` / state — they require the LLM to actually run at least once. Memory likewise. Configuration could mask the symptom (e.g., a guard for empty tool lists) but cannot cause the registry to succeed. + +**Recommended remediations:** +- **Needs-investigation:** re-run `movie_platform` and `professional_basketball` MCP servers in isolation under the same CUGA config; capture stderr from the registry expansion step in `eval_m3.py` (the `m3_registry.yaml` → expanded config path that runs at boot). Look for missing-endpoint, schema-validation, or container-startup errors on the CUGA-side MCP client. +- **Code:** add a CUGA-side failsafe in the MCP-client glue — when the registry returns an empty tool list for a domain, emit an explicit guard error to the trace rather than silently producing no output. This converts a silent zero into a diagnosable error and prevents the case from losing its turn slot. Also enables the rest of the bundle to surface the issue earlier (so we don't lose a full run silently). + +#### 2.3.3 Cluster C — `wrong_args_or_aggregation` (n=7) + +Two sub-patterns, both repeatable. + +**Sub-pattern C1: nested-argument bug** (very prevalent across many runs, including some that DID pass after retry): +```python +# tool A returns {"director": "Wolfgang Reitherman"} +# tool B expects (director: str) +# CUGA generates: +tool_B(director={"director": "Wolfgang Reitherman"}) # ← nested +# Server returns: "Input validation error: {'director': 'Wolfgang Reitherman'} is not of type 'string'" +# CUGA self-corrects on the next turn, but burns a step + an LLM call +``` +Observed in PF cases: `34a533dfd727-9a80447e42a5` (disney), `34a533dfd727-792336e9811f` (disney), `fe971e7f850a-0ce9f1bd5b3e` (soccer_2016), `fe971e7f850a-d6dd43c77447` (soccer_2016), `55b7e50368aa-cd69f2bccbaa` (mondial_geo) — and many more in the FF and PP traces too. CUGA *almost always* recovers, but on multihop chains with 2–3 chained tools, the failed-then-retried attempts inflate the tool-call count to 5+ and inflate latency, sometimes pushing the case into a timeout, sometimes producing a wrong intermediate result that fools the second call. + +**Remediation comparison for C1.** +- **Code:** add a single-purpose rule to the sandbox Python-codegen step: "When passing a value from a previous tool's response, unwrap the JSON key with subscript access — never pass the whole dict." E.g. `r = await tool_A(); tool_B(director=r['director'])`. This is the direct cure and is cheap. **Recommended.** +- **Policy:** a `Playbook` instructing the same rule via intent-match could partially help but only at the level of the planner narrative; the sandbox codegen is downstream and is where the actual bug lives. +- **Tool enrichment:** `ToolGuide` policies prepending "argument types: scalar, not dict" to tool descriptions would marginally help but doesn't catch the root cause. +- **Configuration:** no relevant flag. + +**Sub-pattern C2: expressed-uncertainty / wrong-aggregation answer.** Two cases (`308738b8195d-56faa9f6bbd2` hockey "temporary coaches" and `39a28b2592a2-a6d040ce4d19` computer_student) where CUGA *had* the correct number (1 and 13) but wrapped the answer in language like "The task cannot be completed because..." or "this figure is an upper bound". The answer judge's prompt has a hardcoded rule: "Predictions expressing uncertainty score 0 even if numerically correct." Plus the `bc9218680ed5-0b0ec8d0b7d2` case where CUGA rescaled a `{percentage: 1500}` tool output to "15%" (helpful but penalized). Plus the `34a533dfd727-792336e9811f` case where a generic detail tool's output overrode a purpose-built tool's correct answer. + +**Remediation comparison for C2.** +- **Policy `OutputFormatter` (P-OF-2):** strip hedging tokens from the head of the response when a numeric value is present. Cheap and surgical. **Recommended for the hedging cases.** +- **Policy `Playbook`:** "never numerically rescale a value that came directly from a tool response — emit it verbatim" (for the 1500% case). "When multiple tools have returned candidate answers, prefer the purpose-built tool's result over the generic detail tool's" (for the disney case). +- **Prompt change:** would also work but global; policy is preferred per user's framing. + +#### 2.3.4 Cluster D — `timeout_or_giveup` (n=2) + +Both timeouts hit the 120 s `TOOL_CALL_TIMEOUT` while still iterating: `308738b8195d-56faa9f6bbd2` (140 s, 54 LLM calls, hockey "temporary coaches") and `39a28b2592a2-a6d040ce4d19` (195 s, 48 LLM calls, computer_student). Both are also uncertainty-expression cases (C2 above). + +**Recommended remediation — combined:** the P-OF-2 hedging-strip policy + a `Playbook` rule "when a tool has returned a numeric value that resolves the question's intent, emit a confident final answer immediately". As a configuration ablation, try `REFLECTION_ENABLED=false` on a re-run of these two cases — the reflection step appears to be driving repeated re-checks. `TOOL_CALL_TIMEOUT` is *per-call* and not the binding constraint here; raising it further does not help. + +#### 2.3.5 Cluster E — `wrong_tool_selected` (n=1) + +Single case: `55b7e50368aa-7d0eae1aeaf4` (T2 mondial_geo, "mountains in the country with the greatest population"). The expected single tool is `get_mountain_count_most_populous_country`. CUGA instead chained `get_most_populous_city_excluding_capital_global` → `get_country_of_city` → `get_mountain_count_by_country` and arrived at India (which has 0 mountains in the dataset), the wrong answer. The shortlister didn't surface the canonical tool because its CUGA-side description doesn't contain the user's vocabulary ("country with the greatest/largest population"). + +**Recommended remediation — policy `ToolGuide`:** a `ToolGuide` policy targeting the `get_mountain_count_most_populous_country` tool name, with `prepend=true` and `guide_content`: *"Use this tool when the user asks for mountains in the country with the largest/greatest/most population. Do NOT compose with city-population tools."* This enriches CUGA's view of the tool description used by the shortlister; the upstream MCP tool definition is unchanged. + +#### 2.3.6 Cluster F — `hallucinated_no_tool` (n=1) + +Single case: `31d9743578dc-5b6784e8d151` (T2 movie_platform). CUGA emitted a Python code block describing what it WOULD do instead of executing it. Almost certainly the same root cause as Cluster B (movie_platform registry health). Remediation: **code** (sandbox guard: unknown-tool-reference → hard error; never emit raw code as the final answer) + the same needs-investigation that Cluster B needs. + +--- + +## 3. Per-case PF narratives — grouped by failure mode + +Per the lever taxonomy in §2.1, each case below has: (a) ReAct's answer + CUGA's answer + sub-scores, (b) the diagnosis (unchanged from the prior version), (c) a per-lever verdict table, and (d) a single recommended lever with the runner-up named. + +### 3.1 `verbose_answer_or_extra_tools` (n=23) + +> **Default recommendation for this cluster:** policy P-OF-1 (single-fact OutputFormatter that cites tool name + JSON key), augmented per case with P-OF-2 (hedging/appendix strip), P-PB-1 (no enumeration), P-PB-2 (no corroboration), or P-PB-3 (no idempotent retries) as called out below. Where the C1 nested-arg bug is also present, the code fix is recommended in parallel. + +#### 3.1.1 `6e317bcd6839-bbaadc612be9` | T2 books — "List all books published in 1995" + +**Diagnosis.** CUGA called the right tool (`get_book_titles_by_publication_year(year="1995")`), returned the same titles as ReAct. Answer judge 1.0, gnd judge 0.0 ("document provides only titles without any publication year information") — the year filter is not restated in the answer, so the judge cannot find it as a keyword. + +| lever | verdict | +| --- | --- | +| policy | **STRONG FIT** — OutputFormatter (markdown) that opens the response with "Based on `get_book_titles_by_publication_year` for year 1995:". Puts the keyword back in front of the judge. | +| memory | WEAK — pattern is general; only helps after accumulated trajectories. | +| tool_enrichment | not applicable. | +| configuration | not applicable. | +| prompt | would work but global. | +| code | not needed. | + +**Recommended:** policy (P-OF-1 variant: restate the filter clause). Runner-up: prompt change (equivalent, more global). + +#### 3.1.2 `1960f609e439-e5d337d143b6` | T2 codebase_comments — "Stars of solution #45997?" + +**Diagnosis.** Both: "272 stars". CUGA `(answer=1.0, em=0.0, gnd=0.0)`. The canonical groundedness-judge confabulation example. + +| lever | verdict | +| --- | --- | +| policy | **STRONG FIT** — P-OF-1 (single-tool-fact citation). | +| memory | weak. | +| tool_enrichment | not applicable. | +| configuration | not applicable. | +| prompt | redundant with policy. | +| code | not needed. | + +**Recommended:** policy P-OF-1. Runner-up: prompt change to FinalAnswerAgent (same outcome, global). + +#### 3.1.3 `1960f609e439-ab3a664a6a28` | T2 codebase_comments — "Solution path with highest processed time" + +**Diagnosis.** Same as 3.1.2. + +| lever | verdict | +| --- | --- | +| policy | **STRONG FIT** — P-OF-1. | +| memory | weak. | +| tool_enrichment | not applicable. | +| configuration | not applicable. | +| prompt | redundant. | +| code | not needed. | + +**Recommended:** policy P-OF-1. + +#### 3.1.4 `1960f609e439-00fe3f448af7` | T2 codebase_comments — "Forks-to-stars % for solution 104086" + +**Diagnosis.** Both answered correctly. CUGA called three tools (the percent tool + raw forks + raw stars) — over-grounding. The extra payloads enter the document but their values are not in the answer, which the gnd judge flags. + +| lever | verdict | +| --- | --- | +| policy | **STRONG FIT** — P-OF-1 + P-PB-2 ("one composite tool, no corroboration"). | +| memory | partial — would help with a "preferred composite tool for % queries in domain X". | +| tool_enrichment | partial — `ToolGuide` on the percent tool with "use this single tool; do not call raw count tools afterwards". | +| configuration | not applicable. | +| prompt | partial — could tell FinalAnswerAgent to omit redundant context but does not stop the extra upstream calls. | +| code | not needed. | + +**Recommended:** policy (P-OF-1 + P-PB-2). Runner-up: `ToolGuide` on the percent tool. + +#### 3.1.5 `1960f609e439-d1ba8f4ad233` | T2 codebase_comments — "Solution ids for repos with 238 forks" + +**Diagnosis.** Both: "62258 and 258160". CUGA called the right tool, then 4 verification calls. gnd=0. + +| lever | verdict | +| --- | --- | +| policy | **STRONG FIT** — `Playbook` "do not run verification calls after a list-returning tool succeeds" + P-OF-1. | +| memory | partial. | +| tool_enrichment | partial — `ToolGuide` on `get_solution_ids_by_repo_forks` saying "this tool returns the complete list; do not verify individual entries". | +| configuration | `REFLECTION_ENABLED=false` ablation likely eliminates the verification loop (the verify-after-success reads as reflection-driven). | +| prompt | partial — a global "do not verify" rule is risky (verification is sometimes warranted). | +| code | not needed. | + +**Recommended:** policy (Playbook + P-OF-1). Runner-up: configuration ablation (`REFLECTION_ENABLED=false`). + +#### 3.1.6 `55b7e50368aa-cbe1f5a85755` | T2 mondial_geo — "City of lake at (-85.35, 11.6)?" + +**Diagnosis.** Both: "Granada". CUGA gnd=0. Single-tool confabulation. + +| lever | verdict | +| --- | --- | +| policy | **STRONG FIT** — P-OF-1. | +| memory | weak. tool_enrichment / configuration / code: not applicable. prompt: redundant. | + +**Recommended:** policy P-OF-1. + +#### 3.1.7 `55b7e50368aa-50580d511198` | T2 mondial_geo — "Most prevalent religion in Asia" + +**Diagnosis.** Both: "Islam (Muslim)". CUGA gnd=0. + +| lever | verdict | +| --- | --- | +| policy | **STRONG FIT** — P-OF-1. | +| memory | weak. tool_enrichment / configuration / code: not applicable. prompt: redundant. | + +**Recommended:** policy P-OF-1. + +#### 3.1.8 `fe971e7f850a-f39d12a24e8a` | T2 soccer_2016 — "Country with most umpires, count?" + +**Diagnosis.** First call OK (country_id=1, count=27). Second call (`get_umpire_count_by_country`) hit the nested-arg bug and looped retrying. CUGA's final answer was hedged: "The dataset does not provide a tool to translate the country ID = 1 into its name." Hedge + enumeration tanked gnd. + +| lever | verdict | +| --- | --- | +| policy | **STRONG FIT** — P-OF-1 + P-OF-2 (strip hedge tokens). | +| memory | weak on this turn. | +| tool_enrichment | partial — `ToolGuide` clarifying that the country/umpire pair returns an ID, not a name. | +| configuration | not applicable to the hedge symptom. | +| prompt | redundant. | +| code | **STRONG FIT** — nested-arg fix (C1) directly removes the retry loop. | + +**Recommended:** code (nested-arg) + policy (P-OF-1 + P-OF-2). + +#### 3.1.9 `fe971e7f850a-d96a7bc6401a` | T2 soccer_2016 — "City with most venues" + +**Diagnosis.** Both: "Abu Dhabi". CUGA gnd=0. + +| lever | verdict | +| --- | --- | +| policy | **STRONG FIT** — P-OF-1. | +| memory | weak. tool_enrichment / configuration / code: not applicable. prompt: redundant. | + +**Recommended:** policy P-OF-1. + +#### 3.1.10 `fe971e7f850a-4c26b4a6556a` | T2 soccer_2016 — "Matches with 7-point winning margin" + +**Diagnosis.** Both: 69. CUGA called the same tool twice (idempotent reflection retry). gnd=0. + +| lever | verdict | +| --- | --- | +| policy | **STRONG FIT** — P-OF-1 + P-PB-3 (no idempotent retries). | +| memory | weak. | +| tool_enrichment | not applicable. | +| configuration | partial — `REFLECTION_ENABLED=false` removes the duplicate but is blunt. | +| prompt | redundant. | +| code | not needed. | + +**Recommended:** policy (P-OF-1 + P-PB-3). Runner-up: configuration ablation. + +#### 3.1.11 `fe971e7f850a-a9ff06e36390` | T2 soccer_2016 — "Players born in the 90s" + +**Diagnosis.** Both: 92. CUGA called once, gnd=0. Pure confabulation. + +| lever | verdict | +| --- | --- | +| policy | **STRONG FIT** — P-OF-1. | +| memory | weak. tool_enrichment / configuration / code: not applicable. prompt: redundant. | + +**Recommended:** policy P-OF-1. + +#### 3.1.12 `bc9218680ed5-5c65b18294ea` | T2 student_loan — "Disabled students absent 9 months" + +**Diagnosis.** Both: 7. CUGA gnd=0. Single-tool case. + +| lever | verdict | +| --- | --- | +| policy | **STRONG FIT** — P-OF-1. | +| memory | weak. tool_enrichment / configuration / code: not applicable. prompt: redundant. | + +**Recommended:** policy P-OF-1. + +#### 3.1.13 `bc9218680ed5-c0791be2fe5f` | T2 student_loan — "Males in >1 organization" + +**Diagnosis.** Both: 9. CUGA gnd=0. + +| lever | verdict | +| --- | --- | +| policy | **STRONG FIT** — P-OF-1. | +| memory | weak. tool_enrichment / configuration / code: not applicable. prompt: redundant. | + +**Recommended:** policy P-OF-1. + +#### 3.1.14 `bc9218680ed5-8d19697e5e81` | T2 student_loan — "% male students" + +**Diagnosis.** Both: 49.7%. CUGA gnd=0. + +| lever | verdict | +| --- | --- | +| policy | **STRONG FIT** — P-OF-1. | +| memory | weak. tool_enrichment / configuration / code: not applicable. prompt: redundant. | + +**Recommended:** policy P-OF-1. + +#### 3.1.15 `a823e527d383-9ca3b8a7ad8e` | T3 beer_factory — "Folsom customers using top non-alcoholic credit card" + +**Diagnosis.** Both: 56. CUGA 5 tools (nested-arg retry inflated). gnd=0. + +| lever | verdict | +| --- | --- | +| policy | **STRONG FIT** — P-OF-1. | +| memory | weak. | +| tool_enrichment | partial — for the retry tail. | +| configuration | not applicable. | +| prompt | redundant. | +| code | **STRONG FIT** — nested-arg fix (C1) prevents the retry tail. | + +**Recommended:** code (nested-arg) + policy P-OF-1. + +#### 3.1.16 `2b28654158b1-a59483784521` | T3 college_completion — "Lowest grad-100 4-year public school in ID" + +**Diagnosis.** Both: Lewis-Clark State College. CUGA 5 tools, 20 LLM calls, 64 s. gnd=0 partly because CUGA volunteered "For context, the institution with the highest number of students in Idaho is BYU-Idaho..." — ungrounded extra paragraph. + +| lever | verdict | +| --- | --- | +| policy | **STRONG FIT** — P-OF-2 (strip "For context" appendix). | +| memory | weak. tool_enrichment / configuration / code: not applicable. prompt: would work but global. | + +**Recommended:** policy P-OF-2. + +#### 3.1.17 `39a28b2592a2-ebd77c3a7592` | T3 computer_student — "Students advised by profs teaching basic/medium at most-teachers level" + +**Diagnosis.** Both: 0. CUGA 12 tools, 34 LLM calls, 113 s. gnd=0 — over-exploration introduced extra tool payloads. + +| lever | verdict | +| --- | --- | +| policy | **STRONG FIT** — P-OF-1 + a chain-minimization `Playbook` for multi-hop intents. | +| memory | partial. | +| tool_enrichment | not applicable. | +| configuration | partial — `REFLECTION_ENABLED=false` ablation. | +| prompt | would work; less scoped. | +| code | not needed. | + +**Recommended:** policy (P-OF-1 + chain-minimization Playbook). Runner-up: configuration ablation. + +#### 3.1.18 `55b7e50368aa-cd69f2bccbaa` | T3 mondial_geo — "GDP of continent with country with most erosion of real income" + +**Diagnosis.** Both: 9,138,648. CUGA 6 tools (nested-arg bug retry: `continent_name={"continent":"Europe"}`), 17 LLM calls. gnd=0. + +| lever | verdict | +| --- | --- | +| policy | **STRONG FIT** — P-OF-1. | +| memory | weak. tool_enrichment / configuration: not applicable. prompt: redundant. | +| code | **STRONG FIT** — canonical nested-arg example. | + +**Recommended:** code (nested-arg fix) + policy P-OF-1. + +#### 3.1.19 `55b7e50368aa-2b73471429c9` | T3 mondial_geo — "Mountains in country with highest GDP" + +**Diagnosis.** Both: 0 (United States). CUGA 3 tools (nested-arg retry). gnd=0. + +| lever | verdict | +| --- | --- | +| policy | **STRONG FIT** — P-OF-1. | +| memory | weak. tool_enrichment / configuration: not applicable. prompt: redundant. | +| code | **STRONG FIT** — nested-arg fix. | + +**Recommended:** code (nested-arg) + policy P-OF-1. + +#### 3.1.20 `fe971e7f850a-979018f9bffc` | T3 soccer_2016 — "Matches won by team that won match 336000 in 2008" + +**Diagnosis.** Both: 10. CUGA 4 tools (3 duplicate `get_match_winner` + `get_sum_matches_won`), 26 LLM calls, 63 s. CUGA also editorialized "However, Kings XI Punjab is a cricket franchise, not a soccer club..." (dataset is misnamed) — meta-commentary tanked gnd. + +| lever | verdict | +| --- | --- | +| policy | **STRONG FIT** — P-OF-2 (strip dataset meta-commentary) + P-OF-1. | +| memory | weak. tool_enrichment / configuration / code: not applicable. prompt: would work, less scoped. | + +**Recommended:** policy (P-OF-2 + P-OF-1). + +#### 3.1.21 `fe971e7f850a-67265ddc680f` | T3 soccer_2016 — "Cities in country of Rajkot" + +**Diagnosis.** Both: 20. CUGA 4 tools (nested-arg retry). gnd=0. + +| lever | verdict | +| --- | --- | +| policy | **STRONG FIT** — P-OF-1. | +| memory | weak. tool_enrichment / configuration: not applicable. prompt: redundant. | +| code | **STRONG FIT** — nested-arg fix. | + +**Recommended:** code (nested-arg) + policy P-OF-1. + +#### 3.1.22 `fe971e7f850a-2c978b083683` | T3 soccer_2016 — "Season with most matches at M Chinnaswamy Stadium" + +**Diagnosis.** ReAct: "Season 9". CUGA: "Season 9 with 60 matches" (extra "60 matches" decoration). Expected tool sequence is 3 tools; CUGA's path used 3 different tools, so exactmatch fails on tool identity (separate signal). gnd=0 on the extra decoration. + +| lever | verdict | +| --- | --- | +| policy | **STRONG FIT** — P-OF-1 + a strip-decorative-appendix variant of P-OF-2. | +| memory | weak. | +| tool_enrichment | partial — `ToolGuide` on `get_top_season_by_venue`: "this returns the season — answer with just the season; do not append the match count". | +| configuration | not applicable. | +| prompt | redundant. | +| code | not needed. | + +**Recommended:** policy (P-OF-1 + appendix-strip). Note: tool-identity mismatch is a separate exactmatch issue and is part of the benchmark; the aggregation should still pass if `answer_s=1.0` and `gnd_s=1.0`. + +#### 3.1.23 `adba6c0ec8a8-f33b6a3e1a35` | T3 university — "% universities with teaching>90 in 2011 in same country as univ 112" + +**Diagnosis.** Both: 100%. CUGA 1,139 tool calls (extreme outlier), 40 LLM calls, 159 s, 967 k tok. Final answer is concise ("100 %") and answer=1.0; gnd=0. The 1,139 calls are an exploration explosion that doesn't change the final answer but dominates cost. Likely a registry list_tools loop rather than a multi-step plan. + +| lever | verdict | +| --- | --- | +| policy | **STRONG FIT** for the gnd lift — P-OF-1. Plus a `Playbook`/`IntentGuard` style step-budget rule (when the planner has exceeded N tool calls without a new useful signal, force an emit) — this maps to an `IntentGuard` that intercepts on state conditions. | +| memory | partial. | +| tool_enrichment | partial — `ToolGuide` on whichever composite percent tool exists. | +| configuration | **STRONG FIT** — try `ENABLE_TODOS=true` (forces a bounded plan); ablate `REFLECTION_ENABLED=false` (eliminates per-step recompute). | +| prompt | weak as a primary lever for the cost explosion. | +| code | partial — a planner-side per-turn step cap is the structural fix. | + +**Recommended:** configuration (`ENABLE_TODOS=true`; ablate `REFLECTION_ENABLED=false`) + policy P-OF-1 for gnd. **Needs-investigation** for the 1,139-call root cause — almost certainly not a true 1,139-step plan; more likely a registry list_tools loop. + +### 3.2 `no_trace_pre_llm_crash` (n=7) + +> **All 7 cases share remediation pattern**: needs-investigation (re-run the affected MCP server in isolation under the same CUGA config; capture registry stderr) + code (CUGA-side empty-tool-list failsafe). Policy / memory / config (other than registry health) cannot help when the LLM never ran. + +#### 3.2.1 `31d9743578dc-20fe1c6e0318` | T2 movie_platform — "Mubi movies by Hong Sang-soo" + +**Diagnosis.** No CUGA trace, empty answer. Domain shows 8/10 cases no-trace — domain-wide MCP-client or registry startup failure on the CUGA side. + +| lever | verdict | +| --- | --- | +| policy | weak — cannot fire if LLM never ran. | +| memory | not applicable. | +| tool_enrichment | not applicable. | +| configuration | **needs-investigation** — registry health for this domain. | +| prompt | not applicable. | +| code | **STRONG FIT** — empty-tool-list failsafe in MCP-client glue. | + +**Recommended:** needs-investigation (registry/MCP-client startup) + code failsafe. + +#### 3.2.2 `31d9743578dc-3b59b2d5a9b3` | T2 movie_platform — "Mubi director page URL for critic-39-likes movie" + +Same root cause as 3.2.1. Same recommendation. + +#### 3.2.3 `31d9743578dc-fa8256c2888f` | T2 movie_platform — "Creator of list 'Sound and Vision', was subscriber?" + +Same root cause as 3.2.1. Same recommendation. + +#### 3.2.4 `d14bbb0be92d-781ff55b91b7` | T2 professional_basketball — "All-Star players in 1973" + +**Diagnosis.** `professional_basketball` is 10/10 no-trace. Same domain-wide MCP/registry health issue as movie_platform. Same lever analysis as 3.2.1. **Recommended:** needs-investigation + code failsafe. + +#### 3.2.5 `d14bbb0be92d-b94c0c0446e9` | T2 professional_basketball — "Most Improved 1985-90" + +Same as 3.2.4. + +#### 3.2.6 `d14bbb0be92d-7d51d5f6098d` | T2 professional_basketball — "BMI range query" + +Same as 3.2.4. + +#### 3.2.7 `fe971e7f850a-0bd47606e297` | T2 soccer_2016 — "Most common bowling skill" + +**Diagnosis.** Partial crash, not pure no-trace: 21 sandbox observations but no MCP tool successes; empty final_response. vakra `answer_s=1.0` (judge somehow extracted "right-arm medium" from intermediate scaffolding), `gnd_s=0.0`. Adjacent to Cluster B. + +| lever | verdict | +| --- | --- | +| policy | partial — `OutputFormatter` could detect empty final_response and emit a fallback diagnostic; does not solve root cause. | +| memory | not applicable. | +| tool_enrichment | not applicable. | +| configuration | **needs-investigation** — sandbox + registry health for soccer_2016. | +| prompt | not applicable. | +| code | **STRONG FIT** — empty-final-response guard + zero-successful-tool-call guard. | + +**Recommended:** code (empty-final-response guard + zero-tool-success guard). Needs-investigation: why sandbox ran without any tool successes. + +### 3.3 `wrong_args_or_aggregation` (n=7) + +#### 3.3.1 `840942187214-9915cb1b5445` | T2 authors — "Conference with most papers in 2012" + +**Diagnosis.** Right tool returned ICRA on call #1. CUGA then enumerated "Top 10 by paper count" — answer judge punished verbosity (`answer_s=0.0`, ground truth is a single short name). Also 403 tool calls (likely registry list_tools repetition under reflection). + +| lever | verdict | +| --- | --- | +| policy | **STRONG FIT** — `OutputFormatter` with trigger "response enumerates a Top-N list" + `Playbook` (P-PB-1) "if intent asks for a single item, return only that item". Policy is preferred over a global prompt because it scopes to singular intents only. | +| memory | partial — would help over time. | +| tool_enrichment | not applicable. | +| configuration | partial — `REFLECTION_ENABLED=false` ablation likely cuts the 403-call tail. | +| prompt | would work; global. | +| code | not needed. | + +**Recommended:** policy (`OutputFormatter` enum-strip + Playbook P-PB-1). Runner-up: configuration ablation (`REFLECTION_ENABLED=false`). + +#### 3.3.2 `bc9218680ed5-0b0ec8d0b7d2` | T2 student_loan — "Ratio of disabled students never absent" + +**Diagnosis.** Tool returned `{percentage: 1500}` (an outlier value in the dataset). ReAct parroted "1500%"; CUGA rescaled to "15%" (assuming a units error). Ground truth requires the literal number 1500; CUGA loses. + +| lever | verdict | +| --- | --- | +| policy | **STRONG FIT** — `Playbook` "never numerically rescale, divide, or normalize a value that came directly from a tool response — emit it verbatim". Plus P-OF-1 to cite the tool key. | +| memory | weak. | +| tool_enrichment | partial — `ToolGuide` on this percent tool: "this tool's value is the answer as-is; do not interpret/normalize". | +| configuration | not applicable. | +| prompt | would work; policy is more scoped. | +| code | not needed. | + +**Recommended:** policy (Playbook + P-OF-1). Runner-up: `ToolGuide` on the specific tool. + +#### 3.3.3 `a823e527d383-ad24e3ec0328` | T3 beer_factory — "Non-alcoholic root-beer brands at coord (Folsom)" + +**Diagnosis.** CUGA's first call `get_root_beer_details(root_beer_id=10054)` returned `{root_beer_details: []}` — empty. CUGA correctly concluded "data missing" and returned 0. ReAct's identical call returned actual data (2,717). Possibly server-side data drift between runs, OR a CUGA-side call-shape issue. Note CUGA then passed `container_type={'root_beer_details': []}` downstream (nested-arg bug on the empty value). + +| lever | verdict | +| --- | --- | +| policy | partial — `Playbook` "if detail-fetch returns empty, retry with `str(id)` and `int(id)` before giving up" could mask transient call-shape issues. | +| memory | weak on first encounter. | +| tool_enrichment | partial — `ToolGuide` on `get_root_beer_details` clarifying argument type and "if response is empty, do NOT chain". | +| configuration | not applicable. | +| prompt | partial — same content as policy/ToolGuide. | +| code | **STRONG FIT** — nested-arg fix downstream + CUGA-side retry-with-other-primitive-type wrapper for detail-fetch tools. | + +**Recommended:** **needs-investigation** (replay this exact tool call against the same MCP image to determine whether the empty response is CUGA's call-shape or server-side drift). If call-shape: code (CUGA-side primitive-type retry wrapper). If server-side: out of scope for CUGA-side levers. + +#### 3.3.4 `34a533dfd727-9a80447e42a5` | T3 disney — "Movies of most productive director without villain" + +**Diagnosis.** Both got "The Many Adventures of Winnie the Pooh". CUGA `em_s=1.0` but `gnd_s=0.0`. Classic nested-arg bug: call 2 was `get_movies_without_villains_by_director(director={"director": "Wolfgang Reitherman"})` → validation error → retry as string → success. + +| lever | verdict | +| --- | --- | +| policy | **STRONG FIT** for the gnd lift — P-OF-1. | +| memory | weak. tool_enrichment / configuration: not applicable. prompt: redundant. | +| code | **STRONG FIT** — nested-arg fix. | + +**Recommended:** code (nested-arg) + policy P-OF-1. + +#### 3.3.5 `34a533dfd727-792336e9811f` | T3 disney — "Highest-gross movie by director of movie with most voice actors" + +**Diagnosis.** ReAct: "Moana". CUGA: "Treasure Planet, $55,189,145". CUGA called `get_highest_gross_movie_by_director(director="Ron Clements")` which returned `{movie_title: "Moana"}`, then chained an extra `get_movie_details_by_director` and used its output to override the correct earlier answer. Genuine wrong-aggregation: less-specific tool's output overrode purpose-built tool's. + +| lever | verdict | +| --- | --- | +| policy | **STRONG FIT** — `Playbook` "when multiple tools have returned candidate answers, prefer the purpose-built tool over the generic detail tool". Could also use `IntentGuard`/`ToolApproval` to prevent the secondary call after the primary returns. | +| memory | partial. | +| tool_enrichment | partial — `ToolGuide` on `get_highest_gross_movie_by_director`: "this returns the final answer for highest-gross-by-director queries — do NOT call detail tools to corroborate". | +| configuration | not applicable. | +| prompt | would work; less surgical. | +| code | not needed. | + +**Recommended:** policy (Playbook + ToolGuide). Runner-up: prompt. + +#### 3.3.6 `fe971e7f850a-0ce9f1bd5b3e` | T3 soccer_2016 — "Man-of-the-Series winner's Man-of-the-Match count" + +**Diagnosis.** ReAct: "SR Watson, 10". CUGA: "SR Watson, 940". CUGA used `get_man_of_the_match_count_by_player` (returns lifetime MoM count = 940). ReAct used a different endpoint scoped to the relevant series (returns 10). Wrong-tool-selected within a similar-named tool family. + +| lever | verdict | +| --- | --- | +| policy | **STRONG FIT** — `ToolGuide` policies on both endpoints, disambiguating lifetime-MoM vs in-series-MoM. Plus a `Playbook` "when the user asks about a count tied to a specific series, prefer the series-scoped endpoint". | +| memory | strong once accumulated. | +| tool_enrichment | same as ToolGuide above (that IS the CUGA-side enrichment lever). | +| configuration | partial — higher shortlister candidate cap surfaces both tools but doesn't resolve the choice. | +| prompt | weak without specific tool context. | +| code | not needed. | + +**Recommended:** policy (`ToolGuide` disambiguating the two MoM endpoints + Playbook for series-scope preference). + +#### 3.3.7 `fe971e7f850a-d6dd43c77447` | T3 soccer_2016 — "Winning margin for match 419135 vs RCB on May 28 2008" + +**Diagnosis.** ReAct: "9 runs". CUGA: "42 runs" (wrong). Chain: `get_match_winner_by_match_id` (ok) → `get_win_margin_by_teams_and_date` (nested-arg bug → error) → `get_win_type_by_match_id` (got "runs") → `get_total_runs_scored_by_match_and_innings` (162) → off the rails. Had the second call succeeded, CUGA would have answered correctly. + +| lever | verdict | +| --- | --- | +| policy | partial — `Playbook` "if a primary lookup tool returns a validation error, retry once with corrected argument shapes before composing fallback tools". | +| memory | weak. | +| tool_enrichment | partial — `ToolGuide` on `get_win_margin_by_teams_and_date` specifying argument types. | +| configuration | not applicable. | +| prompt | redundant. | +| code | **STRONG FIT** — nested-arg fix is the direct cure. | + +**Recommended:** code (nested-arg). Runner-up: policy (Playbook for primary-tool-retry-before-compose). + +### 3.4 `timeout_or_giveup` (n=2) + +#### 3.4.1 `308738b8195d-56faa9f6bbd2` | T2 hockey — "Temporary-term coaches in 2007" + +**Diagnosis.** Expected tool `get_count_coaches_by_year_and_notes` wasn't surfaced by the shortlister (description vocabulary doesn't include "temporary/interim/notes"). CUGA explored 6 sibling coach tools and gave up at 140 s / 54 LLM calls with "cannot be completed". + +| lever | verdict | +| --- | --- | +| policy | **STRONG FIT** — (a) `Playbook` keyed on "temporary/interim/notes" in coach-domain intent that names the canonical tool directly; (b) P-OF-2 to strip the "cannot be completed" hedge if a numeric value did surface. | +| memory | **STRONG FIT** if accumulated — past similar trajectories would surface the canonical tool. | +| tool_enrichment | **STRONG FIT** — `ToolGuide` policy enriching `get_count_coaches_by_year_and_notes`'s description with "temporary", "interim", "term coach", "notes filter" pushes it up in the shortlister's embedding match. | +| configuration | partial — raising `LITE_MODE_TOOL_THRESHOLD` / `SHORTLISTING_TOOL_THRESHOLD` returns more candidates but adds noise. `TOOL_CALL_TIMEOUT` is per-call and not the binding constraint. | +| prompt | weak. | +| code | not needed if policy/memory/enrichment path works. | + +**Recommended:** policy (`ToolGuide` enrichment + Playbook naming the canonical tool). Runner-up: memory (if accumulated across runs); configuration tweak as fallback. + +#### 3.4.2 `39a28b2592a2-a6d040ce4d19` | T3 computer_student — "Students of top-advisors person, count by advisor" + +**Diagnosis.** CUGA had the correct value (13) in hand but wrapped the answer in "this figure is an **upper bound**" — answer judge punishes hedging. 195 s / 48 LLM calls / 1.2M tok also indicates exploration didn't terminate. + +| lever | verdict | +| --- | --- | +| policy | **STRONG FIT** — P-OF-2 (strip hedging) + `Playbook` "when a tool has returned a numeric value matching the question intent, emit a confident final answer immediately". | +| memory | weak on this turn. | +| tool_enrichment | not applicable. | +| configuration | partial — `REFLECTION_ENABLED=false` ablation; lower step budget if such a flag exists. | +| prompt | would work; global. | +| code | partial — planner-side per-turn step cap. | + +**Recommended:** policy (P-OF-2 + early-terminate Playbook). Runner-up: configuration ablation. + +### 3.5 `wrong_tool_selected` (n=1) + +#### 3.5.1 `55b7e50368aa-7d0eae1aeaf4` | T2 mondial_geo — "Mountains in country with greatest population" + +**Diagnosis.** Expected single-shot tool `get_mountain_count_most_populous_country`. CUGA chained 3 lower-level tools (most-populous-city-excluding-capital → country-of-city → mountain-count-by-country) and got 0 (India). Shortlister didn't surface the canonical tool because its CUGA-side description doesn't include the user's vocabulary. + +| lever | verdict | +| --- | --- | +| policy | **STRONG FIT** — `ToolGuide` policy on `get_mountain_count_most_populous_country`: prepend description with "Use this tool when the user asks for mountains in the country with the largest/greatest/most population. Do NOT compose with city-population tools." | +| memory | strong if accumulated. | +| tool_enrichment | same as policy ToolGuide above (CUGA-side enrichment). | +| configuration | partial — higher `SHORTLISTING_TOOL_THRESHOLD` surfaces more candidates but adds noise. | +| prompt | weak. | +| code | not needed. | + +**Recommended:** policy (`ToolGuide` enriching the tool's CUGA-side description). Runner-up: memory. + +### 3.6 `hallucinated_no_tool` (n=1) + +#### 3.6.1 `31d9743578dc-5b6784e8d151` | T2 movie_platform — "Movies in largest list + was creator subscriber?" + +**Diagnosis.** CUGA emitted a Python code block describing what it WOULD call (`await task_2_movie_platform_get_user_payment_methods_max_movie_number()`) rather than executing it. 0 actual tool calls despite 19 LLM calls and 91 s. Same domain as the no-trace bucket (Cluster B); almost certainly the same root cause (registry/MCP-client returned an empty/partial tool list, so the sandbox could not bind the referenced function and silently emitted its own code as the final answer). + +| lever | verdict | +| --- | --- | +| policy | partial — `IntentGuard` / `OutputFormatter` could detect "final answer is a Python code block with await calls" and force regeneration / mark failure (better signal). Does not fix root cause. | +| memory | not applicable. | +| tool_enrichment | not applicable. | +| configuration | **needs-investigation** — registry health for movie_platform. | +| prompt | not the right lever for the root cause. | +| code | **STRONG FIT** — CUGA-side guard in the sandbox executor: if the generated code's `await` references a function not in the resolved tool registry, raise a hard error instead of letting the code block become the final answer. Plus the empty-tool-list failsafe from Cluster B. | + +**Recommended:** code (sandbox guard: unknown-tool-reference → hard error). Runner-up: policy (`OutputFormatter` that flags "final answer contains `await` keyword" as malformed). **Needs-investigation:** registry health for movie_platform (same as Cluster B). + +--- + +## 4. Both-pass (PP, n=31) — resource comparison + +In the both-pass quadrant, **CUGA averages 9.7 LLM calls and ~67k tokens per turn, vs ReAct's 1–2 reasoning steps per turn** (ReAct files do not record token counts — note this caveat; the comparison is steps and qualitative). + +| uuid | task | domain | CUGA llm_calls | CUGA tokens | CUGA dur | ReAct pred_steps | flag | +| --- | --- | --- | --- | --- | --- | --- | --- | +| 840942187214-469be8d265fe | 2 | authors | 10 | n/a | 0.5s | 1 | | +| 840942187214-33ce7082460a | 2 | authors | 21 | 146,295 | 24.4s | 1 | **HIGH (≥10× median tokens for domain)** | +| 840942187214-a32375e83a83 | 2 | authors | 7 | 24,712 | 35.2s | 1 | | +| 840942187214-3e5def6a7777 | 2 | authors | 6 | 60,429 | 9.6s | 1 | | +| 6e317bcd6839-d631d2350e0e | 2 | books | 6 | 62,408 | 7.8s | 1 | | +| 6e317bcd6839-c471228ceebe | 2 | books | 10 | 49,205 | 6.7s | 1 | | +| 6e317bcd6839-d43ddbdc7a6b | 2 | books | 6 | 64,299 | 10.6s | 1 | | +| 6e317bcd6839-7fef82f27955 | 2 | books | 6 | 61,101 | 7.8s | 1 | | +| 6e317bcd6839-21e78bb4d842 | 2 | books | 6 | n/a | 0.0s | 1 | | +| 6e317bcd6839-38819e282933 | 2 | books | 10 | n/a | 0.0s | 1 | | +| 308738b8195d-5bd16a8893c5 | 2 | hockey | 10 | 80,556 | 11.4s | 1 | | +| 308738b8195d-18bbc5dc131d | 2 | hockey | 7 | n/a | 0.0s | 1 | | +| 308738b8195d-b11eeb38eded | 2 | hockey | 6 | 78,800 | 7.8s | 1 | | +| 308738b8195d-91d22b875555 | 2 | hockey | 8 | 93,204 | 13.5s | 1 | | +| 308738b8195d-02960c8c0d16 | 2 | hockey | 6 | 4,436 | 0.6s | 1 | | +| 308738b8195d-8b4b06bc3398 | 2 | hockey | 8 | 79,746 | 11.1s | 1 | | +| 35a1befb81d1-535fcb56d619 | 2 | talkingdata | 6 | 73,597 | 9.0s | 1 | | +| 35a1befb81d1-cababea51b16 | 2 | talkingdata | 8 | 54,757 | 4.1s | 1 | | +| 35a1befb81d1-616dfb77883f | 2 | talkingdata | 19 | 177,442 | 13.1s | 1 | **HIGH (≥2× ReAct's steps multiplier; ≥3× median tokens for domain)** | +| 35a1befb81d1-b917690529c6 | 2 | talkingdata | 10 | 4,447 | 0.8s | 1 | | +| 35a1befb81d1-12813911c5ac | 2 | talkingdata | 8 | 57,993 | 4.5s | 1 | | +| 35a1befb81d1-8159fe422dba | 2 | talkingdata | 11 | 4,458 | 0.7s | 1 | | +| 35a1befb81d1-ddbd1a3fa836 | 2 | talkingdata | 6 | 67,858 | 7.2s | 1 | | +| 6e317bcd6839-ad992961f5b3 | 3 | books | 8 | 76,700 | 12.5s | 2 | | +| 6e317bcd6839-b723324adb74 | 3 | books | 10 | n/a | 0.0s | 2 | | +| 6e317bcd6839-36f772964957 | 3 | books | 8 | 55,989 | 8.2s | 2 | | +| 6e317bcd6839-3931dd363e90 | 3 | books | 13 | 75,140 | 44.8s | 2 | | +| 6e317bcd6839-72902fb53f6b | 3 | books | 8 | 80,892 | 12.6s | 2 | | +| 34a533dfd727-dd4cd21da184 | 3 | disney | 12 | 99,094 | 17.3s | 2 | | +| adba6c0ec8a8-ea3c7885c2c2 | 3 | university | 27 | 46,974 | 12.3s | 2 | **HIGH (≥10× ReAct's step multiplier)** | +| adba6c0ec8a8-c70d9c072b32 | 3 | university | 13 | n/a | 0.0s | 2 | | + +**Commentary**: in the both-pass quadrant, CUGA averages **9.7 LLM calls and ~67 k tokens** per turn vs ReAct's 1–2 reasoning steps. Three outliers stand out for engineering attention: `840942187214-33ce7082460a` (21 LLM calls, 146 k tok), `35a1befb81d1-616dfb77883f` (19 LLM calls, 177 k tok), and `adba6c0ec8a8-ea3c7885c2c2` (27 LLM calls). These are not failures — they're costing 5–10× the resources to get the same right answer. Look there next for optimization (and notably, all three involve the reflection step being triggered repeatedly). Do not propose remediations to fix PP cases in this report — they pass — but flag them for the perf workstream. + +A separate concern: many CUGA report.md rows show `dur=0.0s` even when LLM calls happened. That's a logging bug (probably duration-measurement aborted) — the row exists, the trace exists, but `full_execution_time` wasn't captured. Worth fixing in the eval harness; doesn't affect correctness. + +--- + +## 5. ReAct fail / CUGA pass (FP, n=9) — strengths to preserve + +> These 9 are domains/queries where CUGA succeeds and ReAct does not — fixing the PF cases must not regress these. Notable: most are queries where ReAct hedged ("I'm sorry, I couldn't find any papers authored by Barrasa", "I'm not sure I understand…") while CUGA confidently surfaced the answer. CUGA's "always commit to a numeric/named answer" stance is a strength — the P-OF-2 hedging-strip policy must NOT make CUGA hedge more, only less. + +| uuid | task | domain | intent (≤120 chars) | CUGA answer (short) | +| --- | --- | --- | --- | --- | +| 840942187214-b43e6be1231f | 2 | authors | What percentage of journals whose short name begins with ANN were published in 1989? | "1.44 % of journals…" | +| 840942187214-52f6889d5091 | 2 | authors | Among the papers published in 2009, pick 10 and list down the conference's short name | "ICIC, ICIC, ICIC, ICIC, RFIC, …" | +| 840942187214-6e7fd607bd84 | 2 | authors | State the year and title of papers written by Barrasa. | "2004 – 'Survey of scalability techniques for reasoning with ontologies'" | +| 6e317bcd6839-3161b11a0fd3 | 2 | books | Among books ordered by Lucas Wyldbore, % over $13? | "33.33 %" | +| 35a1befb81d1-7e2e878bc36e | 2 | talkingdata | For event at 23:55:16 on 2016/5/7 at (113, 28), on what device? | (confident "device model not found via available APIs" — matched GT) | +| 6e317bcd6839-a76b0204cd69 | 3 | books | average of books in language of ISBN 23755004321 among books by Carole Marsh | "Both books are in English (same lang as 23755004321)" | +| 34a533dfd727-7a0fdbd1aa0c | 3 | disney | list villains of all movies directed by most productive director | "Madam Mim; Kaa and Shere Khan; Edgar Balthazar; Prince…" | +| 3e2b8f7655a2-001669797531 | 3 | trains | trains opposite-direction to most rectangle-2nd-car trains, w/ double-sided… | "1 train" | +| adba6c0ec8a8-a705643b237f | 3 | university | % female students in 2016 among universities in same location as 112 | "46.81 %" | + +CUGA's edge here is **confidence and tool-grounding** — it commits to an answer when ReAct surrenders to ambiguity. The P-OF-2 (hedging-strip) policy is designed to preserve this: it only fires when a numeric/named answer is *already present* in the response, and strips hedging clauses around it. It does not make CUGA more conservative when no answer is in hand. + +--- + +## 6. Both-fail appendix (FF, n=119) + +| uuid | task | domain | intent (≤90 chars) | CUGA fail mode | +| --- | --- | --- | --- | --- | +| 840942187214-305bfcd9e5bc | 2 | authors | Please provide the titles of any two papers that are either preprinted or unpublished… | wrong_answer | +| 840942187214-d5d07b87e19a | 2 | authors | What is the ratio of author with affiliation and without affiliation? | wrong_answer | +| 6e317bcd6839-17b6cf7248bc | 2 | books | How many customers ordered the oldest book? | wrong_answer | +| 6e317bcd6839-a34cc3ccadfa | 2 | books | Who ordered the book with the cheapest price? | wrong_answer | +| 1960f609e439-42eba90b0cb2 | 2 | codebase_comments | List all the methods with a solution with a "636449700980488000" processed time. | timeout_or_giveup | +| 1960f609e439-e82ba6721008 | 2 | codebase_comments | What is the average processed time of the solution paths inside the "https://github.com/…" | no_trace_pre_llm_crash | +| 1960f609e439-8c022719a1ca | 2 | codebase_comments | How many percent more of the watchers for the repo of solution 83855 than 1502? | wrong_answer | +| 1960f609e439-65a22f6df967 | 2 | codebase_comments | For the solution of the most 'sw' methods, what is its path? | wrong_answer | +| 1960f609e439-0a652fb03008 | 2 | codebase_comments | How many methods in the same repository share a tokenized name that begins with "query lang…" | timeout_or_giveup | +| 1960f609e439-eb7d27ccd669 | 2 | codebase_comments | In "maxild_playground\Playground.sln", what is the time of sampling for the method… | wrong_answer | +| 308738b8195d-05ec22ea67ac | 2 | hockey | In 1998, How many wins were made by team 'CAR' per game played? Who contributed the most goals? | wrong_answer | +| 308738b8195d-3e3d6db9ab3d | 2 | hockey | What is the position of the 9th oldest hockey player? | wrong_answer | +| 308738b8195d-9296bdde9ace | 2 | hockey | How many teams scored against their opponent who had pulled their goalie in the year 2005? | wrong_answer | +| 55b7e50368aa-2da1ea58205c | 2 | mondial_geo | In which lake flows the river that is, in turn, the mouth of the Manicouagan River? | timeout_or_giveup | +| 55b7e50368aa-92dc160b1c20 | 2 | mondial_geo | Which two countries have the border in length of 803 km? Give the full names. | no_trace_pre_llm_crash | +| 55b7e50368aa-3e01aa27adb5 | 2 | mondial_geo | Name the tallest mountain on Himalaya and what is its height. | wrong_answer | +| 55b7e50368aa-9768f4b5bc9a | 2 | mondial_geo | In which province is the highest volcano mountain located in? | timeout_or_giveup | +| 55b7e50368aa-9e60460b6696 | 2 | mondial_geo | How many percent of the mountains on Andes which are non-volcanic? | timeout_or_giveup | +| 55b7e50368aa-b86a3020cdc2 | 2 | mondial_geo | What is the capital of the country that has the Licancabur Mountain? | wrong_answer | +| 55b7e50368aa-814c9d0d18dc | 2 | mondial_geo | What sea does the Baltic Sea converge with, and how deep is the Baltic Sea? | wrong_answer | +| 31d9743578dc-ae07dc845bae | 2 | movie_platform | What is the user avatar url for user 41579158? What is the latest movie rated by him/her? | wrong_answer | +| 31d9743578dc-00550ec0c9f4 | 2 | movie_platform | What's the url of user 39115684's rating on the movie 'When Will I Be Loved'? | no_trace_pre_llm_crash | +| 31d9743578dc-949db4b1b7ef | 2 | movie_platform | When did the creator of the list "250 Favourite Films" last update a movie list? | no_trace_pre_llm_crash | +| 31d9743578dc-2d279e62f33e | 2 | movie_platform | How many users were paying subscribers when they rated the movie released as ... | no_trace_pre_llm_crash | +| 31d9743578dc-3130ec5cddcb | 2 | movie_platform | user ID of subscriber who created a list for ... | no_trace_pre_llm_crash | +| 31d9743578dc-ff2d3498a62d | 2 | movie_platform | Avg number of movies added to lists of user 8516503? Indicate how many… | no_trace_pre_llm_crash | +| d14bbb0be92d-9b44601b01e2 | 2 | professional_basketball | How many All Star players who played in the 1973 season were black? | no_trace_pre_llm_crash | +| d14bbb0be92d-dbf89482eaf1 | 2 | professional_basketball | In the year 1997 allstar game, which teams did the players with the most rebounds play in? | no_trace_pre_llm_crash | +| d14bbb0be92d-0283d0ffdf31 | 2 | professional_basketball | Name the teams along with the coaches that went to 'Quarter Final' round in 1946. | no_trace_pre_llm_crash | +| d14bbb0be92d-8486689ff949 | 2 | professional_basketball | How many total minutes has the Brooklyn-born player, known by the name of Superman, played | no_trace_pre_llm_crash | +| d14bbb0be92d-da610a7c37f6 | 2 | professional_basketball | What is the name of the team with the highest home lost rate? | no_trace_pre_llm_crash | +| d14bbb0be92d-77fce6c85d4d | 2 | professional_basketball | What is the birth date of the player with the most assists during the 1985 All-Star season? | no_trace_pre_llm_crash | +| d14bbb0be92d-bb4405fa6cf5 | 2 | professional_basketball | List the champion (team name) and year from year 1950 to 1960. | no_trace_pre_llm_crash | +| fe971e7f850a-4e39ce9069a6 | 2 | soccer_2016 | List down all of the winning teams' IDs that played in St George's Park. | wrong_answer | +| fe971e7f850a-facf43ff16b2 | 2 | soccer_2016 | What is the average number of extra runs made as noballs? | wrong_answer | +| fe971e7f850a-89961390d1c4 | 2 | soccer_2016 | Among the matches, what percentage have a winning margin above 100? | wrong_answer | +| fe971e7f850a-976bb90380b3 | 2 | soccer_2016 | What is the date of the match that has the highest wager on the final result of a game? | hallucinated_no_tool | +| fe971e7f850a-cbb611c31c02 | 2 | soccer_2016 | How many players bat with their left hands? | wrong_answer | +| bc9218680ed5-ba21da2cee4a | 2 | student_loan | What is the average time for a disabled student to be absent from school? | wrong_answer | +| bc9218680ed5-787eda184fe2 | 2 | student_loan | State the number of students who filed for bankruptcy and have payment due. | wrong_answer | +| bc9218680ed5-4877ee4eec05 | 2 | student_loan | How many male students are enrolled at OCC? | wrong_answer | +| bc9218680ed5-caef075781c4 | 2 | student_loan | Calculate the average enlisted students per organization. | wrong_answer | +| bc9218680ed5-085696ec16fd | 2 | student_loan | What is the average absence period of a disabled student? | wrong_answer | +| bc9218680ed5-b319826839ce | 2 | student_loan | Which department has the most disabled students? | wrong_answer | +| 35a1befb81d1-25aa8472bd40 | 2 | talkingdata | What are the categories of the top 2 oldest events? | wrong_answer | +| 35a1befb81d1-2bd6a11a8f64 | 2 | talkingdata | What is the average age of the female users who uses a vivo device? | wrong_answer | +| a823e527d383-5ee3999c12b4 | 3 | beer_factory | How many sweet bottled root beers that do not contain cane sugar… | wrong_answer | +| a823e527d383-90d9c3222d0c | 3 | beer_factory | List out the root beer ID for the brand of the root beer that gained a 1-star rating… | wrong_answer | +| a823e527d383-fda0366a3244 | 3 | beer_factory | How many bottles of beer have been bought by Jim Breech for root beer ID 10054? | wrong_answer | +| a823e527d383-724b86449639 | 3 | beer_factory | difference between bottles of root beer sold from Louisiana and Missouri for… | wrong_answer | +| a823e527d383-03f8c708e98a | 3 | beer_factory | Among the transactions for the purchase of non-alcoholic beer, what % is done by … | wrong_answer | +| a823e527d383-a5aabff8d296 | 3 | beer_factory | Which location sold more bottles of beer, and what is the transaction ratio at Sac State… | wrong_answer | +| a823e527d383-aa69427d37ca | 3 | beer_factory | How many female mailing list subscribers from the city where customer finished tx n… | wrong_answer | +| a823e527d383-0ce1acfc68ce | 3 | beer_factory | How many times did the first customer use the credit card type they used between 12/25/… | wrong_answer | +| 6e317bcd6839-003e2ad6980b | 3 | books | List the ISBN of the book 'El plan infinito' written in the language it is originally written | wrong_answer | +| 6e317bcd6839-01bde0281b9a | 3 | books | publisher who published the first book of the author who published a book on … | wrong_answer | +| 6e317bcd6839-2317dce0eb47 | 3 | books | % of books published by Ace Books in language of first two published books… | timeout_or_giveup | +| 6e317bcd6839-b1ade8414e0d | 3 | books | average of books in languages of first two published books among all books… | wrong_answer | +| 2b28654158b1-8dcab257ca67 | 3 | college_completion | In Connecticut, avg Black students per year who were bachelor… | timeout_or_giveup | +| 2b28654158b1-a5540d8f55af | 3 | college_completion | Among the race of all students, which school in "KY" with the highest # of students… | timeout_or_giveup | +| 2b28654158b1-3979be5a9c26 | 3 | college_completion | % of Asian students among students of other races who graduated from… | timeout_or_giveup | +| 2b28654158b1-ee186bf084e4 | 3 | college_completion | Among institutes in state with most graduate cohort 2012 from private… | no_trace_pre_llm_crash | +| 2b28654158b1-c9a293c0fce5 | 3 | college_completion | How many students for both genders graduated from a 2-year institute in… | timeout_or_giveup | +| 2b28654158b1-22242ab46911 | 3 | college_completion | % of Asian students among students of other races who graduated from… | timeout_or_giveup | +| 2b28654158b1-526967fd559c | 3 | college_completion | Among Ivy League Schools, which school's state has the lowest appropriations… | timeout_or_giveup | +| 2b28654158b1-50cc9614237e | 3 | college_completion | % of 4-year public schools from Madison Area Technical College's… | wrong_answer | +| 2b28654158b1-e0bf826a9e36 | 3 | college_completion | Among institutes in state with most graduate cohort 2012 from private… | timeout_or_giveup | +| 39a28b2592a2-2607b42826bf | 3 | computer_student | How many courses for basic or medium undergraduate at level with same … | wrong_answer | +| 39a28b2592a2-3e453bb6b9af | 3 | computer_student | How many basic/medium UG courses taught by prof in course-level with most… | wrong_answer | +| 39a28b2592a2-e67e788f8418 | 3 | computer_student | How many basic and medium undergraduate courses are there, considering… | wrong_answer | +| 39a28b2592a2-028bd872bb8a | 3 | computer_student | How many teachers are faculty employees who taught high-level UG of <10… | wrong_answer | +| 39a28b2592a2-e1b2e78c96b8 | 3 | computer_student | How many basic and medium UG courses among the courses with the most… | wrong_answer | +| 39a28b2592a2-6a5b7a452425 | 3 | computer_student | How many basic and medium UG courses are taught by faculty member who… | wrong_answer | +| 39a28b2592a2-7b268478811c | 3 | computer_student | Which faculty employees teach a basic or medium UG course that has most… | wrong_answer | +| 39a28b2592a2-098e971bb045 | 3 | computer_student | How many courses for basic or medium UG taught by the faculty member who… | wrong_answer | +| 34a533dfd727-44575f9abc41 | 3 | disney | How many movies for mature/PG did Bill Thompson work as a voice for? | wrong_answer | +| 34a533dfd727-2322cd0c7f3f | 3 | disney | Which movies directed by most productive director can be watched by general audience? | wrong_answer | +| 34a533dfd727-a50635130b38 | 3 | disney | How many PG adventure movies did director of movie with most voice actors direct? | wrong_answer | +| 34a533dfd727-1f689826a14b | 3 | disney | Release date of Lion King directed by person who directed the most popular… | wrong_answer | +| 34a533dfd727-15ee061b3677 | 3 | disney | List voice actors in movie directed by director of Pinocchio released on F... | wrong_answer | +| 34a533dfd727-9eefc7d3a3fa | 3 | disney | List voice actors in movie directed by director of Disney's most popular adventure… | wrong_answer | +| 55b7e50368aa-a1ea6e0aee2e | 3 | mondial_geo | How many mountains in top-3 GDP economies with lowest proportion of … | timeout_or_giveup | +| 55b7e50368aa-ac0fd8df4ccb | 3 | mondial_geo | Please name 3 sovereign nations governed by government type of country… | wrong_answer | +| 55b7e50368aa-ef9c0baae036 | 3 | mondial_geo | Among orgs HQ in one of the two countries with longest border in… | wrong_answer | +| 55b7e50368aa-05a86dbfc326 | 3 | mondial_geo | How many lakes in 4th most populous African country with same govt type… | timeout_or_giveup | +| 55b7e50368aa-75340fb8d38c | 3 | mondial_geo | Nation's GDP lowest among communist states bordering smallest border… | wrong_answer | +| 55b7e50368aa-91888b0b341e | 3 | mondial_geo | In which year were most organizations created on the continent with country… | timeout_or_giveup | +| 55b7e50368aa-47d648237d18 | 3 | mondial_geo | Of countries sharing territory with >1 continent and avg pop… | wrong_answer | +| 55b7e50368aa-ed9c4bb75dcc | 3 | mondial_geo | Proportion of English-speaking citizens in 2 countries with longest border… | wrong_answer | +| fe971e7f850a-a29079011b89 | 3 | soccer_2016 | How many matches did team that played in a match resulting in tie in 2015 win… | wrong_answer | +| fe971e7f850a-f8f8a23174de | 3 | soccer_2016 | How many matches did the second team in match with lowest winning margin play in S8? | timeout_or_giveup | +| fe971e7f850a-17cffbf06c49 | 3 | soccer_2016 | How many left-hand batting players from country of city "Rajkot"? | wrong_answer | +| fe971e7f850a-c133dba8dff5 | 3 | soccer_2016 | Among players born after 1985, % using same … | wrong_answer | +| fe971e7f850a-8d0f6593fd9a | 3 | soccer_2016 | How many matches did team of players in match ID 335990 win in 2008? | wrong_answer | +| 3e2b8f7655a2-6b356a39bd56 | 3 | trains | direction of train with short ellipse car with load shape in its 2nd car? | wrong_answer | +| 3e2b8f7655a2-e1f013d5b67b | 3 | trains | How many cars running same direction as train with ellipse-shape have double-sided? | wrong_answer | +| 3e2b8f7655a2-fe844c4b5921 | 3 | trains | IDs of all cars with double sides on trains opposite direction t… | wrong_answer | +| 3e2b8f7655a2-48b8377cc3d0 | 3 | trains | Among trains with rect-2nd cars, how many have ≤1 car with open … | wrong_answer | +| 3e2b8f7655a2-081fd2441201 | 3 | trains | Among trains with 2 or less cars, how many have ≤1 car with open ro… | wrong_answer | +| 3e2b8f7655a2-d5464931bd4a | 3 | trains | trains with rectangle-shaped 2nd cars running same direction with double sided | wrong_answer | +| 3e2b8f7655a2-c28d9f399463 | 3 | trains | Among trains with rect-2nd cars, how many have three-wheeled, jagged roof cars? | wrong_answer | +| 3e2b8f7655a2-5a1cf4c68245 | 3 | trains | Among trains running same direction as train with ellipse-shaped car… | timeout_or_giveup | +| 3e2b8f7655a2-0ef49d666fe9 | 3 | trains | How many trains with 2 or less cars and running west have double sided cars in 3rd | wrong_answer | +| adba6c0ec8a8-7ee0c032ad57 | 3 | university | In nation where Harvard located, % of female students in universities… | wrong_answer | +| adba6c0ec8a8-b04dba450471 | 3 | university | How many univs have ≥20,000 female students in 2016? Identify how many | wrong_answer | +| adba6c0ec8a8-b2fc3e4f3b70 | 3 | university | Among universities with teaching score >90 in 2011, % of those… | timeout_or_giveup | +| adba6c0ec8a8-83dbc4ae1e32 | 3 | university | How many univs have ≥20,000 female students in 2016? Identify how many | wrong_answer | +| adba6c0ec8a8-7a658ce6f59d | 3 | university | Among universities with teaching score >90 in 2011, % of those… | timeout_or_giveup | +| adba6c0ec8a8-e52b98594643 | 3 | university | Among universities with teaching score >90 in 2011, % of those… | wrong_answer | +| 2ffd766bcf59-5032646dbcfe | 3 | wdi | Avg of Adjusted net enrolment rate, primary in Algeria | wrong_answer | +| 2ffd766bcf59-d565d4ddc1b7 | 3 | wdi | List table name and currency unit of countries using series FP.CPI.TOTL | wrong_answer | +| 2ffd766bcf59-bcdf6da41003 | 3 | wdi | List East Asia & Pacific countries under High income: nonOECD | wrong_answer | +| 2ffd766bcf59-205596757a77 | 3 | wdi | Total urban population of middle income countries in 1960 | wrong_answer | +| 2ffd766bcf59-2a6fb2ae51bf | 3 | wdi | % of countries in region with country with highest pop… | timeout_or_giveup | +| 2ffd766bcf59-e09c86a99e59 | 3 | wdi | Which indicator uses aggregation method for indicator value 133 in 1960… | timeout_or_giveup | +| 2ffd766bcf59-dff92e0acb8d | 3 | wdi | Sources for data of children who finished primary school education in countries… | timeout_or_giveup | +| 2ffd766bcf59-873d5b415b1c | 3 | wdi | In country with highest population in largest city for 19 consecutive years… | timeout_or_giveup | +| 2ffd766bcf59-21de28d36ca7 | 3 | wdi | How many countries have footnotes described same way as footnote on series code | wrong_answer | +| 2ffd766bcf59-a5fb5e8eb512 | 3 | wdi | Full names of any 2 countries that use the same trade system as Bulgaria… | timeout_or_giveup | + +**FF clustering observation**. The FF set concentrates heavily in Task 3 (multihop reasoning): `world_development_indicators` 10/10, `college_completion` 9/10, `trains` 9/10, `beer_factory` 8/10, `computer_student` 8/10, `mondial_geo` 8/10 — and in Task 2 in the "no-trace" domains: `professional_basketball` 7/10, `movie_platform` 6/10. The cluster on multihop reasoning suggests CUGA's planner struggles with 3+ tool chains in unfamiliar domains. The same CUGA-side levers identified in §3 — nested-arg code fix (helps ~20% of multihop chains), policy bundle (gnd lift + chain-minimization Playbooks), and registry-health investigation (no-trace domains) — are the FF workstream's natural starting points, but the FF set is out of scope for this report. + +--- + +## 7. Methodology & data sources + +**CUGA run.** +- Date: 2026-04-28, captured by the bundle dir `20260428_201443_default`. `metadata.json.created_at = 2026-04-28T21:21:05Z`. +- Agent: `cuga_sdk` v0.2.20, git `df40ff98` (branch `fix/watsonx-empty-response-format`, dirty). +- Model: `openai/gpt-oss-120b` via Groq (`AGENT_SETTING_CONFIG=settings.groq.toml`). +- Key env vars: `LITE_MODE=true`, `LITE_MODE_TOOL_THRESHOLD=500`, `SHORTLISTING_TOOL_THRESHOLD=1`, `FORCE_AUTONOMOUS_MODE=true`, `REFLECTION_ENABLED=true`, `ENABLE_TODOS=false`, `POLICY__ENABLED=false`, `DECOMPOSITION_STRATEGY=exact`, `TOOL_CALL_TIMEOUT=120`, `CUGA_MODE=accurate`, `REGISTRY=true`, `LOCAL_SANBDOX=true` (sic). +- Bundle: 200 results in `results/m3_config_20260428_231430.json`; 178 langfuse traces in `langfuse_traces/` (22 missing — see §3.2); 200 rows in `report.md`; per-domain vakra files in `benchmarks/m3/results/_vakra/{groundtruth,prediction}/.json`. + +**ReAct run.** +- Source: `benchmarks/m3/results_react/task{2,3}_lg_gpt-oss-120b.json`. No timestamp in those files; date unknown from data. +- Agent: LangGraph standard ReAct, single agent, same `openai/gpt-oss-120b` model. +- 200 dialogues, scored with the same M3 vakra evaluator (same three judges) — verified by comparing `score_explanation.answer` prompts and judge wording. + +**Join.** +- CUGA langfuse `metadata.uuid` (also `input.task_name`) == ReAct dialogue `uuid` (1:1). +- 178/200 join cleanly; the 22 unjoinable cases are exactly the cases where CUGA produced no langfuse trace. All 22 are CUGA failures per the `report.md` row that still exists for them. +- For report.md row → uuid mapping: per `benchmarks/helpers/compare_report.py:_bucket_m3_tasks`, rows are grouped by `(m3_task_id, domain)` and sorted by `uuid` within each bucket, then numbered 1..N. We re-implemented this ordering (`_assign_v3.py`) and verified that report.md ✓/✗ flag matches the `success` field of the corresponding `results.json` entry on all 200 rows. The earlier approach of using langfuse `metadata.task_index` does NOT match report.md ordering and produced 20+ false-misassignments — this is the parser caveat to preserve for future readers. +- The `report.md` parser must use explicit empty-string checks for the carry-forward Task/Domain columns. The bug to avoid is `set(cells[0]) <= set('- ')` evaluating True for an empty cell — handle the empty case explicitly. + +**CUGA policy engine — where to look.** +- Policy models: `cuga-agent/src/cuga/backend/cuga_graph/policy/models.py` (Playbook, IntentGuard, ToolGuide, ToolApproval, OutputFormatter, CustomPolicy). +- Policy enactment in the lite-mode path (used by this benchmark): `cuga-agent/src/cuga/backend/cuga_graph/nodes/cuga_lite/cuga_lite_node.py:_apply_output_formatter` (≈ L341) — invoked at callback after subgraph execution. +- Shared OutputFormatter application: `cuga-agent/src/cuga/backend/cuga_graph/policy/output_formatter_utils.py:apply_output_formatter_policies`. +- Sample benchmark policy bundle (for shape reference): `benchmarks/bpo/policies/policies.json` — 12 policies (10 playbooks, 1 tool_guide, 1 output_formatter), with both keyword and natural_language triggers. +- Enable flag: `DYNACONF_POLICY__ENABLED` in the run env. Currently `false` in `benchmarks/m3/config/m3.env` and in this bundle's `metadata.json`. + +**Files / paths used.** +- `benchmarks/m3/evaluation_bundles/20260428_201443_default/report.md` — pass/fail table source. +- `benchmarks/m3/evaluation_bundles/20260428_201443_default/results/m3_config_20260428_231430.json` — full per-case CUGA results including `vakra` sub-scoring, top-level `success`, `tool_calls`, `expected_output`, `tool_call_diffs`. +- `benchmarks/m3/evaluation_bundles/20260428_201443_default/langfuse_traces/*.json` — 178 CUGA traces with full `observations` log. +- `benchmarks/m3/evaluation_bundles/20260428_201443_default/metadata.json` — CUGA run config. +- `benchmarks/m3/results_react/task{2,3}_lg_gpt-oss-120b.json` — ReAct scored output. +- `benchmarks/m3/results/_vakra/{prediction,groundtruth}/.json` — exact inputs that the vakra judges saw, used to verify groundedness-judge confabulation. +- `benchmarks/helpers/compare_report.py` — report.md generator; sources the row ordering rule we replicated. + +Intermediate artifacts produced during this analysis (kept in the bundle dir for reproducibility): `_build_join.py`, `_assign_v3.py`, `_quadrants.py`, `_pf_with_vakra.py`, `_pf_dump.py`, `_pp_fp_ff_dump.py`, `_ff_dump.py`, `_score_dist.py`, `_groundedness_probe.py`, `_vakra_inspect.py`. Joined dataset at `docs/m3-vakra-analysis-20260428/joined.json` and `quadrants_v2.json`. Per-PF multi-lever remediation matrix at `docs/m3-vakra-analysis-20260428/_pf_remediation_plan.json` (keyed by uuid; the JSON intermediate used to render §3). + +--- + +## Top-3 remediation priorities, ranked by expected pass-rate lift + +1. **Build and enable a small M3-scoped CUGA policy bundle** (`DYNACONF_POLICY__ENABLED=true`, policy folder scoped to the benchmark, modeled on `benchmarks/bpo/policies/policies.json`). Specifically: P-OF-1 (single-fact OutputFormatter that cites tool name + JSON key); P-OF-2 (strip hedging / "For context" appendices / dataset meta-commentary when a numeric or named answer is present); P-PB-1 (no enumeration when a single item was asked); P-PB-2 (one composite tool, no corroboration on percent/ratio intents); P-PB-3 (no idempotent retries); plus tool-disambiguation `ToolGuide` policies for the 2 wrong-tool cases (mondial_geo mountains + soccer_2016 MoM). **Expected to flip ~14–20 of the 26 groundedness-asymmetry cases + the 2 hedging cases + the 1 enumeration case + the 1 wrong-tool case.** Estimated lift: **+9–12 pp** (CUGA 20% → ~29–32%). Cheapest *new* dependency: one policy.json file + flipping one env var. + +2. **Fix the nested-argument bug in CUGA sandbox codegen** (Cluster C1). Affects at least 6 PF cases directly (3.1.8, 3.1.15, 3.1.18, 3.1.19, 3.1.21, 3.3.4, 3.3.7) and many more across PP/FF where the retry inflates cost and on multihop chains can cause the second-tool error to bias the final answer. Estimated lift: **+3–5 pp** in PF alone, plus large cost reduction across all quadrants. Cheap: one codegen rule. + +3. **Diagnose and fix `movie_platform` and `professional_basketball` CUGA-side registry/MCP-client health** (Cluster B + 3.6.1). Affects 7 PF cases directly and ~13 FF cases in those domains. Add an empty-tool-list guard so future runs fail loudly instead of silently. Estimated lift: **+3 pp PF, up to +6–7 pp FF** if the underlying CUGA-side issue resolves. Needs: registry stderr capture from the next M3 run. + +Combined expected post-fix CUGA pass rate: **~35–40%**, comparable to or exceeding ReAct's 36% — and achieved primarily via a policy bundle and one code fix rather than any benchmark-side change. + +--- + +## Post-analysis: what we actually changed and what each change did + +This section records the implementation work done between 2026-05-12 and 2026-05-20 in response to the analysis above. The work was scoped to this repo (`cuga-internal-evaluation`) — every change is in `benchmarks/m3/`, `scripts/`, or `benchmarks/helpers/` — and obeyed the off-limits rule (no edits to vakra judges, vakra groundtruth, MCP server definitions, or the ReAct baseline). + +The 4 PFs picked for the iterative experiment were a subset of the codebase_comments domain (M3 task 2), chosen because they had been confirmed as ReAct-pass-CUGA-fail under the original bundle: `1960f609e439-e5d337d143b6`, `…-ab3a664a6a28`, `…-00fe3f448af7`, `…-d1ba8f4ad233`. The full final 5-run × 2-config comparison is in `/tmp/clean_report.md` (preserved verbatim in this PR's description); pass rates are quoted below. + +### Headline result + +Baseline (pre-fix): **0/10 PF cases passing** under any config. +After tool-prefix removal + policies-off: **5/10** (small earlier set) and on the 4-PF × 5-run sweep, **81.2%** mean pass rate (3.2/4 cases per run). +After tool-prefix removal + the policy bundle from §"Top-3 remediation priorities": **50.0%** (2.5/4) — i.e. **the policy bundle is net-negative on these 4 PFs once the tool-prefix issue is fixed.** One policy is robustly helpful on one task (`…-e5d337d143b6`: 75% → 100%); one policy reliably breaks one task (`…-d1ba8f4ad233`: 75% → 0%). Net cost dropped 32% (235K tokens → 160K) but mean correctness dropped 31 pp. + +### Changes by lever, in order they were tried + +1. **Registry app-name de-prefixing (CUGA-side code).** Root-cause fix for the analysis's "groundedness=0 on correct answers" finding. The M3 expanded registry was generating per-task app namespaces of the form `task___*` (e.g. `task_2_codebase_comments_get_method_count`). vakra's `_match_live_name` rewrites the gold tool sequence to whatever live names the MCP registry exposes; with the `task__` prefix in front of the domain, vakra could not match CUGA's predictions to the gold sequence, so the groundedness judge's "document" was empty for every CUGA answer. ReAct happened to call bare-domain tool names that *did* match. Files touched: `benchmarks/m3/eval_m3.py` (`registry_app_name = domain`, `expanded_service_name = domain_name`, updated `registry_prefix`), `benchmarks/m3/m3_data_loader.py` (`strip_registry_prefix` tries bare-domain first, falls back to legacy `task___` for old bundles), `benchmarks/m3/m3_vakra_score.py` (`_match_live_name` extended with a suffix-match path so old `task__…` bundles still score). Collision guard added because de-prefixing means tasks 2 and 3 both expose `books`, `mondial_geo`, `soccer_2016` — `expand_registry_config` now raises `RuntimeError` if two services collapse to the same name, and accepts a `capability_filter` so the caller can pre-narrow to just the task being evaluated. Smoke test at `scripts/check_no_task_prefix.py` walks any saved result file and asserts no tool call still carries the legacy form. **Result: +50pp on the iterated set, by far the largest lever.** + +2. **Policy bundle (CUGA-side data + config).** The analysis's #1 priority — small M3-scoped policy bundle plus flipping `DYNACONF_POLICY__ENABLED=true`. Implemented as markdown source files in `benchmarks/m3/policies/` compiled to `policies.json` by `scripts/policies_md_to_json.py` (YAML frontmatter for triggers, markdown body for content). Eight policies were created from the analysis recommendations: `P-OF-1-single-tool-fact-citation`, `P-OF-2-strip-hedging` (later disabled — see below), `P-PB-1-no-enumeration`, `P-PB-2-one-composite-tool-no-corroboration`, `P-PB-3-no-idempotent-retries`, `P-PB-4-validation-error-recovery` (added empirically when validation-error retries kept burning the step budget), `P-TG-1-mountain-count-disambiguation`, `P-TG-2-country-with-most-umpires-returns-id`. `benchmarks/m3/config/m3.env` flips `DYNACONF_POLICY__ENABLED` to `true`. Eval/compare wrappers grew `--no-policies` and `--compare-policies` flags mirroring `benchmarks/bpo` (`benchmarks/m3/eval.sh`, `benchmarks/m3/compare.sh`); `benchmarks/helpers/bundle.py` annotates the bundle directory with the policy mode. **Result on the 4 PFs: net −31pp vs no-policies, mixed per task — robust helper on `…-e5d337d143b6` (75→100%), robust killer on `…-d1ba8f4ad233` (75→0%).** + +3. **Output formatter conflict-resolver interaction (CUGA-side policy data).** While iterating, observed that having both `P-OF-1` (citation rule) and `P-OF-2` (hedging strip) loaded caused CUGA's natural-language conflict resolver to pick one and drop the other; specifically when `P-OF-2` won, the citation disappeared and groundedness regressed. Mitigation: renamed `P-OF-2-strip-hedging.md` to `.md.disabled` so the policy compiler skips it. This is a config-time fix only — the underlying conflict resolution behaviour in cuga-agent is out of scope for this PR. **Result: +5pp on the iterated set vs both-loaded; still net-negative vs no-policies.** + +4. **Policy storage drift across per-domain agent instantiations (CUGA-side code).** Symptom: `.cuga` policy folder count decreased monotonically across per-domain `CugaAgent` constructions because each agent's `__init__` re-loaded policies from disk and the on-disk sync wrote back the conflict-resolver's culled set. Fix: pass `auto_load_policies=False, filesystem_sync=False` to both `CugaAgent(...)` constructors in `benchmarks/m3/eval_m3.py` (lines ~193 and ~1436), and load policies once via the new `_load_m3_policies` async helper. **Result: stabilised per-domain runs (no more "policies vanished" surprises mid-bundle).** + +5. **Multi-UUID test-case filter (CUGA-side code, M3 evaluator harness only).** Reproducible after the tool-prefix fix: with `--task ` the evaluator silently ignored the filter and ran every multiturn sample in the capability (~46 in codebase_comments). Bug was in `M3Evaluator.evaluate_all`'s multiturn branch (`benchmarks/m3/eval_m3.py:886`), which checked `self.task_id` (singular — set only when exactly one UUID is passed) instead of `self.task_ids` (plural — populated for both 1 and N). The single-turn branch had already been updated; the multiturn branch hadn't. Fix: switch the multiturn branch to use `self.task_ids` and lowercase-membership testing. **Result: experiments became feasible — without this fix, a 4-PF × 5-runs × 2-configs sweep was actually a 46-task × 5-runs × 2-configs sweep.** + +6. **Per-service registry port respect (CUGA-side code).** When the user needed to keep port 8001 free for unrelated dev work, `eval_m3.py`'s per-service registry was found to hardcode 8001 in three places (port check, uvicorn `--port`, two health-check URLs at `http://localhost:8001/applications`). Fix: read `REGISTRY_PORT` or `DYNACONF_SERVER_PORTS__REGISTRY` once at startup and thread the value through. `benchmarks/m3/eval.sh` and `benchmarks/m3/compare.sh` already honoured `REGISTRY_PORT` for their kill-stale-process logic; the value now flows end-to-end. **Result: parallel CUGA work on port 8001 now possible during an M3 sweep on port 18001.** + +7. **Outer-registry redundancy (CUGA-side code, eval harness only).** Once `eval_m3.py` honoured a configurable port, the next failure surfaced: `eval.sh` was starting an "outer" registry on `$REGISTRY_PORT` via `run_registry.sh`, then `eval_m3.py` tried to start its per-service mini-registry on the same port and aborted with "port in use". The outer registry was a legacy step from before per-service registries existed — every code path in `eval.sh` invokes a `--from-config` eval that self-manages its registry. Fix: `benchmarks/m3/eval.sh` no longer starts the outer registry by default (the kill-stale-process block runs unconditionally; the start block is gated behind `SKIP_SERVER_START=false`, inverting the previous default). **Result: `eval_m3.py`'s per-service registry runs unopposed; the per-PF re-runs that previously failed with "port in use" complete.** + +8. **`extend`-style argparse for `--capability` and `--task` (CUGA-side code).** With `nargs="*"`, a second invocation of `--task` overwrote the first. Switched both to `action="extend", default=[]` so users can pass `--capability m3_task_2 --task ` in one shot. The UUID detection branch also gained a filter step that strips non-UUID items from the test-case filter before passing it down. **Result: composable filters; less footgun.** + +### Combined effect + +| Stage | 4-PF (×5 runs) no-policies | 4-PF (×5 runs) policies | +|---|:---:|:---:| +| Original bundle (analysis baseline, all PFs) | 0/10 | n/a (disabled) | +| + tool-prefix removal | **5/10** | 4/10 | +| + multi-UUID filter + port + registry fixes (clean 4-PF × 5-runs) | **81.2%** | 50.0% | + +The analysis's stated "+9–12 pp from policies" lift did not materialise on these 4 cases; the realised lift came almost entirely from the tool-prefix root cause, which the analysis had not anticipated as a separate bug (it was a hidden prerequisite for any policy lift to be measurable). Re-running the policy bundle against the **full** 200-case M3 set is the next step before declaring policies net-negative in general. + +### What this PR ships + +All changes above are bundled into one PR. The scope is intentional: the M3 evaluator harness was structurally fragile in several mutually-reinforcing ways (silent filter pass-through, port collisions, registry double-start, prefix bug, policy drift) and fixing one without the others left the harness broken in different ways at each step. Each fix is also small and localised. + +What is *not* in this PR: nested-argument sandbox codegen bug (analysis priority #2 — that lives in cuga-agent, not here), and the `movie_platform` / `professional_basketball` MCP-client health investigation (analysis priority #3 — needs registry stderr capture from a future run). + +### Reproducing the final number + +```bash +# Clean local state +pkill -f "uv run registry" 2>/dev/null +pkill -f "eval_m3" 2>/dev/null +lsof -ti :8001 -i :18001 | xargs kill 2>/dev/null + +# 4-PF × 5-runs × 2-configs sweep +caffeinate -i env \ + REGISTRY_PORT=18001 \ + DYNACONF_SERVER_PORTS__REGISTRY=18001 \ + ./benchmarks/m3/compare.sh --runs 5 --compare-policies \ + --m3-data benchmarks/m3/data/small_train.zip \ + --capability m3_task_2 --domain codebase_comments \ + --task 1960f609e439-e5d337d143b6 \ + 1960f609e439-ab3a664a6a28 \ + 1960f609e439-00fe3f448af7 \ + 1960f609e439-d1ba8f4ad233 +``` + +Caveat from the 2026-05-18 run: do not run any other M3 eval against the same `benchmarks/m3/results/` directory while compare.sh is running — bundle collection is by glob and will pick up the other process's result files. Splitting the results dir per compare-invocation is logged as a follow-up. diff --git a/scripts/check_no_task_prefix.py b/scripts/check_no_task_prefix.py new file mode 100644 index 0000000..99019b2 --- /dev/null +++ b/scripts/check_no_task_prefix.py @@ -0,0 +1,109 @@ +#!/usr/bin/env python3 +"""Smoke test: assert that no tool call in an M3 result file carries the legacy +``task___`` registry prefix. + +The registry was reconfigured (eval_m3.py + m3_data_loader.py + m3_vakra_score.py) +to use the bare domain name as the app namespace, so saved tool calls should +now start with ``_`` rather than ``task___``. This script +walks every recorded tool call in a result file and fails if any one of them +still starts with the legacy form — that would mean the change regressed. + +Usage: + uv run python scripts/check_no_task_prefix.py + + # Or, with no arg, picks the most recent benchmarks/m3/results/m3_config_*.json + uv run python scripts/check_no_task_prefix.py +""" + +from __future__ import annotations + +import json +import re +import sys +from pathlib import Path + +LEGACY_RE = re.compile(r"^task_\d+_[a-z_]+_") + + +def _iter_tool_calls(obj): + """Yield every (tool_call_dict, path-for-error-msg) pair under obj.""" + # Top-level result-file shapes vary across writers (results.json has + # {"metrics": ..., "results": [...]} for SDK eval; the bundle CSV-paired + # file has {uuid: {"tool_calls": [...]}} for legacy m3 runs). Cover both. + if isinstance(obj, dict) and "results" in obj and isinstance(obj["results"], list): + for i, r in enumerate(obj["results"]): + yield from _walk(r, f"results[{i}]") + elif isinstance(obj, dict): + # Per-uuid map + for k, v in obj.items(): + if isinstance(v, dict): + yield from _walk(v, k) + elif isinstance(obj, list): + for i, item in enumerate(obj): + yield from _walk(item, f"[{i}]") + + +def _walk(node, path): + if not isinstance(node, dict): + return + for tc in node.get("tool_calls", []) or []: + if isinstance(tc, dict): + yield tc, f"{path}.tool_calls" + for j, turn in enumerate(node.get("all_responses", []) or []): + if isinstance(turn, dict): + for tc in turn.get("tool_calls", []) or []: + if isinstance(tc, dict): + yield tc, f"{path}.all_responses[{j}].tool_calls" + + +def _newest_default_result_file() -> Path | None: + candidates = sorted( + Path("benchmarks/m3/results").glob("m3_config_*.json"), + key=lambda p: p.stat().st_mtime, + ) + return candidates[-1] if candidates else None + + +def main(argv: list[str]) -> int: + if len(argv) > 1: + path = Path(argv[1]) + else: + path = _newest_default_result_file() + if path is None: + print( + "No result file passed and none found under benchmarks/m3/results/m3_config_*.json", + file=sys.stderr, + ) + return 2 + print(f"(no path given — using latest: {path})", file=sys.stderr) + + if not path.exists(): + print(f"File not found: {path}", file=sys.stderr) + return 2 + + data = json.loads(path.read_text()) + offenders: list[tuple[str, str]] = [] + total = 0 + for tc, where in _iter_tool_calls(data): + total += 1 + name = tc.get("name") or "" + if LEGACY_RE.match(name): + offenders.append((where, name)) + + if offenders: + print( + f"FAIL — {len(offenders)} of {total} tool calls still carry the legacy task___ prefix:", + file=sys.stderr, + ) + for where, name in offenders[:20]: + print(f" {where}: {name}", file=sys.stderr) + if len(offenders) > 20: + print(f" … and {len(offenders) - 20} more", file=sys.stderr) + return 1 + + print(f"OK — {total} tool call(s) checked, none start with the legacy task___ prefix.") + return 0 + + +if __name__ == "__main__": + sys.exit(main(sys.argv)) diff --git a/scripts/policies_md_to_json.py b/scripts/policies_md_to_json.py new file mode 100644 index 0000000..8e0704f --- /dev/null +++ b/scripts/policies_md_to_json.py @@ -0,0 +1,138 @@ +#!/usr/bin/env python3 +"""Compile a directory of policy markdown files into a single policies.json. + +Each policy markdown must start with a YAML frontmatter block delimited by +`---` lines. The frontmatter carries the structured policy metadata +(type, id, name, triggers, etc.); the markdown body becomes the policy's +content field — `format_config` for output_formatter, `markdown_content` +for playbook, `guide_content` for tool_guide. + +Usage: + uv run python scripts/policies_md_to_json.py \ + --policies-dir benchmarks/m3/policies \ + --output benchmarks/m3/policies/policies.json + +Files named `README.md` or `POLICIES.md` (case-insensitive) are skipped — +those are human-readable indices, not policies. + +The output JSON matches the shape benchmarks/bpo/policies/policies.json +uses, which is what CUGA's +`cuga.backend.cuga_graph.policy.models.{OutputFormatter,Playbook,ToolGuide}` +expect via `.model_validate(...)`. +""" + +from __future__ import annotations + +import argparse +import json +import sys +from pathlib import Path +from typing import Any + +import yaml + +# Markdown filenames that are docs, not policies. +SKIP_FILENAMES = {"readme.md", "policies.md"} + +# Map policy type -> body-content field name in the JSON. +BODY_FIELD_BY_TYPE = { + "output_formatter": "format_config", + "playbook": "markdown_content", + "tool_guide": "guide_content", +} + +REQUIRED_FRONTMATTER_KEYS = {"id", "type", "name"} + + +def parse_frontmatter(text: str, src: Path) -> tuple[dict[str, Any], str]: + """Split a markdown file into (frontmatter_dict, body_text). + + The file must start with `---\\n`, then YAML, then a closing `---\\n`. + """ + if not text.startswith("---"): + raise ValueError(f"{src}: file must begin with a YAML frontmatter block delimited by '---'") + # Find the closing '---' (must be on its own line, after the opening one) + lines = text.splitlines(keepends=True) + if lines[0].rstrip("\n") != "---": + raise ValueError(f"{src}: opening '---' must be on its own line") + end_idx = None + for i in range(1, len(lines)): + if lines[i].rstrip("\n") == "---": + end_idx = i + break + if end_idx is None: + raise ValueError(f"{src}: missing closing '---' for frontmatter block") + fm_text = "".join(lines[1:end_idx]) + body = "".join(lines[end_idx + 1 :]).lstrip("\n") + try: + fm = yaml.safe_load(fm_text) or {} + except yaml.YAMLError as exc: + raise ValueError(f"{src}: YAML parse error in frontmatter: {exc}") from exc + if not isinstance(fm, dict): + raise ValueError(f"{src}: frontmatter must be a YAML mapping, got {type(fm).__name__}") + return fm, body + + +def build_policy(fm: dict[str, Any], body: str, src: Path) -> dict[str, Any]: + """Merge frontmatter + body into a single policy dict ready for JSON.""" + missing = REQUIRED_FRONTMATTER_KEYS - set(fm) + if missing: + raise ValueError(f"{src}: missing required frontmatter keys: {sorted(missing)}") + ptype = fm["type"] + if ptype not in BODY_FIELD_BY_TYPE: + raise ValueError(f"{src}: unknown policy type '{ptype}'; supported: {sorted(BODY_FIELD_BY_TYPE)}") + body_field = BODY_FIELD_BY_TYPE[ptype] + if body_field in fm: + raise ValueError(f"{src}: '{body_field}' is provided both via frontmatter and via body — pick one") + policy: dict[str, Any] = dict(fm) + policy[body_field] = body + return policy + + +def collect_policies(policies_dir: Path) -> list[dict[str, Any]]: + if not policies_dir.is_dir(): + raise SystemExit(f"policies-dir does not exist or is not a directory: {policies_dir}") + md_files = sorted(f for f in policies_dir.glob("*.md") if f.name.lower() not in SKIP_FILENAMES) + if not md_files: + raise SystemExit(f"no policy .md files found in {policies_dir}") + policies: list[dict[str, Any]] = [] + seen_ids: dict[str, Path] = {} + for md in md_files: + fm, body = parse_frontmatter(md.read_text(), md) + policy = build_policy(fm, body, md) + if policy["id"] in seen_ids: + raise SystemExit(f"duplicate policy id '{policy['id']}' in {md} and {seen_ids[policy['id']]}") + seen_ids[policy["id"]] = md + policies.append(policy) + return policies + + +def main(argv: list[str] | None = None) -> int: + parser = argparse.ArgumentParser( + description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter + ) + parser.add_argument( + "--policies-dir", + required=True, + type=Path, + help="Directory containing one .md per policy", + ) + parser.add_argument( + "--output", + type=Path, + default=None, + help="Output JSON path (default: /policies.json)", + ) + args = parser.parse_args(argv) + + output_path = args.output or (args.policies_dir / "policies.json") + policies = collect_policies(args.policies_dir) + output_path.write_text(json.dumps(policies, indent=2, ensure_ascii=False) + "\n") + print(f"wrote {len(policies)} policy/policies to {output_path}", file=sys.stderr) + for p in policies: + print(f" - {p['type']:18s} {p['id']}", file=sys.stderr) + return 0 + + +if __name__ == "__main__": + sys.exit(main()) diff --git a/uv.lock b/uv.lock index 57731a8..589e324 100644 --- a/uv.lock +++ b/uv.lock @@ -4974,7 +4974,7 @@ wheels = [ [[package]] name = "sentence-transformers" -version = "5.4.1" +version = "5.5.0" source = { registry = "https://pypi.org/simple" } dependencies = [ { name = "huggingface-hub" }, @@ -4987,9 +4987,9 @@ dependencies = [ { name = "transformers" }, { name = "typing-extensions" }, ] -sdist = { url = "https://files.pythonhosted.org/packages/4d/68/7f98c221940ce783b492ad6140384daf2e2918cd7175009d6a362c22b9ee/sentence_transformers-5.4.1.tar.gz", hash = "sha256:436bcb1182a0ff42a8fb2b1c43498a70d0a75b688d182f2cd0d1dd115af61ddc", size = 428910, upload-time = "2026-04-14T13:34:59.006Z" } +sdist = { url = "https://files.pythonhosted.org/packages/2c/27/16d127a61303e05847d878b23687f3371868c76e738557fa80b4373a8c2b/sentence_transformers-5.5.0.tar.gz", hash = "sha256:9cec675e68bfe09d07466d1f13ab06d1d79d60a0f45b154baf433bde6ae159cb", size = 444908, upload-time = "2026-05-12T14:05:42.383Z" } wheels = [ - { url = "https://files.pythonhosted.org/packages/c5/d9/3a9b6f2ccdedc9dc00fe37b2fc58f58f8efbff44565cf4bf39d8568bb13a/sentence_transformers-5.4.1-py3-none-any.whl", hash = "sha256:a6d640fc363849b63affb8e140e9d328feabab86f83d58ac3e16b1c28140b790", size = 571311, upload-time = "2026-04-14T13:34:57.731Z" }, + { url = "https://files.pythonhosted.org/packages/55/20/18416624bcbae866ec0b111979766cebabe8e5ff7563ab953ecbaf3ff9e7/sentence_transformers-5.5.0-py3-none-any.whl", hash = "sha256:75313fdcc2397ec4b58297c25d6187fcca5a6b2aeb09570a72eff5a3223d8d58", size = 588665, upload-time = "2026-05-12T14:05:40.899Z" }, ] [[package]] From 58aabfed0a8da2be84f8609aedeed35f6ad8a74c Mon Sep 17 00:00:00 2001 From: Harold Ship Date: Wed, 20 May 2026 15:09:39 +0300 Subject: [PATCH 02/20] fix(m3): mark P-OF-2 frontmatter as disabled to match filename MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The MD→JSON compiler skips files with the .disabled suffix, so the policy was already inactive, but the frontmatter still said `enabled: true`. Misleading to a reader; CodeRabbit flagged the inconsistency in #100. The intent of the .disabled suffix here is "P-OF-2 conflicts with P-OF-1 in the natural-language conflict resolver and shouldn't be loaded for now." Flipping the frontmatter to `enabled: false` makes that state visible without relying on the suffix alone — and means removing the suffix later won't silently re-enable a policy whose trade-offs haven't been re-evaluated. --- benchmarks/m3/policies/P-OF-2-strip-hedging.md.disabled | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/benchmarks/m3/policies/P-OF-2-strip-hedging.md.disabled b/benchmarks/m3/policies/P-OF-2-strip-hedging.md.disabled index e20dfb8..f5e97a9 100644 --- a/benchmarks/m3/policies/P-OF-2-strip-hedging.md.disabled +++ b/benchmarks/m3/policies/P-OF-2-strip-hedging.md.disabled @@ -4,7 +4,7 @@ type: output_formatter name: P-OF-2 — Strip Hedging and Unsolicited Meta-Commentary description: When the answer contains a resolved value, strip hedging language, dataset meta-commentary, and unsolicited "For context" appendices. priority: 90 -enabled: true +enabled: false format_type: markdown triggers: - type: natural_language From 329b198499bf20107a959da1dd3d2c0891c844c1 Mon Sep 17 00:00:00 2001 From: Harold Ship Date: Thu, 28 May 2026 15:35:49 +0300 Subject: [PATCH 03/20] fix(m3): use DYNACONF_SERVER_PORTS__REGISTRY for registry bind and agent Add get_registry_port() backed by cuga settings so start_registry_server listens on the same port the agent uses via get_registry_base_url(). Sync REGISTRY_PORT and DYNACONF in eval.sh/compare.sh; stop hardcoding 8001 in eval_m3_react port cleanup. Co-authored-by: Cursor --- benchmarks/m3/compare.sh | 6 ++++++ benchmarks/m3/eval.sh | 7 ++++++- benchmarks/m3/eval_m3.py | 22 +++++++++++++++------- benchmarks/m3/eval_m3_react.py | 6 ++++-- 4 files changed, 31 insertions(+), 10 deletions(-) diff --git a/benchmarks/m3/compare.sh b/benchmarks/m3/compare.sh index 7359d65..e70128c 100755 --- a/benchmarks/m3/compare.sh +++ b/benchmarks/m3/compare.sh @@ -25,6 +25,12 @@ if [ -f "$PROJECT_ROOT/benchmarks/helpers/common.sh" ]; then source "$PROJECT_ROOT/benchmarks/helpers/common.sh" fi +# Align cleanup port with eval.sh / cuga-agent (DYNACONF_SERVER_PORTS__REGISTRY). +source "$PROJECT_ROOT/benchmarks/helpers/load_env.sh" "m3" +REGISTRY_PORT="${REGISTRY_PORT:-${DYNACONF_SERVER_PORTS__REGISTRY:-8001}}" +export REGISTRY_PORT +export DYNACONF_SERVER_PORTS__REGISTRY="$REGISTRY_PORT" + # Source model profiles if [ -f "$PROJECT_ROOT/scripts/model_profiles.sh" ]; then source "$PROJECT_ROOT/scripts/model_profiles.sh" diff --git a/benchmarks/m3/eval.sh b/benchmarks/m3/eval.sh index ab82f52..f8de486 100755 --- a/benchmarks/m3/eval.sh +++ b/benchmarks/m3/eval.sh @@ -112,7 +112,6 @@ while [[ $# -gt 0 ]]; do done -REGISTRY_PORT="${REGISTRY_PORT:-8001}" REGISTRY_PID="" cleanup() { @@ -136,6 +135,12 @@ cd "$PROJECT_ROOT" # Load environment source "$PROJECT_ROOT/benchmarks/helpers/load_env.sh" "m3" +# Single registry port for shell helpers and Python (eval_m3 / cuga-agent both +# read DYNACONF_SERVER_PORTS__REGISTRY via settings.server_ports.registry). +REGISTRY_PORT="${REGISTRY_PORT:-${DYNACONF_SERVER_PORTS__REGISTRY:-8001}}" +export REGISTRY_PORT +export DYNACONF_SERVER_PORTS__REGISTRY="$REGISTRY_PORT" + # Make sure Python doesn't block-buffer stdout when it's piped through `tee`. # Without this, print() output from the summary can land after the process # exits, long after the surrounding loguru stderr stream, making it look like diff --git a/benchmarks/m3/eval_m3.py b/benchmarks/m3/eval_m3.py index 37cf8b1..d57d4b2 100644 --- a/benchmarks/m3/eval_m3.py +++ b/benchmarks/m3/eval_m3.py @@ -1559,6 +1559,18 @@ def _dom_name(dc): return task_results +def get_registry_port() -> int: + """Registry port shared by the MCP server and cuga-agent HTTP client. + + Reads ``settings.server_ports.registry`` (override via + ``DYNACONF_SERVER_PORTS__REGISTRY``), the same source + ``get_registry_base_url()`` uses when the agent calls the registry. + """ + from cuga.config import settings + + return int(settings.server_ports.registry) + + async def start_registry_server(config_path: str) -> subprocess.Popen: """Start the registry server with the specified config. @@ -1571,13 +1583,7 @@ async def start_registry_server(config_path: str) -> subprocess.Popen: import os import subprocess - # Honour caller-provided port via REGISTRY_PORT (set by eval.sh) and - # DYNACONF_SERVER_PORTS__REGISTRY (set when the agent's settings.toml - # registry port is overridden). Both must match: CUGA-agent reads the - # DYNACONF value when constructing HTTP requests to its registry; the - # registry server must listen on the same port. Default 8001. - _port_env = os.environ.get("REGISTRY_PORT") or os.environ.get("DYNACONF_SERVER_PORTS__REGISTRY") - registry_port = int(_port_env) if _port_env else 8001 + registry_port = get_registry_port() # Check if the registry port is already in use logger.info(f"🔍 Checking if port {registry_port} is available...") @@ -1627,6 +1633,8 @@ async def start_registry_server(config_path: str) -> subprocess.Popen: # Set environment variables for registry config env = os.environ.copy() env["MCP_SERVERS_FILE"] = abs_config_path + env["DYNACONF_SERVER_PORTS__REGISTRY"] = str(registry_port) + env["REGISTRY_PORT"] = str(registry_port) # Ensure CONTAINER_RUNTIME is set for the registry subprocess as a full path. # The registry server calls os.path.expandvars() on the YAML, so ${CONTAINER_RUNTIME} diff --git a/benchmarks/m3/eval_m3_react.py b/benchmarks/m3/eval_m3_react.py index 142ff6d..62f16ce 100644 --- a/benchmarks/m3/eval_m3_react.py +++ b/benchmarks/m3/eval_m3_react.py @@ -243,8 +243,10 @@ async def _stop_active_registry(self) -> None: self._registry_tmp_yaml = None # OS may take a moment to fully release the port (TIME_WAIT etc). # Without this poll, the next group's start_registry_server hits - # "Port 8001 is already in use" and aborts. - await self._wait_for_port_free(8001, timeout=15.0) + # "Port … is already in use" and aborts. + from benchmarks.m3.eval_m3 import get_registry_port + + await self._wait_for_port_free(get_registry_port(), timeout=15.0) self._active_group = None async def _wait_for_port_free(self, port: int, timeout: float) -> None: From ab5ccc48895a9139e6cf180d917aecc4b4ddb3b6 Mon Sep 17 00:00:00 2001 From: Harold Ship Date: Thu, 28 May 2026 16:14:56 +0300 Subject: [PATCH 04/20] fix(m3): auto-sequence capability passes when --m3-data has no --capability Bare-domain registry names (from the vakra tool-name fix) collide when both m3_task_2 and m3_task_3 expand into one yaml. Run one capability at a time automatically so full small_train.zip evals work without --capability on CLI. --- benchmarks/m3/eval_m3.py | 54 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 54 insertions(+) diff --git a/benchmarks/m3/eval_m3.py b/benchmarks/m3/eval_m3.py index d57d4b2..4b4efb6 100644 --- a/benchmarks/m3/eval_m3.py +++ b/benchmarks/m3/eval_m3.py @@ -1879,6 +1879,30 @@ def rewrite_config_with_loader_domains(config_path: str, m3_data_loader: M3DataL return path +def _service_name_filters_from_task(task_list: Optional[List[str]]) -> Optional[List[str]]: + """Return source-yaml service names from args.task (e.g. m3_task_2). + + UUIDs and hockey_395_0-style test-case IDs are not service-name filters. + """ + if not task_list: + return None + import re + + uuid_re = re.compile(r"^[a-f0-9]{12}-[a-f0-9]{12}$") + testcase_re = re.compile(r"^[a-z_]+_\d+_\d+$") + names = [f for f in task_list if not uuid_re.match(f) and not testcase_re.match(f)] + return names or None + + +def _non_service_task_filters(task_list: List[str]) -> List[str]: + """Keep UUID / test-case filters when auto-sequencing capability passes.""" + import re + + uuid_re = re.compile(r"^[a-f0-9]{12}-[a-f0-9]{12}$") + testcase_re = re.compile(r"^[a-z_]+_\d+_\d+$") + return [f for f in task_list if uuid_re.match(f) or testcase_re.match(f)] + + def expand_registry_config( config_path: str, capability_filter: Optional[List[str]] = None, @@ -2175,6 +2199,36 @@ async def run_config_mode(args, container_runtime: str): f"no_ground_truth={no_ground_truth}" ) + # When --m3-data is set but no --capability/--task service name was given, + # expand one capability at a time. Bare-domain registry names (books, + # mondial_geo, soccer_2016, …) collide across m3_task_2 and m3_task_3 if + # both are expanded into the same yaml (regression from the vakra tool-name + # fix in c0ce9f1). Sequential passes restore the old "run everything" + # behaviour without requiring --capability on the CLI. + _task_filters = list(args.task) if getattr(args, "task", None) else [] + if m3_data_loader and _service_name_filters_from_task(_task_filters) is None: + cap_ids = m3_data_loader.available_capabilities() + preserved = _non_service_task_filters(_task_filters) + if len(cap_ids) > 1: + logger.info( + f"No --capability filter: running {len(cap_ids)} capability passes " + f"sequentially ({', '.join(f'm3_task_{i}' for i in cap_ids)}) " + f"to avoid cross-task domain-name collisions" + ) + import copy + + for task_id in cap_ids: + cap_name = f"m3_task_{task_id}" + logger.info(f"\n{'=' * 80}\n🔁 Auto capability pass: {cap_name}\n{'=' * 80}") + pass_args = copy.copy(args) + pass_args.task = [cap_name] + preserved + await run_config_mode(pass_args, container_runtime) + return + if len(cap_ids) == 1: + cap_name = f"m3_task_{cap_ids[0]}" + logger.info(f"No --capability filter: auto-narrowing to data capability {cap_name}") + args.task = [cap_name] + preserved + # In --no-ground-truth mode, rewrite the YAML so each service's # metadata.domains reflects the loader's view (test domains), not the # YAML's hard-coded small_train list. Without this, `--domain X` filters From 3b0276ca62a0301564ac29a7b99266b90c3dd5b4 Mon Sep 17 00:00:00 2001 From: Harold Ship Date: Fri, 29 May 2026 07:39:10 +0300 Subject: [PATCH 05/20] Fix M3 bundle assembly and eval harness reliability. Strip inline env comments so CUGA mode parses correctly, lazy-load helpers so bundle creation works without importing the agent, and add a utility to assemble bundles from existing results after long eval runs. Co-authored-by: Cursor --- benchmarks/helpers/__init__.py | 48 ++-- benchmarks/helpers/bundle.py | 6 + benchmarks/helpers/common.sh | 54 +++- benchmarks/helpers/load_env.sh | 6 + benchmarks/m3/config/m3.env | 42 ++- benchmarks/m3/eval_m3.py | 65 ++--- benchmarks/m3/m3_vakra_score.py | 103 +++++++- .../m3/tests/test_vakra_langfuse_scores.py | 108 ++++++++ scripts/create_eval_bundle.py | 242 ++++++++++++++++++ scripts/model_profiles.sh | 19 +- 10 files changed, 613 insertions(+), 80 deletions(-) create mode 100644 benchmarks/m3/tests/test_vakra_langfuse_scores.py create mode 100644 scripts/create_eval_bundle.py diff --git a/benchmarks/helpers/__init__.py b/benchmarks/helpers/__init__.py index cd27530..938cd43 100644 --- a/benchmarks/helpers/__init__.py +++ b/benchmarks/helpers/__init__.py @@ -1,23 +1,6 @@ """Helper functions for SDK evaluation benchmarks.""" from .config_loader import load_eval_config -from .sdk_eval_helpers import ( - MetricsConfig, - add_policy_via_agent, - check_keywords, - clear_all_policies, - create_activity_tracker_callback, - evaluate_multiturn_task_with_langfuse, - evaluate_multiturn_task_with_langfuse_react, - evaluate_task_with_langfuse, - evaluate_task_with_langfuse_react, - flush_langfuse, - print_evaluation_summary, - save_evaluation_results, - setup_agent_with_tools, - setup_langfuse, - setup_react_agent_for_evaluation, -) from .token_usage import TokenUsageCallback __all__ = [ @@ -39,3 +22,34 @@ "create_activity_tracker_callback", "save_evaluation_results", ] + +_LAZY_EXPORTS = { + "MetricsConfig": ("sdk_eval_helpers", "MetricsConfig"), + "setup_agent_with_tools": ("sdk_eval_helpers", "setup_agent_with_tools"), + "setup_react_agent_for_evaluation": ("sdk_eval_helpers", "setup_react_agent_for_evaluation"), + "setup_langfuse": ("sdk_eval_helpers", "setup_langfuse"), + "clear_all_policies": ("sdk_eval_helpers", "clear_all_policies"), + "add_policy_via_agent": ("sdk_eval_helpers", "add_policy_via_agent"), + "check_keywords": ("sdk_eval_helpers", "check_keywords"), + "evaluate_task_with_langfuse": ("sdk_eval_helpers", "evaluate_task_with_langfuse"), + "evaluate_task_with_langfuse_react": ("sdk_eval_helpers", "evaluate_task_with_langfuse_react"), + "evaluate_multiturn_task_with_langfuse": ("sdk_eval_helpers", "evaluate_multiturn_task_with_langfuse"), + "evaluate_multiturn_task_with_langfuse_react": ( + "sdk_eval_helpers", + "evaluate_multiturn_task_with_langfuse_react", + ), + "print_evaluation_summary": ("sdk_eval_helpers", "print_evaluation_summary"), + "flush_langfuse": ("sdk_eval_helpers", "flush_langfuse"), + "create_activity_tracker_callback": ("sdk_eval_helpers", "create_activity_tracker_callback"), + "save_evaluation_results": ("sdk_eval_helpers", "save_evaluation_results"), +} + + +def __getattr__(name: str): + if name in _LAZY_EXPORTS: + import importlib + + module_name, attr_name = _LAZY_EXPORTS[name] + module = importlib.import_module(f".{module_name}", __name__) + return getattr(module, attr_name) + raise AttributeError(f"module {__name__!r} has no attribute {name!r}") diff --git a/benchmarks/helpers/bundle.py b/benchmarks/helpers/bundle.py index bac1526..ff5091a 100644 --- a/benchmarks/helpers/bundle.py +++ b/benchmarks/helpers/bundle.py @@ -723,6 +723,12 @@ def cli(): args = parser.parse_args() + # Reload benchmark env from disk (dotenv strips inline comments). Shell-sourced + # vars from eval.sh may include trailing comment text in values. + from benchmarks.helpers.config_loader import load_eval_config + + load_eval_config(args.benchmark) + policies_dir = Path(args.policies_dir) if getattr(args, "policies_dir", None) else None if args.command == "assemble": diff --git a/benchmarks/helpers/common.sh b/benchmarks/helpers/common.sh index 307a26f..a2436bd 100755 --- a/benchmarks/helpers/common.sh +++ b/benchmarks/helpers/common.sh @@ -123,6 +123,8 @@ OUTPUT_FILE="${OUTPUT_FILE:-}" DRY_RUN="${DRY_RUN:-false}" VERBOSE="${VERBOSE:-false}" MODEL_PROFILE="${MODEL_PROFILE:-}" +CLI_MODEL_NAME="${CLI_MODEL_NAME:-}" +CLI_OPENAI_BASE_URL="${CLI_OPENAI_BASE_URL:-}" AGENT="${AGENT:-cuga}" AGENTS="${AGENTS:-}" COMPARE_AGENTS="${COMPARE_AGENTS:-false}" @@ -163,6 +165,14 @@ parse_common_args() { MODEL_PROFILE="${args[$((idx+1))]}" idx=$((idx+2)) ;; + --model-name) + CLI_MODEL_NAME="${args[$((idx+1))]}" + idx=$((idx+2)) + ;; + --openai-base-url) + CLI_OPENAI_BASE_URL="${args[$((idx+1))]}" + idx=$((idx+2)) + ;; --agent) AGENT="${args[$((idx+1))]}" idx=$((idx+2)) @@ -205,22 +215,46 @@ parse_common_args() { fi } +# Source scripts/model_profiles.sh once (idempotent). +_ensure_model_profiles_loaded() { + local script_dir profiles_script + script_dir="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" + profiles_script="$script_dir/../../scripts/model_profiles.sh" + if [ -f "$profiles_script" ]; then + # shellcheck source=/dev/null + source "$profiles_script" + return 0 + fi + echo -e "${RED}Error: model_profiles.sh not found at $profiles_script${NC}" + return 1 +} + # Apply model profile if specified apply_model_profile_if_set() { if [ -n "$MODEL_PROFILE" ]; then - local script_dir - script_dir="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" - local profiles_script="$script_dir/../../scripts/model_profiles.sh" - if [ -f "$profiles_script" ]; then - source "$profiles_script" - apply_model_profile "$MODEL_PROFILE" - else - echo -e "${RED}Error: model_profiles.sh not found at $profiles_script${NC}" - return 1 - fi + _ensure_model_profiles_loaded || return 1 + apply_model_profile "$MODEL_PROFILE" + fi +} + +# Apply per-run CLI overrides (after profile and .env load). +apply_model_cli_overrides_if_set() { + if [ -n "$CLI_MODEL_NAME" ]; then + export MODEL_NAME="$CLI_MODEL_NAME" + echo -e "${GREEN}✓${NC} MODEL_NAME override: $MODEL_NAME" + fi + if [ -n "$CLI_OPENAI_BASE_URL" ]; then + export OPENAI_BASE_URL="$CLI_OPENAI_BASE_URL" + echo -e "${GREEN}✓${NC} OPENAI_BASE_URL override: $OPENAI_BASE_URL" fi } +# Apply profile then CLI overrides. Call after load_env.sh and arg parsing. +finalize_model_config() { + apply_model_profile_if_set || return 1 + apply_model_cli_overrides_if_set +} + # Build model-envs JSON for bundle CLI. # Usage: build_model_envs_json model1 model2 ... # Applies each profile, captures env vars, and outputs JSON to stdout. diff --git a/benchmarks/helpers/load_env.sh b/benchmarks/helpers/load_env.sh index 951214f..70c95b6 100755 --- a/benchmarks/helpers/load_env.sh +++ b/benchmarks/helpers/load_env.sh @@ -38,6 +38,12 @@ _source_no_override() { fi # Skip if already set (allows model profile exports from compare.sh to win) [[ -n "${!key+x}" ]] && continue + # Strip inline comments from unquoted values (e.g. KEY=value # comment). + # python-dotenv strips these; bash source does not unless we do it here. + if [[ "$val" != \"*\" && "$val" != \'*\' ]]; then + val="${val%%[[:space:]]#*}" + val="${val%"${val##*[![:space:]]}"}" + fi # Strip surrounding quotes val="${val#\"}" ; val="${val%\"}" val="${val#\'}" ; val="${val%\'}" diff --git a/benchmarks/m3/config/m3.env b/benchmarks/m3/config/m3.env index 54573b8..aaedbfd 100644 --- a/benchmarks/m3/config/m3.env +++ b/benchmarks/m3/config/m3.env @@ -33,20 +33,34 @@ DYNACONF_ADVANCED_FEATURES__TOOL_CALL_TIMEOUT=120 # DYNACONF_ADVANCED_FEATURES__SHORTLISTING_TOOL_THRESHOLD=20 -DYNACONF_ADVANCED_FEATURES__REFLECTION_ENABLED=true # Enables the reflection step in the execution loop. -DYNACONF_ADVANCED_FEATURES__ENABLE_TODOS=false # Enables todo-list behavior in the agent (off for v1 / current eval plan). -DYNACONF_CONTEXT_SUMMARIZATION__ENABLED=true # (set in agent env when used) Toggles context summarization (conversation/context compression). Not set in appworld.env; add alongside other flags when you need it explicit in runs. -DYNACONF_ADVANCED_FEATURES__FORCE_AUTONOMOUS_MODE=true # Forces autonomous mode so the run is non-interactive and suitable for benchmarks (no interactive / HITL-style flow). -DYNACONF_ADVANCED_FEATURES__BENCHMARK=m3 # Selects benchmark-specific behavior for the named suite. -DYNACONF_ADVANCED_FEATURES__DECOMPOSITION_STRATEGY=exact # Decomposition strategy for tasks that use decomposition. -# DYNACONF_ADVANCED_FEATURES__FORCE_LITE_MODE_APPS=["supervisor","gmail","file_system"] # Apps forced into lite mode (JSON list string). -DYNACONF_ADVANCED_FEATURES__LITE_MODE=true # Master switch for lite mode behavior. -DYNACONF_ADVANCED_FEATURES__SHORTLISTING_TOOL_THRESHOLD=1 # Threshold for shortlisting tools. -DYNACONF_ADVANCED_FEATURES__LITE_MODE_TOOL_THRESHOLD=500 # Tool-count threshold for lite mode behavior. -DYNACONF_FEATURES__CUGA_MODE=accurate # Overall CUGA accuracy/behavior mode. -DYNACONF_FEATURES__LOCAL_SANBDOX=true # Local sandbox flag (name as in config). -# DYNACONF_POLICY__ENABLED=false # Enables policy layer when true. -# DYNACONF_SERVER_PORTS__APIS_URL=9111 # API server port (wiring, not a behavioral feature toggle). +# Enables the reflection step in the execution loop. +DYNACONF_ADVANCED_FEATURES__REFLECTION_ENABLED=true +# Enables todo-list behavior in the agent (off for v1 / current eval plan). +DYNACONF_ADVANCED_FEATURES__ENABLE_TODOS=false +# Toggles context summarization (conversation/context compression). +DYNACONF_CONTEXT_SUMMARIZATION__ENABLED=true +# Forces autonomous mode so the run is non-interactive and suitable for benchmarks. +DYNACONF_ADVANCED_FEATURES__FORCE_AUTONOMOUS_MODE=true +# Selects benchmark-specific behavior for the named suite. +DYNACONF_ADVANCED_FEATURES__BENCHMARK=m3 +# Decomposition strategy for tasks that use decomposition. +DYNACONF_ADVANCED_FEATURES__DECOMPOSITION_STRATEGY=exact +# DYNACONF_ADVANCED_FEATURES__FORCE_LITE_MODE_APPS=["supervisor","gmail","file_system"] +# Apps forced into lite mode (JSON list string). +# Master switch for lite mode behavior. +DYNACONF_ADVANCED_FEATURES__LITE_MODE=true +# Threshold for shortlisting tools. +DYNACONF_ADVANCED_FEATURES__SHORTLISTING_TOOL_THRESHOLD=1 +# Tool-count threshold for lite mode behavior. +DYNACONF_ADVANCED_FEATURES__LITE_MODE_TOOL_THRESHOLD=500 +# Overall CUGA accuracy/behavior mode. +DYNACONF_FEATURES__CUGA_MODE=accurate +# Local sandbox flag (name as in config). +DYNACONF_FEATURES__LOCAL_SANBDOX=true +# DYNACONF_POLICY__ENABLED=false +# Enables policy layer when true. +# DYNACONF_SERVER_PORTS__APIS_URL=9111 +# API server port (wiring, not a behavioral feature toggle). # Vakra evaluator (benchmarks/m3/evaluator/) reads API_KEY for the LLM-as-judge. # Mirror GROQ_API_KEY into API_KEY so judge.ChatModel.__init__ authenticates without diff --git a/benchmarks/m3/eval_m3.py b/benchmarks/m3/eval_m3.py index 4b4efb6..8530037 100644 --- a/benchmarks/m3/eval_m3.py +++ b/benchmarks/m3/eval_m3.py @@ -2432,43 +2432,34 @@ def _service_has_wanted_domain(svc_dict): except Exception as e: logger.warning(f"Could not initialize Langfuse: {e}") - # Collect all task evaluation coroutines - task_evaluations = [] - - for service_dict in services: - # Extract service name and config - service_name = list(service_dict.keys())[0] - service_config = service_dict[service_name] - - metadata = service_config.get("metadata", {}) - task_id = metadata.get("task_id") - container = metadata.get("container") - domains = metadata.get("domains", []) - task_multiturn = metadata.get("multiturn", None) # None = auto-detect - - # Create coroutine for this task (will process all its domains) - # Task 1 uses uuid-based tool universe switching — route to dedicated handler - # if task_id == 1: - # task_coro = evaluate_single_task_1( - # service_name=service_name, - # task_id=task_id, - # container=container, - # domains=domains, - # args=args, - # container_runtime=container_runtime - # ) - # else: - task_coro = evaluate_single_task( - service_name=service_name, - task_id=task_id, - container=container, - domains=domains, - task_multiturn=task_multiturn, - args=args, - container_runtime=container_runtime, - m3_data_loader=m3_data_loader, - ) - task_evaluations.append((service_name, task_coro)) + # Collect task evaluation coroutines only for parallel/batched mode. + # In sequential mode we await evaluate_single_task per service below + # (after starting a one-service registry). Building coroutines here + # and never awaiting them triggers "coroutine was never awaited". + task_evaluations: List[tuple[str, Any]] = [] + + if not sequential_mode: + for service_dict in services: + service_name = list(service_dict.keys())[0] + service_config = service_dict[service_name] + + metadata = service_config.get("metadata", {}) + task_id = metadata.get("task_id") + container = metadata.get("container") + domains = metadata.get("domains", []) + task_multiturn = metadata.get("multiturn", None) # None = auto-detect + + task_coro = evaluate_single_task( + service_name=service_name, + task_id=task_id, + container=container, + domains=domains, + task_multiturn=task_multiturn, + args=args, + container_runtime=container_runtime, + m3_data_loader=m3_data_loader, + ) + task_evaluations.append((service_name, task_coro)) # Concurrency: sequential by default, batched when --batch-size >= 2. # "Fully parallel" is just a large batch size (>= total tasks). diff --git a/benchmarks/m3/m3_vakra_score.py b/benchmarks/m3/m3_vakra_score.py index 1d38fba..14ca5bf 100644 --- a/benchmarks/m3/m3_vakra_score.py +++ b/benchmarks/m3/m3_vakra_score.py @@ -20,6 +20,8 @@ from pathlib import Path from typing import Any, Dict, List, Optional, Tuple +from loguru import logger + _EVAL_DIR = Path(__file__).resolve().parent / "evaluator" # The upstream vendor was renamed `enterprise-benchmark` → `vakra`. Try the # new name first; fall back to the old name so older clones still work. @@ -565,6 +567,10 @@ async def score_results_async( for r in results: if "vakra" in r: r["vakra"]["_scoring_mode"] = used_mode + log_vakra_task_scores(results) + langfuse_pushed = push_vakra_scores_to_langfuse(results) + if langfuse_pushed: + logger.info(f"📊 Pushed Vakra scores to Langfuse for {langfuse_pushed} trace(s)") return summary @@ -602,6 +608,99 @@ def score_results( _JUDGE_KEYS = ("exactmatch", "answer", "groundedness") +def _last_turn_judge_scores(vakra: Dict[str, Any]) -> Dict[str, float]: + """Extract per-judge scores from the last scored turn in a Vakra dialogue.""" + per_turn = (vakra.get("details") or {}).get("per_turn") or [] + if not per_turn: + return {} + meta = per_turn[-1].get("metadata") or {} + scores: Dict[str, float] = {} + for key in _JUDGE_KEYS: + score_key = "exactmatch_score" if key == "exactmatch" else f"{key}_score" + val = meta.get(score_key) + if val is not None: + scores[key] = float(val) + return scores + + +def _format_judge_scores_compact(scores: Dict[str, float]) -> str: + return " ".join(f"{key}={scores[key]:.1f}" for key in _JUDGE_KEYS if key in scores) + + +def log_vakra_task_scores(results: List[Dict[str, Any]]) -> None: + """Log per-task Vakra dialogue + judge scores to the console.""" + for r in results: + vakra = r.get("vakra") + if not vakra: + continue + uuid = _result_uuid(r) or r.get("task_name", "?") + dialogue_score = float(r.get("match_rate", vakra.get("score", 0.0))) + passed = dialogue_score >= 1.0 + mark = "✅ PASS" if passed else "❌ FAIL" + judge = _last_turn_judge_scores(vakra) + judge_str = _format_judge_scores_compact(judge) if judge else "(no judge breakdown)" + logger.info(f"📊 Vakra {mark}: {uuid} dialogue={dialogue_score:.2f} | {judge_str}") + + +def push_vakra_scores_to_langfuse(results: List[Dict[str, Any]]) -> int: + """Attach Vakra judge scores to existing Langfuse traces (post-hoc). + + Mirrors AppWorld's ``appworld_success`` / ``pass_percentage`` pattern: + trace-level scores are written after evaluation completes, keyed by each + result's ``trace_id``. + + Returns the number of traces scored. + """ + try: + from langfuse import get_client + + langfuse = get_client() + except Exception: + return 0 + + pushed = 0 + for r in results: + trace_id = r.get("trace_id") + vakra = r.get("vakra") + if not trace_id or not vakra: + continue + dialogue_score = float(r.get("match_rate", vakra.get("score", 0.0))) + success = dialogue_score >= 1.0 + try: + langfuse.create_score( + trace_id=trace_id, + name="m3_success", + value=success, + data_type="BOOLEAN", + comment="Vakra dialogue pass (score >= 1.0)", + ) + langfuse.create_score( + trace_id=trace_id, + name="m3_dialogue_score", + value=dialogue_score, + data_type="NUMERIC", + comment="Vakra aggregated dialogue score", + ) + for key, val in _last_turn_judge_scores(vakra).items(): + langfuse.create_score( + trace_id=trace_id, + name=f"m3_{key}_score", + value=val, + data_type="NUMERIC", + comment=f"Vakra {key} judge score", + ) + pushed += 1 + except Exception as e: + logger.warning(f"Failed to push Vakra scores to Langfuse for trace {trace_id}: {e}") + + if pushed: + try: + langfuse.flush() + except Exception as e: + logger.warning(f"Failed to flush Langfuse after pushing Vakra scores: {e}") + return pushed + + def _judge_lines(turn: Dict[str, Any], indent: str = " ") -> List[str]: """One line per judge with score + explanation. Used for failure detail.""" meta = turn.get("metadata") or {} @@ -690,7 +789,9 @@ def print_vakra_summary(results: List[Dict[str, Any]]) -> None: tc_str = f"tool_calls={actual_tcs}/{expected_tcs}" else: tc_str = f"tool_calls={actual_tcs}" - write(f" {mark} {uuid:<30} score={score:.2f} {tc_str}\n") + judge = _last_turn_judge_scores(r.get("vakra") or {}) + judge_str = f" {_format_judge_scores_compact(judge)}" if judge else "" + write(f" {mark} {uuid:<30} score={score:.2f} {tc_str}{judge_str}\n") if not passed: details = (r.get("vakra") or {}).get("details") or {} per_turn = details.get("per_turn") or [] diff --git a/benchmarks/m3/tests/test_vakra_langfuse_scores.py b/benchmarks/m3/tests/test_vakra_langfuse_scores.py new file mode 100644 index 0000000..57db23f --- /dev/null +++ b/benchmarks/m3/tests/test_vakra_langfuse_scores.py @@ -0,0 +1,108 @@ +"""Tests for Vakra → Langfuse score push and console logging helpers.""" + +from __future__ import annotations + +from unittest.mock import MagicMock, patch + +import pytest + +pytest.importorskip( + "evaluator", + reason="M3 Vakra vendor not installed; run ./setup_m3.sh to enable this test.", +) +pytest.importorskip( + "benchmark.mcp_client", + reason="M3 Vakra vendor not installed; run ./setup_m3.sh to enable this test.", +) + +from benchmarks.m3.m3_vakra_score import ( # noqa: E402 + _format_judge_scores_compact, + _last_turn_judge_scores, + log_vakra_task_scores, + push_vakra_scores_to_langfuse, +) + + +def _sample_vakra(*, exactmatch=0.0, answer=1.0, groundedness=0.0): + return { + "score": groundedness, + "details": { + "per_turn": [ + { + "turn_id": 1, + "score": groundedness, + "metadata": { + "exactmatch_score": exactmatch, + "answer_score": answer, + "groundedness_score": groundedness, + }, + } + ] + }, + } + + +def test_last_turn_judge_scores_extracts_all_three(): + scores = _last_turn_judge_scores(_sample_vakra()) + assert scores == {"exactmatch": 0.0, "answer": 1.0, "groundedness": 0.0} + assert _format_judge_scores_compact(scores) == "exactmatch=0.0 answer=1.0 groundedness=0.0" + + +def test_last_turn_judge_scores_skips_none(): + vakra = _sample_vakra(exactmatch=1.0, answer=None, groundedness=1.0) + vakra["details"]["per_turn"][0]["metadata"]["answer_score"] = None + scores = _last_turn_judge_scores(vakra) + assert scores == {"exactmatch": 1.0, "groundedness": 1.0} + + +def test_push_vakra_scores_to_langfuse_creates_five_scores(): + results = [ + { + "trace_id": "trace-abc", + "match_rate": 1.0, + "success": True, + "vakra": _sample_vakra(exactmatch=0.0, answer=1.0, groundedness=1.0), + } + ] + mock_langfuse = MagicMock() + with patch("langfuse.get_client", return_value=mock_langfuse): + pushed = push_vakra_scores_to_langfuse(results) + + assert pushed == 1 + assert mock_langfuse.create_score.call_count == 5 + names = {call.kwargs["name"] for call in mock_langfuse.create_score.call_args_list} + assert names == { + "m3_success", + "m3_dialogue_score", + "m3_exactmatch_score", + "m3_answer_score", + "m3_groundedness_score", + } + mock_langfuse.flush.assert_called_once() + + +def test_push_vakra_scores_skips_results_without_trace_id(): + results = [{"vakra": _sample_vakra()}] + mock_langfuse = MagicMock() + with patch("langfuse.get_client", return_value=mock_langfuse): + assert push_vakra_scores_to_langfuse(results) == 0 + mock_langfuse.create_score.assert_not_called() + + +def test_log_vakra_task_scores_emits_judge_breakdown(caplog): + import logging + + caplog.set_level(logging.INFO) + results = [ + { + "uuid": "uuid-a", + "match_rate": 0.0, + "vakra": _sample_vakra(exactmatch=0.0, answer=1.0, groundedness=0.0), + } + ] + with patch("benchmarks.m3.m3_vakra_score.logger") as mock_logger: + log_vakra_task_scores(results) + msg = mock_logger.info.call_args[0][0] + assert "uuid-a" in msg + assert "dialogue=0.00" in msg + assert "exactmatch=0.0 answer=1.0 groundedness=0.0" in msg diff --git a/scripts/create_eval_bundle.py b/scripts/create_eval_bundle.py new file mode 100644 index 0000000..824be25 --- /dev/null +++ b/scripts/create_eval_bundle.py @@ -0,0 +1,242 @@ +#!/usr/bin/env python3 +"""Create a reproducibility bundle from existing evaluation results. + +Use when eval.sh finished successfully but bundle creation failed, or to +re-assemble a bundle with different options without re-running the eval. + +Examples:: + + # M3: bundle the latest result (mirrors eval.sh defaults) + uv run python scripts/create_eval_bundle.py --benchmark m3 --latest + + # M3: bundle a specific result file + uv run python scripts/create_eval_bundle.py --benchmark m3 \\ + --result-file benchmarks/m3/results/m3_20260529_020934.json + + # Any benchmark with explicit paths + uv run python scripts/create_eval_bundle.py --benchmark bpo \\ + --result-file benchmarks/bpo/results/bpo_run.json \\ + --task-file benchmarks/bpo/data/bpo_test_suite_v1.json +""" + +from __future__ import annotations + +import argparse +import subprocess +import sys +import tempfile +from pathlib import Path + +PROJECT_ROOT = Path(__file__).resolve().parent.parent + + +def _latest_result_file(benchmark: str) -> Path | None: + results_dir = PROJECT_ROOT / "benchmarks" / benchmark / "results" + if not results_dir.is_dir(): + return None + patterns = ["*.json"] + if benchmark == "m3": + patterns = ["m3_*.json", "multiturn_*.json"] + candidates: list[Path] = [] + for pattern in patterns: + candidates.extend(results_dir.glob(pattern)) + if not candidates: + return None + return max(candidates, key=lambda p: p.stat().st_mtime) + + +def _default_task_file(benchmark: str, result_file: Path) -> Path | None: + data_dir = PROJECT_ROOT / "benchmarks" / benchmark / "data" + if benchmark == "m3": + if result_file.name.startswith("multiturn_"): + candidate = data_dir / "olympics_mutliturn.json" + else: + candidate = data_dir / "hockey.json" + return candidate if candidate.exists() else None + return None + + +def _default_log_files(benchmark: str) -> list[Path]: + bench_dir = PROJECT_ROOT / "benchmarks" / benchmark + logs: list[Path] = [] + for name in ("registry_server.log",): + path = bench_dir / name + if path.is_file() and path.stat().st_size > 0: + logs.append(path) + # Same fixed paths as benchmarks/m3/eval.sh (not user-controlled). + for fallback in ( + Path("/tmp/m3_registry.log"), # noqa: S108 + Path("/tmp/m3_console.log"), # noqa: S108 + ): + if fallback.is_file() and fallback.stat().st_size > 0: + logs.append(fallback) + return logs + + +def _default_trajectory_dir(benchmark: str) -> Path | None: + from benchmarks.helpers.bundle import find_latest_trajectory + + traj_root = PROJECT_ROOT / "benchmarks" / benchmark / "logging" / "trajectory_data" + return find_latest_trajectory(traj_root) + + +def _generate_report(result_file: Path) -> Path | None: + report_tmp = Path(tempfile.mkstemp(prefix=f"{result_file.stem}_report_", suffix=".md")[1]) + cmd = [ + sys.executable, + "-m", + "benchmarks.helpers.compare_report", + "eval", + "--result-file", + str(result_file), + "--output", + str(report_tmp), + ] + try: + subprocess.run(cmd, cwd=PROJECT_ROOT, check=True) + except subprocess.CalledProcessError: + report_tmp.unlink(missing_ok=True) + return None + return report_tmp + + +def main() -> int: + parser = argparse.ArgumentParser( + description="Create a reproducibility bundle from existing evaluation results." + ) + parser.add_argument( + "--benchmark", + default="m3", + help="Benchmark name (default: m3)", + ) + parser.add_argument( + "--result-file", + action="append", + dest="result_files", + help="Result JSON path (repeatable). Omit with --latest to pick the newest file.", + ) + parser.add_argument( + "--latest", + action="store_true", + help="Use the most recent result file under benchmarks//results/", + ) + parser.add_argument( + "--task-file", + action="append", + dest="task_files", + help="Ground-truth task JSON (repeatable). Default: benchmark-specific guess for M3.", + ) + parser.add_argument("--model-profile", default=None, help="Model profile label for bundle name") + parser.add_argument( + "--trajectory-dir", + default=None, + help="Trajectory folder to include (default: latest under logging/trajectory_data)", + ) + parser.add_argument( + "--log-file", + action="append", + dest="log_files", + help="Log file to include (repeatable). Default: registry + console logs when present.", + ) + parser.add_argument("--no-report", action="store_true", help="Skip eval report generation") + parser.add_argument("--no-langfuse", action="store_true", help="Skip Langfuse trace download") + parser.add_argument("--zip", action="store_true", help="Also create a zip archive") + args = parser.parse_args() + + if args.latest and args.result_files: + parser.error("Use either --latest or --result-file, not both") + if not args.latest and not args.result_files: + args.latest = True + + result_files: list[Path] = [] + if args.latest: + latest = _latest_result_file(args.benchmark) + if latest is None: + print( + f"No result files found under benchmarks/{args.benchmark}/results/", + file=sys.stderr, + ) + return 1 + result_files = [latest] + print(f"Using latest result: {latest}") + else: + result_files = [Path(p).resolve() for p in args.result_files] + + for rf in result_files: + if not rf.is_file(): + print(f"Result file not found: {rf}", file=sys.stderr) + return 1 + + task_files: list[Path] = [] + if args.task_files: + task_files = [Path(p).resolve() for p in args.task_files] + else: + default_task = _default_task_file(args.benchmark, result_files[0]) + if default_task is None: + print( + "No --task-file given and no default task file found. Pass --task-file explicitly.", + file=sys.stderr, + ) + return 1 + task_files = [default_task] + print(f"Using default task file: {default_task}") + + for tf in task_files: + if not tf.is_file(): + print(f"Task file not found: {tf}", file=sys.stderr) + return 1 + + trajectory_dir = Path(args.trajectory_dir).resolve() if args.trajectory_dir else None + if trajectory_dir is None: + trajectory_dir = _default_trajectory_dir(args.benchmark) + if trajectory_dir: + print(f"Including trajectory: {trajectory_dir}") + + log_files = ( + [Path(p).resolve() for p in args.log_files] if args.log_files else _default_log_files(args.benchmark) + ) + if log_files: + print(f"Including logs: {', '.join(str(p) for p in log_files)}") + + report_path: Path | None = None + if not args.no_report: + report_path = _generate_report(result_files[0]) + if report_path: + print(f"Generated report: {report_path}") + + bundle_cmd = [ + sys.executable, + str(PROJECT_ROOT / "benchmarks" / "helpers" / "bundle.py"), + "assemble", + "--benchmark", + args.benchmark, + "--result-files", + *[str(p) for p in result_files], + "--task-files", + *[str(p) for p in task_files], + ] + if args.model_profile: + bundle_cmd.extend(["--model-profile", args.model_profile]) + if trajectory_dir: + bundle_cmd.extend(["--trajectory-dir", str(trajectory_dir)]) + if log_files: + bundle_cmd.extend(["--log-files", *[str(p) for p in log_files]]) + if report_path: + bundle_cmd.extend(["--report", str(report_path)]) + if not args.no_langfuse: + bundle_cmd.append("--fetch-langfuse") + if args.zip: + bundle_cmd.append("--zip") + + print("Running:", " ".join(bundle_cmd)) + try: + subprocess.run(bundle_cmd, cwd=PROJECT_ROOT, check=True) + finally: + if report_path: + report_path.unlink(missing_ok=True) + + return 0 + + +if __name__ == "__main__": + raise SystemExit(main()) diff --git a/scripts/model_profiles.sh b/scripts/model_profiles.sh index 580ff7d..bf3e467 100755 --- a/scripts/model_profiles.sh +++ b/scripts/model_profiles.sh @@ -52,12 +52,29 @@ apply_model_profile() { echo -e "${GREEN} MODEL_NAME=$MODEL_NAME${NC}" echo -e "${GREEN} OPENAI_BASE_URL=$OPENAI_BASE_URL${NC}" ;; + vllm|local) + # OpenAI-compatible local server (vLLM, etc.). Set MODEL_NAME and + # OPENAI_BASE_URL in .env, or pass --model-name / --openai-base-url. + export AGENT_SETTING_CONFIG="settings.openai.toml" + export OPENAI_BASE_URL="${OPENAI_BASE_URL:-http://127.0.0.1:8000/v1}" + export OPENAI_API_KEY="${OPENAI_API_KEY:-}" # pragma: allowlist secret + unset OPENAI_API_VERSION + if [ -z "${MODEL_NAME:-}" ]; then + echo -e "${RED}Error: MODEL_NAME is required for profile '$profile'${NC}" + echo -e "${YELLOW}Set MODEL_NAME in .env or pass --model-name (e.g. Qwen/Qwen3-32B)${NC}" + return 1 + fi + echo -e "${GREEN}✓${NC} Model profile: $profile (local OpenAI-compatible)" + echo -e "${GREEN} AGENT_SETTING_CONFIG=$AGENT_SETTING_CONFIG${NC}" + echo -e "${GREEN} MODEL_NAME=$MODEL_NAME${NC}" + echo -e "${GREEN} OPENAI_BASE_URL=$OPENAI_BASE_URL${NC}" + ;; "") # No profile specified, use .env defaults ;; *) echo -e "${RED}Error: Unknown model profile '$profile'${NC}" - echo -e "${YELLOW}Valid values: gpt-oss, gpt4o, gpt4.1, opus4.5${NC}" + echo -e "${YELLOW}Valid values: gpt-oss, gpt4o, gpt4.1, opus4.5, vllm, local${NC}" return 1 ;; esac From de8ddd49031068ef22a237cc164c43eeec1265dc Mon Sep 17 00:00:00 2001 From: Harold Ship Date: Fri, 29 May 2026 07:40:36 +0300 Subject: [PATCH 06/20] Fix create_eval_bundle import error when run as a script. Inline trajectory lookup so the utility does not import the benchmarks package before subprocess calls. Co-authored-by: Cursor --- scripts/create_eval_bundle.py | 14 +++++++++++--- 1 file changed, 11 insertions(+), 3 deletions(-) diff --git a/scripts/create_eval_bundle.py b/scripts/create_eval_bundle.py index 824be25..70d458d 100644 --- a/scripts/create_eval_bundle.py +++ b/scripts/create_eval_bundle.py @@ -73,11 +73,19 @@ def _default_log_files(benchmark: str) -> list[Path]: return logs -def _default_trajectory_dir(benchmark: str) -> Path | None: - from benchmarks.helpers.bundle import find_latest_trajectory +def _find_latest_trajectory(trajectory_data_dir: Path) -> Path | None: + """Return the most recently modified subfolder under trajectory_data_dir.""" + if not trajectory_data_dir.is_dir(): + return None + subdirs = [d for d in trajectory_data_dir.iterdir() if d.is_dir()] + if not subdirs: + return None + return max(subdirs, key=lambda d: d.stat().st_mtime) + +def _default_trajectory_dir(benchmark: str) -> Path | None: traj_root = PROJECT_ROOT / "benchmarks" / benchmark / "logging" / "trajectory_data" - return find_latest_trajectory(traj_root) + return _find_latest_trajectory(traj_root) def _generate_report(result_file: Path) -> Path | None: From 49479476be7e602c7f0fbeedeb022cbbc62e63ff Mon Sep 17 00:00:00 2001 From: Harold Ship Date: Fri, 29 May 2026 07:46:15 +0300 Subject: [PATCH 07/20] Fix bundle CLI when invoked outside the benchmarks package. Load benchmark env via dotenv inside bundle.py and invoke it with -m from create_eval_bundle so direct script execution does not require sys.path hacks. Co-authored-by: Cursor --- benchmarks/helpers/bundle.py | 16 +++++++++++++--- scripts/create_eval_bundle.py | 3 ++- 2 files changed, 15 insertions(+), 4 deletions(-) diff --git a/benchmarks/helpers/bundle.py b/benchmarks/helpers/bundle.py index ff5091a..d298515 100644 --- a/benchmarks/helpers/bundle.py +++ b/benchmarks/helpers/bundle.py @@ -43,6 +43,18 @@ PROJECT_ROOT = _HELPERS_DIR.parent.parent +def _load_benchmark_env(benchmark_name: str) -> None: + """Load global + benchmark .env files (dotenv strips inline comments).""" + from dotenv import load_dotenv + + global_env = PROJECT_ROOT / "config" / "global.env" + if global_env.exists(): + load_dotenv(global_env, override=True) + benchmark_env = PROJECT_ROOT / "benchmarks" / benchmark_name / "config" / f"{benchmark_name}.env" + if benchmark_env.exists(): + load_dotenv(benchmark_env, override=True) + + # --------------------------------------------------------------------------- # Git / hash helpers # --------------------------------------------------------------------------- @@ -725,9 +737,7 @@ def cli(): # Reload benchmark env from disk (dotenv strips inline comments). Shell-sourced # vars from eval.sh may include trailing comment text in values. - from benchmarks.helpers.config_loader import load_eval_config - - load_eval_config(args.benchmark) + _load_benchmark_env(args.benchmark) policies_dir = Path(args.policies_dir) if getattr(args, "policies_dir", None) else None diff --git a/scripts/create_eval_bundle.py b/scripts/create_eval_bundle.py index 70d458d..4d200e5 100644 --- a/scripts/create_eval_bundle.py +++ b/scripts/create_eval_bundle.py @@ -214,7 +214,8 @@ def main() -> int: bundle_cmd = [ sys.executable, - str(PROJECT_ROOT / "benchmarks" / "helpers" / "bundle.py"), + "-m", + "benchmarks.helpers.bundle", "assemble", "--benchmark", args.benchmark, From a3002587442389909ca7bd9c4598b1ff95225fb6 Mon Sep 17 00:00:00 2001 From: Harold Ship Date: Fri, 29 May 2026 08:20:25 +0300 Subject: [PATCH 08/20] Fix CI failures from polluted eval env and bandit B108. Reload benchmark env in root conftest before cuga imports and add nosec annotations for fixed /tmp log paths in create_eval_bundle. Co-authored-by: Cursor --- conftest.py | 23 +++++++++++++++++++++++ scripts/create_eval_bundle.py | 4 ++-- 2 files changed, 25 insertions(+), 2 deletions(-) create mode 100644 conftest.py diff --git a/conftest.py b/conftest.py new file mode 100644 index 0000000..31ae4b5 --- /dev/null +++ b/conftest.py @@ -0,0 +1,23 @@ +"""Pytest root conftest: reload benchmark env before tests import cuga. + +Eval shells source *.env via bash, which used to leave inline comments in +values (e.g. ``accurate # Overall CUGA...``). Reload committed config with +dotenv so ``just ci`` stays green even after a local eval run. +""" + +from pathlib import Path + +_PROJECT_ROOT = Path(__file__).resolve().parent + + +def _reload_benchmark_env() -> None: + from dotenv import load_dotenv + + global_env = _PROJECT_ROOT / "config" / "global.env" + if global_env.exists(): + load_dotenv(global_env, override=True) + for env_file in sorted((_PROJECT_ROOT / "benchmarks").glob("*/config/*.env")): + load_dotenv(env_file, override=True) + + +_reload_benchmark_env() diff --git a/scripts/create_eval_bundle.py b/scripts/create_eval_bundle.py index 4d200e5..c1d0a98 100644 --- a/scripts/create_eval_bundle.py +++ b/scripts/create_eval_bundle.py @@ -65,8 +65,8 @@ def _default_log_files(benchmark: str) -> list[Path]: logs.append(path) # Same fixed paths as benchmarks/m3/eval.sh (not user-controlled). for fallback in ( - Path("/tmp/m3_registry.log"), # noqa: S108 - Path("/tmp/m3_console.log"), # noqa: S108 + Path("/tmp/m3_registry.log"), # noqa: S108 # nosec B108 + Path("/tmp/m3_console.log"), # noqa: S108 # nosec B108 ): if fallback.is_file() and fallback.stat().st_size > 0: logs.append(fallback) From 2eb12ddcbe93fbc0ead71f694b41a9ed84ab128b Mon Sep 17 00:00:00 2001 From: Harold Ship Date: Sun, 31 May 2026 16:32:00 +0300 Subject: [PATCH 09/20] fix(m3): one eval run = one result file + one trajectory run (all tasks) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The auto-capability-pass logic ran each capability (m3_task_2, m3_task_3) as a separate recursive run_config_mode and each pass saved its own result file. With --m3-data covering 2 capabilities, one eval.sh invocation emitted two 100-task files instead of one 200-task file, so compare_report counted each capability file as a separate "run" (inflated run count) and every run column only showed 100 tasks (the other capability's tasks rendered as "—"). - eval_m3.py: capability passes now run with defer_save=True and return their results; the parent aggregates them and writes ONE result file (+ ground-truth dump and summary) per eval.sh run via the new _finalize_and_save_results helper. Single-capability/--capability paths use the same helper. Also removes a duplicate _write_single_service_yaml definition. - compare.sh / bundle.py: trajectory folders (cuga writes one per domain) are now grouped per eval.sh run and merged into a single runN/trajectories dir, so one bundle run holds all 200 trajectories instead of one run per domain. The bundle CLI still accepts the legacy flat trajectory-dirs shape. Net effect: `compare.sh --runs 3 --compare-policies` produces 6 runs (3 policies + 3 no-policies), each covering all 200 tasks/trajectories. Co-authored-by: Cursor --- benchmarks/helpers/bundle.py | 57 ++++++++--- benchmarks/m3/compare.sh | 63 +++++++++--- benchmarks/m3/eval_m3.py | 193 ++++++++++++++++++----------------- 3 files changed, 190 insertions(+), 123 deletions(-) diff --git a/benchmarks/helpers/bundle.py b/benchmarks/helpers/bundle.py index d298515..a1e5ce0 100644 --- a/benchmarks/helpers/bundle.py +++ b/benchmarks/helpers/bundle.py @@ -517,11 +517,16 @@ def assemble_compare_bundle( bundle_root: Path | None = None, model_envs: dict | None = None, policies_dir: Path | None = None, - trajectory_dirs: dict[str, list[Path]] | None = None, + trajectory_dirs: dict[str, list[list[Path]]] | None = None, log_files: dict[str, list[str | Path]] | None = None, fetch_langfuse: bool = False, ) -> Path: - """Create a comparison-level bundle directory.""" + """Create a comparison-level bundle directory. + + ``trajectory_dirs`` maps each config key to a list of RUNS, where each run + is itself a list of trajectory folders (cuga emits one folder per domain). + All folders within a run are merged into a single ``runN/trajectories`` dir. + """ benchmark_dir = PROJECT_ROOT / "benchmarks" / benchmark_name if bundle_root is None: bundle_root = benchmark_dir / "evaluation_bundles" @@ -567,19 +572,28 @@ def assemble_compare_bundle( # Policies _copy_policies(bundle_dir, policies_dir) - # Cuga trajectories (per-model, per-run) + # Cuga trajectories (per-model, per-run). `trajectory_dirs[config]` is a + # list of RUNS, and each run is a list of trajectory folders (cuga writes + # one folder per domain). All folders belonging to one eval.sh run are + # merged into that run's single `trajectories/` dir, so one bundle run maps + # to one eval run (all 200 trajectories) rather than one per-domain folder. if trajectory_dirs: - for config_key, traj_paths in trajectory_dirs.items(): - for i, traj_path in enumerate(traj_paths, 1): - traj_path = Path(traj_path) - if not traj_path.exists(): - continue + for config_key, run_groups in trajectory_dirs.items(): + for i, group in enumerate(run_groups, 1): run_label = f"{config_key.replace(':', '_')}_run{i}" - _copy_trajectories( - bundle_dir, - traj_path, - dest_subdir=f"runs/{run_label}/trajectories", - ) + copied_any = False + for traj_path in group: + traj_path = Path(traj_path) + if not traj_path.exists(): + continue + if _copy_trajectories( + bundle_dir, + traj_path, + dest_subdir=f"runs/{run_label}/trajectories", + ): + copied_any = True + if not copied_any: + continue # Copy .progress to run root so cuga-viz can find it _run_progress = bundle_dir / "runs" / run_label / "trajectories" / ".progress" if _run_progress.exists(): @@ -722,7 +736,12 @@ def cli(): p_cmp.add_argument("--task-files", nargs="*", default=None) p_cmp.add_argument("--policies-dir", default=None) p_cmp.add_argument("--model-envs", default=None, help='JSON: {"model": {"MODEL_NAME": "...", ...}}') - p_cmp.add_argument("--trajectory-dirs", default=None, help='JSON: {"model": ["/path/to/traj_run1", ...]}') + p_cmp.add_argument( + "--trajectory-dirs", + default=None, + help='JSON grouped by run: {"model": [["/run1/domA", "/run1/domB"], ["/run2/domA"]]}. ' + 'A flat {"model": ["/dir1", ...]} is still accepted (each dir treated as its own run).', + ) p_cmp.add_argument( "--log-files", default=None, @@ -775,7 +794,15 @@ def cli(): traj_dirs = None if args.trajectory_dirs: raw = json.loads(args.trajectory_dirs) - traj_dirs = {k: [Path(p) for p in v] for k, v in raw.items()} + # Accept two shapes: + # grouped (preferred): {config: [[dir, ...run1], [dir, ...run2]]} + # legacy flat: {config: [dir, dir, ...]} -> each dir = 1 run + traj_dirs = {} + for k, v in raw.items(): + if v and isinstance(v[0], list): + traj_dirs[k] = [[Path(p) for p in group] for group in v] + else: + traj_dirs[k] = [[Path(p)] for p in v] log_file_map = None if args.log_files: log_file_map = json.loads(args.log_files) diff --git a/benchmarks/m3/compare.sh b/benchmarks/m3/compare.sh index e70128c..9773b43 100755 --- a/benchmarks/m3/compare.sh +++ b/benchmarks/m3/compare.sh @@ -222,7 +222,15 @@ for config in "${CONFIGS[@]}"; do # config's runs. Filtering by agent prevents stale files from the OTHER # agent leaking into this config's recent_files. before_files=$(_list_results_for_agent "$agent") - before_trajs=$(find "$SCRIPT_DIR/logging/trajectory_data" -mindepth 1 -maxdepth 1 -type d 2>/dev/null | sort) + + # Trajectory dirs are grouped per eval.sh run: cuga writes one folder per + # domain, so we snapshot before/after EACH run and record that run's new + # folders as one group. Groups are separated by a sentinel line so the JSON + # builder below can emit a list-of-lists (one inner list per run). This + # keeps "one eval.sh run = one bundle run" instead of one run per domain. + run_before_trajs=$(find "$SCRIPT_DIR/logging/trajectory_data" -mindepth 1 -maxdepth 1 -type d 2>/dev/null | sort) + config_traj_groups="" + TRAJ_GROUP_SEP="@@RUN@@" for ((r=1; r<=RUNS; r++)); do total_runs=$((total_runs+1)) @@ -243,6 +251,12 @@ for config in "${CONFIGS[@]}"; do runs_done=$(( runs_done + 1 )) runs_elapsed_total=$(( runs_elapsed_total + run_dur )) + # Record the trajectory folders this single run produced as one group. + run_after_trajs=$(find "$SCRIPT_DIR/logging/trajectory_data" -mindepth 1 -maxdepth 1 -type d 2>/dev/null | sort) + run_new_trajs=$(comm -13 <(echo "$run_before_trajs") <(echo "$run_after_trajs")) + config_traj_groups+="${TRAJ_GROUP_SEP}"$'\n'"${run_new_trajs}"$'\n' + run_before_trajs="$run_after_trajs" + # After first run, reuse servers for all subsequent runs export SKIP_SERVER_START="true" echo "" @@ -255,11 +269,9 @@ for config in "${CONFIGS[@]}"; do CONFIG_RESULT_KEYS+=("$config") CONFIG_RESULT_VALS+=("$recent_files") - # Collect only NEW trajectory folders produced by this config's runs - after_trajs=$(find "$SCRIPT_DIR/logging/trajectory_data" -mindepth 1 -maxdepth 1 -type d 2>/dev/null | sort) - recent_trajs=$(comm -13 <(echo "$before_trajs") <(echo "$after_trajs")) + # Store the per-run trajectory groups (sentinel-delimited) for this config. CONFIG_TRAJ_KEYS+=("$config") - CONFIG_TRAJ_VALS+=("$recent_trajs") + CONFIG_TRAJ_VALS+=("$config_traj_groups") done total_dur=$(( $(date +%s) - compare_t0 )) @@ -318,24 +330,41 @@ if [[ "${NO_BUNDLE:-false}" != "true" ]]; then MODEL_ENVS_JSON=$(build_model_envs_json "${MODEL_LIST[@]}") fi - # Build per-config trajectory dirs JSON: {"model:agent": ["/path/run1", ...]} + # Build per-config trajectory JSON grouped by run: + # {"model:agent:policy": [["/run1/domA", ...], ["/run2/domA", ...]]} + # CONFIG_TRAJ_VALS holds sentinel-delimited groups (one per eval run). TRAJ_JSON_PARTS=() for ci in "${!CONFIG_TRAJ_KEYS[@]}"; do tconfig="${CONFIG_TRAJ_KEYS[$ci]}" - tfiles="${CONFIG_TRAJ_VALS[$ci]}" - if [[ -z "$tfiles" ]]; then + tgroups="${CONFIG_TRAJ_VALS[$ci]}" + if [[ -z "$tgroups" ]]; then continue fi - tfile_list="" - tfirst=true - for f in $tfiles; do - if [[ "$tfirst" != "true" ]]; then - tfile_list+="," + groups_json="" + cur_group="" + in_group=false + while IFS= read -r line; do + if [[ "$line" == "$TRAJ_GROUP_SEP" ]]; then + if [[ "$in_group" == "true" ]]; then + if [[ -n "$groups_json" ]]; then groups_json+=","; fi + groups_json+="[${cur_group}]" + fi + cur_group="" + in_group=true + continue fi - tfirst=false - tfile_list+="\"${f}\"" - done - TRAJ_JSON_PARTS+=("\"${tconfig}\":[${tfile_list}]") + [[ -z "$line" ]] && continue + if [[ -n "$cur_group" ]]; then cur_group+=","; fi + cur_group+="\"${line}\"" + done <<< "$tgroups" + if [[ "$in_group" == "true" ]]; then + if [[ -n "$groups_json" ]]; then groups_json+=","; fi + groups_json+="[${cur_group}]" + fi + if [[ -z "$groups_json" ]]; then + continue + fi + TRAJ_JSON_PARTS+=("\"${tconfig}\":[${groups_json}]") done TRAJ_JSON_INPUT="{" diff --git a/benchmarks/m3/eval_m3.py b/benchmarks/m3/eval_m3.py index 8530037..b8244fa 100644 --- a/benchmarks/m3/eval_m3.py +++ b/benchmarks/m3/eval_m3.py @@ -2084,31 +2084,6 @@ def _write_single_service_yaml(service_dict: Dict[str, Any]) -> str: return path -def _write_single_service_yaml(service_dict: Dict[str, Any]) -> str: - """Write a minimal registry yaml containing only the given service. - - Used in sequential mode so each expanded (task, domain) pair gets its own - registry with just that domain's MCP server loaded, instead of all ~20 - MCP servers running at once. - """ - import tempfile - - service_name = list(service_dict.keys())[0] - mini = {"services": [service_dict]} - fd, path = tempfile.mkstemp(suffix=".yaml", prefix=f"m3_registry_{service_name}_") - try: - with os.fdopen(fd, "w") as f: - yaml.dump(mini, f, default_flow_style=False, sort_keys=False) - except Exception: - # Best effort: clean up if write failed - try: - os.unlink(path) - except Exception: # noqa: S110 — unlink during error cleanup is best-effort - pass - raise - return path - - async def evaluate_tasks_in_batches(task_evaluations: List[tuple], batch_size: int, args) -> List[Any]: """Evaluate tasks in batches to manage resources for large-scale evaluation. @@ -2174,7 +2149,80 @@ async def evaluate_tasks_in_batches(task_evaluations: List[tuple], batch_size: i return all_results -async def run_config_mode(args, container_runtime: str): +def _finalize_and_save_results(all_results: List[Dict[str, Any]], no_ground_truth: bool): + """Persist exactly one result file (plus ground-truth dump) for a run. + + Shared by the single-capability path and the multi-capability aggregation + path so that ONE eval.sh invocation always yields ONE result file covering + every task it evaluated. Previously each capability pass saved its own + 100-task file, which made compare_report count one logical run as several + runs (one per capability) and made each "run" look like only 100 tasks. + """ + output_dir = Path(__file__).parent / "results" + + # In no-ground-truth mode there's no scoring — render the tool-call-count + # summary instead and capture it to the summary file. + if no_ground_truth: + _emit_cleanly(print_no_gt_summary, all_results) + try: + with open(M3_SUMMARY_FILE, "w") as _sf: + _sf.write(_render_no_gt_summary(all_results)) + logger.info(f"Summary written to {M3_SUMMARY_FILE}") + except Exception as e: + logger.warning(f"Failed to write summary to {M3_SUMMARY_FILE}: {e}") + + # Save raw results JSON and skip vakra-format ground-truth dump. + saved_path = save_evaluation_results(all_results, output_dir, prefix="m3_config_no_gt") + logger.info(f"\nResults saved to: {saved_path}") + return saved_path + + # Vakra is the source of truth for the overall summary. We capture it to + # M3_SUMMARY_FILE so eval.sh can re-echo it as the last thing on screen. + if any("vakra" in r for r in all_results): + _emit_cleanly(print_vakra_summary, all_results) + try: + import io as _io + + buf = _io.StringIO() + _orig = sys.__stdout__ + + # Re-render to capture text for the summary file + class _Cap: + def write(self, s): + buf.write(s) + return len(s) + + def flush(self): + pass + + sys.__stdout__ = _Cap() # type: ignore[assignment] + try: + print_vakra_summary(all_results) + finally: + sys.__stdout__ = _orig # type: ignore[assignment] + with open(M3_SUMMARY_FILE, "w") as _sf: + _sf.write(buf.getvalue()) + logger.info(f"Summary written to {M3_SUMMARY_FILE}") + except Exception as e: + logger.warning(f"Failed to write summary to {M3_SUMMARY_FILE}: {e}") + else: + logger.warning( + "No Vakra scores produced for any task — check API_KEY and the per-domain Vakra warnings above." + ) + + # Save results + saved_path = save_evaluation_results(all_results, output_dir, prefix="m3_config") + logger.info(f"\nResults saved to: {saved_path}") + + # Save ground truth format + evaluator_temp = M3Evaluator() + evaluator_temp.results = all_results + ground_truth_path = evaluator_temp._save_ground_truth_format(output_dir) + logger.info(f"Ground truth format saved to: {ground_truth_path}") + return saved_path + + +async def run_config_mode(args, container_runtime: str, defer_save: bool = False): """Run evaluation in config mode with task-level parallelism and optional batching. Tasks run in parallel (each uses separate container). @@ -2217,13 +2265,29 @@ async def run_config_mode(args, container_runtime: str): ) import copy + # Run each capability as its own pass (separate registry/expanded + # config to dodge cross-task domain-name collisions), but collect + # every pass's results and persist them together as a SINGLE result + # file for this run. One eval.sh run -> one file -> all tasks. + combined_results: List[Dict[str, Any]] = [] for task_id in cap_ids: cap_name = f"m3_task_{task_id}" logger.info(f"\n{'=' * 80}\n🔁 Auto capability pass: {cap_name}\n{'=' * 80}") pass_args = copy.copy(args) pass_args.task = [cap_name] + preserved - await run_config_mode(pass_args, container_runtime) - return + pass_results = await run_config_mode(pass_args, container_runtime, defer_save=True) + if pass_results: + combined_results.extend(pass_results) + + if combined_results: + logger.info( + f"🧮 Aggregated {len(combined_results)} results across " + f"{len(cap_ids)} capability pass(es) → writing one result file" + ) + _finalize_and_save_results(combined_results, no_ground_truth) + else: + logger.warning("⚠️ No results produced across capability passes.") + return combined_results if len(cap_ids) == 1: cap_name = f"m3_task_{cap_ids[0]}" logger.info(f"No --capability filter: auto-narrowing to data capability {cap_name}") @@ -2540,72 +2604,19 @@ def _service_has_wanted_domain(svc_dict): sys.stderr.flush() if all_results: - # In no-ground-truth mode there's no scoring — render the - # tool-call-count summary instead and capture to the summary file. - if no_ground_truth: - _emit_cleanly(print_no_gt_summary, all_results) - try: - with open(M3_SUMMARY_FILE, "w") as _sf: - _sf.write(_render_no_gt_summary(all_results)) - logger.info(f"Summary written to {M3_SUMMARY_FILE}") - except Exception as e: - logger.warning(f"Failed to write summary to {M3_SUMMARY_FILE}: {e}") - - # Save raw results JSON and skip vakra-format ground-truth dump. - output_dir = Path(__file__).parent / "results" - saved_path = save_evaluation_results(all_results, output_dir, prefix="m3_config_no_gt") - logger.info(f"\nResults saved to: {saved_path}") - return - - # Vakra is the source of truth for the overall summary. We capture - # it to M3_SUMMARY_FILE so eval.sh can re-echo it as the last thing - # on screen. - if any("vakra" in r for r in all_results): - _emit_cleanly(print_vakra_summary, all_results) - try: - import io as _io - - buf = _io.StringIO() - _orig = sys.__stdout__ - - # Re-render to capture text for the summary file - class _Cap: - def write(self, s): - buf.write(s) - return len(s) - - def flush(self): - pass - - sys.__stdout__ = _Cap() # type: ignore[assignment] - try: - print_vakra_summary(all_results) - finally: - sys.__stdout__ = _orig # type: ignore[assignment] - with open(M3_SUMMARY_FILE, "w") as _sf: - _sf.write(buf.getvalue()) - logger.info(f"Summary written to {M3_SUMMARY_FILE}") - except Exception as e: - logger.warning(f"Failed to write summary to {M3_SUMMARY_FILE}: {e}") - else: - logger.warning( - "No Vakra scores produced for any task — check API_KEY and " - "the per-domain Vakra warnings above." - ) - - # Save results - output_dir = Path(__file__).parent / "results" - saved_path = save_evaluation_results(all_results, output_dir, prefix="m3_config") - logger.info(f"\nResults saved to: {saved_path}") - - # Save ground truth format - evaluator_temp = M3Evaluator() - evaluator_temp.results = all_results - ground_truth_path = evaluator_temp._save_ground_truth_format(output_dir) - logger.info(f"Ground truth format saved to: {ground_truth_path}") + # If this invocation is one capability sub-pass of a larger + # multi-capability run, return the results unsaved so the caller can + # aggregate every capability into ONE result file (one eval.sh run = + # one file = all tasks). Saving here is what previously produced a + # separate 100-task file per capability. + if defer_save: + return all_results + _finalize_and_save_results(all_results, no_ground_truth) else: logger.warning("⚠️ No results produced. Check the registry logs and task filters.") + return all_results + finally: # Stop registry if it was started if registry_process is not None: From 1852ea58f87f049af172a387e85b71614fa97b14 Mon Sep 17 00:00:00 2001 From: Harold Ship Date: Sun, 31 May 2026 17:30:30 +0300 Subject: [PATCH 10/20] fix(m3): make sequential per-domain registry restarts reliable on the port Sequential mode starts/stops a registry per domain on the same port. stop only waited on the `uv` wrapper, so the uvicorn worker could still hold the port when the next domain started, and start_registry_server hard-failed immediately on a busy port ("Port 18001 is already in use" mid-run, e.g. on talkingdata). - eval_m3.py: add _port_in_use/_wait_for_port_free/_kill_port_listeners helpers. start_registry_server now frees + waits for the port (up to 20s) instead of failing on first check; stop_registry_server waits for the port to actually be released (and force-kills stray listeners) before returning. - eval.sh: only start the "outer" registry for multiturn. Single-turn and --m3-data flows (cuga/react) manage their own per-service registries, so the outer registry (which compare.sh forces on via SKIP_SERVER_START=false on its first run) only collided on $REGISTRY_PORT. Fixes both direct eval and compare. Co-authored-by: Cursor --- benchmarks/m3/eval.sh | 9 ++++ benchmarks/m3/eval_m3.py | 104 +++++++++++++++++++++++++++++++-------- 2 files changed, 92 insertions(+), 21 deletions(-) diff --git a/benchmarks/m3/eval.sh b/benchmarks/m3/eval.sh index f8de486..32ede0e 100755 --- a/benchmarks/m3/eval.sh +++ b/benchmarks/m3/eval.sh @@ -173,6 +173,15 @@ if port_in_use $REGISTRY_PORT 2>/dev/null; then sleep 1 fi +# Only the multiturn flow relies on an externally-started ("outer") registry. +# The single-turn and --m3-data flows (eval_m3.py / eval_m3_react.py) start and +# manage their OWN per-service registries on $REGISTRY_PORT, so starting an +# outer registry here would just collide on the port (e.g. compare.sh forces +# SKIP_SERVER_START=false on its first run). Force-skip unless multiturn. +if [ "$MULTITURN" != "true" ]; then + SKIP_SERVER_START="true" +fi + if [ "${SKIP_SERVER_START:-true}" = "false" ]; then echo -e "${YELLOW:-}Starting registry server on port $REGISTRY_PORT...${NC:-}" bash "$SCRIPT_DIR/run_registry.sh" > /tmp/m3_registry.log 2>&1 & diff --git a/benchmarks/m3/eval_m3.py b/benchmarks/m3/eval_m3.py index b8244fa..b926d7a 100644 --- a/benchmarks/m3/eval_m3.py +++ b/benchmarks/m3/eval_m3.py @@ -1571,6 +1571,51 @@ def get_registry_port() -> int: return int(settings.server_ports.registry) +def _port_in_use(port: int) -> bool: + """Return True if something is listening on 127.0.0.1:`port`.""" + import socket + + sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) + try: + return sock.connect_ex(("127.0.0.1", port)) == 0 + finally: + sock.close() + + +async def _wait_for_port_free(port: int, timeout: float = 20.0) -> bool: + """Poll until `port` has no listener. Returns True if it freed up in time. + + Sequential mode starts/stops a registry per domain on the same port; a + just-stopped uvicorn worker can hold the socket for a few seconds during + graceful shutdown, so the next domain must wait rather than fail instantly. + """ + import time + + deadline = time.monotonic() + timeout + while True: + if not _port_in_use(port): + return True + if time.monotonic() >= deadline: + return False + await asyncio.sleep(0.5) + + +def _kill_port_listeners(port: int) -> None: + """Best-effort SIGKILL of any process listening on `port` (via lsof).""" + import signal + import subprocess + + try: + out = subprocess.run(["lsof", "-ti", f":{port}"], capture_output=True, text=True) # noqa: S603,S607 — lsof from PATH, fixed args + for pid in out.stdout.split(): + try: + os.kill(int(pid), signal.SIGKILL) + except (ProcessLookupError, ValueError): + pass + except Exception as e: # noqa: BLE001 — best-effort cleanup + logger.debug(f"Could not enumerate/kill listeners on port {port}: {e}") + + async def start_registry_server(config_path: str) -> subprocess.Popen: """Start the registry server with the specified config. @@ -1588,28 +1633,32 @@ async def start_registry_server(config_path: str) -> subprocess.Popen: # Check if the registry port is already in use logger.info(f"🔍 Checking if port {registry_port} is available...") try: - import socket - - sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) - result = sock.connect_ex(('127.0.0.1', registry_port)) - sock.close() - - if result == 0: - # Port is in use - logger.error(f"❌ Port {registry_port} is already in use!") - logger.error("Another registry server or process is using this port.") - logger.error("") - logger.error("To fix this, run one of these commands:") - logger.error(f" 1. Kill processes on port {registry_port}:") - logger.error(f" lsof -ti :{registry_port} | xargs kill") - logger.error("") - logger.error(" 2. Or find and kill specific process:") - logger.error(f" lsof -i :{registry_port}") - logger.error(" kill ") - logger.error("") - raise RuntimeError( - f"Port {registry_port} is already in use. Please kill the existing process first." + if _port_in_use(registry_port): + # Port is busy — most often a registry from the PREVIOUS service in + # a sequential run that hasn't released the socket yet. Proactively + # kill any stray listener and wait for the port to free up before + # giving up. + logger.warning( + f"⚠️ Port {registry_port} is in use — attempting to free it " + f"(likely the previous service's registry shutting down)..." ) + _kill_port_listeners(registry_port) + if not await _wait_for_port_free(registry_port, timeout=20.0): + logger.error(f"❌ Port {registry_port} is still in use after waiting!") + logger.error("Another registry server or process is using this port.") + logger.error("") + logger.error("To fix this, run one of these commands:") + logger.error(f" 1. Kill processes on port {registry_port}:") + logger.error(f" lsof -ti :{registry_port} | xargs kill") + logger.error("") + logger.error(" 2. Or find and kill specific process:") + logger.error(f" lsof -i :{registry_port}") + logger.error(" kill ") + logger.error("") + raise RuntimeError( + f"Port {registry_port} is already in use. Please kill the existing process first." + ) + logger.info(f"✅ Port {registry_port} is now free") except RuntimeError: raise # Re-raise the port-in-use error except Exception as e: @@ -1841,6 +1890,19 @@ def _kill_group(sig: int) -> None: except Exception as e: logger.error(f"❌ Error stopping registry: {e}") + # `process.wait()` only reaps the `uv` wrapper; the uvicorn worker holding + # the port can linger briefly. Wait for the OS to release the registry port + # so the next sequential service can bind it without racing (the error that + # previously surfaced as "Port N is already in use" on the next domain). + try: + registry_port = get_registry_port() + if not await _wait_for_port_free(registry_port, timeout=15.0): + logger.warning(f"⚠️ Port {registry_port} still occupied after stop — killing stray listeners") + _kill_port_listeners(registry_port) + await _wait_for_port_free(registry_port, timeout=10.0) + except Exception as e: # noqa: BLE001 — best-effort port-release wait + logger.debug(f"Port-release wait after stop failed (continuing): {e}") + def rewrite_config_with_loader_domains(config_path: str, m3_data_loader: M3DataLoader) -> str: """Write a copy of `config_path` with each service's `metadata.domains` From cf4303e0541e5a7b0a487786017f183d508920ea Mon Sep 17 00:00:00 2001 From: Harold Ship Date: Mon, 1 Jun 2026 11:38:39 +0300 Subject: [PATCH 11/20] fix(m3): add capability/domain/task# report columns + per-run bundle logs - compare_report: per-task comparison table now leads with Capability, Domain, and # (1-based input task number), sorted by (capability, domain, task#); non-M3 benchmarks keep the legacy table. - thread task_number from m3_data_loader through eval_m3/eval_m3_react into results so the report can render it. - bundle.assemble_compare_bundle accepts per-run grouped logs (runs/_run/logs) while staying backward-compatible with flat/shared logs. - m3/compare.sh snapshots each run's console + live registry_server.log (fixing the stale /tmp/m3_registry.log) and emits per-run grouped --log-files; appworld and other benchmarks keep shared logs. Co-authored-by: Cursor --- benchmarks/helpers/bundle.py | 23 +++++-- benchmarks/helpers/compare_report.py | 70 +++++++++++++++++++-- benchmarks/m3/compare.sh | 92 +++++++++++++++++++++++++++- benchmarks/m3/eval_m3.py | 2 + benchmarks/m3/eval_m3_react.py | 3 + benchmarks/m3/m3_data_loader.py | 5 +- 6 files changed, 181 insertions(+), 14 deletions(-) diff --git a/benchmarks/helpers/bundle.py b/benchmarks/helpers/bundle.py index a1e5ce0..5c78e4d 100644 --- a/benchmarks/helpers/bundle.py +++ b/benchmarks/helpers/bundle.py @@ -526,6 +526,11 @@ def assemble_compare_bundle( ``trajectory_dirs`` maps each config key to a list of RUNS, where each run is itself a list of trajectory folders (cuga emits one folder per domain). All folders within a run are merged into a single ``runN/trajectories`` dir. + + ``log_files`` maps each config key to either a grouped per-run list + (``[[run1 logs], [run2 logs], ...]`` → ``runN/logs``) or a flat list + (legacy / ``"shared"`` key → ``runs//logs``). Per-run grouping keeps + each run's own console + registry log instead of only the last run's. """ benchmark_dir = PROJECT_ROOT / "benchmarks" / benchmark_name if bundle_root is None: @@ -599,11 +604,21 @@ def assemble_compare_bundle( if _run_progress.exists(): shutil.copy2(_run_progress, bundle_dir / "runs" / run_label / ".progress") - # Logs (per-model) + # Logs. Two accepted shapes per config key: + # grouped per-run (preferred): [[run1 logs...], [run2 logs...], ...] + # → each run's logs land in runs/_run/logs so every run in a + # multi-run comparison keeps its OWN console/registry log. + # flat (legacy / "shared"): [log, log, ...] + # → placed in runs//logs (e.g. the "shared" key → runs/shared/logs). if log_files: - for config_key, lf_list in log_files.items(): - run_label = f"{config_key.replace(':', '_')}" - _copy_logs(bundle_dir, lf_list, dest_subdir=f"runs/{run_label}/logs") + for config_key, lf_val in log_files.items(): + if lf_val and isinstance(lf_val[0], list): + for i, group in enumerate(lf_val, 1): + run_label = f"{config_key.replace(':', '_')}_run{i}" + _copy_logs(bundle_dir, group, dest_subdir=f"runs/{run_label}/logs") + else: + run_label = f"{config_key.replace(':', '_')}" + _copy_logs(bundle_dir, lf_val, dest_subdir=f"runs/{run_label}/logs") # Langfuse traces (per-model, per-run) if fetch_langfuse: diff --git a/benchmarks/helpers/compare_report.py b/benchmarks/helpers/compare_report.py index 4598268..a4af5ee 100644 --- a/benchmarks/helpers/compare_report.py +++ b/benchmarks/helpers/compare_report.py @@ -99,6 +99,9 @@ def _parse_sdk_results(data: dict) -> dict: # M3-specific tags so the eval report can group by (task, domain). "m3_task_id": r.get("m3_task_id"), "domain": r.get("domain"), + # 1-based position of this sample within its (capability, domain) + # input file. Lets reports show the source "task number". + "task_number": r.get("task_number"), "uuid": r.get("uuid") or r.get("task_name") or r.get("name"), } @@ -406,14 +409,57 @@ def generate_report(config_results: dict[str, list[str]], markdown: bool = True) lines.append(fence_open()) # Collect all task IDs across runs - all_tasks = sorted({t for r in runs for t in r["tasks"].keys()}) + all_tasks = list({t for r in runs for t in r["tasks"].keys()}) + + # M3 result files tag each task with capability (m3_task_id), domain and + # the 1-based task number from the input data. When present we surface + # them as leading columns and order rows by (capability, domain, #) so a + # UUID-only row becomes attributable. Non-M3 benchmarks (e.g. AppWorld) + # don't set these → columns are suppressed and the legacy layout stands. + task_meta = {} + for r in runs: + for tname, t in r["tasks"].items(): + task_meta.setdefault(tname, t) + + def _cap_label(t, _m=task_meta): + tid = _m.get(t, {}).get("m3_task_id") + return f"m3_task_{tid}" if tid is not None else "" + + def _dom_label(t, _m=task_meta): + return _m.get(t, {}).get("domain") or "" + + def _num_label(t, _m=task_meta): + n = _m.get(t, {}).get("task_number") + return str(n) if n is not None else "" + + m3_mode = any( + task_meta.get(t, {}).get("m3_task_id") is not None and task_meta.get(t, {}).get("domain") + for t in all_tasks + ) + if m3_mode: + all_tasks.sort( + key=lambda t: ( + task_meta.get(t, {}).get("m3_task_id") or 0, + _dom_label(t), + task_meta.get(t, {}).get("task_number") or 0, + t, + ) + ) + cap_w = max(len("Capability"), max((len(_cap_label(t)) for t in all_tasks), default=0)) + dom_w = max(len("Domain"), max((len(_dom_label(t)) for t in all_tasks), default=0)) + num_w = max(len("#"), max((len(_num_label(t)) for t in all_tasks), default=0)) + prefix_hdr = f"{'Capability':<{cap_w}} {'Domain':<{dom_w}} {'#':>{num_w}} " + else: + all_tasks.sort() + cap_w = dom_w = num_w = 0 + prefix_hdr = "" n_runs = len(runs) run_cols = " ".join(f"R{i + 1}" for i in range(n_runs)) # Truncate task IDs to keep table readable but distinguishable col_task_w = min(28, max((len(t) for t in all_tasks), default=8)) task_header = ( - f"{'Task':<{col_task_w}} {run_cols} {'Successes':>10} " + f"{prefix_hdr}{'Task':<{col_task_w}} {run_cols} {'Successes':>10} " f"{'Rate':>6} {'Tokens':>8} {'LLM':>5} {'Time':>6}" ) lines.append(task_header) @@ -459,8 +505,13 @@ def generate_report(config_results: dict[str, list[str]], markdown: bool = True) n_dur += 1 task_disp = task if len(task) <= col_task_w else task[: col_task_w - 1] + "…" + row_prefix = ( + f"{_cap_label(task):<{cap_w}} {_dom_label(task):<{dom_w}} {_num_label(task):>{num_w}} " + if m3_mode + else "" + ) lines.append( - f"{task_disp:<{col_task_w}} {symbols} " + f"{row_prefix}{task_disp:<{col_task_w}} {symbols} " f"{successes:>3}/{total:<3} {rate_pct:>5.1f}% " f"{_fmt(mt):>8} {_fmt(ml):>5} {_fmt(md, 's'):>6}" ) @@ -475,8 +526,9 @@ def generate_report(config_results: dict[str, list[str]], markdown: bool = True) avg_dur = _fmt(sum_dur / n_dur, "s") if n_dur else "--" lines.append("─" * len(task_header)) spacer = " ".join("──" for _ in range(n_runs)) + avg_prefix = f"{'':<{cap_w}} {'':<{dom_w}} {'':>{num_w}} " if m3_mode else "" lines.append( - f"{'AVERAGE':<{col_task_w}} {spacer} " + f"{avg_prefix}{'AVERAGE':<{col_task_w}} {spacer} " f"{avg_successes:>3.1f}/{n_runs:<3} {avg_rate:>5.1f}% " f"{avg_tok:>8} {avg_llm:>5} {avg_dur:>6}" ) @@ -551,14 +603,20 @@ def _bucket_m3_tasks(tasks: dict) -> tuple: rows = [] for key in sorted(buckets.keys()): - members = sorted(buckets[key], key=lambda nt: nt[1].get("uuid") or nt[0]) + # Order within a (capability, domain) bucket by the input-data task + # number when present (stable, matches the source file), else by uuid. + members = sorted( + buckets[key], + key=lambda nt: (nt[1].get("task_number") or 0, nt[1].get("uuid") or nt[0]), + ) for i, (name, t) in enumerate(members, start=1): rows.append( { "label": name, "m3_task_id": key[0], "domain": key[1], - "ordinal": i, + # Prefer the source task number; fall back to positional. + "ordinal": t.get("task_number") if t.get("task_number") is not None else i, "uuid": t.get("uuid") or name, "data": t, } diff --git a/benchmarks/m3/compare.sh b/benchmarks/m3/compare.sh index 9773b43..e4162ed 100755 --- a/benchmarks/m3/compare.sh +++ b/benchmarks/m3/compare.sh @@ -172,6 +172,8 @@ compare_t0=$(date +%s) compare_cleanup() { echo -e "${YELLOW:-}Stopping servers...${NC:-}" kill_port_processes "${REGISTRY_PORT:-8001}" + # Staged per-run logs were already copied into the bundle by now. + [[ -n "${LOG_STAGE_DIR:-}" && -d "$LOG_STAGE_DIR" ]] && rm -rf "$LOG_STAGE_DIR" } trap compare_cleanup EXIT INT TERM @@ -182,6 +184,16 @@ CONFIG_RESULT_KEYS=() CONFIG_RESULT_VALS=() CONFIG_TRAJ_KEYS=() CONFIG_TRAJ_VALS=() +CONFIG_LOG_KEYS=() +CONFIG_LOG_VALS=() + +# Per-run logs are snapshotted into this staging dir after each eval.sh run. +# The /tmp console/registry logs are overwritten by the next run, so without a +# snapshot a multi-run bundle would only keep the LAST run's logs. Sentinel +# grouping (one group per run) mirrors the trajectory collection below. +LOG_STAGE_DIR="$(mktemp -d 2>/dev/null || echo "/tmp/m3_log_stage_$$")" +mkdir -p "$LOG_STAGE_DIR" +LOG_GROUP_SEP="@@RUN@@" # Per-agent filename discrimination. cuga's eval_m3.py saves result files # with prefix m3_config_*.json; eval_m3_react.py saves m3_*.json. The plain @@ -230,6 +242,7 @@ for config in "${CONFIGS[@]}"; do # keeps "one eval.sh run = one bundle run" instead of one run per domain. run_before_trajs=$(find "$SCRIPT_DIR/logging/trajectory_data" -mindepth 1 -maxdepth 1 -type d 2>/dev/null | sort) config_traj_groups="" + config_log_groups="" TRAJ_GROUP_SEP="@@RUN@@" for ((r=1; r<=RUNS; r++)); do @@ -257,6 +270,31 @@ for config in "${CONFIGS[@]}"; do config_traj_groups+="${TRAJ_GROUP_SEP}"$'\n'"${run_new_trajs}"$'\n' run_before_trajs="$run_after_trajs" + # Snapshot THIS run's logs before the next eval.sh run overwrites them. + # Console: eval.sh tees stdout to /tmp/m3_console.log (truncated each + # run, so it holds exactly this run). Registry: the --m3-data flow lets + # eval_m3.py manage per-service registries and write to + # benchmarks/m3/registry_server.log; the multiturn flow uses the outer + # /tmp/m3_registry.log. Prefer the former, fall back to the latter. + run_log_dir="$LOG_STAGE_DIR/$(echo "$config" | tr ':/' '__')_run${r}" + mkdir -p "$run_log_dir" + run_log_lines="" + if [[ -f /tmp/m3_console.log ]]; then + cp -f /tmp/m3_console.log "$run_log_dir/m3_console.log" 2>/dev/null \ + && run_log_lines+="$run_log_dir/m3_console.log"$'\n' + fi + reg_src="" + if [[ -s "$SCRIPT_DIR/registry_server.log" ]]; then + reg_src="$SCRIPT_DIR/registry_server.log" + elif [[ -s /tmp/m3_registry.log ]]; then + reg_src="/tmp/m3_registry.log" + fi + if [[ -n "$reg_src" ]]; then + cp -f "$reg_src" "$run_log_dir/m3_registry.log" 2>/dev/null \ + && run_log_lines+="$run_log_dir/m3_registry.log"$'\n' + fi + config_log_groups+="${LOG_GROUP_SEP}"$'\n'"${run_log_lines}" + # After first run, reuse servers for all subsequent runs export SKIP_SERVER_START="true" echo "" @@ -272,6 +310,10 @@ for config in "${CONFIGS[@]}"; do # Store the per-run trajectory groups (sentinel-delimited) for this config. CONFIG_TRAJ_KEYS+=("$config") CONFIG_TRAJ_VALS+=("$config_traj_groups") + + # Store the per-run log groups (sentinel-delimited) for this config. + CONFIG_LOG_KEYS+=("$config") + CONFIG_LOG_VALS+=("$config_log_groups") done total_dur=$(( $(date +%s) - compare_t0 )) @@ -399,9 +441,53 @@ if [[ "${NO_BUNDLE:-false}" != "true" ]]; then if [[ "$TRAJ_JSON_INPUT" != "{}" ]]; then BUNDLE_CMD+=(--trajectory-dirs "$TRAJ_JSON_INPUT") fi - # Include server logs (from last run) - LOG_JSON="{\"shared\":[\"/tmp/m3_registry.log\",\"/tmp/m3_console.log\"]}" - BUNDLE_CMD+=(--log-files "$LOG_JSON") + # Build per-config log JSON grouped by run (one console+registry log set + # per eval run) so each run folder gets its OWN logs: + # {"model:agent:policy": [["/run1/console.log", ...], ["/run2/...", ...]]} + LOG_JSON_PARTS=() + for ci in "${!CONFIG_LOG_KEYS[@]}"; do + lconfig="${CONFIG_LOG_KEYS[$ci]}" + lgroups="${CONFIG_LOG_VALS[$ci]}" + if [[ -z "$lgroups" ]]; then + continue + fi + lgroups_json="" + lcur_group="" + lin_group=false + while IFS= read -r line; do + if [[ "$line" == "$LOG_GROUP_SEP" ]]; then + if [[ "$lin_group" == "true" ]]; then + if [[ -n "$lgroups_json" ]]; then lgroups_json+=","; fi + lgroups_json+="[${lcur_group}]" + fi + lcur_group="" + lin_group=true + continue + fi + [[ -z "$line" ]] && continue + if [[ -n "$lcur_group" ]]; then lcur_group+=","; fi + lcur_group+="\"${line}\"" + done <<< "$lgroups" + if [[ "$lin_group" == "true" ]]; then + if [[ -n "$lgroups_json" ]]; then lgroups_json+=","; fi + lgroups_json+="[${lcur_group}]" + fi + if [[ -z "$lgroups_json" ]]; then + continue + fi + LOG_JSON_PARTS+=("\"${lconfig}\":[${lgroups_json}]") + done + LOG_JSON="{" + ljfirst=true + for part in "${LOG_JSON_PARTS[@]}"; do + if [[ "$ljfirst" != "true" ]]; then LOG_JSON+=","; fi + ljfirst=false + LOG_JSON+="$part" + done + LOG_JSON+="}" + if [[ "$LOG_JSON" != "{}" ]]; then + BUNDLE_CMD+=(--log-files "$LOG_JSON") + fi # Download Langfuse traces if available BUNDLE_CMD+=(--fetch-langfuse) if [[ "${BUNDLE_ZIP:-false}" == "true" ]]; then diff --git a/benchmarks/m3/eval_m3.py b/benchmarks/m3/eval_m3.py index b926d7a..3a19716 100644 --- a/benchmarks/m3/eval_m3.py +++ b/benchmarks/m3/eval_m3.py @@ -742,6 +742,8 @@ async def evaluate_multiturn_task(self, sample: Dict[str, Any], sample_index: in if "uuid" in sample: result["uuid"] = sample["uuid"] result["domain"] = domain + if "task_number" in sample: + result["task_number"] = sample["task_number"] # Surface the GT bits Vakra needs so _to_vakra_pair can build a real # ground-truth dialogue (single-turn samples; multi-turn would need diff --git a/benchmarks/m3/eval_m3_react.py b/benchmarks/m3/eval_m3_react.py index 62f16ce..d3800a7 100644 --- a/benchmarks/m3/eval_m3_react.py +++ b/benchmarks/m3/eval_m3_react.py @@ -123,6 +123,7 @@ def _merged_to_react_test_case( "intent": intent, "domain": sample.get("domain"), "m3_task_id": task_id, + "task_number": sample.get("task_number"), "expected_output": { "response": gt_answer, "tool_calls": gold_calls, @@ -400,6 +401,8 @@ async def evaluate_task(self, task: Dict[str, Any], task_index: int) -> Dict[str result["uuid"] = task["uuid"] if task.get("domain") and not result.get("domain"): result["domain"] = task["domain"] + if task.get("task_number") is not None and "task_number" not in result: + result["task_number"] = task["task_number"] if task.get("intent") and not result.get("intent"): result["intent"] = task["intent"] if task.get("expected_output"): diff --git a/benchmarks/m3/m3_data_loader.py b/benchmarks/m3/m3_data_loader.py index 838e171..d1d545f 100644 --- a/benchmarks/m3/m3_data_loader.py +++ b/benchmarks/m3/m3_data_loader.py @@ -199,7 +199,7 @@ def load_domain(self, task_id: int, domain: str) -> List[Dict[str, Any]]: } merged: List[Dict[str, Any]] = [] - for sample in inputs: + for idx, sample in enumerate(inputs, 1): uuid = sample.get("uuid") gold = outputs_by_uuid.get(uuid) @@ -246,6 +246,9 @@ def load_domain(self, task_id: int, domain: str) -> List[Dict[str, Any]]: merged_sample: Dict[str, Any] = { "uuid": uuid, "sample_id": uuid, + # 1-based position of this sample within its (capability, domain) + # input list — surfaced as the per-task "#" in reports. + "task_number": idx, "domain": sample.get("domain", domain), "num_turns": sample.get("num_turns", len(turns)), "dialogue": {"turns": turns}, From f4505c0df400bd56db6b4d5c9cdff916642c5251 Mon Sep 17 00:00:00 2001 From: Harold Ship Date: Mon, 1 Jun 2026 15:10:13 +0300 Subject: [PATCH 12/20] fix(m3): tune eval env and add defensive tool-output instructions Reorganize m3.env with documented settings, disable Evolve, cap cuga_lite_max_steps at 35 (GT M=3 in small_train, formula uses M=4 padding), and inject special_instructions so CugaLite probes unknown tool shapes then accesses results with isinstance checks. Co-authored-by: Cursor --- benchmarks/m3/config/m3.env | 111 ++++++++++++++++-------------------- benchmarks/m3/eval_m3.py | 25 ++++++++ 2 files changed, 74 insertions(+), 62 deletions(-) diff --git a/benchmarks/m3/config/m3.env b/benchmarks/m3/config/m3.env index aaedbfd..af20ae4 100644 --- a/benchmarks/m3/config/m3.env +++ b/benchmarks/m3/config/m3.env @@ -1,77 +1,64 @@ -# M3 benchmark specific configuration -# These are not secrets and should be committed to the repository +# M3 benchmark configuration (not secrets — safe to commit) -# Container runtime (docker or podman) -# Set to full path if not in PATH +# --- Container runtime --- +# docker or podman; set full path if not on PATH CONTAINER_RUNTIME=docker -# Registry and logging paths -# Registry enabled for find_tools functionality -# Note: MCP_SERVERS_FILE should NOT be set here for M3 benchmark -# The eval_m3.py script will expand m3_registry.yaml (replacing {domain} placeholders) -# and set MCP_SERVERS_FILE to the expanded config at runtime -CUGA_LOGGING_DIR=benchmarks/m3/logging -# DYNACONF_ADVANCED_FEATURES__REFLECTION_ENABLED=false +# --- Registry & logging --- +# Registry provides find_tools and MCP tool routing for the agent DYNACONF_ADVANCED_FEATURES__REGISTRY=true - -# M3 evaluation script starts its own registry server with expanded config -# Skip the default registry startup in eval.sh to avoid conflicts +# Where CUGA writes trajectory / activity logs for this benchmark +CUGA_LOGGING_DIR=benchmarks/m3/logging +# eval_m3.py expands m3_registry.yaml per domain at runtime — do not set MCP_SERVERS_FILE here +# eval.sh must not start a second registry; eval_m3.py manages its own per-service registry SKIP_SERVER_START=true -DYNACONF_POLICY__ENABLED=true -DYNACONF_ADVANCED_FEATURES__BENCHMARK=m3 -# DYNACONF_ADVANCED_FEATURES__FORCE_AUTONOMOUS_MODE=false +# Which URL path segment names MCP operations in the registry (M3 uses segment index 3) DYNACONF_ADVANCED_FEATURES__PATH_SEGMENT_INDEX=3 -# Tool call and find_tools configuration -# Increase tool call timeout from default 30 seconds to 120 seconds -# This helps with large tool sets (e.g., 122 Olympics tools) where find_tools takes longer -DYNACONF_ADVANCED_FEATURES__TOOL_CALL_TIMEOUT=120 - -# Enable find_tools with reasonable threshold (we have 206 hockey tools) -# With 120s timeout, find_tools should complete successfully -# Threshold of 20 means: if more than 20 tools available, use find_tools to shortlist -# DYNACONF_ADVANCED_FEATURES__SHORTLISTING_TOOL_THRESHOLD=20 - - -# Enables the reflection step in the execution loop. +# --- Agent behavior --- +# Suite-specific hooks in cuga-agent (prompt tweaks, scoring adapters, etc.) +DYNACONF_ADVANCED_FEATURES__BENCHMARK=m3 +# CugaLite path for API/MCP tasks (M3 domains are tool-heavy but filtered per domain) +DYNACONF_ADVANCED_FEATURES__LITE_MODE=true +# accuracy vs balanced vs fast — M3 eval uses accurate for best tool-call fidelity +DYNACONF_FEATURES__CUGA_MODE=accurate +# Run without user prompts (required for batch eval) +DYNACONF_ADVANCED_FEATURES__FORCE_AUTONOMOUS_MODE=true +# One subtask per app/domain when decomposition is used +DYNACONF_ADVANCED_FEATURES__DECOMPOSITION_STRATEGY=exact +# Reflection after tool execution (extra model pass per step) DYNACONF_ADVANCED_FEATURES__REFLECTION_ENABLED=true -# Enables todo-list behavior in the agent (off for v1 / current eval plan). +# Todo-list planning in the agent loop (off for current M3 eval plan) DYNACONF_ADVANCED_FEATURES__ENABLE_TODOS=false -# Toggles context summarization (conversation/context compression). +# Compress long conversation history to stay within context limits DYNACONF_CONTEXT_SUMMARIZATION__ENABLED=true -# Forces autonomous mode so the run is non-interactive and suitable for benchmarks. -DYNACONF_ADVANCED_FEATURES__FORCE_AUTONOMOUS_MODE=true -# Selects benchmark-specific behavior for the named suite. -DYNACONF_ADVANCED_FEATURES__BENCHMARK=m3 -# Decomposition strategy for tasks that use decomposition. -DYNACONF_ADVANCED_FEATURES__DECOMPOSITION_STRATEGY=exact -# DYNACONF_ADVANCED_FEATURES__FORCE_LITE_MODE_APPS=["supervisor","gmail","file_system"] -# Apps forced into lite mode (JSON list string). -# Master switch for lite mode behavior. -DYNACONF_ADVANCED_FEATURES__LITE_MODE=true -# Threshold for shortlisting tools. -DYNACONF_ADVANCED_FEATURES__SHORTLISTING_TOOL_THRESHOLD=1 -# Tool-count threshold for lite mode behavior. +# Max call_model + sandbox cycles per CugaLite task. +# Let M = longest correct tool-call count in the dataset GT (scan output/*.json +# gold_sequence.tool_call). Safe cap: max(28, (M + 3) * 4 + 4) — headroom for find_tools, +# isolated no-schema probes, reflection, and one retry. small_train.zip has M=3; we use M=4 +# in the formula as padding → (4+3)*4+4=32, rounded up to 35 (well below the 70 default). +DYNACONF_ADVANCED_FEATURES__CUGA_LITE_MAX_STEPS=35 +# Domains with fewer than this many tools stay on the lite path DYNACONF_ADVANCED_FEATURES__LITE_MODE_TOOL_THRESHOLD=500 -# Overall CUGA accuracy/behavior mode. -DYNACONF_FEATURES__CUGA_MODE=accurate -# Local sandbox flag (name as in config). +# OpenSandbox (Docker) code execution for generated Python DYNACONF_FEATURES__LOCAL_SANBDOX=true -# DYNACONF_POLICY__ENABLED=false -# Enables policy layer when true. -# DYNACONF_SERVER_PORTS__APIS_URL=9111 -# API server port (wiring, not a behavioral feature toggle). -# Vakra evaluator (benchmarks/m3/evaluator/) reads API_KEY for the LLM-as-judge. -# Mirror GROQ_API_KEY into API_KEY so judge.ChatModel.__init__ authenticates without -# requiring a separate shell export. -API_KEY=${GROQ_API_KEY} +# --- Tool discovery (find_tools) --- +# Seconds to wait for a single MCP tool call (large domains can be slow) +DYNACONF_ADVANCED_FEATURES__TOOL_CALL_TIMEOUT=120 +# Always use find_tools to shortlist when any tools are present (M3 domains are large) +DYNACONF_ADVANCED_FEATURES__SHORTLISTING_TOOL_THRESHOLD=1 + +# --- Policies --- +# Policy layer on by default; compare.sh --no-policies sets this false per run +DYNACONF_POLICY__ENABLED=true + +# --- Evolve (trajectory memory) --- +# Evolve needs an MCP server at 127.0.0.1:8201; we do not run one during M3 eval +DYNACONF_EVOLVE__ENABLED=false -# Require live-MCP scoring (matches Vakra CLI semantics: replay predicted and -# ground-truth tool_calls against the capability container, inject canonical -# responses, then judge). Set to `auto` to fall back silently to offline when -# the container is unreachable, or `off` to never replay. Offline scoring is -# meaningfully different (ExactMatch always fails on agent-recorded payloads -# vs zip GT entries; Correctness/Groundedness see uncanonical responses) — so -# `on` keeps verdicts honest by failing loudly when the container is down. +# --- Vakra scoring / API keys --- +# Vakra judge reads API_KEY; mirror GROQ_API_KEY so no separate export is needed +API_KEY=${GROQ_API_KEY} +# Live MCP replay for honest ExactMatch/Correctness (fail loudly if container is down) M3_VAKRA_LIVE_MCP=on diff --git a/benchmarks/m3/eval_m3.py b/benchmarks/m3/eval_m3.py index 3a19716..a734c82 100644 --- a/benchmarks/m3/eval_m3.py +++ b/benchmarks/m3/eval_m3.py @@ -92,6 +92,30 @@ from benchmarks.helpers.sdk_eval_helpers import add_policy_via_agent, clear_all_policies from benchmarks.m3.m3_data_loader import M3DataLoader, diff_tool_calls +# Injected into CugaLite's system prompt via SDK special_instructions (eval-only). +# Many M3 MCP tools lack a documented output/response schema (response_doc is empty). +# Without guidance the model assumes dict-shaped results and calls .get() on lists/strings. +M3_SPECIAL_INSTRUCTIONS = """ +## Undocumented tool outputs (M3 eval) + +When a tool in **Current Available Tools** has no **Response Schema** / output documentation: + +1. **First use — isolated probe:** Run the tool alone (Isolated Tools rule). End with a `print()` of a **compact shape summary**, not a full dump: + - Top-level type: `dict`, `list`, `str`, `int`, etc. + - If `dict`: key names (first ~10) and the type of each value at one level (e.g. `list`, `dict`, `str`). + - If `list`: length and the type of the first element (e.g. `list[dict]`, `list[str]`). + - Shallow shape is enough (`dict[str, object]`, `list[int]`, `dict[str, dict]`) — do not recurse deeply. + +2. **All follow-up code — handle defensively:** Never assume dict/list/key types from memory. + - Use `isinstance(result, dict)` before `.get()` or key access. + - Use `isinstance(result, list)` before indexing or iteration. + - If APIs vary (bare list vs `{"items": [...]}`), normalize once then proceed, e.g.: + `rows = result if isinstance(result, list) else (result.get("items") if isinstance(result, dict) else [])` + - Do not call `.get()`, `[0]`, or attribute access on a value until its type is confirmed. + +Reporting shape in step 1 is for choosing correct access in step 2 — the goal is **crash-free Python**, not type narration for its own sake. +""".strip() + async def _load_m3_policies(agent: CugaAgent, policies_enabled: bool = True) -> None: """Load CUGA policies into the per-domain agent. @@ -1446,6 +1470,7 @@ def _dom_name(dc): evaluator.agent = CugaAgent( tool_provider=filtered_provider, # Only sees this domain's tools callbacks=callbacks, + special_instructions=M3_SPECIAL_INSTRUCTIONS, # Policies are loaded explicitly by _load_m3_policies below per # eval run. Disable .cuga auto-load and filesystem sync to keep # the per-domain agent's policy set deterministic — otherwise From ab3b4ef098f018bfd6944f8ca4c9f8a5254f5aed Mon Sep 17 00:00:00 2001 From: Harold Ship Date: Tue, 2 Jun 2026 20:28:42 +0300 Subject: [PATCH 13/20] fix(m3): single Langfuse trace per task on Watsonx/Cuga path --- benchmarks/helpers/sdk_eval_helpers.py | 25 ++++++---- .../tests/test_invoke_agent_for_eval.py | 47 +++++++++++++++++++ benchmarks/m3/eval_m3.py | 23 +++++---- 3 files changed, 74 insertions(+), 21 deletions(-) create mode 100644 benchmarks/helpers/tests/test_invoke_agent_for_eval.py diff --git a/benchmarks/helpers/sdk_eval_helpers.py b/benchmarks/helpers/sdk_eval_helpers.py index c96e3e6..85f8741 100644 --- a/benchmarks/helpers/sdk_eval_helpers.py +++ b/benchmarks/helpers/sdk_eval_helpers.py @@ -238,19 +238,26 @@ async def _invoke_agent_for_eval( lf_config: Optional[dict[str, Any]] = None, ) -> Any: """Invoke CugaAgent (LangGraph config) or GenericReactAgent (per-LLM callbacks).""" - common = { - "messages": messages, + if isinstance(agent, GenericReactAgent): + kwargs: dict[str, Any] = { + "messages": messages, + "thread_id": thread_id, + "user_context": user_context, + "track_tool_calls": track_tool_calls, + } + if lf_config: + kwargs["invoke_callbacks"] = lf_config.get("callbacks") + return await agent.invoke(**kwargs) + + kwargs = { + "message": messages, "thread_id": thread_id, - "user_context": user_context, + "user_context": user_context or "", "track_tool_calls": track_tool_calls, } - if isinstance(agent, GenericReactAgent): - if lf_config: - common["invoke_callbacks"] = lf_config.get("callbacks") - return await agent.invoke(**common) if lf_config: - common["config"] = lf_config - return await agent.invoke(**common) + kwargs["config"] = lf_config + return await agent.invoke(**kwargs) def _react_steps_from_invoke_result(invoke_result: Any) -> Optional[int]: diff --git a/benchmarks/helpers/tests/test_invoke_agent_for_eval.py b/benchmarks/helpers/tests/test_invoke_agent_for_eval.py new file mode 100644 index 0000000..b92c6dc --- /dev/null +++ b/benchmarks/helpers/tests/test_invoke_agent_for_eval.py @@ -0,0 +1,47 @@ +"""_invoke_agent_for_eval uses the correct keyword per agent type.""" + +from unittest.mock import AsyncMock, MagicMock + +import pytest +from langchain_core.messages import HumanMessage + +from benchmarks.helpers.react_agent import GenericReactAgent +from benchmarks.helpers.sdk_eval_helpers import _invoke_agent_for_eval + +pytestmark = pytest.mark.unit + + +@pytest.mark.asyncio +async def test_invoke_agent_for_eval_cuga_uses_message_kwarg(): + agent = MagicMock() + agent.invoke = AsyncMock(return_value=MagicMock(answer="ok", tool_calls=[])) + + await _invoke_agent_for_eval( + agent, + [HumanMessage(content="hi")], + thread_id="t1", + lf_config={"callbacks": ["cb"]}, + ) + + agent.invoke.assert_awaited_once() + kwargs = agent.invoke.call_args.kwargs + assert "message" in kwargs + assert "messages" not in kwargs + assert kwargs["config"] == {"callbacks": ["cb"]} + + +@pytest.mark.asyncio +async def test_invoke_agent_for_eval_react_uses_messages_kwarg(): + agent = MagicMock(spec=GenericReactAgent) + agent.invoke = AsyncMock(return_value=MagicMock(answer="ok", tool_calls=[], react_steps=1)) + + await _invoke_agent_for_eval( + agent, + [HumanMessage(content="hi")], + thread_id="t1", + lf_config={"callbacks": ["cb"]}, + ) + + kwargs = agent.invoke.call_args.kwargs + assert "messages" in kwargs + assert kwargs["invoke_callbacks"] == ["cb"] diff --git a/benchmarks/m3/eval_m3.py b/benchmarks/m3/eval_m3.py index a734c82..0b8b9cb 100644 --- a/benchmarks/m3/eval_m3.py +++ b/benchmarks/m3/eval_m3.py @@ -89,7 +89,11 @@ save_evaluation_results, setup_langfuse, ) -from benchmarks.helpers.sdk_eval_helpers import add_policy_via_agent, clear_all_policies +from benchmarks.helpers.sdk_eval_helpers import ( + add_policy_via_agent, + clear_all_policies, + is_langfuse_tracing_enabled, +) from benchmarks.m3.m3_data_loader import M3DataLoader, diff_tool_calls # Injected into CugaLite's system prompt via SDK special_instructions (eval-only). @@ -1463,13 +1467,14 @@ def _dom_name(dc): if hasattr(filtered_provider, 'app_name'): logger.info(f" 🔒 Filtered to app: {filtered_provider.app_name}") - # Create agent with filtered provider + # Langfuse: per-task trace-scoped handlers are attached in + # evaluate_task_with_langfuse via build_langfuse_invoke_config. + # Do not pass an unscoped CallbackHandler on the agent — that creates + # orphan root traces per LLM call (especially visible on Watsonx). langfuse_handler = setup_langfuse() - callbacks = [langfuse_handler] if langfuse_handler else [] evaluator.agent = CugaAgent( tool_provider=filtered_provider, # Only sees this domain's tools - callbacks=callbacks, special_instructions=M3_SPECIAL_INSTRUCTIONS, # Policies are loaded explicitly by _load_m3_policies below per # eval run. Disable .cuga auto-load and filesystem sync to keep @@ -2576,14 +2581,8 @@ def _service_has_wanted_domain(svc_dict): ) services = filtered - # Initialize Langfuse (optional) - try: - from langfuse.callback import CallbackHandler - - CallbackHandler() - logger.info("Langfuse handler initialized") - except Exception as e: - logger.warning(f"Could not initialize Langfuse: {e}") + if is_langfuse_tracing_enabled(): + logger.info("Langfuse tracing enabled (per-task handlers via evaluate_task_with_langfuse)") # Collect task evaluation coroutines only for parallel/batched mode. # In sequential mode we await evaluate_single_task per service below From 041af8b4999afd105c001e42f3bcccb03f85620e Mon Sep 17 00:00:00 2001 From: Harold Ship Date: Wed, 3 Jun 2026 15:03:40 +0300 Subject: [PATCH 14/20] fix(m3): gate Langfuse on settings and harden eval invoke fallbacks Use should_trace_langfuse_task() instead of setup_langfuse() on the evaluator; write trace_id whenever a per-task trace is created; route error-path invokes through _invoke_agent_for_eval; flush when tracing is enabled in settings. --- benchmarks/helpers/sdk_eval_helpers.py | 59 +++++++++++++++++--------- benchmarks/m3/eval_m3.py | 6 +-- 2 files changed, 41 insertions(+), 24 deletions(-) diff --git a/benchmarks/helpers/sdk_eval_helpers.py b/benchmarks/helpers/sdk_eval_helpers.py index 85f8741..dd6e605 100644 --- a/benchmarks/helpers/sdk_eval_helpers.py +++ b/benchmarks/helpers/sdk_eval_helpers.py @@ -337,6 +337,20 @@ def _langfuse_callback_handler_class(): _langfuse_nesting_warning_emitted = False +def should_trace_langfuse_task(langfuse_handler: Optional[Any] = None) -> bool: + """True when the harness should create one Langfuse trace per eval task. + + Uses per-invoke trace-scoped handlers (``build_langfuse_invoke_config``), not an + unscoped ``CallbackHandler`` on the agent. *langfuse_handler* is a legacy gate + (``True`` / any truthy value); ``False`` forces tracing off for this evaluator. + """ + if langfuse_handler is False: + return False + if langfuse_handler: + return True + return is_langfuse_tracing_enabled() + + def is_langfuse_tracing_enabled() -> bool: """True when Langfuse tracing is enabled in cuga settings and the SDK is installed.""" try: @@ -842,7 +856,7 @@ async def evaluate_task_with_langfuse( _langfuse_metrics = None predefined_trace_id = None - if langfuse_handler: + if should_trace_langfuse_task(langfuse_handler): try: from langfuse import get_client @@ -916,7 +930,8 @@ async def evaluate_task_with_langfuse( except Exception as e: logger.warning(f"Failed to start Langfuse trace: {e}") - invoke_result = await agent.invoke( + invoke_result = await _invoke_agent_for_eval( + agent, [HumanMessage(content=intent)], thread_id=thread_id, user_context=user_context or "", @@ -1045,8 +1060,9 @@ async def evaluate_task_with_langfuse( if react_steps is not None: result["steps"] = react_steps - # Add Langfuse metrics if available - if langfuse_handler and _langfuse_metrics: + if predefined_trace_id: + result["trace_id"] = predefined_trace_id + if _langfuse_metrics: result["total_tokens"] = _langfuse_metrics.total_tokens result["total_llm_calls"] = _langfuse_metrics.total_llm_calls result["total_cost"] = _langfuse_metrics.total_cost @@ -1055,7 +1071,6 @@ async def evaluate_task_with_langfuse( result["generation_timings"] = _langfuse_metrics.generation_timings result["llm_call_details"] = _langfuse_metrics.llm_call_details result["node_timings"] = _langfuse_metrics.node_timings - result["trace_id"] = predefined_trace_id # Compute enhanced metrics if metrics_config is provided if metrics_config: @@ -1333,7 +1348,7 @@ async def evaluate_multiturn_task_with_langfuse( predefined_trace_id = None total_react_steps = 0 - if langfuse_handler: + if should_trace_langfuse_task(langfuse_handler): try: from langfuse import get_client @@ -1468,10 +1483,11 @@ async def evaluate_multiturn_task_with_langfuse( logger.info(f"\n[Turn {turn_idx}/{num_turns}] Query: {query}") logger.info(f"[Turn {turn_idx}] Using thread_id: {thread_id}") - invoke_result = await agent.invoke( + invoke_result = await _invoke_agent_for_eval( + agent, [HumanMessage(content=query)], thread_id=thread_id, - user_context=user_context, + user_context=user_context or "", track_tool_calls=track_tool_calls, ) total_react_steps = _accumulate_react_steps(total_react_steps, invoke_result) @@ -1603,8 +1619,9 @@ async def evaluate_multiturn_task_with_langfuse( "error": None, } - # Add Langfuse metrics if available - if langfuse_handler and _langfuse_metrics: + if predefined_trace_id: + result["trace_id"] = predefined_trace_id + if _langfuse_metrics: result["total_tokens"] = _langfuse_metrics.total_tokens result["total_llm_calls"] = _langfuse_metrics.total_llm_calls result["total_cost"] = _langfuse_metrics.total_cost @@ -1613,7 +1630,6 @@ async def evaluate_multiturn_task_with_langfuse( result["generation_timings"] = _langfuse_metrics.generation_timings result["llm_call_details"] = _langfuse_metrics.llm_call_details result["node_timings"] = _langfuse_metrics.node_timings - result["trace_id"] = predefined_trace_id if task_metadata: result.update(task_metadata) @@ -1823,21 +1839,22 @@ def print_evaluation_summary(results: List[Dict[str, Any]]): print(f"{status} {task_name:25s} ({difficulty:6s}) - {metrics_str}") -def flush_langfuse(langfuse_handler: Optional[Any]): +def flush_langfuse(langfuse_handler: Optional[Any] = None): """Flush Langfuse events in short-lived applications. Args: - langfuse_handler: Optional Langfuse handler + langfuse_handler: Legacy gate (any truthy value) or omit to flush when + tracing is enabled in settings. """ - if langfuse_handler: - try: - from langfuse import get_client + if not should_trace_langfuse_task(langfuse_handler): + return + try: + from langfuse import get_client - langfuse = get_client() - langfuse.flush() - logger.info("✅ Flushed Langfuse events") - except Exception as e: - logger.warning(f"Failed to flush Langfuse events: {e}") + get_client().flush() + logger.info("✅ Flushed Langfuse events") + except Exception as e: + logger.warning(f"Failed to flush Langfuse events: {e}") def save_evaluation_results( diff --git a/benchmarks/m3/eval_m3.py b/benchmarks/m3/eval_m3.py index 0b8b9cb..d0a9498 100644 --- a/benchmarks/m3/eval_m3.py +++ b/benchmarks/m3/eval_m3.py @@ -87,7 +87,7 @@ evaluate_task_with_langfuse, flush_langfuse, save_evaluation_results, - setup_langfuse, + should_trace_langfuse_task, ) from benchmarks.helpers.sdk_eval_helpers import ( add_policy_via_agent, @@ -1471,7 +1471,8 @@ def _dom_name(dc): # evaluate_task_with_langfuse via build_langfuse_invoke_config. # Do not pass an unscoped CallbackHandler on the agent — that creates # orphan root traces per LLM call (especially visible on Watsonx). - langfuse_handler = setup_langfuse() + # Gate only — per-task trace-scoped handlers are attached in invoke config. + evaluator.langfuse_handler = should_trace_langfuse_task() evaluator.agent = CugaAgent( tool_provider=filtered_provider, # Only sees this domain's tools @@ -1484,7 +1485,6 @@ def _dom_name(dc): auto_load_policies=False, filesystem_sync=False, ) - evaluator.langfuse_handler = langfuse_handler logger.info(f"Agent created with filtered tool provider (domain: {domain})") # Load CUGA policies for this per-domain agent (mirrors benchmarks/bpo From f54f4a5247b8fa87ae35348912d0280a63c25f27 Mon Sep 17 00:00:00 2001 From: Harold Ship Date: Wed, 3 Jun 2026 18:55:15 +0300 Subject: [PATCH 15/20] fix(m3): export should_trace_langfuse_task from benchmarks.helpers --- benchmarks/helpers/__init__.py | 2 ++ 1 file changed, 2 insertions(+) diff --git a/benchmarks/helpers/__init__.py b/benchmarks/helpers/__init__.py index 938cd43..2552d61 100644 --- a/benchmarks/helpers/__init__.py +++ b/benchmarks/helpers/__init__.py @@ -10,6 +10,7 @@ "setup_agent_with_tools", "setup_react_agent_for_evaluation", "setup_langfuse", + "should_trace_langfuse_task", "clear_all_policies", "add_policy_via_agent", "check_keywords", @@ -28,6 +29,7 @@ "setup_agent_with_tools": ("sdk_eval_helpers", "setup_agent_with_tools"), "setup_react_agent_for_evaluation": ("sdk_eval_helpers", "setup_react_agent_for_evaluation"), "setup_langfuse": ("sdk_eval_helpers", "setup_langfuse"), + "should_trace_langfuse_task": ("sdk_eval_helpers", "should_trace_langfuse_task"), "clear_all_policies": ("sdk_eval_helpers", "clear_all_policies"), "add_policy_via_agent": ("sdk_eval_helpers", "add_policy_via_agent"), "check_keywords": ("sdk_eval_helpers", "check_keywords"), From 76767641c19e3dc4cc4adca1658ed1de4d126fe8 Mon Sep 17 00:00:00 2001 From: Harold Ship Date: Wed, 3 Jun 2026 18:58:35 +0300 Subject: [PATCH 16/20] fix(m3): wire --no-policies through compare and eval.sh Parse --no-policies in compare.sh (config label + eval.sh args). Pass EVAL_M3_EXTRA on all eval_m3 entrypoints. Disable policy engine via DYNACONF_POLICY__ENABLED=false when baselining. Export should_trace_langfuse_task from benchmarks.helpers. --- benchmarks/m3/compare.sh | 14 +++++++++++++- benchmarks/m3/config/m3.env | 2 +- benchmarks/m3/eval.sh | 11 ++++++++--- 3 files changed, 22 insertions(+), 5 deletions(-) diff --git a/benchmarks/m3/compare.sh b/benchmarks/m3/compare.sh index e4162ed..4757077 100755 --- a/benchmarks/m3/compare.sh +++ b/benchmarks/m3/compare.sh @@ -45,6 +45,7 @@ AGENT="${AGENT:-cuga}" AGENTS="${AGENTS:-}" COMPARE_AGENTS="${COMPARE_AGENTS:-false}" COMPARE_POLICIES="${COMPARE_POLICIES:-false}" +GLOBAL_NO_POLICIES="${GLOBAL_NO_POLICIES:-false}" NO_BUNDLE="${NO_BUNDLE:-false}" BUNDLE_ZIP="${BUNDLE_ZIP:-false}" FORWARDED_ARGS=() @@ -83,6 +84,10 @@ while [[ $idx -lt ${#ARGS[@]} ]]; do COMPARE_POLICIES=true idx=$((idx+1)) ;; + --no-policies) + GLOBAL_NO_POLICIES=true + idx=$((idx+1)) + ;; --no-bundle) NO_BUNDLE=true idx=$((idx+1)) @@ -122,6 +127,8 @@ for _m in "${MODEL_LIST[@]}"; do if [[ "$COMPARE_POLICIES" == "true" ]]; then CONFIGS+=("${_m}:${_a}:policies") CONFIGS+=("${_m}:${_a}:no-policies") + elif [[ "$GLOBAL_NO_POLICIES" == "true" ]]; then + CONFIGS+=("${_m}:${_a}:no-policies") else CONFIGS+=("${_m}:${_a}:policies") fi @@ -136,6 +143,11 @@ echo -e " Agents: ${CYAN:-}${AGENTS}${NC:-}" echo -e " Models: ${CYAN:-}${MODELS}${NC:-}" echo -e " Configurations: ${CYAN:-}${#CONFIGS[@]}${NC:-}" echo -e " Runs per config: ${CYAN:-}${RUNS}${NC:-}" +if [[ "$COMPARE_POLICIES" == "true" ]]; then + echo -e " Compare policies: ${CYAN:-}yes (policies vs no-policies)${NC:-}" +elif [[ "$GLOBAL_NO_POLICIES" == "true" ]]; then + echo -e " Policies: ${CYAN:-}disabled (--no-policies)${NC:-}" +fi echo "" if [[ "$DRY_RUN" == "true" ]]; then @@ -226,7 +238,7 @@ for config in "${CONFIGS[@]}"; do # Per-config extra args (e.g., --no-policies when comparing policy modes). config_extra_args=() - if [[ "$policy_mode" == "no-policies" ]]; then + if [[ "$policy_mode" == "no-policies" ]] || [[ "$GLOBAL_NO_POLICIES" == "true" ]]; then config_extra_args+=(--no-policies) fi diff --git a/benchmarks/m3/config/m3.env b/benchmarks/m3/config/m3.env index af20ae4..43b25d5 100644 --- a/benchmarks/m3/config/m3.env +++ b/benchmarks/m3/config/m3.env @@ -50,7 +50,7 @@ DYNACONF_ADVANCED_FEATURES__TOOL_CALL_TIMEOUT=120 DYNACONF_ADVANCED_FEATURES__SHORTLISTING_TOOL_THRESHOLD=1 # --- Policies --- -# Policy layer on by default; compare.sh --no-policies sets this false per run +# Policy layer on by default; eval.sh/compare.sh --no-policies export DYNACONF_POLICY__ENABLED=false DYNACONF_POLICY__ENABLED=true # --- Evolve (trajectory memory) --- diff --git a/benchmarks/m3/eval.sh b/benchmarks/m3/eval.sh index 32ede0e..f9cafde 100755 --- a/benchmarks/m3/eval.sh +++ b/benchmarks/m3/eval.sh @@ -211,6 +211,8 @@ if [ "$NO_GROUND_TRUTH" = "true" ]; then fi if [ "$NO_POLICIES" = "true" ]; then EVAL_M3_EXTRA+=(--no-policies) + export DYNACONF_POLICY__ENABLED=false + echo -e "${YELLOW:-}Policy engine disabled (--no-policies)${NC:-}" fi # Compile policy markdowns -> policies.json (unless policies are disabled). @@ -259,14 +261,14 @@ elif [ "$MULTITURN" = "true" ]; then echo -e "${RED:-}Error: M3 multi-turn evaluation is not available for the react agent${NC:-}" exit 1 else - uv run python -m benchmarks.m3.eval_m3_multiturn --from-config "$SCRIPT_DIR/config/m3_registry.yaml" "${PASSTHROUGH_ARGS[@]}" + uv run python -m benchmarks.m3.eval_m3_multiturn --from-config "$SCRIPT_DIR/config/m3_registry.yaml" "${EVAL_M3_EXTRA[@]}" "${PASSTHROUGH_ARGS[@]}" fi else echo -e "${YELLOW:-}Running single-turn evaluation with agent ${AGENT:-cuga}...${NC:-}" if [ "${AGENT:-cuga}" = "react" ]; then - uv run python -m benchmarks.m3.eval_m3_react --from-config "$SCRIPT_DIR/config/m3_registry.yaml" "${PASSTHROUGH_ARGS[@]}" + uv run python -m benchmarks.m3.eval_m3_react --from-config "$SCRIPT_DIR/config/m3_registry.yaml" "${EVAL_M3_EXTRA[@]}" "${PASSTHROUGH_ARGS[@]}" else - uv run python -m benchmarks.m3.eval_m3 --from-config "$SCRIPT_DIR/config/m3_registry.yaml" "${PASSTHROUGH_ARGS[@]}" + uv run python -m benchmarks.m3.eval_m3 --from-config "$SCRIPT_DIR/config/m3_registry.yaml" "${EVAL_M3_EXTRA[@]}" "${PASSTHROUGH_ARGS[@]}" fi fi @@ -302,6 +304,9 @@ if [ $EVAL_EXIT -eq 0 ]; then if [ -n "$MODEL_PROFILE" ]; then BUNDLE_ARGS+=(--model-profile "$MODEL_PROFILE") fi + if [ "$NO_POLICIES" = "true" ]; then + BUNDLE_ARGS+=(--no-policies) + fi if [ "${BUNDLE_ZIP:-false}" = "true" ]; then BUNDLE_ARGS+=(--zip) fi From 790518c2e4e7ba0cc4cfacf393c1bcf30e015563 Mon Sep 17 00:00:00 2001 From: Harold Ship Date: Sun, 7 Jun 2026 11:00:13 +0300 Subject: [PATCH 17/20] fix: address CodeRabbit review findings on PR #3 - check_no_task_prefix.py: LEGACY_RE now matches digit-containing domains (e.g. soccer_2016) instead of only [a-z_]+ - eval.sh / compare.sh: call finalize_model_config instead of apply_model_profile_if_set so --model-name / --openai-base-url CLI overrides are actually applied - P-PB-2 policy doc: fix mismatched backtick that broke code-span rendering in the "Wrong" example --- .../m3/policies/P-PB-2-one-composite-tool-no-corroboration.md | 2 +- scripts/check_no_task_prefix.py | 2 +- scripts/compare.sh | 4 ++-- scripts/eval.sh | 4 ++-- 4 files changed, 6 insertions(+), 6 deletions(-) diff --git a/benchmarks/m3/policies/P-PB-2-one-composite-tool-no-corroboration.md b/benchmarks/m3/policies/P-PB-2-one-composite-tool-no-corroboration.md index 3ae54aa..18004ae 100644 --- a/benchmarks/m3/policies/P-PB-2-one-composite-tool-no-corroboration.md +++ b/benchmarks/m3/policies/P-PB-2-one-composite-tool-no-corroboration.md @@ -72,7 +72,7 @@ This policy does **not** apply when: ## Examples - ✗ Question: *"What is the forks-to-stars percentage for solution 104086?"* - ✗ Wrong: *Call `get_forks_to_stars_percentage(solution=104086)` → 0.00%. Then also call `get_repo_forks` and `get_repo_stars` to "double-check". Then report `0 forks / 1 star = 0.00%, confirmed by `get_forks_to_stars_percentage`.* + ✗ Wrong: *Call `get_forks_to_stars_percentage(solution=104086)` → 0.00%. Then also call `get_repo_forks` and `get_repo_stars` to "double-check". Then report 0 forks / 1 star = 0.00%, confirmed by `get_forks_to_stars_percentage`.* ✓ Right: *Call `get_forks_to_stars_percentage(solution=104086)` → 0.00%. Report: *"The forks-to-stars percentage for solution 104086 is 0.00% (source: `get_forks_to_stars_percentage`)."** - ✗ Question: *"Average net enrolment rate for Algeria 1975–1980?"* ✗ Wrong: *Call `get_average_enrolment_rate(country=Algeria, start=1975, end=1980)` → 77.0. Then also call `get_enrolment_rate(year=1975)`, …, `get_enrolment_rate(year=1980)` and average them yourself.* diff --git a/scripts/check_no_task_prefix.py b/scripts/check_no_task_prefix.py index 99019b2..7b801f3 100644 --- a/scripts/check_no_task_prefix.py +++ b/scripts/check_no_task_prefix.py @@ -22,7 +22,7 @@ import sys from pathlib import Path -LEGACY_RE = re.compile(r"^task_\d+_[a-z_]+_") +LEGACY_RE = re.compile(r"^task_\d+_[A-Za-z0-9_]+_") def _iter_tool_calls(obj): diff --git a/scripts/compare.sh b/scripts/compare.sh index bc70865..1712f17 100755 --- a/scripts/compare.sh +++ b/scripts/compare.sh @@ -88,8 +88,8 @@ cd "$PROJECT_ROOT" # Load environment source "$PROJECT_ROOT/benchmarks/helpers/load_env.sh" "$BENCHMARK" -# Apply model profile -apply_model_profile_if_set +# Apply model profile, then per-run CLI overrides +finalize_model_config # Check Langfuse env vars check_langfuse_env diff --git a/scripts/eval.sh b/scripts/eval.sh index 4e8e5a1..c973b03 100755 --- a/scripts/eval.sh +++ b/scripts/eval.sh @@ -79,8 +79,8 @@ cd "$PROJECT_ROOT" # Load environment source "$PROJECT_ROOT/benchmarks/helpers/load_env.sh" "$BENCHMARK" -# Apply model profile -apply_model_profile_if_set +# Apply model profile, then per-run CLI overrides +finalize_model_config # Check Langfuse env vars check_langfuse_env From 4049fb0aff8bd8a376a84486981557f809835bb6 Mon Sep 17 00:00:00 2001 From: Harold Ship Date: Sun, 7 Jun 2026 13:07:40 +0300 Subject: [PATCH 18/20] fix: address Sergey's and Offer's review findings on PR #3 - policies_md_to_json.py: rstrip() instead of rstrip("\n") so CRLF frontmatter delimiters parse correctly; write policies.json via tmp-file + os.replace so a crash mid-write can't leave a truncated file for _load_m3_policies to read - eval_m3.py: rename M3Evaluator.langfuse_handler -> langfuse_enabled (it stores a bool gate, not a handler; kwarg names passed to sdk_eval_helpers stay langfuse_handler since that's their declared legacy-gate parameter name) - m3_vakra_score.py: log when _match_live_name's length-based tie-breaks have multiple candidates, so non-deterministic scoring drift from live-tool-list ordering is debuggable - benchmarks/helpers/__init__.py: guard that every _LAZY_EXPORTS entry is declared in __all__ --- benchmarks/helpers/__init__.py | 3 +++ benchmarks/m3/eval_m3.py | 10 +++++----- benchmarks/m3/m3_vakra_score.py | 16 ++++++++++++++-- scripts/policies_md_to_json.py | 8 +++++--- 4 files changed, 27 insertions(+), 10 deletions(-) diff --git a/benchmarks/helpers/__init__.py b/benchmarks/helpers/__init__.py index 2552d61..974deba 100644 --- a/benchmarks/helpers/__init__.py +++ b/benchmarks/helpers/__init__.py @@ -46,6 +46,9 @@ "save_evaluation_results": ("sdk_eval_helpers", "save_evaluation_results"), } +if not set(_LAZY_EXPORTS).issubset(__all__): + raise AssertionError("every lazy export must be declared in __all__") + def __getattr__(name: str): if name in _LAZY_EXPORTS: diff --git a/benchmarks/m3/eval_m3.py b/benchmarks/m3/eval_m3.py index d0a9498..efc8cb9 100644 --- a/benchmarks/m3/eval_m3.py +++ b/benchmarks/m3/eval_m3.py @@ -685,7 +685,7 @@ def __init__( self.m3_task_id = m3_task_id self.domain = domain self.agent: Optional[CugaAgent] = None - self.langfuse_handler = None + self.langfuse_enabled = None self.results: List[Dict[str, Any]] = [] # Removed setup() method - now using registry mode only @@ -713,7 +713,7 @@ async def evaluate_task(self, task: Dict[str, Any], task_index: int) -> Dict[str agent=self.agent, task=task, task_index=task_index, - langfuse_handler=self.langfuse_handler, + langfuse_handler=self.langfuse_enabled, user_context=None, tracker_callback=tracker_callback, track_tool_calls=True, @@ -758,7 +758,7 @@ async def evaluate_multiturn_task(self, sample: Dict[str, Any], sample_index: in turns=turns, task_name=sample_id, task_index=sample_index, - langfuse_handler=self.langfuse_handler, + langfuse_handler=self.langfuse_enabled, user_context=None, tracker_callback=tracker_callback, track_tool_calls=True, @@ -1007,7 +1007,7 @@ async def evaluate_all( # `evaluate_single_task` (above), once each result has been tagged with # m3_task_id/domain so capability resolution works. Scoring inside this # method is a no-op for that path. - flush_langfuse(self.langfuse_handler) + flush_langfuse(self.langfuse_enabled) def print_summary(self): """Print evaluation summary (Vakra-only; legacy keyword/count reports removed).""" @@ -1472,7 +1472,7 @@ def _dom_name(dc): # Do not pass an unscoped CallbackHandler on the agent — that creates # orphan root traces per LLM call (especially visible on Watsonx). # Gate only — per-task trace-scoped handlers are attached in invoke config. - evaluator.langfuse_handler = should_trace_langfuse_task() + evaluator.langfuse_enabled = should_trace_langfuse_task() evaluator.agent = CugaAgent( tool_provider=filtered_provider, # Only sees this domain's tools diff --git a/benchmarks/m3/m3_vakra_score.py b/benchmarks/m3/m3_vakra_score.py index 14ca5bf..61680ef 100644 --- a/benchmarks/m3/m3_vakra_score.py +++ b/benchmarks/m3/m3_vakra_score.py @@ -160,13 +160,25 @@ def _match_live_name(name: str, live_tool_names: List[str]) -> Optional[str]: if forward_candidates: if len(forward_candidates) == 1: return forward_candidates[0] - return min(forward_candidates, key=len) + shortest = min(len(c) for c in forward_candidates) + ties = [c for c in forward_candidates if len(c) == shortest] + if len(ties) > 1: + logger.debug( + f"_match_live_name: forward-match length tie for {name!r}: {ties} — picking {ties[0]!r}" + ) + return ties[0] # Fall back to suffix matches (new path for bare-domain registry prefix): # longest = most specific live name reachable from the tail of the input. if suffix_candidates: if len(suffix_candidates) == 1: return suffix_candidates[0] - return max(suffix_candidates, key=len) + longest = max(len(c) for c in suffix_candidates) + ties = [c for c in suffix_candidates if len(c) == longest] + if len(ties) > 1: + logger.debug( + f"_match_live_name: suffix-match length tie for {name!r}: {ties} — picking {ties[0]!r}" + ) + return ties[0] return None diff --git a/scripts/policies_md_to_json.py b/scripts/policies_md_to_json.py index 8e0704f..e9211c1 100644 --- a/scripts/policies_md_to_json.py +++ b/scripts/policies_md_to_json.py @@ -53,11 +53,11 @@ def parse_frontmatter(text: str, src: Path) -> tuple[dict[str, Any], str]: raise ValueError(f"{src}: file must begin with a YAML frontmatter block delimited by '---'") # Find the closing '---' (must be on its own line, after the opening one) lines = text.splitlines(keepends=True) - if lines[0].rstrip("\n") != "---": + if lines[0].rstrip() != "---": raise ValueError(f"{src}: opening '---' must be on its own line") end_idx = None for i in range(1, len(lines)): - if lines[i].rstrip("\n") == "---": + if lines[i].rstrip() == "---": end_idx = i break if end_idx is None: @@ -127,7 +127,9 @@ def main(argv: list[str] | None = None) -> int: output_path = args.output or (args.policies_dir / "policies.json") policies = collect_policies(args.policies_dir) - output_path.write_text(json.dumps(policies, indent=2, ensure_ascii=False) + "\n") + tmp_path = output_path.with_suffix(output_path.suffix + ".tmp") + tmp_path.write_text(json.dumps(policies, indent=2, ensure_ascii=False) + "\n") + tmp_path.replace(output_path) print(f"wrote {len(policies)} policy/policies to {output_path}", file=sys.stderr) for p in policies: print(f" - {p['type']:18s} {p['id']}", file=sys.stderr) From eeea3919e62f54075e3339310314503acaa90004 Mon Sep 17 00:00:00 2001 From: Harold Ship Date: Sun, 7 Jun 2026 14:17:27 +0300 Subject: [PATCH 19/20] fix(m3): propagate --capability filter to react agent's registry expansion eval_m3_react.py's _load_m3_registry_services called expand_registry_config without a capability_filter, expanding the entire registry (m3_task_2 and m3_task_3) and tripping the cross-task domain-name collision guard for shared bare-domain names (books, mondial_geo, soccer_2016). The cuga path in eval_m3.py already pre-filters by --capability before expansion; mirror that here so --compare-agents runs with --capability narrowing don't crash the react configuration. --- benchmarks/m3/eval_m3_react.py | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/benchmarks/m3/eval_m3_react.py b/benchmarks/m3/eval_m3_react.py index d3800a7..31eba72 100644 --- a/benchmarks/m3/eval_m3_react.py +++ b/benchmarks/m3/eval_m3_react.py @@ -202,7 +202,11 @@ def _load_m3_registry_services(self) -> List[Dict[str, Any]]: if not os.path.isfile(config_path): raise FileNotFoundError(f"Registry config not found: {config_path}") - expanded_path = expand_registry_config(config_path) + # Pre-filter source services by --capability so bare-domain expanded + # names (e.g. `books` from m3_task_2 vs `books` from m3_task_3) can't + # collide in the same expanded yaml — mirrors eval_m3.py's handling. + capability_filter = [self.capability] if self.capability else None + expanded_path = expand_registry_config(config_path, capability_filter=capability_filter) try: with open(expanded_path) as f: expanded = yaml.safe_load(f) or {} From acb003197fddd151ce3befbb44c3275210fabd5e Mon Sep 17 00:00:00 2001 From: Harold Ship Date: Sun, 7 Jun 2026 14:18:48 +0300 Subject: [PATCH 20/20] chore(m3): regenerate compiled policies.json from markdown sources The P-PB-2 markdown fix in 790518c (closing an unmatched backtick) wasn't reflected in the compiled policies.json. Regenerated via the eval harness's compile step so the artifact matches its source. --- benchmarks/m3/policies/policies.json | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/benchmarks/m3/policies/policies.json b/benchmarks/m3/policies/policies.json index 6404ff1..67fc1d0 100644 --- a/benchmarks/m3/policies/policies.json +++ b/benchmarks/m3/policies/policies.json @@ -84,7 +84,7 @@ ] } ], - "markdown_content": "# P-PB-2 — One Composite Tool, No Corroboration\n\n## Policy\n\nWhen a single endpoint returns the composite metric the user asked for (percentage, ratio, proportion, share, aggregate), the assistant must:\n\n1. Call only that endpoint.\n2. Report the returned value (subject to [[output_formatter_single_tool_fact_citation]] for source attribution).\n\nThe assistant must **not** also call the raw component endpoints (the numerator and denominator tools) to re-derive or \"double-check\" the composite value.\n\n## Rationale\n\nThis policy enforces two related principles from analytical and dashboard reporting:\n\n1. **Source-of-truth discipline.** When the data system exposes a tool that returns the composite metric directly, that tool is the source of truth. Re-deriving the value from component tools introduces consistency risk (numerator and denominator may be computed over different time windows, populations, or filters than the composite tool uses) and produces an answer that is *less trustworthy*, not more.\n2. **Tool-call frugality.** Each extra tool call costs LLM tokens, latency, and (for paid APIs) money. When the answer is already in hand from the composite tool, additional calls add no value.\n\n## Required behaviour\n\nFor percentage / ratio / proportion / aggregate questions:\n\n1. **Identify the composite tool first** — the tool whose name and description directly match the requested metric (e.g., `get_forks_to_stars_percentage`, `get_conversion_rate`, `get_average_X`, `get_X_per_Y`).\n2. **Call only that tool** with the appropriate parameters.\n3. **Report the returned value** with the source citation required by P-OF-1.\n\nExplicitly forbidden:\n- Calling `get_repo_forks` and `get_repo_stars` separately, then dividing, **when** `get_forks_to_stars_percentage` exists.\n- Calling `get_total_X` and `get_count_X` separately to compute an average, **when** `get_average_X` exists.\n- Re-running the composite tool with the same arguments to \"verify\" the value.\n\n## Exceptions\n\nThis policy does **not** apply when:\n- No composite tool exists for the requested metric (then the assistant must compute it from components — that is the only path).\n- The user explicitly asks for the component values *as well as* the composite (\"give me the forks count, stars count, and forks-to-stars percentage\").\n- The composite tool returned a clearly invalid value (HTTP error, type-validation failure) — then the assistant may fall back to components and must say so.\n\n## Examples\n\n- ✗ Question: *\"What is the forks-to-stars percentage for solution 104086?\"*\n ✗ Wrong: *Call `get_forks_to_stars_percentage(solution=104086)` → 0.00%. Then also call `get_repo_forks` and `get_repo_stars` to \"double-check\". Then report `0 forks / 1 star = 0.00%, confirmed by `get_forks_to_stars_percentage`.*\n ✓ Right: *Call `get_forks_to_stars_percentage(solution=104086)` → 0.00%. Report: *\"The forks-to-stars percentage for solution 104086 is 0.00% (source: `get_forks_to_stars_percentage`).\"**\n- ✗ Question: *\"Average net enrolment rate for Algeria 1975–1980?\"*\n ✗ Wrong: *Call `get_average_enrolment_rate(country=Algeria, start=1975, end=1980)` → 77.0. Then also call `get_enrolment_rate(year=1975)`, …, `get_enrolment_rate(year=1980)` and average them yourself.*\n ✓ Right: *Call `get_average_enrolment_rate(country=Algeria, start=1975, end=1980)` → 77.0. Report once with citation.*\n\n## Interaction with other policies\n\n- [[playbook_no_idempotent_retries]] forbids calling the same tool with the same arguments twice; this policy forbids calling **redundant** tools after a composite tool has already answered.\n- [[output_formatter_single_tool_fact_citation]] handles the source-citation requirement for the single composite tool's value.\n" + "markdown_content": "# P-PB-2 — One Composite Tool, No Corroboration\n\n## Policy\n\nWhen a single endpoint returns the composite metric the user asked for (percentage, ratio, proportion, share, aggregate), the assistant must:\n\n1. Call only that endpoint.\n2. Report the returned value (subject to [[output_formatter_single_tool_fact_citation]] for source attribution).\n\nThe assistant must **not** also call the raw component endpoints (the numerator and denominator tools) to re-derive or \"double-check\" the composite value.\n\n## Rationale\n\nThis policy enforces two related principles from analytical and dashboard reporting:\n\n1. **Source-of-truth discipline.** When the data system exposes a tool that returns the composite metric directly, that tool is the source of truth. Re-deriving the value from component tools introduces consistency risk (numerator and denominator may be computed over different time windows, populations, or filters than the composite tool uses) and produces an answer that is *less trustworthy*, not more.\n2. **Tool-call frugality.** Each extra tool call costs LLM tokens, latency, and (for paid APIs) money. When the answer is already in hand from the composite tool, additional calls add no value.\n\n## Required behaviour\n\nFor percentage / ratio / proportion / aggregate questions:\n\n1. **Identify the composite tool first** — the tool whose name and description directly match the requested metric (e.g., `get_forks_to_stars_percentage`, `get_conversion_rate`, `get_average_X`, `get_X_per_Y`).\n2. **Call only that tool** with the appropriate parameters.\n3. **Report the returned value** with the source citation required by P-OF-1.\n\nExplicitly forbidden:\n- Calling `get_repo_forks` and `get_repo_stars` separately, then dividing, **when** `get_forks_to_stars_percentage` exists.\n- Calling `get_total_X` and `get_count_X` separately to compute an average, **when** `get_average_X` exists.\n- Re-running the composite tool with the same arguments to \"verify\" the value.\n\n## Exceptions\n\nThis policy does **not** apply when:\n- No composite tool exists for the requested metric (then the assistant must compute it from components — that is the only path).\n- The user explicitly asks for the component values *as well as* the composite (\"give me the forks count, stars count, and forks-to-stars percentage\").\n- The composite tool returned a clearly invalid value (HTTP error, type-validation failure) — then the assistant may fall back to components and must say so.\n\n## Examples\n\n- ✗ Question: *\"What is the forks-to-stars percentage for solution 104086?\"*\n ✗ Wrong: *Call `get_forks_to_stars_percentage(solution=104086)` → 0.00%. Then also call `get_repo_forks` and `get_repo_stars` to \"double-check\". Then report 0 forks / 1 star = 0.00%, confirmed by `get_forks_to_stars_percentage`.*\n ✓ Right: *Call `get_forks_to_stars_percentage(solution=104086)` → 0.00%. Report: *\"The forks-to-stars percentage for solution 104086 is 0.00% (source: `get_forks_to_stars_percentage`).\"**\n- ✗ Question: *\"Average net enrolment rate for Algeria 1975–1980?\"*\n ✗ Wrong: *Call `get_average_enrolment_rate(country=Algeria, start=1975, end=1980)` → 77.0. Then also call `get_enrolment_rate(year=1975)`, …, `get_enrolment_rate(year=1980)` and average them yourself.*\n ✓ Right: *Call `get_average_enrolment_rate(country=Algeria, start=1975, end=1980)` → 77.0. Report once with citation.*\n\n## Interaction with other policies\n\n- [[playbook_no_idempotent_retries]] forbids calling the same tool with the same arguments twice; this policy forbids calling **redundant** tools after a composite tool has already answered.\n- [[output_formatter_single_tool_fact_citation]] handles the source-citation requirement for the single composite tool's value.\n" }, { "id": "playbook_no_idempotent_retries",