fix(m3): repair harness bugs that artificially zeroed CUGA M3 pass rate by sami-marreed · Pull Request #3 · cuga-project/cuga-eval

sami-marreed · 2026-05-26T09:34:58Z

Related issues (sub-issues of #37)

Track	Issue	PR scope
Policies	#38	M3 policy bundle, `--no-policies` / `--compare-policies`, compiler, bundle metadata
Tool calling	#39	(1) bare-domain registry names + Vakra `_match_live_name` compat + `check_no_task_prefix.py`; (2) `M3_SPECIAL_INSTRUCTIONS` for undocumented MCP outputs

Closes #38
Closes #39

Part of #37.

Summary

Harness-only changes so CUGA M3 (Vakra) scores are not artificially zeroed. Vakra judges, ground truth, MCP definitions, and the ReAct baseline are untouched.

This PR delivers three tracks:

Tool calling ([Feature]: M3 tool-calling harness fixes (Vakra matching + undocumented outputs) #39) — fix registry/Vakra tool-name mismatch and harden agent behavior on schema-less tool outputs.
Policies ([Feature]: M3 evaluation policy bundle (playbooks, tool guides, compare flags) #38) — eight M3 policies behind opt-in compare flags (net effect on a 4-PF slice was mixed once prefix was fixed; see headline table below).
Harness reliability — registry port, multiturn UUID filter, eval.sh registry lifecycle, Langfuse trace wiring, bundle/report UX, and related fixes from the full branch diff.

Tool calling (#39)

1) Vakra tool-name prefix (`task_<n>_<domain>_`)

Area	Change
Registry expansion	Bare domain as MCP app name (e.g. `hockey_get_…` not `task_2_hockey_get_…`)
Agent filter	`FilteredToolProvider` uses `app_name=<domain>`
Scoring	`_match_live_name` suffix fallback for legacy result bundles
Guard	`scripts/check_no_task_prefix.py` — fail if any saved tool call still has legacy prefix

Files: benchmarks/m3/eval_m3.py (expand_registry_config, evaluate_single_task), benchmarks/m3/m3_data_loader.py, benchmarks/m3/m3_vakra_score.py, scripts/check_no_task_prefix.py

2) Undocumented tool outputs

Injected eval-only M3_SPECIAL_INSTRUCTIONS via SDK special_instructions: probe shape on first tool use, then defensive isinstance / normalization before .get() or indexing.

Files: benchmarks/m3/eval_m3.py (M3_SPECIAL_INSTRUCTIONS, passed into CugaAgent)

Policies (#38)

Item	Detail
Policies	P-OF-1, P-PB-1..4, P-TG-1..2 (+ P-OF-2 disabled in frontmatter)
Compiler	`scripts/policies_md_to_json.py` → `benchmarks/m3/policies/policies.json`
Loader	`_load_m3_policies()`; `auto_load_policies=False`, `filesystem_sync=False` on domain agents
CLI	`--no-policies`, `--compare-policies` on `benchmarks/m3/eval.sh`, `benchmarks/m3/compare.sh`
Bundles	Policy-mode suffix in `benchmarks/helpers/bundle.py`

4-PF × 5-runs × 2-configs (codebase_comments, 4 UUIDs) after prefix fix:

Stage	no-policies	policies
baseline (broken prefix)	0/10	disabled
+ tool-prefix removal only	5/10	4/10
+ remaining harness fixes	81.2%	50.0%

Policies are not declared net-positive on that slice; full 200-case re-run is follow-up work.

Additional harness fixes (same PR)

Fix	File(s)
Multi-UUID `--task` filter ignored in multiturn path	`benchmarks/m3/eval_m3.py` — `self.task_ids` vs singular `self.task_id`
Registry port hardcoded 8001	`eval_m3.py` — `REGISTRY_PORT` / `DYNACONF_SERVER_PORTS__REGISTRY`
Outer `eval.sh` registry collides with per-service registry	`benchmarks/m3/eval.sh` — default `SKIP_SERVER_START=true`
`--capability` / `--task` overwritten on second CLI token	`eval_m3.py` — `action="extend"`
Langfuse: one trace per task; gated on settings	`eval_m3.py`, `sdk_eval_helpers.py`, `should_trace_langfuse_task`
Report: capability / domain / task `#` columns	`benchmarks/helpers/compare_report.py`
Per-run bundle logs; sequential registry restart reliability	`benchmarks/m3/compare.sh`, `eval_m3.py`
One eval run → one result file + one trajectory run	`eval_m3.py`, `compare.sh`
Auto-sequence capabilities when `--m3-data` without `--capability`	`eval_m3.py`
Bundle assembly fixes	`benchmarks/helpers/bundle.py`, `scripts/create_eval_bundle.py`
Tests	`benchmarks/m3/tests/test_vakra_langfuse_scores.py`, `benchmarks/helpers/tests/test_invoke_agent_for_eval.py`
Analysis doc	`docs/m3-vakra-analysis-20260428/cuga_vs_react_full_analysis.md`

Test results — full `small_train.zip` run (no policies)

Bundle: benchmarks/m3/evaluation_bundles/20260603_220001_default/
Branch @ run: fix/m3-harness-bugs (7676764) · policies_enabled: false

Summary (from `report.md`)

Metric	Value
Pass rate	67 / 200 (33.5%)
m3_task_2 (dashboard APIs)	45 / 100 (45.0%)
m3_task_3 (multihop)	22 / 100 (22.0%)
Total tokens	65,497,046
Total LLM calls	2,701
Total duration	14,076 s (~3.9 h)

Per-domain pass rate (Vakra)

m3_task_2 — 45/100

Domain	Pass
authors	6/10
books	5/10
codebase_comments	6/10
hockey	5/10
mondial_geo	4/10
movie_platform	4/10
professional_basketball	2/10
soccer_2016	2/10
student_loan	5/10
talkingdata	6/10

m3_task_3 — 22/100

Domain	Pass
beer_factory	1/10
books	3/10
college_completion	0/10
computer_student	1/10
disney	4/10
mondial_geo	2/10
soccer_2016	5/10
trains	0/10
university	2/10
world_development_indicators	4/10

Note: Several rows show -- for Steps in report.md (metrics not propagated into SDK results for this run). Tokens / LLM calls / duration are populated for most completed tasks.

Changed files (vs `main`)

31 files changed, 3720 insertions(+), 403 deletions(-)

Key paths: benchmarks/m3/eval_m3.py, eval.sh, compare.sh, m3_vakra_score.py, m3_data_loader.py, benchmarks/m3/policies/*, benchmarks/helpers/{bundle,compare_report,sdk_eval_helpers}.py, scripts/{check_no_task_prefix,policies_md_to_json,create_eval_bundle}.py, docs/m3-vakra-analysis-20260428/…, tests under benchmarks/m3/tests/ and benchmarks/helpers/tests/.

Verification

just ci / GitHub CI green on branch
scripts/check_no_task_prefix.py on latest result JSON
4-PF policy compare sweep (see table above)
Full 200-case small_train.zip eval → bundle above
Re-run 200-case with --compare-policies (follow-up)
Manual Ctrl-C / partial-save scenarios covered by PR fix(m3): bundle on interrupt and save partial results on crash #4

Out of scope

Nested-arg sandbox codegen (cuga-agent)
movie_platform / professional_basketball MCP health (infra)
Parallel compare.sh runs sharing benchmarks/m3/results/ (glob contamination)

References

Analysis: docs/m3-vakra-analysis-20260428/cuga_vs_react_full_analysis.md (post-implementation section)

Summary by CodeRabbit

New Features
- Per-run compare bundles with nested trajectory/log grouping; new CLI to build reproducibility bundles.
- New model-profile options (vllm, local) and global --no-policies mode for M3 runs.
Improvements
- More robust registry/port management and deterministic startup/shutdown for multi-run M3 evaluations.
- Lazier and safer env loading (inline-comment handling), refined Langfuse tracing/flush behavior, per-task numbering in outputs, and richer Vakra scoring with optional Langfuse export.
Tests
- Added tests for agent invocation paths and Vakra→Langfuse score integration.
Documentation
- New and updated policy/playbook/ToolGuide docs and an extensive CUGA vs ReAct analysis.

The M3 harness in this repo had several mutually-reinforcing bugs that caused CUGA's M3 (vakra) pass rate to be near zero on PF cases (cases where ReAct passes and CUGA fails), independent of agent quality. This commit fixes the cluster: 1. Tool-name app-prefix breaks vakra's _match_live_name (groundedness=0 on otherwise-correct answers). Registry now uses bare-domain app names; m3_data_loader and m3_vakra_score gain backward-compat suffix- match paths for legacy bundles. Collision guard added because de-prefixing collapses cross-task duplicates (books, mondial_geo, soccer_2016 in tasks 2 and 3); expand_registry_config gains a capability_filter to pre-narrow. 2. M3Evaluator.evaluate_all multiturn branch checked self.task_id (singular, set only when N=1) instead of self.task_ids — so `--task <uuid> <uuid> <uuid> <uuid>` silently ran all ~46 samples in the capability. Switched to plural and lowercase-membership. 3. eval_m3.py:start_registry_server hardcoded port 8001 in three places. Now reads REGISTRY_PORT or DYNACONF_SERVER_PORTS__REGISTRY. 4. eval.sh started an outer registry that collided with eval_m3.py's per-service registry on the same port. Every code path through eval.sh uses --from-config and self-manages its registry, so the outer-start is dead weight; gated behind SKIP_SERVER_START=false now. 5. Per-domain CugaAgent constructors were re-loading policies from disk and writing back the conflict-resolver's culled set, so the .cuga folder count decreased monotonically across domains. Pass auto_load_policies=False, filesystem_sync=False; load policies once via a new _load_m3_policies helper. 6. --capability / --task with nargs="*" overwrote on a second invocation; switched both to action="extend", default=[]. UUID detection now strips non-UUID items before passing the filter down. 7. M3 policy bundle (P-OF-1, P-PB-1..4, P-TG-1..2) authored in benchmarks/m3/policies/ as markdown with YAML frontmatter; compiled to policies.json by scripts/policies_md_to_json.py. eval.sh and compare.sh grew --no-policies and --compare-policies flags mirroring bpo. DYNACONF_POLICY__ENABLED flipped to true in m3.env. Bundle dir naming annotates the policy mode. 8. scripts/check_no_task_prefix.py smoke-tests result files to ensure no tool call still carries the legacy task_<n>_<domain>_ prefix. Headline result on a 4-PF x 5-runs x 2-configs sweep: - baseline: 0/10 - + tool-prefix removal: 5/10 (no-policies) / 4/10 (policies) - + remaining fixes (4 PFs x 5 runs): 81.2% no-policies / 50.0% policies The policy bundle is net-negative on these 4 cases once the tool-prefix root cause is fixed; one task improves (75 -> 100%), one regresses (75 -> 0%). Re-running policies against the full 200-case M3 set is the next step. Full implementation notes appended to the analysis report under "Post-analysis: what we actually changed and what each change did". Out of scope (deferred): nested-arg sandbox codegen bug (cuga-agent), movie_platform/professional_basketball MCP-client health, and splitting results/ per compare.sh invocation to prevent parallel-eval contamination. Refs: #99

The MD→JSON compiler skips files with the .disabled suffix, so the policy was already inactive, but the frontmatter still said `enabled: true`. Misleading to a reader; CodeRabbit flagged the inconsistency in #100. The intent of the .disabled suffix here is "P-OF-2 conflicts with P-OF-1 in the natural-language conflict resolver and shouldn't be loaded for now." Flipping the frontmatter to `enabled: false` makes that state visible without relying on the suffix alone — and means removing the suffix later won't silently re-enable a policy whose trade-offs haven't been re-evaluated.

Add get_registry_port() backed by cuga settings so start_registry_server listens on the same port the agent uses via get_registry_base_url(). Sync REGISTRY_PORT and DYNACONF in eval.sh/compare.sh; stop hardcoding 8001 in eval_m3_react port cleanup. Co-authored-by: Cursor <cursoragent@cursor.com>

…bility Bare-domain registry names (from the vakra tool-name fix) collide when both m3_task_2 and m3_task_3 expand into one yaml. Run one capability at a time automatically so full small_train.zip evals work without --capability on CLI.

Strip inline env comments so CUGA mode parses correctly, lazy-load helpers so bundle creation works without importing the agent, and add a utility to assemble bundles from existing results after long eval runs. Co-authored-by: Cursor <cursoragent@cursor.com>

Inline trajectory lookup so the utility does not import the benchmarks package before subprocess calls. Co-authored-by: Cursor <cursoragent@cursor.com>

Load benchmark env via dotenv inside bundle.py and invoke it with -m from create_eval_bundle so direct script execution does not require sys.path hacks. Co-authored-by: Cursor <cursoragent@cursor.com>

Reload benchmark env in root conftest before cuga imports and add nosec annotations for fixed /tmp log paths in create_eval_bundle. Co-authored-by: Cursor <cursoragent@cursor.com>

The auto-capability-pass logic ran each capability (m3_task_2, m3_task_3) as a separate recursive run_config_mode and each pass saved its own result file. With --m3-data covering 2 capabilities, one eval.sh invocation emitted two 100-task files instead of one 200-task file, so compare_report counted each capability file as a separate "run" (inflated run count) and every run column only showed 100 tasks (the other capability's tasks rendered as "—"). - eval_m3.py: capability passes now run with defer_save=True and return their results; the parent aggregates them and writes ONE result file (+ ground-truth dump and summary) per eval.sh run via the new _finalize_and_save_results helper. Single-capability/--capability paths use the same helper. Also removes a duplicate _write_single_service_yaml definition. - compare.sh / bundle.py: trajectory folders (cuga writes one per domain) are now grouped per eval.sh run and merged into a single runN/trajectories dir, so one bundle run holds all 200 trajectories instead of one run per domain. The bundle CLI still accepts the legacy flat trajectory-dirs shape. Net effect: `compare.sh --runs 3 --compare-policies` produces 6 runs (3 policies + 3 no-policies), each covering all 200 tasks/trajectories. Co-authored-by: Cursor <cursoragent@cursor.com>

… port Sequential mode starts/stops a registry per domain on the same port. stop only waited on the `uv` wrapper, so the uvicorn worker could still hold the port when the next domain started, and start_registry_server hard-failed immediately on a busy port ("Port 18001 is already in use" mid-run, e.g. on talkingdata). - eval_m3.py: add _port_in_use/_wait_for_port_free/_kill_port_listeners helpers. start_registry_server now frees + waits for the port (up to 20s) instead of failing on first check; stop_registry_server waits for the port to actually be released (and force-kills stray listeners) before returning. - eval.sh: only start the "outer" registry for multiturn. Single-turn and --m3-data flows (cuga/react) manage their own per-service registries, so the outer registry (which compare.sh forces on via SKIP_SERVER_START=false on its first run) only collided on $REGISTRY_PORT. Fixes both direct eval and compare. Co-authored-by: Cursor <cursoragent@cursor.com>

…logs - compare_report: per-task comparison table now leads with Capability, Domain, and # (1-based input task number), sorted by (capability, domain, task#); non-M3 benchmarks keep the legacy table. - thread task_number from m3_data_loader through eval_m3/eval_m3_react into results so the report can render it. - bundle.assemble_compare_bundle accepts per-run grouped logs (runs/<config>_run<i>/logs) while staying backward-compatible with flat/shared logs. - m3/compare.sh snapshots each run's console + live registry_server.log (fixing the stale /tmp/m3_registry.log) and emits per-run grouped --log-files; appworld and other benchmarks keep shared logs. Co-authored-by: Cursor <cursoragent@cursor.com>

Reorganize m3.env with documented settings, disable Evolve, cap cuga_lite_max_steps at 35 (GT M=3 in small_train, formula uses M=4 padding), and inject special_instructions so CugaLite probes unknown tool shapes then accesses results with isinstance checks. Co-authored-by: Cursor <cursoragent@cursor.com>

Use should_trace_langfuse_task() instead of setup_langfuse() on the evaluator; write trace_id whenever a per-task trace is created; route error-path invokes through _invoke_agent_for_eval; flush when tracing is enabled in settings.

Parse --no-policies in compare.sh (config label + eval.sh args). Pass EVAL_M3_EXTRA on all eval_m3 entrypoints. Disable policy engine via DYNACONF_POLICY__ENABLED=false when baselining. Export should_trace_langfuse_task from benchmarks.helpers.

haroldship · 2026-06-04T09:31:35Z

Ready for review.

PR description updated (policies → [Feature]: M3 evaluation policy bundle (playbooks, tool guides, compare flags) #38, tool calling → [Feature]: M3 tool-calling harness fixes (Vakra matching + undocumented outputs) #39, both sub-issues of [Feature]: Improve evaluation harness to improve CUGA score on Vakra (m3) #37).
Full 200-case small_train.zip results from bundle 20260603_220001_default included in the description.
CI: all checks passing on latest branch push.

Please focus review on harness-only scope (no Vakra judge / MCP / React changes).

coderabbitai · 2026-06-04T13:08:23Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: fbe6965c-4f56-44fd-a30e-970964fa84f7

📥 Commits

Reviewing files that changed from the base of the PR and between 4049fb0 and acb0031.

📒 Files selected for processing (2)

benchmarks/m3/eval_m3_react.py
benchmarks/m3/policies/policies.json

🚧 Files skipped from review as they are similar to previous changes (2)

benchmarks/m3/policies/policies.json
benchmarks/m3/eval_m3_react.py

📝 Walkthrough

Walkthrough

Adds M3 policy markdowns and a compiler, centralizes Langfuse tracing/invocation, normalizes registry naming/port handling, propagates task_number, stages per-run trajectory/log groups for compare bundles, and extends Vakra scoring with judge extraction and Langfuse pushes.

Changes

M3 Evaluation Harness and Vakra Scoring

Layer / File(s)	Summary
Environment loading & model configuration `conftest.py`, `benchmarks/helpers/load_env.sh`, `benchmarks/helpers/common.sh`, `scripts/model_profiles.sh`	Reloads committed env files for tests; strips inline `#` from unquoted env values; captures CLI model overrides and finalizes model profile + CLI overrides; adds `vllm`/`local` profiles.
Langfuse tracing gate & agent invocation refactor `benchmarks/helpers/sdk_eval_helpers.py`, `benchmarks/helpers/tests/test_invoke_agent_for_eval.py`, `benchmarks/helpers/__init__.py`	Adds `should_trace_langfuse_task()` gating; refactors `_invoke_agent_for_eval()` to shape kwargs per agent type and map lf callbacks/config; makes helper exports lazy and adds invocation unit tests.
Policy markdowns & compiler `benchmarks/m3/policies/*.md`, `scripts/policies_md_to_json.py`, `benchmarks/m3/policies/policies.json`	Adds policy/playbook/tool-guide markdowns and `policies_md_to_json.py` to parse frontmatter, validate fields, map types to JSON body fields, detect duplicate IDs, and write `policies.json`.
M3 shell runners: policy modes & dynamic ports `benchmarks/m3/eval.sh`, `benchmarks/m3/compare.sh`, `benchmarks/m3/config/m3.env`	Adds `--no-policies` and `--compare-policies`, derives/export `REGISTRY_PORT` from dynaconf, stages per-run trajectories/logs, and encodes policy_mode into compare bundle inputs.
Bundle assembly & packaging `benchmarks/helpers/bundle.py`, `scripts/create_eval_bundle.py`	Adds `_load_benchmark_env`, accepts nested per-run `trajectory_dirs` and grouped log inputs, copies per-config/per-run trajectories and logs into bundle `runs/<label>_runN/{trajectories,logs}`, and provides a CLI to assemble bundles from latest artifacts.
Task numbering & compare report layout `benchmarks/m3/m3_data_loader.py`, `benchmarks/helpers/compare_report.py`, `benchmarks/m3/eval_m3_react.py`, `benchmarks/m3/eval_m3.py`	Sets 1-based `task_number` in loader; propagates into ReAct/multiturn results; compare report detects M3 metadata and prepends Capability/Domain/# columns and sorts by (m3_task_id,domain,task_number,label).
eval_m3 orchestration & registry lifecycle `benchmarks/m3/eval_m3.py`, `benchmarks/m3/eval_m3_react.py`	Adds `M3_SPECIAL_INSTRUCTIONS`, per-domain `_load_m3_policies`, `--no-policies` handling, capability multi-pass sequencing with deferred finalization, expands registry config to bare-domain app names, adds registry port helpers, and hardens startup/shutdown waits.
Vakra scoring: judge extraction & Langfuse pushes `benchmarks/m3/m3_vakra_score.py`, `benchmarks/m3/tests/test_vakra_langfuse_scores.py`	Adds per-judge extraction and compact formatting helpers, `log_vakra_task_scores()` and `push_vakra_scores_to_langfuse()` (best-effort, failures logged), augments console summary, and adds tests validating Langfuse interactions and logging.
Validation & helper scripts `scripts/check_no_task_prefix.py`, `scripts/policies_md_to_json.py`	Adds a smoke-check for legacy `task_<n>_<domain>_` prefixes in result tool names and a policy markdown-to-JSON compiler CLI.
Analysis & docs `docs/m3-vakra-analysis-20260428/cuga_vs_react_full_analysis.md`	Adds comprehensive CUGA vs ReAct Vakra analysis with failure clusters, remediation levers, and a post-change reproduction summary.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/m3-harness-bugs

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

benchmarks/m3/m3_vakra_score.py (1)

102-104: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Legacy prefix stripping misses underscore domains.

The regex only strips task_<n>_<domain>_ when <domain> is a single alnum token. Legacy names like task_2_codebase_comments_... won’t be stripped, which can break the legacy-bundle compatibility path.

Suggested fix

-        _REGISTRY_PREFIX_RE = _re.compile(r"^task_\d+_[A-Za-z0-9]+_(.+)$")
+        # Domain boundaries are ambiguous because both domain and operation ids
+        # can contain underscores; strip only the deterministic task prefix.
+        _REGISTRY_PREFIX_RE = _re.compile(r"^task_\d+_(.+)$")

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@benchmarks/m3/m3_vakra_score.py` around lines 102 - 104, The current regex
_REGISTRY_PREFIX_RE only allows a single alphanumeric token for the domain and
thus fails to strip legacy names with underscores (e.g.
task_2_codebase_comments_...). Update the pattern used to compile
_REGISTRY_PREFIX_RE to accept domains containing underscores (for example
include "_" inside the character class), then keep the same match/group logic
(use m.group(1) if m else name) so names like task_2_codebase_comments_...
correctly return the suffix; ensure you only modify the regex expression and not
the subsequent matching logic.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@benchmarks/helpers/common.sh`:
- Around line 233-256: parse_common_args reads CLI overrides
(--model-name/--openai-base-url) but some call sites call
apply_model_profile_if_set directly which skips the CLI overrides; update those
call sites to call finalize_model_config (which runs apply_model_profile_if_set
then apply_model_cli_overrides_if_set) instead of apply_model_profile_if_set,
and ensure no early returns in apply_model_profile_if_set prevent later CLI
overrides (leave finalize_model_config's sequence as apply_model_profile_if_set
|| return 1; apply_model_cli_overrides_if_set). Also confirm parse_common_args
sets CLI_MODEL_NAME/CLI_OPENAI_BASE_URL so apply_model_cli_overrides_if_set can
export them to MODEL_NAME/OPENAI_BASE_URL.

In `@benchmarks/m3/policies/P-PB-2-one-composite-tool-no-corroboration.md`:
- Around line 75-76: Inline code spans contain extra spaces inside backticks;
remove the inner padding so the symbols render as
`get_forks_to_stars_percentage`, `get_repo_forks`, `get_repo_stars` and the
inline fragment "0 forks / 1 star = 0.00%, confirmed by
`get_forks_to_stars_percentage`" use backticks without leading/trailing spaces
(e.g. change ` get_forks_to_stars_percentage ` to
`get_forks_to_stars_percentage`). Update those inline code spans in the markdown
so there are no spaces immediately inside the backticks.

In `@scripts/check_no_task_prefix.py`:
- Line 25: The LEGACY_RE pattern only allows lowercase letters and underscores
in the domain segment, causing false negatives for domain names with digits;
update the regex used in LEGACY_RE (and the other occurrence around lines 90-91)
to permit digits as well (e.g., change the domain class from [a-z_]+ to
[a-z0-9_]+ or \w+ depending on desired constraints) so names like
"task_3_soccer_2016_..." are matched; ensure you update both occurrences (the
LEGACY_RE definition and the similar pattern at 90-91) and run tests/linters
after the change.

---

Outside diff comments:
In `@benchmarks/m3/m3_vakra_score.py`:
- Around line 102-104: The current regex _REGISTRY_PREFIX_RE only allows a
single alphanumeric token for the domain and thus fails to strip legacy names
with underscores (e.g. task_2_codebase_comments_...). Update the pattern used to
compile _REGISTRY_PREFIX_RE to accept domains containing underscores (for
example include "_" inside the character class), then keep the same match/group
logic (use m.group(1) if m else name) so names like task_2_codebase_comments_...
correctly return the suffix; ensure you only modify the regex expression and not
the subsequent matching logic.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: a1ddb713-f58a-47f2-ac1b-66052ff6a080

📥 Commits

Reviewing files that changed from the base of the PR and between 167291d and 7676764.

⛔ Files ignored due to path filters (1)

uv.lock is excluded by !**/*.lock, !uv.lock

📒 Files selected for processing (30)

benchmarks/helpers/__init__.py
benchmarks/helpers/bundle.py
benchmarks/helpers/common.sh
benchmarks/helpers/compare_report.py
benchmarks/helpers/load_env.sh
benchmarks/helpers/sdk_eval_helpers.py
benchmarks/helpers/tests/test_invoke_agent_for_eval.py
benchmarks/m3/compare.sh
benchmarks/m3/config/m3.env
benchmarks/m3/eval.sh
benchmarks/m3/eval_m3.py
benchmarks/m3/eval_m3_react.py
benchmarks/m3/m3_data_loader.py
benchmarks/m3/m3_vakra_score.py
benchmarks/m3/policies/P-OF-1-single-tool-fact-citation.md
benchmarks/m3/policies/P-OF-2-strip-hedging.md.disabled
benchmarks/m3/policies/P-PB-1-no-enumeration.md
benchmarks/m3/policies/P-PB-2-one-composite-tool-no-corroboration.md
benchmarks/m3/policies/P-PB-3-no-idempotent-retries.md
benchmarks/m3/policies/P-PB-4-validation-error-recovery.md
benchmarks/m3/policies/P-TG-1-mountain-count-disambiguation.md
benchmarks/m3/policies/P-TG-2-country-with-most-umpires-returns-id.md
benchmarks/m3/policies/policies.json
benchmarks/m3/tests/test_vakra_langfuse_scores.py
conftest.py
docs/m3-vakra-analysis-20260428/cuga_vs_react_full_analysis.md
scripts/check_no_task_prefix.py
scripts/create_eval_bundle.py
scripts/model_profiles.sh
scripts/policies_md_to_json.py

👮 Files not reviewed due to content moderation or server errors (8)

benchmarks/m3/policies/policies.json
benchmarks/m3/config/m3.env
benchmarks/m3/eval.sh
benchmarks/m3/compare.sh
benchmarks/helpers/bundle.py
scripts/create_eval_bundle.py
benchmarks/m3/m3_data_loader.py
benchmarks/m3/eval_m3.py

- check_no_task_prefix.py: LEGACY_RE now matches digit-containing domains (e.g. soccer_2016) instead of only [a-z_]+ - eval.sh / compare.sh: call finalize_model_config instead of apply_model_profile_if_set so --model-name / --openai-base-url CLI overrides are actually applied - P-PB-2 policy doc: fix mismatched backtick that broke code-span rendering in the "Wrong" example

haroldship · 2026-06-07T08:00:49Z

Reply to CodeRabbit's "Legacy prefix stripping misses underscore domains" finding on benchmarks/m3/m3_vakra_score.py:102-104 (outside-diff, couldn't reply inline):

Pushing back on this one — the current regex is intentional, not a bug. The docstring on _strip_registry_prefix explains why a single-token domain class is used: "the domain can itself contain underscores (codebase_comments, world_development_indicators…), so a plain regex can't say where the prefix ends. We leave the domain-only prefix to be resolved by _match_live_name (suffix match against the live MCP tool list)."

Traced through _match_live_name for task_2_codebase_comments_get_books:

_strip_registry_prefix doesn't match (multi-underscore domain), returns the name unchanged
the suffix-match branch (name.endswith("_" + ln)) then finds get_books in the live tool list and resolves correctly

So multi-underscore domains are already handled — just via the suffix matcher rather than the prefix regex, exactly as the docstring describes. Adopting the suggested ^task_\d+_(.+)$ would actually regress single-underscore domains (e.g. hockey): today they get stripped and matched directly; with the broader regex they'd capture hockey_get_books (domain still attached), miss the direct lookup, and fall through to suffix matching too — extra work for a case that already worked cleanly. Leaving as-is.

(Findings on common.sh/eval.sh/compare.sh, check_no_task_prefix.py, and the P-PB-2 markdown spacing were valid and are fixed in 790518c — replied inline on each.)

Sergey-Zeltyn · 2026-06-07T08:48:11Z

Likely bugs

scripts/policies_md_to_json.py:56 -lines[0].rstrip("\n") != "---" leaves \r on CRLF files; a policy authored on Windows silently fails to compile. Use bare rstrip().
scripts/policies_md_to_json.py:130 — output_path.write_text(...) is non-atomic. If compilation crashes mid-write, _load_m3_policies() then reads a truncated policies.json. Write to .tmp + os.replace.

Sergey-Zeltyn · 2026-06-07T08:54:10Z

Concurrency

eval_m3.py — between _kill_port_listeners (SIGKILL) and the next uvicorn bind, TIME_WAIT can hold the port up to 60s without SO_REUSEADDR; _wait_for_port_free's 20s timeout can lose the race during sequential per-domain restarts. Add --reuse-port / SO_REUSEADDR.
eval_m3.py - auto capability-pass loop — combined_results.extend(pass_results) accumulates partial pass results when a later pass raises. The saved file silently mixes complete and partial passes. Tag partial passes or fail the bundle.

Sergey-Zeltyn · 2026-06-07T08:56:16Z

Cross-test leakage
conftest.py — _reload_benchmark_env() calls load_dotenv(env_file, override=True) over every benchmarks/*/config/*.env, in sorted order. Later-sorted files clobber earlier ones in os.environ. The docstring says this is a workaround for inline-comment values left behind by shell sourcing, but the blast radius is global: an appworld test now sees whatever m3.env defines last. Consider scoping per-test (autouse fixture with monkeypatch) or only setting unset keys.

Sergey-Zeltyn · 2026-06-07T09:04:46Z

Maintainability / naming

eval_m3.py:1475 — evaluator.langfuse_handler = should_trace_langfuse_task() stores a bool in a slot named langfuse_handler. It works today only because every consumer treats it as a truthy gate. Either rename the field (langfuse_enabled) or store the actual handler.
benchmarks/helpers/init.py — __all__ and _LAZY_EXPORTS are hand-maintained twin lists. They're in sync right now, but nothing enforces it. One-line guard: assert set(_LAZY_EXPORTS).issubset(__all__).
benchmarks/helpers/bundle.py:46-55 — _load_benchmark_env uses override=True. If bundle.py is ever invoked from a parent shell that explicitly set MODEL_NAME via CLI, the benchmark default wins. Confirm intent; otherwise override=False.
m3_vakra_score.py - _match_live_name suffix fallback — max(candidates, key=len) is non-deterministic when two live tools share the longest suffix (e.g. get_X vs get_extra_X). Log when len-ties occur so scoring drift is debuggable.
eval_m3.py — argparse(action="extend", default=[]) shares a mutable default list. Use default=None with args.task or [].
eval_m3.py — M3_SPECIAL_INSTRUCTIONS (~1.5 KB) injected unconditionally into every per-domain agent prompt. Token waste on domains that don't return schema-less outputs.

Sergey-Zeltyn · 2026-06-07T09:06:24Z

benchmarks/m3/policies/P-OF-2-strip-hedging.md.disabled — file already has enabled: false in frontmatter; the .md.disabled extension is a second mechanism that hides the file from the *.md glob in the compiler. Pick one. The current setup means anyone reading the policies dir will assume the file is "off but loaded" when it's actually invisible.

Sergey-Zeltyn

See comments above.

No remaining issue is a merge blocker. The two correctness items worth fixing in this PR are the CRLF frontmatter and non-atomic policies.json write; the rest can land as follow-ups.

offerakrabi · 2026-06-07T09:23:07Z

-            run_label = f"{config_key.replace(':', '_')}"
-            _copy_logs(bundle_dir, lf_list, dest_subdir=f"runs/{run_label}/logs")
+        for config_key, lf_val in log_files.items():
+            if lf_val and isinstance(lf_val[0], list):


The code tries to detect whether log files are organized as "one list per run" or "one flat list" by checking if the first element is itself a list. If someone accidentally passes an empty nested structure like [[]], the code won't crash immediately but might behave unexpectedly—it'll try to copy logs from an empty group, which could silently fail or do nothing.

Checked this — [[]] doesn't actually misbehave: _copy_logs early-returns False on a falsy/empty log_files list (line 250: if not log_files: return False), so the inner empty group is just a no-op — no silent partial copy, no crash. The shape-detection branch correctly routes it to the grouped path and then does nothing for the empty group. Leaving as-is; the existing guard already covers this.

offerakrabi · 2026-06-07T09:24:52Z

+    data_dir = PROJECT_ROOT / "benchmarks" / benchmark / "data"
+    if benchmark == "m3":
+        if result_file.name.startswith("multiturn_"):
+            candidate = data_dir / "olympics_mutliturn.json"


there's a typo here mutliturn instead of multiturn

actually looking again i see the error is in the file name itself, your choice whether you would like to fix this..not very important

Right — that's not a typo introduced here, it's the actual filename of the existing data file (benchmarks/m3/data/olympics_mutliturn.json, pre-existing in the repo). Renaming the reference without renaming the data file would break loading. Renaming the data file itself is a separate, slightly-bigger cleanup (touches whatever else references it) — leaving it out of this PR's scope, agreed it's not important.

- policies_md_to_json.py: rstrip() instead of rstrip("\n") so CRLF frontmatter delimiters parse correctly; write policies.json via tmp-file + os.replace so a crash mid-write can't leave a truncated file for _load_m3_policies to read - eval_m3.py: rename M3Evaluator.langfuse_handler -> langfuse_enabled (it stores a bool gate, not a handler; kwarg names passed to sdk_eval_helpers stay langfuse_handler since that's their declared legacy-gate parameter name) - m3_vakra_score.py: log when _match_live_name's length-based tie-breaks have multiple candidates, so non-deterministic scoring drift from live-tool-list ordering is debuggable - benchmarks/helpers/__init__.py: guard that every _LAZY_EXPORTS entry is declared in __all__

haroldship · 2026-06-07T10:09:23Z

@Sergey-Zeltyn — addressed the items below in 4049fb0 (and 790518c for the CodeRabbit batch). Replying to each of your threads:

Likely bugs (both fixed, as you flagged as the priority pair):

policies_md_to_json.py:56 — confirmed: rstrip("\n") leaves \r on CRLF files. Changed both occurrences to bare rstrip().
policies_md_to_json.py:130 — confirmed non-atomic write. Now writes to <output>.tmp then Path.replace() (atomic rename) so a mid-write crash can't leave _load_m3_policies reading a truncated policies.json.

Maintainability / naming:

eval_m3.py:1475 — agreed, langfuse_handler stored a bool. Renamed the M3Evaluator attribute to langfuse_enabled (kept the langfuse_handler= kwarg names when calling into sdk_eval_helpers — that's the declared parameter name there, documented as a "legacy gate").
benchmarks/helpers/__init__.py — added the one-line guard you suggested (_LAZY_EXPORTS ⊆ __all__), raising at import time rather than via assert (ruff S101 flags bare assert in non-test code, since -O strips it).
m3_vakra_score.py _match_live_name — added debug logging when a length-based tie has multiple candidates, so scoring drift from live-tool-list ordering is traceable.

Pushing back / leaving as follow-up (matches your own triage that these aren't blockers):

eval_m3.py argparse(..., action="extend", default=[]) — checked this empirically; modern argparse's _ExtendAction copies the default list before extending (_copy_items), so the [] default is never mutated in place. Verified with a quick repro — two parse_args() calls get distinct list objects. Not a bug.
P-OF-2-strip-hedging.md.disabled — careful here: the .md.disabled extension is the only mechanism with teeth (the compiler's *.md glob skips it entirely; _load_m3_policies doesn't check the enabled field at all). So renaming to .md to "pick the frontmatter mechanism" would actually flip it from disabled to loaded — the opposite of intent. The real fix would be teaching the compiler/loader to respect enabled: false and dropping the filename convention, which is a slightly bigger change I'd rather not bundle into this PR's scope. Left as-is for now; will file a follow-up issue for it.
TIME_WAIT/SO_REUSEADDR, partial-pass accumulation in combined_results, conftest.py cross-suite env clobbering, bundle.py override=True, and M3_SPECIAL_INSTRUCTIONS token cost — all valid observations, agreed they're not blockers per your note. Tracking as follow-ups rather than scope-creeping this PR further.

The conftest.py overlapping-keys point in particular (MCP_SERVERS_FILE/CUGA_LOGGING_DIR/DYNACONF_POLICY__ENABLED clobbering across appworld/bpo/m3/oak_health_insurance envs) is worth its own issue — will file one.

…nsion eval_m3_react.py's _load_m3_registry_services called expand_registry_config without a capability_filter, expanding the entire registry (m3_task_2 and m3_task_3) and tripping the cross-task domain-name collision guard for shared bare-domain names (books, mondial_geo, soccer_2016). The cuga path in eval_m3.py already pre-filters by --capability before expansion; mirror that here so --compare-agents runs with --capability narrowing don't crash the react configuration.

The P-PB-2 markdown fix in 790518c (closing an unmatched backtick) wasn't reflected in the compiled policies.json. Regenerated via the eval harness's compile step so the artifact matches its source.

haroldship · 2026-06-07T11:19:36Z

M3 eval/compare validation (small_train.zip, m3_task_2/hockey, max-samples=2)

Ran the README's documented recipes against this branch to validate the harness fixes end-to-end.

Eval (`cuga`, `--max-samples 2`)

caffeinate -i bash benchmarks/m3/eval.sh \
  --m3-data benchmarks/m3/data/small_train.zip \
  --capability m3_task_2 --domain hockey --max-samples 2

Completed cleanly (exit 0, no exceptions). Pass 1/2 (50%) — both failures are model-accuracy judging outcomes (wrong answer / groundedness), not harness errors.

Compare (`--compare-agents --runs 2`)

caffeinate -i bash benchmarks/m3/compare.sh --compare-agents --runs 2 \
  --m3-data benchmarks/m3/data/small_train.zip \
  --capability m3_task_2 --domain hockey --max-samples 2

First attempt surfaced a real bug: the react agent configuration crashed both of its runs with

RuntimeError: Service-name collision in expanded registry config: ['books', 'mondial_geo', 'soccer_2016']

eval_m3_react.py's _load_m3_registry_services was calling expand_registry_config() without a capability_filter, so it expanded the entire registry (m3_task_2 + m3_task_3) instead of narrowing to the requested --capability, tripping the cross-task domain-name collision guard. The cuga path in eval_m3.py already pre-filters by --capability before expansion — eval_m3_react.py didn't mirror that.

Fixed in eeea391 (propagate capability_filter=[self.capability] to expand_registry_config, mirroring eval_m3.py). Re-ran the full compare after the fix:

Configuration                Runs  Pass Rate  pass@2  pass^2  maj@2  Cons   Tokens     LLM   Time
cuga — policies (GPT-OSS-120B)    2   75.0%   100.0%   50.0%   50.0%  1.00  291,569.0  16.5  31.8s
react — policies (GPT-OSS-120B)   2   50.0%   100.0%    0.0%    0.0%    --   71,776.0   5.5   6.9s

0 failed out of 4 runs, no collisions, no exceptions, exit 0.

Also verified:

Without --compare-agents/--agent, eval.sh runs cuga only — ${AGENT:-cuga} defaults to cuga and the react branch is gated on [ "${AGENT:-cuga}" = "react" ]. No AGENT= override exists in any .env/global.env/m3.env. So agent-comparison stays opt-in as intended, not a default.
Also regenerated policies.json (in acb0031) — the P-PB-2 markdown backtick fix from 790518c hadn't been recompiled into the artifact.

No errors or failures remain. Merging.

haroldship and others added 19 commits May 26, 2026 12:30

Merge branch 'main' into fix/m3-harness-bugs

455315e

Fix create_eval_bundle import error when run as a script.

de8ddd4

Inline trajectory lookup so the utility does not import the benchmarks package before subprocess calls. Co-authored-by: Cursor <cursoragent@cursor.com>

Fix bundle CLI when invoked outside the benchmarks package.

4947947

Load benchmark env via dotenv inside bundle.py and invoke it with -m from create_eval_bundle so direct script execution does not require sys.path hacks. Co-authored-by: Cursor <cursoragent@cursor.com>

Fix CI failures from polluted eval env and bandit B108.

a300258

Reload benchmark env in root conftest before cuga imports and add nosec annotations for fixed /tmp log paths in create_eval_bundle. Co-authored-by: Cursor <cursoragent@cursor.com>

Merge remote-tracking branch 'origin/main' into fix/m3-harness-bugs

2ce23f5

fix(m3): single Langfuse trace per task on Watsonx/Cuga path

ab3b4ef

Merge remote-tracking branch 'origin/main' into fix/m3-harness-bugs

671046c

fix(m3): export should_trace_langfuse_task from benchmarks.helpers

f54f4a5

This was referenced Jun 4, 2026

[Feature]: M3 evaluation policy bundle (playbooks, tool guides, compare flags) #38

Closed

[Feature]: M3 tool-calling harness fixes (Vakra matching + undocumented outputs) #39

Closed

haroldship marked this pull request as ready for review June 4, 2026 13:08

haroldship requested a review from offerakrabi June 4, 2026 13:08

haroldship requested a review from Sergey-Zeltyn June 4, 2026 13:08

coderabbitai Bot reviewed Jun 4, 2026

View reviewed changes

Comment thread benchmarks/helpers/common.sh

Comment thread benchmarks/m3/policies/P-PB-2-one-composite-tool-no-corroboration.md Outdated

Comment thread scripts/check_no_task_prefix.py Outdated

Sergey-Zeltyn reviewed Jun 7, 2026

View reviewed changes

offerakrabi reviewed Jun 7, 2026

View reviewed changes

haroldship added 2 commits June 7, 2026 14:17

chore(m3): regenerate compiled policies.json from markdown sources

acb0031

The P-PB-2 markdown fix in 790518c (closing an unmatched backtick) wasn't reflected in the compiled policies.json. Regenerated via the eval harness's compile step so the artifact matches its source.

haroldship merged commit 99a7326 into main Jun 7, 2026
3 of 4 checks passed

haroldship deleted the fix/m3-harness-bugs branch June 7, 2026 11:39

haroldship mentioned this pull request Jun 7, 2026

fix(m3): bundle on interrupt and save partial results on crash #4

Merged

7 tasks

Conversation

sami-marreed commented May 26, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Related issues (sub-issues of #37)

Summary

Tool calling (#39)

1) Vakra tool-name prefix (task_<n>_<domain>_)

2) Undocumented tool outputs

Policies (#38)

Additional harness fixes (same PR)

Test results — full small_train.zip run (no policies)

Summary (from report.md)

Per-domain pass rate (Vakra)

Changed files (vs main)

Verification

Out of scope

References

Summary by CodeRabbit

Uh oh!

haroldship commented Jun 4, 2026

Uh oh!

coderabbitai Bot commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

haroldship commented Jun 7, 2026

Uh oh!

Sergey-Zeltyn commented Jun 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Sergey-Zeltyn commented Jun 7, 2026

Uh oh!

Sergey-Zeltyn commented Jun 7, 2026

Uh oh!

Sergey-Zeltyn commented Jun 7, 2026

Uh oh!

Sergey-Zeltyn commented Jun 7, 2026

Uh oh!

Sergey-Zeltyn left a comment

Choose a reason for hiding this comment

Uh oh!

offerakrabi Jun 7, 2026

Choose a reason for hiding this comment

Uh oh!

haroldship Jun 7, 2026

Choose a reason for hiding this comment

Uh oh!

offerakrabi Jun 7, 2026

Choose a reason for hiding this comment

Uh oh!

haroldship Jun 7, 2026

Choose a reason for hiding this comment

Uh oh!

haroldship commented Jun 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

haroldship commented Jun 7, 2026

M3 eval/compare validation (small_train.zip, m3_task_2/hockey, max-samples=2)

Eval (cuga, --max-samples 2)

Compare (--compare-agents --runs 2)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sami-marreed commented May 26, 2026 •

edited by coderabbitai Bot

Loading

1) Vakra tool-name prefix (`task_<n>_<domain>_`)

Test results — full `small_train.zip` run (no policies)

Summary (from `report.md`)

Changed files (vs `main`)

coderabbitai Bot commented Jun 4, 2026 •

edited

Loading

Sergey-Zeltyn commented Jun 7, 2026 •

edited

Loading

haroldship commented Jun 7, 2026 •

edited

Loading

Eval (`cuga`, `--max-samples 2`)

Compare (`--compare-agents --runs 2`)