Skip to content

fix(m3): repair harness bugs that artificially zeroed CUGA M3 pass rate#3

Merged
haroldship merged 23 commits into
mainfrom
fix/m3-harness-bugs
Jun 7, 2026
Merged

fix(m3): repair harness bugs that artificially zeroed CUGA M3 pass rate#3
haroldship merged 23 commits into
mainfrom
fix/m3-harness-bugs

Conversation

@sami-marreed

@sami-marreed sami-marreed commented May 26, 2026

Copy link
Copy Markdown
Contributor

Related issues (sub-issues of #37)

Track Issue PR scope
Policies #38 M3 policy bundle, --no-policies / --compare-policies, compiler, bundle metadata
Tool calling #39 (1) bare-domain registry names + Vakra _match_live_name compat + check_no_task_prefix.py; (2) M3_SPECIAL_INSTRUCTIONS for undocumented MCP outputs

Closes #38
Closes #39

Part of #37.


Summary

Harness-only changes so CUGA M3 (Vakra) scores are not artificially zeroed. Vakra judges, ground truth, MCP definitions, and the ReAct baseline are untouched.

This PR delivers three tracks:

  1. Tool calling ([Feature]: M3 tool-calling harness fixes (Vakra matching + undocumented outputs) #39) — fix registry/Vakra tool-name mismatch and harden agent behavior on schema-less tool outputs.
  2. Policies ([Feature]: M3 evaluation policy bundle (playbooks, tool guides, compare flags) #38) — eight M3 policies behind opt-in compare flags (net effect on a 4-PF slice was mixed once prefix was fixed; see headline table below).
  3. Harness reliability — registry port, multiturn UUID filter, eval.sh registry lifecycle, Langfuse trace wiring, bundle/report UX, and related fixes from the full branch diff.

Tool calling (#39)

1) Vakra tool-name prefix (task_<n>_<domain>_)

Area Change
Registry expansion Bare domain as MCP app name (e.g. hockey_get_… not task_2_hockey_get_…)
Agent filter FilteredToolProvider uses app_name=<domain>
Scoring _match_live_name suffix fallback for legacy result bundles
Guard scripts/check_no_task_prefix.py — fail if any saved tool call still has legacy prefix

Files: benchmarks/m3/eval_m3.py (expand_registry_config, evaluate_single_task), benchmarks/m3/m3_data_loader.py, benchmarks/m3/m3_vakra_score.py, scripts/check_no_task_prefix.py

2) Undocumented tool outputs

Injected eval-only M3_SPECIAL_INSTRUCTIONS via SDK special_instructions: probe shape on first tool use, then defensive isinstance / normalization before .get() or indexing.

Files: benchmarks/m3/eval_m3.py (M3_SPECIAL_INSTRUCTIONS, passed into CugaAgent)


Policies (#38)

Item Detail
Policies P-OF-1, P-PB-1..4, P-TG-1..2 (+ P-OF-2 disabled in frontmatter)
Compiler scripts/policies_md_to_json.pybenchmarks/m3/policies/policies.json
Loader _load_m3_policies(); auto_load_policies=False, filesystem_sync=False on domain agents
CLI --no-policies, --compare-policies on benchmarks/m3/eval.sh, benchmarks/m3/compare.sh
Bundles Policy-mode suffix in benchmarks/helpers/bundle.py

4-PF × 5-runs × 2-configs (codebase_comments, 4 UUIDs) after prefix fix:

Stage no-policies policies
baseline (broken prefix) 0/10 disabled
+ tool-prefix removal only 5/10 4/10
+ remaining harness fixes 81.2% 50.0%

Policies are not declared net-positive on that slice; full 200-case re-run is follow-up work.


Additional harness fixes (same PR)

Fix File(s)
Multi-UUID --task filter ignored in multiturn path benchmarks/m3/eval_m3.pyself.task_ids vs singular self.task_id
Registry port hardcoded 8001 eval_m3.pyREGISTRY_PORT / DYNACONF_SERVER_PORTS__REGISTRY
Outer eval.sh registry collides with per-service registry benchmarks/m3/eval.sh — default SKIP_SERVER_START=true
--capability / --task overwritten on second CLI token eval_m3.pyaction="extend"
Langfuse: one trace per task; gated on settings eval_m3.py, sdk_eval_helpers.py, should_trace_langfuse_task
Report: capability / domain / task # columns benchmarks/helpers/compare_report.py
Per-run bundle logs; sequential registry restart reliability benchmarks/m3/compare.sh, eval_m3.py
One eval run → one result file + one trajectory run eval_m3.py, compare.sh
Auto-sequence capabilities when --m3-data without --capability eval_m3.py
Bundle assembly fixes benchmarks/helpers/bundle.py, scripts/create_eval_bundle.py
Tests benchmarks/m3/tests/test_vakra_langfuse_scores.py, benchmarks/helpers/tests/test_invoke_agent_for_eval.py
Analysis doc docs/m3-vakra-analysis-20260428/cuga_vs_react_full_analysis.md

Test results — full small_train.zip run (no policies)

Bundle: benchmarks/m3/evaluation_bundles/20260603_220001_default/
Branch @ run: fix/m3-harness-bugs (7676764) · policies_enabled: false

Summary (from report.md)

Metric Value
Pass rate 67 / 200 (33.5%)
m3_task_2 (dashboard APIs) 45 / 100 (45.0%)
m3_task_3 (multihop) 22 / 100 (22.0%)
Total tokens 65,497,046
Total LLM calls 2,701
Total duration 14,076 s (~3.9 h)

Per-domain pass rate (Vakra)

m3_task_2 — 45/100
Domain Pass
authors 6/10
books 5/10
codebase_comments 6/10
hockey 5/10
mondial_geo 4/10
movie_platform 4/10
professional_basketball 2/10
soccer_2016 2/10
student_loan 5/10
talkingdata 6/10
m3_task_3 — 22/100
Domain Pass
beer_factory 1/10
books 3/10
college_completion 0/10
computer_student 1/10
disney 4/10
mondial_geo 2/10
soccer_2016 5/10
trains 0/10
university 2/10
world_development_indicators 4/10

Note: Several rows show -- for Steps in report.md (metrics not propagated into SDK results for this run). Tokens / LLM calls / duration are populated for most completed tasks.


Changed files (vs main)

31 files changed, 3720 insertions(+), 403 deletions(-)

Key paths: benchmarks/m3/eval_m3.py, eval.sh, compare.sh, m3_vakra_score.py, m3_data_loader.py, benchmarks/m3/policies/*, benchmarks/helpers/{bundle,compare_report,sdk_eval_helpers}.py, scripts/{check_no_task_prefix,policies_md_to_json,create_eval_bundle}.py, docs/m3-vakra-analysis-20260428/…, tests under benchmarks/m3/tests/ and benchmarks/helpers/tests/.


Verification

  • just ci / GitHub CI green on branch
  • scripts/check_no_task_prefix.py on latest result JSON
  • 4-PF policy compare sweep (see table above)
  • Full 200-case small_train.zip eval → bundle above
  • Re-run 200-case with --compare-policies (follow-up)
  • Manual Ctrl-C / partial-save scenarios covered by PR fix(m3): bundle on interrupt and save partial results on crash #4

Out of scope

  • Nested-arg sandbox codegen (cuga-agent)
  • movie_platform / professional_basketball MCP health (infra)
  • Parallel compare.sh runs sharing benchmarks/m3/results/ (glob contamination)

References

  • Analysis: docs/m3-vakra-analysis-20260428/cuga_vs_react_full_analysis.md (post-implementation section)

Summary by CodeRabbit

  • New Features

    • Per-run compare bundles with nested trajectory/log grouping; new CLI to build reproducibility bundles.
    • New model-profile options (vllm, local) and global --no-policies mode for M3 runs.
  • Improvements

    • More robust registry/port management and deterministic startup/shutdown for multi-run M3 evaluations.
    • Lazier and safer env loading (inline-comment handling), refined Langfuse tracing/flush behavior, per-task numbering in outputs, and richer Vakra scoring with optional Langfuse export.
  • Tests

    • Added tests for agent invocation paths and Vakra→Langfuse score integration.
  • Documentation

    • New and updated policy/playbook/ToolGuide docs and an extensive CUGA vs ReAct analysis.

haroldship and others added 19 commits May 26, 2026 12:30
The M3 harness in this repo had several mutually-reinforcing bugs that
caused CUGA's M3 (vakra) pass rate to be near zero on PF cases (cases
where ReAct passes and CUGA fails), independent of agent quality. This
commit fixes the cluster:

1. Tool-name app-prefix breaks vakra's _match_live_name (groundedness=0
   on otherwise-correct answers). Registry now uses bare-domain app
   names; m3_data_loader and m3_vakra_score gain backward-compat suffix-
   match paths for legacy bundles. Collision guard added because
   de-prefixing collapses cross-task duplicates (books, mondial_geo,
   soccer_2016 in tasks 2 and 3); expand_registry_config gains a
   capability_filter to pre-narrow.
2. M3Evaluator.evaluate_all multiturn branch checked self.task_id
   (singular, set only when N=1) instead of self.task_ids — so
   `--task <uuid> <uuid> <uuid> <uuid>` silently ran all ~46 samples
   in the capability. Switched to plural and lowercase-membership.
3. eval_m3.py:start_registry_server hardcoded port 8001 in three places.
   Now reads REGISTRY_PORT or DYNACONF_SERVER_PORTS__REGISTRY.
4. eval.sh started an outer registry that collided with eval_m3.py's
   per-service registry on the same port. Every code path through
   eval.sh uses --from-config and self-manages its registry, so the
   outer-start is dead weight; gated behind SKIP_SERVER_START=false now.
5. Per-domain CugaAgent constructors were re-loading policies from disk
   and writing back the conflict-resolver's culled set, so the .cuga
   folder count decreased monotonically across domains. Pass
   auto_load_policies=False, filesystem_sync=False; load policies once
   via a new _load_m3_policies helper.
6. --capability / --task with nargs="*" overwrote on a second
   invocation; switched both to action="extend", default=[]. UUID
   detection now strips non-UUID items before passing the filter down.
7. M3 policy bundle (P-OF-1, P-PB-1..4, P-TG-1..2) authored in
   benchmarks/m3/policies/ as markdown with YAML frontmatter; compiled
   to policies.json by scripts/policies_md_to_json.py. eval.sh and
   compare.sh grew --no-policies and --compare-policies flags mirroring
   bpo. DYNACONF_POLICY__ENABLED flipped to true in m3.env. Bundle dir
   naming annotates the policy mode.
8. scripts/check_no_task_prefix.py smoke-tests result files to ensure
   no tool call still carries the legacy task_<n>_<domain>_ prefix.

Headline result on a 4-PF x 5-runs x 2-configs sweep:
- baseline: 0/10
- + tool-prefix removal: 5/10 (no-policies) / 4/10 (policies)
- + remaining fixes (4 PFs x 5 runs): 81.2% no-policies / 50.0% policies

The policy bundle is net-negative on these 4 cases once the tool-prefix
root cause is fixed; one task improves (75 -> 100%), one regresses
(75 -> 0%). Re-running policies against the full 200-case M3 set is the
next step.

Full implementation notes appended to the analysis report under
"Post-analysis: what we actually changed and what each change did".

Out of scope (deferred): nested-arg sandbox codegen bug (cuga-agent),
movie_platform/professional_basketball MCP-client health, and
splitting results/ per compare.sh invocation to prevent parallel-eval
contamination.

Refs: #99
The MD→JSON compiler skips files with the .disabled suffix, so the
policy was already inactive, but the frontmatter still said
`enabled: true`. Misleading to a reader; CodeRabbit flagged the
inconsistency in #100.

The intent of the .disabled suffix here is "P-OF-2 conflicts with
P-OF-1 in the natural-language conflict resolver and shouldn't be
loaded for now." Flipping the frontmatter to `enabled: false` makes
that state visible without relying on the suffix alone — and means
removing the suffix later won't silently re-enable a policy whose
trade-offs haven't been re-evaluated.
Add get_registry_port() backed by cuga settings so start_registry_server
listens on the same port the agent uses via get_registry_base_url(). Sync
REGISTRY_PORT and DYNACONF in eval.sh/compare.sh; stop hardcoding 8001 in
eval_m3_react port cleanup.

Co-authored-by: Cursor <cursoragent@cursor.com>
…bility

Bare-domain registry names (from the vakra tool-name fix) collide when both
m3_task_2 and m3_task_3 expand into one yaml. Run one capability at a time
automatically so full small_train.zip evals work without --capability on CLI.
Strip inline env comments so CUGA mode parses correctly, lazy-load helpers
so bundle creation works without importing the agent, and add a utility to
assemble bundles from existing results after long eval runs.

Co-authored-by: Cursor <cursoragent@cursor.com>
Inline trajectory lookup so the utility does not import the benchmarks
package before subprocess calls.

Co-authored-by: Cursor <cursoragent@cursor.com>
Load benchmark env via dotenv inside bundle.py and invoke it with -m from
create_eval_bundle so direct script execution does not require sys.path hacks.

Co-authored-by: Cursor <cursoragent@cursor.com>
Reload benchmark env in root conftest before cuga imports and add nosec
annotations for fixed /tmp log paths in create_eval_bundle.

Co-authored-by: Cursor <cursoragent@cursor.com>
The auto-capability-pass logic ran each capability (m3_task_2, m3_task_3) as a
separate recursive run_config_mode and each pass saved its own result file. With
--m3-data covering 2 capabilities, one eval.sh invocation emitted two 100-task
files instead of one 200-task file, so compare_report counted each capability
file as a separate "run" (inflated run count) and every run column only showed
100 tasks (the other capability's tasks rendered as "—").

- eval_m3.py: capability passes now run with defer_save=True and return their
  results; the parent aggregates them and writes ONE result file (+ ground-truth
  dump and summary) per eval.sh run via the new _finalize_and_save_results
  helper. Single-capability/--capability paths use the same helper. Also removes
  a duplicate _write_single_service_yaml definition.
- compare.sh / bundle.py: trajectory folders (cuga writes one per domain) are now
  grouped per eval.sh run and merged into a single runN/trajectories dir, so one
  bundle run holds all 200 trajectories instead of one run per domain. The bundle
  CLI still accepts the legacy flat trajectory-dirs shape.

Net effect: `compare.sh --runs 3 --compare-policies` produces 6 runs (3 policies
+ 3 no-policies), each covering all 200 tasks/trajectories.

Co-authored-by: Cursor <cursoragent@cursor.com>
… port

Sequential mode starts/stops a registry per domain on the same port. stop only
waited on the `uv` wrapper, so the uvicorn worker could still hold the port when
the next domain started, and start_registry_server hard-failed immediately on a
busy port ("Port 18001 is already in use" mid-run, e.g. on talkingdata).

- eval_m3.py: add _port_in_use/_wait_for_port_free/_kill_port_listeners helpers.
  start_registry_server now frees + waits for the port (up to 20s) instead of
  failing on first check; stop_registry_server waits for the port to actually be
  released (and force-kills stray listeners) before returning.
- eval.sh: only start the "outer" registry for multiturn. Single-turn and
  --m3-data flows (cuga/react) manage their own per-service registries, so the
  outer registry (which compare.sh forces on via SKIP_SERVER_START=false on its
  first run) only collided on $REGISTRY_PORT. Fixes both direct eval and compare.

Co-authored-by: Cursor <cursoragent@cursor.com>
…logs

- compare_report: per-task comparison table now leads with Capability,
  Domain, and # (1-based input task number), sorted by (capability,
  domain, task#); non-M3 benchmarks keep the legacy table.
- thread task_number from m3_data_loader through eval_m3/eval_m3_react
  into results so the report can render it.
- bundle.assemble_compare_bundle accepts per-run grouped logs
  (runs/<config>_run<i>/logs) while staying backward-compatible with
  flat/shared logs.
- m3/compare.sh snapshots each run's console + live registry_server.log
  (fixing the stale /tmp/m3_registry.log) and emits per-run grouped
  --log-files; appworld and other benchmarks keep shared logs.

Co-authored-by: Cursor <cursoragent@cursor.com>
Reorganize m3.env with documented settings, disable Evolve, cap
cuga_lite_max_steps at 35 (GT M=3 in small_train, formula uses M=4
padding), and inject special_instructions so CugaLite probes unknown
tool shapes then accesses results with isinstance checks.

Co-authored-by: Cursor <cursoragent@cursor.com>
Use should_trace_langfuse_task() instead of setup_langfuse() on the evaluator;
write trace_id whenever a per-task trace is created; route error-path invokes
through _invoke_agent_for_eval; flush when tracing is enabled in settings.
Parse --no-policies in compare.sh (config label + eval.sh args). Pass
EVAL_M3_EXTRA on all eval_m3 entrypoints. Disable policy engine via
DYNACONF_POLICY__ENABLED=false when baselining. Export
should_trace_langfuse_task from benchmarks.helpers.
@haroldship

Copy link
Copy Markdown
Collaborator

Ready for review.

Please focus review on harness-only scope (no Vakra judge / MCP / React changes).

@haroldship haroldship marked this pull request as ready for review June 4, 2026 13:08
@haroldship haroldship requested a review from offerakrabi June 4, 2026 13:08
@coderabbitai

coderabbitai Bot commented Jun 4, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: fbe6965c-4f56-44fd-a30e-970964fa84f7

📥 Commits

Reviewing files that changed from the base of the PR and between 4049fb0 and acb0031.

📒 Files selected for processing (2)
  • benchmarks/m3/eval_m3_react.py
  • benchmarks/m3/policies/policies.json
🚧 Files skipped from review as they are similar to previous changes (2)
  • benchmarks/m3/policies/policies.json
  • benchmarks/m3/eval_m3_react.py

📝 Walkthrough

Walkthrough

Adds M3 policy markdowns and a compiler, centralizes Langfuse tracing/invocation, normalizes registry naming/port handling, propagates task_number, stages per-run trajectory/log groups for compare bundles, and extends Vakra scoring with judge extraction and Langfuse pushes.

Changes

M3 Evaluation Harness and Vakra Scoring

Layer / File(s) Summary
Environment loading & model configuration
conftest.py, benchmarks/helpers/load_env.sh, benchmarks/helpers/common.sh, scripts/model_profiles.sh
Reloads committed env files for tests; strips inline # from unquoted env values; captures CLI model overrides and finalizes model profile + CLI overrides; adds vllm/local profiles.
Langfuse tracing gate & agent invocation refactor
benchmarks/helpers/sdk_eval_helpers.py, benchmarks/helpers/tests/test_invoke_agent_for_eval.py, benchmarks/helpers/__init__.py
Adds should_trace_langfuse_task() gating; refactors _invoke_agent_for_eval() to shape kwargs per agent type and map lf callbacks/config; makes helper exports lazy and adds invocation unit tests.
Policy markdowns & compiler
benchmarks/m3/policies/*.md, scripts/policies_md_to_json.py, benchmarks/m3/policies/policies.json
Adds policy/playbook/tool-guide markdowns and policies_md_to_json.py to parse frontmatter, validate fields, map types to JSON body fields, detect duplicate IDs, and write policies.json.
M3 shell runners: policy modes & dynamic ports
benchmarks/m3/eval.sh, benchmarks/m3/compare.sh, benchmarks/m3/config/m3.env
Adds --no-policies and --compare-policies, derives/export REGISTRY_PORT from dynaconf, stages per-run trajectories/logs, and encodes policy_mode into compare bundle inputs.
Bundle assembly & packaging
benchmarks/helpers/bundle.py, scripts/create_eval_bundle.py
Adds _load_benchmark_env, accepts nested per-run trajectory_dirs and grouped log inputs, copies per-config/per-run trajectories and logs into bundle runs/<label>_runN/{trajectories,logs}, and provides a CLI to assemble bundles from latest artifacts.
Task numbering & compare report layout
benchmarks/m3/m3_data_loader.py, benchmarks/helpers/compare_report.py, benchmarks/m3/eval_m3_react.py, benchmarks/m3/eval_m3.py
Sets 1-based task_number in loader; propagates into ReAct/multiturn results; compare report detects M3 metadata and prepends Capability/Domain/# columns and sorts by (m3_task_id,domain,task_number,label).
eval_m3 orchestration & registry lifecycle
benchmarks/m3/eval_m3.py, benchmarks/m3/eval_m3_react.py
Adds M3_SPECIAL_INSTRUCTIONS, per-domain _load_m3_policies, --no-policies handling, capability multi-pass sequencing with deferred finalization, expands registry config to bare-domain app names, adds registry port helpers, and hardens startup/shutdown waits.
Vakra scoring: judge extraction & Langfuse pushes
benchmarks/m3/m3_vakra_score.py, benchmarks/m3/tests/test_vakra_langfuse_scores.py
Adds per-judge extraction and compact formatting helpers, log_vakra_task_scores() and push_vakra_scores_to_langfuse() (best-effort, failures logged), augments console summary, and adds tests validating Langfuse interactions and logging.
Validation & helper scripts
scripts/check_no_task_prefix.py, scripts/policies_md_to_json.py
Adds a smoke-check for legacy task_<n>_<domain>_ prefixes in result tool names and a policy markdown-to-JSON compiler CLI.
Analysis & docs
docs/m3-vakra-analysis-20260428/cuga_vs_react_full_analysis.md
Adds comprehensive CUGA vs ReAct Vakra analysis with failure clusters, remediation levers, and a post-change reproduction summary.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/m3-harness-bugs

@haroldship haroldship requested a review from Sergey-Zeltyn June 4, 2026 13:08

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
benchmarks/m3/m3_vakra_score.py (1)

102-104: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Legacy prefix stripping misses underscore domains.

The regex only strips task_<n>_<domain>_ when <domain> is a single alnum token. Legacy names like task_2_codebase_comments_... won’t be stripped, which can break the legacy-bundle compatibility path.

Suggested fix
-        _REGISTRY_PREFIX_RE = _re.compile(r"^task_\d+_[A-Za-z0-9]+_(.+)$")
+        # Domain boundaries are ambiguous because both domain and operation ids
+        # can contain underscores; strip only the deterministic task prefix.
+        _REGISTRY_PREFIX_RE = _re.compile(r"^task_\d+_(.+)$")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@benchmarks/m3/m3_vakra_score.py` around lines 102 - 104, The current regex
_REGISTRY_PREFIX_RE only allows a single alphanumeric token for the domain and
thus fails to strip legacy names with underscores (e.g.
task_2_codebase_comments_...). Update the pattern used to compile
_REGISTRY_PREFIX_RE to accept domains containing underscores (for example
include "_" inside the character class), then keep the same match/group logic
(use m.group(1) if m else name) so names like task_2_codebase_comments_...
correctly return the suffix; ensure you only modify the regex expression and not
the subsequent matching logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@benchmarks/helpers/common.sh`:
- Around line 233-256: parse_common_args reads CLI overrides
(--model-name/--openai-base-url) but some call sites call
apply_model_profile_if_set directly which skips the CLI overrides; update those
call sites to call finalize_model_config (which runs apply_model_profile_if_set
then apply_model_cli_overrides_if_set) instead of apply_model_profile_if_set,
and ensure no early returns in apply_model_profile_if_set prevent later CLI
overrides (leave finalize_model_config's sequence as apply_model_profile_if_set
|| return 1; apply_model_cli_overrides_if_set). Also confirm parse_common_args
sets CLI_MODEL_NAME/CLI_OPENAI_BASE_URL so apply_model_cli_overrides_if_set can
export them to MODEL_NAME/OPENAI_BASE_URL.

In `@benchmarks/m3/policies/P-PB-2-one-composite-tool-no-corroboration.md`:
- Around line 75-76: Inline code spans contain extra spaces inside backticks;
remove the inner padding so the symbols render as
`get_forks_to_stars_percentage`, `get_repo_forks`, `get_repo_stars` and the
inline fragment "0 forks / 1 star = 0.00%, confirmed by
`get_forks_to_stars_percentage`" use backticks without leading/trailing spaces
(e.g. change ` get_forks_to_stars_percentage ` to
`get_forks_to_stars_percentage`). Update those inline code spans in the markdown
so there are no spaces immediately inside the backticks.

In `@scripts/check_no_task_prefix.py`:
- Line 25: The LEGACY_RE pattern only allows lowercase letters and underscores
in the domain segment, causing false negatives for domain names with digits;
update the regex used in LEGACY_RE (and the other occurrence around lines 90-91)
to permit digits as well (e.g., change the domain class from [a-z_]+ to
[a-z0-9_]+ or \w+ depending on desired constraints) so names like
"task_3_soccer_2016_..." are matched; ensure you update both occurrences (the
LEGACY_RE definition and the similar pattern at 90-91) and run tests/linters
after the change.

---

Outside diff comments:
In `@benchmarks/m3/m3_vakra_score.py`:
- Around line 102-104: The current regex _REGISTRY_PREFIX_RE only allows a
single alphanumeric token for the domain and thus fails to strip legacy names
with underscores (e.g. task_2_codebase_comments_...). Update the pattern used to
compile _REGISTRY_PREFIX_RE to accept domains containing underscores (for
example include "_" inside the character class), then keep the same match/group
logic (use m.group(1) if m else name) so names like task_2_codebase_comments_...
correctly return the suffix; ensure you only modify the regex expression and not
the subsequent matching logic.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: a1ddb713-f58a-47f2-ac1b-66052ff6a080

📥 Commits

Reviewing files that changed from the base of the PR and between 167291d and 7676764.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock, !uv.lock
📒 Files selected for processing (30)
  • benchmarks/helpers/__init__.py
  • benchmarks/helpers/bundle.py
  • benchmarks/helpers/common.sh
  • benchmarks/helpers/compare_report.py
  • benchmarks/helpers/load_env.sh
  • benchmarks/helpers/sdk_eval_helpers.py
  • benchmarks/helpers/tests/test_invoke_agent_for_eval.py
  • benchmarks/m3/compare.sh
  • benchmarks/m3/config/m3.env
  • benchmarks/m3/eval.sh
  • benchmarks/m3/eval_m3.py
  • benchmarks/m3/eval_m3_react.py
  • benchmarks/m3/m3_data_loader.py
  • benchmarks/m3/m3_vakra_score.py
  • benchmarks/m3/policies/P-OF-1-single-tool-fact-citation.md
  • benchmarks/m3/policies/P-OF-2-strip-hedging.md.disabled
  • benchmarks/m3/policies/P-PB-1-no-enumeration.md
  • benchmarks/m3/policies/P-PB-2-one-composite-tool-no-corroboration.md
  • benchmarks/m3/policies/P-PB-3-no-idempotent-retries.md
  • benchmarks/m3/policies/P-PB-4-validation-error-recovery.md
  • benchmarks/m3/policies/P-TG-1-mountain-count-disambiguation.md
  • benchmarks/m3/policies/P-TG-2-country-with-most-umpires-returns-id.md
  • benchmarks/m3/policies/policies.json
  • benchmarks/m3/tests/test_vakra_langfuse_scores.py
  • conftest.py
  • docs/m3-vakra-analysis-20260428/cuga_vs_react_full_analysis.md
  • scripts/check_no_task_prefix.py
  • scripts/create_eval_bundle.py
  • scripts/model_profiles.sh
  • scripts/policies_md_to_json.py
👮 Files not reviewed due to content moderation or server errors (8)
  • benchmarks/m3/policies/policies.json
  • benchmarks/m3/config/m3.env
  • benchmarks/m3/eval.sh
  • benchmarks/m3/compare.sh
  • benchmarks/helpers/bundle.py
  • scripts/create_eval_bundle.py
  • benchmarks/m3/m3_data_loader.py
  • benchmarks/m3/eval_m3.py

Comment thread benchmarks/helpers/common.sh
Comment thread benchmarks/m3/policies/P-PB-2-one-composite-tool-no-corroboration.md Outdated
Comment thread scripts/check_no_task_prefix.py Outdated
- check_no_task_prefix.py: LEGACY_RE now matches digit-containing
  domains (e.g. soccer_2016) instead of only [a-z_]+
- eval.sh / compare.sh: call finalize_model_config instead of
  apply_model_profile_if_set so --model-name / --openai-base-url CLI
  overrides are actually applied
- P-PB-2 policy doc: fix mismatched backtick that broke code-span
  rendering in the "Wrong" example
@haroldship

Copy link
Copy Markdown
Collaborator

Reply to CodeRabbit's "Legacy prefix stripping misses underscore domains" finding on benchmarks/m3/m3_vakra_score.py:102-104 (outside-diff, couldn't reply inline):

Pushing back on this one — the current regex is intentional, not a bug. The docstring on _strip_registry_prefix explains why a single-token domain class is used: "the domain can itself contain underscores (codebase_comments, world_development_indicators…), so a plain regex can't say where the prefix ends. We leave the domain-only prefix to be resolved by _match_live_name (suffix match against the live MCP tool list)."

Traced through _match_live_name for task_2_codebase_comments_get_books:

  • _strip_registry_prefix doesn't match (multi-underscore domain), returns the name unchanged
  • the suffix-match branch (name.endswith("_" + ln)) then finds get_books in the live tool list and resolves correctly

So multi-underscore domains are already handled — just via the suffix matcher rather than the prefix regex, exactly as the docstring describes. Adopting the suggested ^task_\d+_(.+)$ would actually regress single-underscore domains (e.g. hockey): today they get stripped and matched directly; with the broader regex they'd capture hockey_get_books (domain still attached), miss the direct lookup, and fall through to suffix matching too — extra work for a case that already worked cleanly. Leaving as-is.

(Findings on common.sh/eval.sh/compare.sh, check_no_task_prefix.py, and the P-PB-2 markdown spacing were valid and are fixed in 790518c — replied inline on each.)

@Sergey-Zeltyn

Sergey-Zeltyn commented Jun 7, 2026

Copy link
Copy Markdown
Collaborator

Likely bugs

  • scripts/policies_md_to_json.py:56 -lines[0].rstrip("\n") != "---" leaves \r on CRLF files; a policy authored on Windows silently fails to compile. Use bare rstrip().
  • scripts/policies_md_to_json.py:130output_path.write_text(...) is non-atomic. If compilation crashes mid-write, _load_m3_policies() then reads a truncated policies.json. Write to .tmp + os.replace.

@Sergey-Zeltyn

Copy link
Copy Markdown
Collaborator

Concurrency

  • eval_m3.py — between _kill_port_listeners (SIGKILL) and the next uvicorn bind, TIME_WAIT can hold the port up to 60s without SO_REUSEADDR; _wait_for_port_free's 20s timeout can lose the race during sequential per-domain restarts. Add --reuse-port / SO_REUSEADDR.

  • eval_m3.py - auto capability-pass loop — combined_results.extend(pass_results) accumulates partial pass results when a later pass raises. The saved file silently mixes complete and partial passes. Tag partial passes or fail the bundle.

@Sergey-Zeltyn

Copy link
Copy Markdown
Collaborator

Cross-test leakage
conftest.py_reload_benchmark_env() calls load_dotenv(env_file, override=True) over every benchmarks/*/config/*.env, in sorted order. Later-sorted files clobber earlier ones in os.environ. The docstring says this is a workaround for inline-comment values left behind by shell sourcing, but the blast radius is global: an appworld test now sees whatever m3.env defines last. Consider scoping per-test (autouse fixture with monkeypatch) or only setting unset keys.

@Sergey-Zeltyn

Copy link
Copy Markdown
Collaborator

Maintainability / naming

  • eval_m3.py:1475evaluator.langfuse_handler = should_trace_langfuse_task() stores a bool in a slot named langfuse_handler. It works today only because every consumer treats it as a truthy gate. Either rename the field (langfuse_enabled) or store the actual handler.

  • benchmarks/helpers/init.py__all__ and _LAZY_EXPORTS are hand-maintained twin lists. They're in sync right now, but nothing enforces it. One-line guard: assert set(_LAZY_EXPORTS).issubset(__all__).

  • benchmarks/helpers/bundle.py:46-55_load_benchmark_env uses override=True. If bundle.py is ever invoked from a parent shell that explicitly set MODEL_NAME via CLI, the benchmark default wins. Confirm intent; otherwise override=False.

  • m3_vakra_score.py - _match_live_name suffix fallback — max(candidates, key=len) is non-deterministic when two live tools share the longest suffix (e.g. get_X vs get_extra_X). Log when len-ties occur so scoring drift is debuggable.

  • eval_m3.pyargparse(action="extend", default=[]) shares a mutable default list. Use default=None with args.task or [].

  • eval_m3.pyM3_SPECIAL_INSTRUCTIONS (~1.5 KB) injected unconditionally into every per-domain agent prompt. Token waste on domains that don't return schema-less outputs.

@Sergey-Zeltyn

Copy link
Copy Markdown
Collaborator

benchmarks/m3/policies/P-OF-2-strip-hedging.md.disabled — file already has enabled: false in frontmatter; the .md.disabled extension is a second mechanism that hides the file from the *.md glob in the compiler. Pick one. The current setup means anyone reading the policies dir will assume the file is "off but loaded" when it's actually invisible.

@Sergey-Zeltyn Sergey-Zeltyn left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments above.

No remaining issue is a merge blocker. The two correctness items worth fixing in this PR are the CRLF frontmatter and non-atomic policies.json write; the rest can land as follow-ups.

run_label = f"{config_key.replace(':', '_')}"
_copy_logs(bundle_dir, lf_list, dest_subdir=f"runs/{run_label}/logs")
for config_key, lf_val in log_files.items():
if lf_val and isinstance(lf_val[0], list):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code tries to detect whether log files are organized as "one list per run" or "one flat list" by checking if the first element is itself a list. If someone accidentally passes an empty nested structure like [[]], the code won't crash immediately but might behave unexpectedly—it'll try to copy logs from an empty group, which could silently fail or do nothing.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checked this — [[]] doesn't actually misbehave: _copy_logs early-returns False on a falsy/empty log_files list (line 250: if not log_files: return False), so the inner empty group is just a no-op — no silent partial copy, no crash. The shape-detection branch correctly routes it to the grouped path and then does nothing for the empty group. Leaving as-is; the existing guard already covers this.

data_dir = PROJECT_ROOT / "benchmarks" / benchmark / "data"
if benchmark == "m3":
if result_file.name.startswith("multiturn_"):
candidate = data_dir / "olympics_mutliturn.json"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's a typo here mutliturn instead of multiturn

actually looking again i see the error is in the file name itself, your choice whether you would like to fix this..not very important

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right — that's not a typo introduced here, it's the actual filename of the existing data file (benchmarks/m3/data/olympics_mutliturn.json, pre-existing in the repo). Renaming the reference without renaming the data file would break loading. Renaming the data file itself is a separate, slightly-bigger cleanup (touches whatever else references it) — leaving it out of this PR's scope, agreed it's not important.

- policies_md_to_json.py: rstrip() instead of rstrip("\n") so CRLF
  frontmatter delimiters parse correctly; write policies.json via
  tmp-file + os.replace so a crash mid-write can't leave a truncated
  file for _load_m3_policies to read
- eval_m3.py: rename M3Evaluator.langfuse_handler -> langfuse_enabled
  (it stores a bool gate, not a handler; kwarg names passed to
  sdk_eval_helpers stay langfuse_handler since that's their declared
  legacy-gate parameter name)
- m3_vakra_score.py: log when _match_live_name's length-based
  tie-breaks have multiple candidates, so non-deterministic scoring
  drift from live-tool-list ordering is debuggable
- benchmarks/helpers/__init__.py: guard that every _LAZY_EXPORTS entry
  is declared in __all__
@haroldship

haroldship commented Jun 7, 2026

Copy link
Copy Markdown
Collaborator

@Sergey-Zeltyn — addressed the items below in 4049fb0 (and 790518c for the CodeRabbit batch). Replying to each of your threads:

Likely bugs (both fixed, as you flagged as the priority pair):

  • policies_md_to_json.py:56 — confirmed: rstrip("\n") leaves \r on CRLF files. Changed both occurrences to bare rstrip().
  • policies_md_to_json.py:130 — confirmed non-atomic write. Now writes to <output>.tmp then Path.replace() (atomic rename) so a mid-write crash can't leave _load_m3_policies reading a truncated policies.json.

Maintainability / naming:

  • eval_m3.py:1475 — agreed, langfuse_handler stored a bool. Renamed the M3Evaluator attribute to langfuse_enabled (kept the langfuse_handler= kwarg names when calling into sdk_eval_helpers — that's the declared parameter name there, documented as a "legacy gate").
  • benchmarks/helpers/__init__.py — added the one-line guard you suggested (_LAZY_EXPORTS__all__), raising at import time rather than via assert (ruff S101 flags bare assert in non-test code, since -O strips it).
  • m3_vakra_score.py _match_live_name — added debug logging when a length-based tie has multiple candidates, so scoring drift from live-tool-list ordering is traceable.

Pushing back / leaving as follow-up (matches your own triage that these aren't blockers):

  • eval_m3.py argparse(..., action="extend", default=[]) — checked this empirically; modern argparse's _ExtendAction copies the default list before extending (_copy_items), so the [] default is never mutated in place. Verified with a quick repro — two parse_args() calls get distinct list objects. Not a bug.
  • P-OF-2-strip-hedging.md.disabled — careful here: the .md.disabled extension is the only mechanism with teeth (the compiler's *.md glob skips it entirely; _load_m3_policies doesn't check the enabled field at all). So renaming to .md to "pick the frontmatter mechanism" would actually flip it from disabled to loaded — the opposite of intent. The real fix would be teaching the compiler/loader to respect enabled: false and dropping the filename convention, which is a slightly bigger change I'd rather not bundle into this PR's scope. Left as-is for now; will file a follow-up issue for it.
  • TIME_WAIT/SO_REUSEADDR, partial-pass accumulation in combined_results, conftest.py cross-suite env clobbering, bundle.py override=True, and M3_SPECIAL_INSTRUCTIONS token cost — all valid observations, agreed they're not blockers per your note. Tracking as follow-ups rather than scope-creeping this PR further.

The conftest.py overlapping-keys point in particular (MCP_SERVERS_FILE/CUGA_LOGGING_DIR/DYNACONF_POLICY__ENABLED clobbering across appworld/bpo/m3/oak_health_insurance envs) is worth its own issue — will file one.

…nsion

eval_m3_react.py's _load_m3_registry_services called expand_registry_config
without a capability_filter, expanding the entire registry (m3_task_2 and
m3_task_3) and tripping the cross-task domain-name collision guard for
shared bare-domain names (books, mondial_geo, soccer_2016). The cuga path
in eval_m3.py already pre-filters by --capability before expansion; mirror
that here so --compare-agents runs with --capability narrowing don't crash
the react configuration.
The P-PB-2 markdown fix in 790518c (closing an unmatched backtick) wasn't
reflected in the compiled policies.json. Regenerated via the eval harness's
compile step so the artifact matches its source.
@haroldship

Copy link
Copy Markdown
Collaborator

M3 eval/compare validation (small_train.zip, m3_task_2/hockey, max-samples=2)

Ran the README's documented recipes against this branch to validate the harness fixes end-to-end.

Eval (cuga, --max-samples 2)

caffeinate -i bash benchmarks/m3/eval.sh \
  --m3-data benchmarks/m3/data/small_train.zip \
  --capability m3_task_2 --domain hockey --max-samples 2

Completed cleanly (exit 0, no exceptions). Pass 1/2 (50%) — both failures are model-accuracy judging outcomes (wrong answer / groundedness), not harness errors.

Compare (--compare-agents --runs 2)

caffeinate -i bash benchmarks/m3/compare.sh --compare-agents --runs 2 \
  --m3-data benchmarks/m3/data/small_train.zip \
  --capability m3_task_2 --domain hockey --max-samples 2

First attempt surfaced a real bug: the react agent configuration crashed both of its runs with

RuntimeError: Service-name collision in expanded registry config: ['books', 'mondial_geo', 'soccer_2016']

eval_m3_react.py's _load_m3_registry_services was calling expand_registry_config() without a capability_filter, so it expanded the entire registry (m3_task_2 + m3_task_3) instead of narrowing to the requested --capability, tripping the cross-task domain-name collision guard. The cuga path in eval_m3.py already pre-filters by --capability before expansion — eval_m3_react.py didn't mirror that.

Fixed in eeea391 (propagate capability_filter=[self.capability] to expand_registry_config, mirroring eval_m3.py). Re-ran the full compare after the fix:

Configuration                Runs  Pass Rate  pass@2  pass^2  maj@2  Cons   Tokens     LLM   Time
cuga — policies (GPT-OSS-120B)    2   75.0%   100.0%   50.0%   50.0%  1.00  291,569.0  16.5  31.8s
react — policies (GPT-OSS-120B)   2   50.0%   100.0%    0.0%    0.0%    --   71,776.0   5.5   6.9s

0 failed out of 4 runs, no collisions, no exceptions, exit 0.

Also verified:

  • Without --compare-agents/--agent, eval.sh runs cuga only — ${AGENT:-cuga} defaults to cuga and the react branch is gated on [ "${AGENT:-cuga}" = "react" ]. No AGENT= override exists in any .env/global.env/m3.env. So agent-comparison stays opt-in as intended, not a default.
  • Also regenerated policies.json (in acb0031) — the P-PB-2 markdown backtick fix from 790518c hadn't been recompiled into the artifact.

No errors or failures remain. Merging.

@haroldship haroldship merged commit 99a7326 into main Jun 7, 2026
3 of 4 checks passed
@haroldship haroldship deleted the fix/m3-harness-bugs branch June 7, 2026 11:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

4 participants