You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add an M3-specific CUGA policy bundle (playbooks, tool guides, output formatters) with harness flags to run evals and compare sweeps with, without, or A/B comparing policies—mirroring the BPO benchmark workflow.
Deliverables:
Eight policies under benchmarks/m3/policies/ (P-OF-1, P-PB-1..4, P-TG-1..2; P-OF-2 disabled in frontmatter)
scripts/policies_md_to_json.py to compile MD → policies.json
_load_m3_policies() with per-domain agents using auto_load_policies=False and filesystem_sync=False
--no-policies and --compare-policies on benchmarks/m3/eval.sh and compare.sh
Bundle directory suffix / metadata for policy mode in benchmarks/helpers/bundle.py
Motivation / Problem
M3 Vakra scoring should reflect agent behavior, but we also need a controlled way to test whether harness-level policies improve pass rate without changing judges or ground truth. Today there is no first-class policy path for M3 comparable to BPO, so policy experiments are ad hoc and hard to reproduce in evaluation bundles.
Use Case
As someone running M3 benchmarks and compare sweeps, I want to:
Run the same task slice with policies on vs off and record which bundle used which mode
Iterate on policy markdown under benchmarks/m3/policies/ and recompile to JSON before eval
Measure whether policies help or hurt on a domain slice before rolling them into default evals
Proposed Solution
Author policies as markdown under benchmarks/m3/policies/; compile to policies.json via scripts/policies_md_to_json.py (invoked from eval.sh before eval).
Load policies once per domain via _load_m3_policies(agent) after clearing stray policy DB state on the agent.
Expose --no-policies (skip load) and --compare-policies (run both modes in compare.sh).
Encode policy mode in bundle metadata / directory naming for reproducibility.
Acceptance criteria
Policies load from benchmarks/m3/policies/policies.json when enabled
Compare sweep produces distinct bundles for with/without policies
Net policy effect documented on a representative slice before default-on for full 200-case runs
Alternatives Considered
Hard-code prompts only in special_instructions — does not reuse CUGA policy types (playbook / tool_guide) or BPO conventions
Enable CUGA default policy auto-load from CWD — caused drift across per-domain agents; rejected in favor of explicit _load_m3_policies
Feature Request
Add an M3-specific CUGA policy bundle (playbooks, tool guides, output formatters) with harness flags to run evals and compare sweeps with, without, or A/B comparing policies—mirroring the BPO benchmark workflow.
Deliverables:
benchmarks/m3/policies/(P-OF-1, P-PB-1..4, P-TG-1..2; P-OF-2 disabled in frontmatter)scripts/policies_md_to_json.pyto compile MD →policies.json_load_m3_policies()with per-domain agents usingauto_load_policies=Falseandfilesystem_sync=False--no-policiesand--compare-policiesonbenchmarks/m3/eval.shandcompare.shbenchmarks/helpers/bundle.pyMotivation / Problem
M3 Vakra scoring should reflect agent behavior, but we also need a controlled way to test whether harness-level policies improve pass rate without changing judges or ground truth. Today there is no first-class policy path for M3 comparable to BPO, so policy experiments are ad hoc and hard to reproduce in evaluation bundles.
Use Case
As someone running M3 benchmarks and compare sweeps, I want to:
benchmarks/m3/policies/and recompile to JSON before evalProposed Solution
benchmarks/m3/policies/; compile topolicies.jsonviascripts/policies_md_to_json.py(invoked fromeval.shbefore eval)._load_m3_policies(agent)after clearing stray policy DB state on the agent.--no-policies(skip load) and--compare-policies(run both modes incompare.sh).Acceptance criteria
benchmarks/m3/policies/policies.jsonwhen enabledAlternatives Considered
special_instructions— does not reuse CUGA policy types (playbook / tool_guide) or BPO conventions_load_m3_policiesPriority
High - Important for my workflow
Additional Context
fix/m3-harness-bugs), closes this issue when mergedcodebase_comments): After fixing tool-name prefix ([Feature]: M3 tool-calling harness fixes (Vakra matching + undocumented outputs) #39), no-policies reached ~81% on that slice; policies ~50%—full 200-case policy compare is follow-up workdocs/m3-vakra-analysis-20260428/cuga_vs_react_full_analysis.md