Skip to content

[Feature]: M3 evaluation policy bundle (playbooks, tool guides, compare flags) #38

@haroldship

Description

@haroldship

Feature Request

Add an M3-specific CUGA policy bundle (playbooks, tool guides, output formatters) with harness flags to run evals and compare sweeps with, without, or A/B comparing policies—mirroring the BPO benchmark workflow.

Deliverables:

  • Eight policies under benchmarks/m3/policies/ (P-OF-1, P-PB-1..4, P-TG-1..2; P-OF-2 disabled in frontmatter)
  • scripts/policies_md_to_json.py to compile MD → policies.json
  • _load_m3_policies() with per-domain agents using auto_load_policies=False and filesystem_sync=False
  • --no-policies and --compare-policies on benchmarks/m3/eval.sh and compare.sh
  • Bundle directory suffix / metadata for policy mode in benchmarks/helpers/bundle.py

Motivation / Problem

M3 Vakra scoring should reflect agent behavior, but we also need a controlled way to test whether harness-level policies improve pass rate without changing judges or ground truth. Today there is no first-class policy path for M3 comparable to BPO, so policy experiments are ad hoc and hard to reproduce in evaluation bundles.

Use Case

As someone running M3 benchmarks and compare sweeps, I want to:

  • Run the same task slice with policies on vs off and record which bundle used which mode
  • Iterate on policy markdown under benchmarks/m3/policies/ and recompile to JSON before eval
  • Measure whether policies help or hurt on a domain slice before rolling them into default evals

Proposed Solution

  1. Author policies as markdown under benchmarks/m3/policies/; compile to policies.json via scripts/policies_md_to_json.py (invoked from eval.sh before eval).
  2. Load policies once per domain via _load_m3_policies(agent) after clearing stray policy DB state on the agent.
  3. Expose --no-policies (skip load) and --compare-policies (run both modes in compare.sh).
  4. Encode policy mode in bundle metadata / directory naming for reproducibility.

Acceptance criteria

  • Policies load from benchmarks/m3/policies/policies.json when enabled
  • Compare sweep produces distinct bundles for with/without policies
  • Net policy effect documented on a representative slice before default-on for full 200-case runs

Alternatives Considered

  • Hard-code prompts only in special_instructions — does not reuse CUGA policy types (playbook / tool_guide) or BPO conventions
  • Enable CUGA default policy auto-load from CWD — caused drift across per-domain agents; rejected in favor of explicit _load_m3_policies
  • Change Vakra judges or gold — out of scope for harness work tracked under [Feature]: Improve evaluation harness to improve CUGA score on Vakra (m3) #37

Priority

High - Important for my workflow

Additional Context

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions