[Feature]: M3 evaluation policy bundle (playbooks, tool guides, compare flags)

## Feature Request

Add an M3-specific CUGA policy bundle (playbooks, tool guides, output formatters) with harness flags to run evals and compare sweeps **with**, **without**, or **A/B comparing** policies—mirroring the BPO benchmark workflow.

Deliverables:
- Eight policies under `benchmarks/m3/policies/` (P-OF-1, P-PB-1..4, P-TG-1..2; P-OF-2 disabled in frontmatter)
- `scripts/policies_md_to_json.py` to compile MD → `policies.json`
- `_load_m3_policies()` with per-domain agents using `auto_load_policies=False` and `filesystem_sync=False`
- `--no-policies` and `--compare-policies` on `benchmarks/m3/eval.sh` and `compare.sh`
- Bundle directory suffix / metadata for policy mode in `benchmarks/helpers/bundle.py`

## Motivation / Problem

M3 Vakra scoring should reflect agent behavior, but we also need a controlled way to test whether harness-level policies improve pass rate without changing judges or ground truth. Today there is no first-class policy path for M3 comparable to BPO, so policy experiments are ad hoc and hard to reproduce in evaluation bundles.

## Use Case

As someone running M3 benchmarks and compare sweeps, I want to:
- Run the same task slice with policies on vs off and record which bundle used which mode
- Iterate on policy markdown under `benchmarks/m3/policies/` and recompile to JSON before eval
- Measure whether policies help or hurt on a domain slice before rolling them into default evals

## Proposed Solution

1. Author policies as markdown under `benchmarks/m3/policies/`; compile to `policies.json` via `scripts/policies_md_to_json.py` (invoked from `eval.sh` before eval).
2. Load policies once per domain via `_load_m3_policies(agent)` after clearing stray policy DB state on the agent.
3. Expose `--no-policies` (skip load) and `--compare-policies` (run both modes in `compare.sh`).
4. Encode policy mode in bundle metadata / directory naming for reproducibility.

**Acceptance criteria**
- [ ] Policies load from `benchmarks/m3/policies/policies.json` when enabled
- [x] Compare sweep produces distinct bundles for with/without policies
- [ ] Net policy effect documented on a representative slice before default-on for full 200-case runs

## Alternatives Considered

- **Hard-code prompts only in `special_instructions`** — does not reuse CUGA policy types (playbook / tool_guide) or BPO conventions
- **Enable CUGA default policy auto-load from CWD** — caused drift across per-domain agents; rejected in favor of explicit `_load_m3_policies`
- **Change Vakra judges or gold** — out of scope for harness work tracked under #37

## Priority

High - Important for my workflow

## Additional Context

- **Parent / epic:** Sub-issue of #37 ([Feature]: Improve evaluation harness to improve CUGA score on Vakra (m3))
- **Implementation tracked in:** PR #3 (`fix/m3-harness-bugs`), closes this issue when merged
- **Early signal (4 PF × 5 runs, `codebase_comments`):** After fixing tool-name prefix (#39), no-policies reached ~81% on that slice; policies ~50%—full 200-case policy compare is follow-up work
- **Analysis:** `docs/m3-vakra-analysis-20260428/cuga_vs_react_full_analysis.md`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: M3 evaluation policy bundle (playbooks, tool guides, compare flags) #38

Feature Request

Motivation / Problem

Use Case

Proposed Solution

Alternatives Considered

Priority

Additional Context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Feature]: M3 evaluation policy bundle (playbooks, tool guides, compare flags) #38

Description

Feature Request

Motivation / Problem

Use Case

Proposed Solution

Alternatives Considered

Priority

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions