[Feature]: Evaluation bundle must include full Vakra judge scores in all artifacts (cuga + react)

## Feature Request

**Hard requirement: every item below must be included in the evaluation bundle** (`benchmarks/m3/evaluation_bundles/<timestamp>_<profile>/`). The bundle is the deliverable. Scores must not exist only on disk outside the bundle, in the Langfuse UI, or in ephemeral `/tmp` paths — they must be copied or generated into the bundle at assembly time.

For both the **cuga SDK** and **react** agent paths, the bundle must contain:

| Bundle path | What must be included |
|-------------|----------------------|
| `logs/console.log` | Per-task dialogue score, each individual judge score (`exactmatch`, `answer`, `groundedness`), turn-level partial scores, pass/fail, and judge explanations |
| `logs/registry.log` | Registry/server output for the run |
| `langfuse_traces/<task>_<trace_id>.json` | Full trace **with** post-hoc Vakra scores attached: `m3_success`, `m3_dialogue_score`, `m3_exactmatch_score`, `m3_answer_score`, `m3_groundedness_score` (not `"scores": []`) |
| `results/m3_*.json` | Each task entry with full `vakra` object (per-turn judge metadata + explanations), not just `success` / `match_rate` |
| `report.md` | Per-task table with dialogue score and per-judge breakdown, not only ✓/✗ |
| `trajectories/` (incl. `results.json`) | Vakra-corrected scores after `patch_tracker_scores` (issue #71) |

**React agent parity:** `--agent react` must produce a bundle with the same score content as the default cuga SDK path. Both agents call `vakra_score_results_async` and `print_vakra_summary`, but react's separate flow (`eval_m3_react.py`) must not ship a bundle with missing or thinner score artifacts.

Bundles already exist — this is not about creating the bundle mechanism. It is about ensuring the bundle **includes** complete evaluator scores in every artifact channel, and fixing holes where scores are computed during the run but absent or incomplete inside the assembled bundle.

## Motivation / Problem

Vakra scoring produces rich judge output (per-turn `exactmatch_score`, `answer_score`, `groundedness_score`, `score_explanation`, dialogue-level `match_rate`). But when I open a bundle, the scores are often missing or incomplete **inside the bundle**:

| Bundle artifact | What we have today | Gap |
|-----------------|-------------------|-----|
| `logs/console.log` | Tee'd stdout/stderr; `print_vakra_summary` and `log_vakra_task_scores` run during eval | Judge detail may not land in the file copied into the bundle; passing tasks get only compact one-liners; react vs cuga formatting may differ |
| `langfuse_traces/` | `--fetch-langfuse` downloads traces at bundle assembly | Traces are often fetched **before** `push_vakra_scores_to_langfuse` completes — bundle JSON shows `"scores": []` |
| `report.md` | Generated by `compare_report` at bundle time | No per-judge scores, no dialogue score — only ✓/✗ pass-fail |
| `results/m3_*.json` | Result file copied into bundle | `vakra` dict may be present, but other bundle artifacts don't reflect it; partial runs may only have scores for completed domains |
| `trajectories/results.json` | Trajectory dir copied into bundle | Must carry Vakra-corrected scores (issue #71) |
| React agent bundle | Same `create_bundle` path via `eval.sh` | Must be verified end-to-end: all bundle paths above populated equivalently to cuga SDK |

As an evaluator, I need to open **one bundle directory** and find — for every scored task — what each judge said and what the final verdict was, regardless of which agent ran the eval.

## Use Case

As an evaluator of cuga-agent on the M3 benchmark:

- Run `eval.sh` (cuga SDK) or `eval.sh --agent react` and receive a bundle that is self-contained — no need to look outside it for scores
- Open `logs/console.log` and `langfuse_traces/` inside the bundle for a failing task and see `exactmatch=0.0 answer=1.0 groundedness=0.0` with explanations
- Use `analytics/trace_comparison_rules` and other downstream tools that consume bundles, knowing every required score field is present in the bundle
- On interrupted runs ([PR #4](https://github.com/cuga-project/cuga-eval/pull/4)), the partial bundle still includes judge scores in all bundle artifacts for tasks that completed scoring before the interrupt

## Proposed Solution

All changes below target **bundle inclusion** — what `create_bundle` / `assemble_bundle` / `assemble_compare_bundle` must place inside the bundle directory.

### 1. `langfuse_traces/` in bundle must include judge scores

- Ensure `push_vakra_scores_to_langfuse` + `flush()` complete **before** `_download_langfuse_traces` runs in `create_bundle`.
- Retry/poll in `_download_langfuse_traces` until each trace's `scores` array contains the Vakra score names (not just until the trace exists).
- Verify react results include `trace_id` so scores are pushed and traces in the bundle are not scoreless.

### 2. `logs/console.log` in bundle must include full score detail

- Ensure `print_vakra_summary` output is captured in `$CONSOLE_LOG` and that file is copied into `logs/` in the bundle for both cuga and react paths.
- Include per-judge scores (and explanations where available) for **all** scored tasks, not only failures.
- React `print_summary()` / cuga `_finalize_and_save_results` must emit the same summary into the tee'd stream before bundle assembly.

### 3. `report.md` in bundle must include evaluator scores

- Extend `compare_report.generate_eval_report` (and comparison variant) with per-task columns: dialogue score + `exactmatch` / `answer` / `groundedness`.
- Report is written into the bundle at assembly time — this is a bundle artifact, not terminal-only output.

### 4. `results/m3_*.json` and `trajectories/` in bundle must be complete

- Confirm full `vakra.details.per_turn[].metadata` is serialized in the result JSON that gets copied into `results/` in the bundle.
- Confirm `patch_tracker_scores` runs in both `eval_m3.py` and `eval_m3_react.py` before trajectory dir is copied into the bundle.

### 5. React agent bundle parity

- React and cuga SDK bundles must include the same score fields in the same bundle paths.
- Add regression/smoke tests that assemble a bundle and assert score fields exist in `logs/`, `langfuse_traces/`, `results/`, `report.md`, and `trajectories/`.

### Acceptance criteria

Every criterion is about content **inside the evaluation bundle**:

- [ ] Bundle `langfuse_traces/` contains trace JSON with non-empty `scores` (all five Vakra score names) for every scored task with `trace_id`
- [ ] Bundle `logs/console.log` contains per-task dialogue score + per-judge scores for all scored tasks (cuga and react)
- [ ] Bundle `report.md` contains per-judge scores and dialogue score per task (not just ✓/✗)
- [ ] Bundle `results/m3_*.json` contains full `vakra` per-turn judge metadata for both agents
- [ ] Bundle `trajectories/results.json` contains Vakra-corrected scores
- [ ] React (`--agent react`) and cuga SDK bundles include equivalent score content
- [ ] Partial bundles (PR #4): completed tasks have scores in all bundle paths above; incomplete tasks clearly marked
- [ ] Tests validate bundle contents (not just runtime logging)

## Alternatives Considered

- **Scores only in `_vakra/results.json` on disk** — not included in bundle today; bundle consumers can't see it.
- **Langfuse UI only** — scores may exist remotely but bundle-downloaded traces don't include them.
- **Console-only during eval** — unless copied into `logs/` in the bundle, it's not part of the deliverable.

## Priority

High - Important for my workflow

## Additional Context

**Evaluator scores that must appear in the bundle** (from `benchmarks/m3/evaluator/scorer.py` and `m3_vakra_score.py`):

- Per-turn: `exactmatch_score`, `answer_score`, `groundedness_score`, `score_explanation.{exactmatch,answer,groundedness}`
- Per-dialogue: `match_rate` / `vakra.score` (aggregated; >= 1.0 = pass)
- Langfuse names (must be in bundle `langfuse_traces/`): `m3_success`, `m3_dialogue_score`, `m3_exactmatch_score`, `m3_answer_score`, `m3_groundedness_score`

**Relevant code:**

- `benchmarks/m3/eval.sh` — `create_bundle()` assembles bundle, copies logs, fetches Langfuse
- `benchmarks/helpers/bundle.py` — `assemble_bundle`, `_download_langfuse_traces`, `_copy_logs`
- `benchmarks/m3/m3_vakra_score.py` — scoring, Langfuse push, console summary
- `benchmarks/m3/eval_m3.py` / `eval_m3_react.py` — cuga vs react eval paths
- `benchmarks/helpers/compare_report.py` — `report.md` generation
- `benchmarks/m3/tests/test_vakra_langfuse_scores.py`

**Related work:**

- [PR #4](https://github.com/cuga-project/cuga-eval/pull/4) — partial bundles on interrupt; scores for completed tasks must still be in all bundle paths
- Issue #71 — trajectory `results.json` vs Vakra scores alignment

Bundle path	What must be included
`logs/console.log`	Per-task dialogue score, each individual judge score (`exactmatch`, `answer`, `groundedness`), turn-level partial scores, pass/fail, and judge explanations
`logs/registry.log`	Registry/server output for the run
`langfuse_traces/<task>_<trace_id>.json`	Full trace with post-hoc Vakra scores attached: `m3_success`, `m3_dialogue_score`, `m3_exactmatch_score`, `m3_answer_score`, `m3_groundedness_score` (not `"scores": []`)
`results/m3_*.json`	Each task entry with full `vakra` object (per-turn judge metadata + explanations), not just `success` / `match_rate`
`report.md`	Per-task table with dialogue score and per-judge breakdown, not only ✓/✗
`trajectories/` (incl. `results.json`)	Vakra-corrected scores after `patch_tracker_scores` (issue #71)

Bundle artifact	What we have today	Gap
`logs/console.log`	Tee'd stdout/stderr; `print_vakra_summary` and `log_vakra_task_scores` run during eval	Judge detail may not land in the file copied into the bundle; passing tasks get only compact one-liners; react vs cuga formatting may differ
`langfuse_traces/`	`--fetch-langfuse` downloads traces at bundle assembly	Traces are often fetched before `push_vakra_scores_to_langfuse` completes — bundle JSON shows `"scores": []`
`report.md`	Generated by `compare_report` at bundle time	No per-judge scores, no dialogue score — only ✓/✗ pass-fail
`results/m3_*.json`	Result file copied into bundle	`vakra` dict may be present, but other bundle artifacts don't reflect it; partial runs may only have scores for completed domains
`trajectories/results.json`	Trajectory dir copied into bundle	Must carry Vakra-corrected scores (issue #71)
React agent bundle	Same `create_bundle` path via `eval.sh`	Must be verified end-to-end: all bundle paths above populated equivalently to cuga SDK

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Evaluation bundle must include full Vakra judge scores in all artifacts (cuga + react) #56

Feature Request

Motivation / Problem

Use Case

Proposed Solution

1. `langfuse_traces/` in bundle must include judge scores

2. `logs/console.log` in bundle must include full score detail

3. `report.md` in bundle must include evaluator scores

4. `results/m3_*.json` and `trajectories/` in bundle must be complete

5. React agent bundle parity

Acceptance criteria

Alternatives Considered

Priority

Additional Context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Feature]: Evaluation bundle must include full Vakra judge scores in all artifacts (cuga + react) #56

Description

Feature Request

Motivation / Problem

Use Case

Proposed Solution

1. langfuse_traces/ in bundle must include judge scores

2. logs/console.log in bundle must include full score detail

3. report.md in bundle must include evaluator scores

4. results/m3_*.json and trajectories/ in bundle must be complete

5. React agent bundle parity

Acceptance criteria

Alternatives Considered

Priority

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

1. `langfuse_traces/` in bundle must include judge scores

2. `logs/console.log` in bundle must include full score detail

3. `report.md` in bundle must include evaluator scores

4. `results/m3_*.json` and `trajectories/` in bundle must be complete