Skip to content

[Feature]: Evaluation bundle must include full Vakra judge scores in all artifacts (cuga + react) #56

@haroldship

Description

@haroldship

Feature Request

Hard requirement: every item below must be included in the evaluation bundle (benchmarks/m3/evaluation_bundles/<timestamp>_<profile>/). The bundle is the deliverable. Scores must not exist only on disk outside the bundle, in the Langfuse UI, or in ephemeral /tmp paths — they must be copied or generated into the bundle at assembly time.

For both the cuga SDK and react agent paths, the bundle must contain:

Bundle path What must be included
logs/console.log Per-task dialogue score, each individual judge score (exactmatch, answer, groundedness), turn-level partial scores, pass/fail, and judge explanations
logs/registry.log Registry/server output for the run
langfuse_traces/<task>_<trace_id>.json Full trace with post-hoc Vakra scores attached: m3_success, m3_dialogue_score, m3_exactmatch_score, m3_answer_score, m3_groundedness_score (not "scores": [])
results/m3_*.json Each task entry with full vakra object (per-turn judge metadata + explanations), not just success / match_rate
report.md Per-task table with dialogue score and per-judge breakdown, not only ✓/✗
trajectories/ (incl. results.json) Vakra-corrected scores after patch_tracker_scores (issue #71)

React agent parity: --agent react must produce a bundle with the same score content as the default cuga SDK path. Both agents call vakra_score_results_async and print_vakra_summary, but react's separate flow (eval_m3_react.py) must not ship a bundle with missing or thinner score artifacts.

Bundles already exist — this is not about creating the bundle mechanism. It is about ensuring the bundle includes complete evaluator scores in every artifact channel, and fixing holes where scores are computed during the run but absent or incomplete inside the assembled bundle.

Motivation / Problem

Vakra scoring produces rich judge output (per-turn exactmatch_score, answer_score, groundedness_score, score_explanation, dialogue-level match_rate). But when I open a bundle, the scores are often missing or incomplete inside the bundle:

Bundle artifact What we have today Gap
logs/console.log Tee'd stdout/stderr; print_vakra_summary and log_vakra_task_scores run during eval Judge detail may not land in the file copied into the bundle; passing tasks get only compact one-liners; react vs cuga formatting may differ
langfuse_traces/ --fetch-langfuse downloads traces at bundle assembly Traces are often fetched before push_vakra_scores_to_langfuse completes — bundle JSON shows "scores": []
report.md Generated by compare_report at bundle time No per-judge scores, no dialogue score — only ✓/✗ pass-fail
results/m3_*.json Result file copied into bundle vakra dict may be present, but other bundle artifacts don't reflect it; partial runs may only have scores for completed domains
trajectories/results.json Trajectory dir copied into bundle Must carry Vakra-corrected scores (issue #71)
React agent bundle Same create_bundle path via eval.sh Must be verified end-to-end: all bundle paths above populated equivalently to cuga SDK

As an evaluator, I need to open one bundle directory and find — for every scored task — what each judge said and what the final verdict was, regardless of which agent ran the eval.

Use Case

As an evaluator of cuga-agent on the M3 benchmark:

  • Run eval.sh (cuga SDK) or eval.sh --agent react and receive a bundle that is self-contained — no need to look outside it for scores
  • Open logs/console.log and langfuse_traces/ inside the bundle for a failing task and see exactmatch=0.0 answer=1.0 groundedness=0.0 with explanations
  • Use analytics/trace_comparison_rules and other downstream tools that consume bundles, knowing every required score field is present in the bundle
  • On interrupted runs (PR #4), the partial bundle still includes judge scores in all bundle artifacts for tasks that completed scoring before the interrupt

Proposed Solution

All changes below target bundle inclusion — what create_bundle / assemble_bundle / assemble_compare_bundle must place inside the bundle directory.

1. langfuse_traces/ in bundle must include judge scores

  • Ensure push_vakra_scores_to_langfuse + flush() complete before _download_langfuse_traces runs in create_bundle.
  • Retry/poll in _download_langfuse_traces until each trace's scores array contains the Vakra score names (not just until the trace exists).
  • Verify react results include trace_id so scores are pushed and traces in the bundle are not scoreless.

2. logs/console.log in bundle must include full score detail

  • Ensure print_vakra_summary output is captured in $CONSOLE_LOG and that file is copied into logs/ in the bundle for both cuga and react paths.
  • Include per-judge scores (and explanations where available) for all scored tasks, not only failures.
  • React print_summary() / cuga _finalize_and_save_results must emit the same summary into the tee'd stream before bundle assembly.

3. report.md in bundle must include evaluator scores

  • Extend compare_report.generate_eval_report (and comparison variant) with per-task columns: dialogue score + exactmatch / answer / groundedness.
  • Report is written into the bundle at assembly time — this is a bundle artifact, not terminal-only output.

4. results/m3_*.json and trajectories/ in bundle must be complete

  • Confirm full vakra.details.per_turn[].metadata is serialized in the result JSON that gets copied into results/ in the bundle.
  • Confirm patch_tracker_scores runs in both eval_m3.py and eval_m3_react.py before trajectory dir is copied into the bundle.

5. React agent bundle parity

  • React and cuga SDK bundles must include the same score fields in the same bundle paths.
  • Add regression/smoke tests that assemble a bundle and assert score fields exist in logs/, langfuse_traces/, results/, report.md, and trajectories/.

Acceptance criteria

Every criterion is about content inside the evaluation bundle:

  • Bundle langfuse_traces/ contains trace JSON with non-empty scores (all five Vakra score names) for every scored task with trace_id
  • Bundle logs/console.log contains per-task dialogue score + per-judge scores for all scored tasks (cuga and react)
  • Bundle report.md contains per-judge scores and dialogue score per task (not just ✓/✗)
  • Bundle results/m3_*.json contains full vakra per-turn judge metadata for both agents
  • Bundle trajectories/results.json contains Vakra-corrected scores
  • React (--agent react) and cuga SDK bundles include equivalent score content
  • Partial bundles (PR fix(m3): bundle on interrupt and save partial results on crash #4): completed tasks have scores in all bundle paths above; incomplete tasks clearly marked
  • Tests validate bundle contents (not just runtime logging)

Alternatives Considered

  • Scores only in _vakra/results.json on disk — not included in bundle today; bundle consumers can't see it.
  • Langfuse UI only — scores may exist remotely but bundle-downloaded traces don't include them.
  • Console-only during eval — unless copied into logs/ in the bundle, it's not part of the deliverable.

Priority

High - Important for my workflow

Additional Context

Evaluator scores that must appear in the bundle (from benchmarks/m3/evaluator/scorer.py and m3_vakra_score.py):

  • Per-turn: exactmatch_score, answer_score, groundedness_score, score_explanation.{exactmatch,answer,groundedness}
  • Per-dialogue: match_rate / vakra.score (aggregated; >= 1.0 = pass)
  • Langfuse names (must be in bundle langfuse_traces/): m3_success, m3_dialogue_score, m3_exactmatch_score, m3_answer_score, m3_groundedness_score

Relevant code:

  • benchmarks/m3/eval.shcreate_bundle() assembles bundle, copies logs, fetches Langfuse
  • benchmarks/helpers/bundle.pyassemble_bundle, _download_langfuse_traces, _copy_logs
  • benchmarks/m3/m3_vakra_score.py — scoring, Langfuse push, console summary
  • benchmarks/m3/eval_m3.py / eval_m3_react.py — cuga vs react eval paths
  • benchmarks/helpers/compare_report.pyreport.md generation
  • benchmarks/m3/tests/test_vakra_langfuse_scores.py

Related work:

  • PR #4 — partial bundles on interrupt; scores for completed tasks must still be in all bundle paths
  • Issue #71 — trajectory results.json vs Vakra scores alignment

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

Status
Backlog

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions