Feature Request
Hard requirement: every item below must be included in the evaluation bundle (benchmarks/m3/evaluation_bundles/<timestamp>_<profile>/). The bundle is the deliverable. Scores must not exist only on disk outside the bundle, in the Langfuse UI, or in ephemeral /tmp paths — they must be copied or generated into the bundle at assembly time.
For both the cuga SDK and react agent paths, the bundle must contain:
| Bundle path |
What must be included |
logs/console.log |
Per-task dialogue score, each individual judge score (exactmatch, answer, groundedness), turn-level partial scores, pass/fail, and judge explanations |
logs/registry.log |
Registry/server output for the run |
langfuse_traces/<task>_<trace_id>.json |
Full trace with post-hoc Vakra scores attached: m3_success, m3_dialogue_score, m3_exactmatch_score, m3_answer_score, m3_groundedness_score (not "scores": []) |
results/m3_*.json |
Each task entry with full vakra object (per-turn judge metadata + explanations), not just success / match_rate |
report.md |
Per-task table with dialogue score and per-judge breakdown, not only ✓/✗ |
trajectories/ (incl. results.json) |
Vakra-corrected scores after patch_tracker_scores (issue #71) |
React agent parity: --agent react must produce a bundle with the same score content as the default cuga SDK path. Both agents call vakra_score_results_async and print_vakra_summary, but react's separate flow (eval_m3_react.py) must not ship a bundle with missing or thinner score artifacts.
Bundles already exist — this is not about creating the bundle mechanism. It is about ensuring the bundle includes complete evaluator scores in every artifact channel, and fixing holes where scores are computed during the run but absent or incomplete inside the assembled bundle.
Motivation / Problem
Vakra scoring produces rich judge output (per-turn exactmatch_score, answer_score, groundedness_score, score_explanation, dialogue-level match_rate). But when I open a bundle, the scores are often missing or incomplete inside the bundle:
| Bundle artifact |
What we have today |
Gap |
logs/console.log |
Tee'd stdout/stderr; print_vakra_summary and log_vakra_task_scores run during eval |
Judge detail may not land in the file copied into the bundle; passing tasks get only compact one-liners; react vs cuga formatting may differ |
langfuse_traces/ |
--fetch-langfuse downloads traces at bundle assembly |
Traces are often fetched before push_vakra_scores_to_langfuse completes — bundle JSON shows "scores": [] |
report.md |
Generated by compare_report at bundle time |
No per-judge scores, no dialogue score — only ✓/✗ pass-fail |
results/m3_*.json |
Result file copied into bundle |
vakra dict may be present, but other bundle artifacts don't reflect it; partial runs may only have scores for completed domains |
trajectories/results.json |
Trajectory dir copied into bundle |
Must carry Vakra-corrected scores (issue #71) |
| React agent bundle |
Same create_bundle path via eval.sh |
Must be verified end-to-end: all bundle paths above populated equivalently to cuga SDK |
As an evaluator, I need to open one bundle directory and find — for every scored task — what each judge said and what the final verdict was, regardless of which agent ran the eval.
Use Case
As an evaluator of cuga-agent on the M3 benchmark:
- Run
eval.sh (cuga SDK) or eval.sh --agent react and receive a bundle that is self-contained — no need to look outside it for scores
- Open
logs/console.log and langfuse_traces/ inside the bundle for a failing task and see exactmatch=0.0 answer=1.0 groundedness=0.0 with explanations
- Use
analytics/trace_comparison_rules and other downstream tools that consume bundles, knowing every required score field is present in the bundle
- On interrupted runs (PR #4), the partial bundle still includes judge scores in all bundle artifacts for tasks that completed scoring before the interrupt
Proposed Solution
All changes below target bundle inclusion — what create_bundle / assemble_bundle / assemble_compare_bundle must place inside the bundle directory.
1. langfuse_traces/ in bundle must include judge scores
- Ensure
push_vakra_scores_to_langfuse + flush() complete before _download_langfuse_traces runs in create_bundle.
- Retry/poll in
_download_langfuse_traces until each trace's scores array contains the Vakra score names (not just until the trace exists).
- Verify react results include
trace_id so scores are pushed and traces in the bundle are not scoreless.
2. logs/console.log in bundle must include full score detail
- Ensure
print_vakra_summary output is captured in $CONSOLE_LOG and that file is copied into logs/ in the bundle for both cuga and react paths.
- Include per-judge scores (and explanations where available) for all scored tasks, not only failures.
- React
print_summary() / cuga _finalize_and_save_results must emit the same summary into the tee'd stream before bundle assembly.
3. report.md in bundle must include evaluator scores
- Extend
compare_report.generate_eval_report (and comparison variant) with per-task columns: dialogue score + exactmatch / answer / groundedness.
- Report is written into the bundle at assembly time — this is a bundle artifact, not terminal-only output.
4. results/m3_*.json and trajectories/ in bundle must be complete
- Confirm full
vakra.details.per_turn[].metadata is serialized in the result JSON that gets copied into results/ in the bundle.
- Confirm
patch_tracker_scores runs in both eval_m3.py and eval_m3_react.py before trajectory dir is copied into the bundle.
5. React agent bundle parity
- React and cuga SDK bundles must include the same score fields in the same bundle paths.
- Add regression/smoke tests that assemble a bundle and assert score fields exist in
logs/, langfuse_traces/, results/, report.md, and trajectories/.
Acceptance criteria
Every criterion is about content inside the evaluation bundle:
Alternatives Considered
- Scores only in
_vakra/results.json on disk — not included in bundle today; bundle consumers can't see it.
- Langfuse UI only — scores may exist remotely but bundle-downloaded traces don't include them.
- Console-only during eval — unless copied into
logs/ in the bundle, it's not part of the deliverable.
Priority
High - Important for my workflow
Additional Context
Evaluator scores that must appear in the bundle (from benchmarks/m3/evaluator/scorer.py and m3_vakra_score.py):
- Per-turn:
exactmatch_score, answer_score, groundedness_score, score_explanation.{exactmatch,answer,groundedness}
- Per-dialogue:
match_rate / vakra.score (aggregated; >= 1.0 = pass)
- Langfuse names (must be in bundle
langfuse_traces/): m3_success, m3_dialogue_score, m3_exactmatch_score, m3_answer_score, m3_groundedness_score
Relevant code:
benchmarks/m3/eval.sh — create_bundle() assembles bundle, copies logs, fetches Langfuse
benchmarks/helpers/bundle.py — assemble_bundle, _download_langfuse_traces, _copy_logs
benchmarks/m3/m3_vakra_score.py — scoring, Langfuse push, console summary
benchmarks/m3/eval_m3.py / eval_m3_react.py — cuga vs react eval paths
benchmarks/helpers/compare_report.py — report.md generation
benchmarks/m3/tests/test_vakra_langfuse_scores.py
Related work:
- PR #4 — partial bundles on interrupt; scores for completed tasks must still be in all bundle paths
- Issue #71 — trajectory
results.json vs Vakra scores alignment
Feature Request
Hard requirement: every item below must be included in the evaluation bundle (
benchmarks/m3/evaluation_bundles/<timestamp>_<profile>/). The bundle is the deliverable. Scores must not exist only on disk outside the bundle, in the Langfuse UI, or in ephemeral/tmppaths — they must be copied or generated into the bundle at assembly time.For both the cuga SDK and react agent paths, the bundle must contain:
logs/console.logexactmatch,answer,groundedness), turn-level partial scores, pass/fail, and judge explanationslogs/registry.loglangfuse_traces/<task>_<trace_id>.jsonm3_success,m3_dialogue_score,m3_exactmatch_score,m3_answer_score,m3_groundedness_score(not"scores": [])results/m3_*.jsonvakraobject (per-turn judge metadata + explanations), not justsuccess/match_ratereport.mdtrajectories/(incl.results.json)patch_tracker_scores(issue #71)React agent parity:
--agent reactmust produce a bundle with the same score content as the default cuga SDK path. Both agents callvakra_score_results_asyncandprint_vakra_summary, but react's separate flow (eval_m3_react.py) must not ship a bundle with missing or thinner score artifacts.Bundles already exist — this is not about creating the bundle mechanism. It is about ensuring the bundle includes complete evaluator scores in every artifact channel, and fixing holes where scores are computed during the run but absent or incomplete inside the assembled bundle.
Motivation / Problem
Vakra scoring produces rich judge output (per-turn
exactmatch_score,answer_score,groundedness_score,score_explanation, dialogue-levelmatch_rate). But when I open a bundle, the scores are often missing or incomplete inside the bundle:logs/console.logprint_vakra_summaryandlog_vakra_task_scoresrun during evallangfuse_traces/--fetch-langfusedownloads traces at bundle assemblypush_vakra_scores_to_langfusecompletes — bundle JSON shows"scores": []report.mdcompare_reportat bundle timeresults/m3_*.jsonvakradict may be present, but other bundle artifacts don't reflect it; partial runs may only have scores for completed domainstrajectories/results.jsoncreate_bundlepath viaeval.shAs an evaluator, I need to open one bundle directory and find — for every scored task — what each judge said and what the final verdict was, regardless of which agent ran the eval.
Use Case
As an evaluator of cuga-agent on the M3 benchmark:
eval.sh(cuga SDK) oreval.sh --agent reactand receive a bundle that is self-contained — no need to look outside it for scoreslogs/console.logandlangfuse_traces/inside the bundle for a failing task and seeexactmatch=0.0 answer=1.0 groundedness=0.0with explanationsanalytics/trace_comparison_rulesand other downstream tools that consume bundles, knowing every required score field is present in the bundleProposed Solution
All changes below target bundle inclusion — what
create_bundle/assemble_bundle/assemble_compare_bundlemust place inside the bundle directory.1.
langfuse_traces/in bundle must include judge scorespush_vakra_scores_to_langfuse+flush()complete before_download_langfuse_tracesruns increate_bundle._download_langfuse_tracesuntil each trace'sscoresarray contains the Vakra score names (not just until the trace exists).trace_idso scores are pushed and traces in the bundle are not scoreless.2.
logs/console.login bundle must include full score detailprint_vakra_summaryoutput is captured in$CONSOLE_LOGand that file is copied intologs/in the bundle for both cuga and react paths.print_summary()/ cuga_finalize_and_save_resultsmust emit the same summary into the tee'd stream before bundle assembly.3.
report.mdin bundle must include evaluator scorescompare_report.generate_eval_report(and comparison variant) with per-task columns: dialogue score +exactmatch/answer/groundedness.4.
results/m3_*.jsonandtrajectories/in bundle must be completevakra.details.per_turn[].metadatais serialized in the result JSON that gets copied intoresults/in the bundle.patch_tracker_scoresruns in botheval_m3.pyandeval_m3_react.pybefore trajectory dir is copied into the bundle.5. React agent bundle parity
logs/,langfuse_traces/,results/,report.md, andtrajectories/.Acceptance criteria
Every criterion is about content inside the evaluation bundle:
langfuse_traces/contains trace JSON with non-emptyscores(all five Vakra score names) for every scored task withtrace_idlogs/console.logcontains per-task dialogue score + per-judge scores for all scored tasks (cuga and react)report.mdcontains per-judge scores and dialogue score per task (not just ✓/✗)results/m3_*.jsoncontains fullvakraper-turn judge metadata for both agentstrajectories/results.jsoncontains Vakra-corrected scores--agent react) and cuga SDK bundles include equivalent score contentAlternatives Considered
_vakra/results.jsonon disk — not included in bundle today; bundle consumers can't see it.logs/in the bundle, it's not part of the deliverable.Priority
High - Important for my workflow
Additional Context
Evaluator scores that must appear in the bundle (from
benchmarks/m3/evaluator/scorer.pyandm3_vakra_score.py):exactmatch_score,answer_score,groundedness_score,score_explanation.{exactmatch,answer,groundedness}match_rate/vakra.score(aggregated; >= 1.0 = pass)langfuse_traces/):m3_success,m3_dialogue_score,m3_exactmatch_score,m3_answer_score,m3_groundedness_scoreRelevant code:
benchmarks/m3/eval.sh—create_bundle()assembles bundle, copies logs, fetches Langfusebenchmarks/helpers/bundle.py—assemble_bundle,_download_langfuse_traces,_copy_logsbenchmarks/m3/m3_vakra_score.py— scoring, Langfuse push, console summarybenchmarks/m3/eval_m3.py/eval_m3_react.py— cuga vs react eval pathsbenchmarks/helpers/compare_report.py—report.mdgenerationbenchmarks/m3/tests/test_vakra_langfuse_scores.pyRelated work:
results.jsonvs Vakra scores alignment