Background
PR #4 (fix/issue-91-92-eval-resilience) hoists all_results before the main try block in run_config_mode and saves partial results in the KeyboardInterrupt/Exception handlers. This reduces the blast radius from "lose the entire run" to "lose at most the in-flight task/batch".
However, all_results is only updated after evaluate_single_task or evaluate_tasks_in_batches returns in full. If a Ctrl-C or exception lands mid-task (e.g. while a later domain within the same task is still running), the outer handlers persist an empty or incomplete file — the completed domains' results exist only in inner locals.
Discussed in: #4 (comment)
Goal
Achieve domain-level checkpoint granularity so that any domain that completes before an interrupt is preserved, regardless of whether its parent task finished.
Proposed approach
Thread a checkpoint_callback (or shared mutable shared_results list) parameter into evaluate_single_task and evaluate_tasks_in_batches. After each domain (or batch) result is appended in the inner helpers, invoke the callback to flush those results into the outer all_results before the next await. The outer interrupt handlers then persist whatever has accumulated.
Rough sketch:
# run_config_mode
all_results: List[Dict[str, Any]] = []
def _checkpoint(results):
all_results.extend(results)
# pass _checkpoint into evaluate_single_task / evaluate_tasks_in_batches
Within evaluate_single_task, call checkpoint_callback(evaluator.results) after each domain loop iteration completes successfully.
Scope
benchmarks/m3/eval_m3.py: run_config_mode, evaluate_single_task, evaluate_tasks_in_batches
- May also want to consider the same pattern for
compare.sh sequential domain loops if applicable
- Add or extend regression tests in
benchmarks/m3/tests/test_partial_save_on_interrupt.py
Background
PR #4 (fix/issue-91-92-eval-resilience) hoists
all_resultsbefore the maintryblock inrun_config_modeand saves partial results in theKeyboardInterrupt/Exceptionhandlers. This reduces the blast radius from "lose the entire run" to "lose at most the in-flight task/batch".However,
all_resultsis only updated afterevaluate_single_taskorevaluate_tasks_in_batchesreturns in full. If a Ctrl-C or exception lands mid-task (e.g. while a later domain within the same task is still running), the outer handlers persist an empty or incomplete file — the completed domains' results exist only in inner locals.Discussed in: #4 (comment)
Goal
Achieve domain-level checkpoint granularity so that any domain that completes before an interrupt is preserved, regardless of whether its parent task finished.
Proposed approach
Thread a
checkpoint_callback(or shared mutableshared_resultslist) parameter intoevaluate_single_taskandevaluate_tasks_in_batches. After each domain (or batch) result is appended in the inner helpers, invoke the callback to flush those results into the outerall_resultsbefore the nextawait. The outer interrupt handlers then persist whatever has accumulated.Rough sketch:
Within
evaluate_single_task, callcheckpoint_callback(evaluator.results)after each domain loop iteration completes successfully.Scope
benchmarks/m3/eval_m3.py:run_config_mode,evaluate_single_task,evaluate_tasks_in_batchescompare.shsequential domain loops if applicablebenchmarks/m3/tests/test_partial_save_on_interrupt.py