Skip to content

feat(m3): domain-level checkpoint callback for partial saves on interrupt #54

@coderabbitai

Description

@coderabbitai

Background

PR #4 (fix/issue-91-92-eval-resilience) hoists all_results before the main try block in run_config_mode and saves partial results in the KeyboardInterrupt/Exception handlers. This reduces the blast radius from "lose the entire run" to "lose at most the in-flight task/batch".

However, all_results is only updated after evaluate_single_task or evaluate_tasks_in_batches returns in full. If a Ctrl-C or exception lands mid-task (e.g. while a later domain within the same task is still running), the outer handlers persist an empty or incomplete file — the completed domains' results exist only in inner locals.

Discussed in: #4 (comment)

Goal

Achieve domain-level checkpoint granularity so that any domain that completes before an interrupt is preserved, regardless of whether its parent task finished.

Proposed approach

Thread a checkpoint_callback (or shared mutable shared_results list) parameter into evaluate_single_task and evaluate_tasks_in_batches. After each domain (or batch) result is appended in the inner helpers, invoke the callback to flush those results into the outer all_results before the next await. The outer interrupt handlers then persist whatever has accumulated.

Rough sketch:

# run_config_mode
all_results: List[Dict[str, Any]] = []

def _checkpoint(results):
    all_results.extend(results)

# pass _checkpoint into evaluate_single_task / evaluate_tasks_in_batches

Within evaluate_single_task, call checkpoint_callback(evaluator.results) after each domain loop iteration completes successfully.

Scope

  • benchmarks/m3/eval_m3.py: run_config_mode, evaluate_single_task, evaluate_tasks_in_batches
  • May also want to consider the same pattern for compare.sh sequential domain loops if applicable
  • Add or extend regression tests in benchmarks/m3/tests/test_partial_save_on_interrupt.py

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

Status
Backlog

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions