feat(m3): domain-level checkpoint callback for partial saves on interrupt

## Background

PR #4 ([fix/issue-91-92-eval-resilience](https://github.com/cuga-project/cuga-eval/pull/4)) hoists `all_results` before the main `try` block in `run_config_mode` and saves partial results in the `KeyboardInterrupt`/`Exception` handlers. This reduces the blast radius from *"lose the entire run"* to *"lose at most the in-flight task/batch"*.

However, `all_results` is only updated **after** `evaluate_single_task` or `evaluate_tasks_in_batches` returns in full. If a Ctrl-C or exception lands mid-task (e.g. while a later domain within the same task is still running), the outer handlers persist an empty or incomplete file — the completed domains' results exist only in inner locals.

**Discussed in:** https://github.com/cuga-project/cuga-eval/pull/4#discussion_r3369504325

## Goal

Achieve **domain-level** checkpoint granularity so that any domain that completes before an interrupt is preserved, regardless of whether its parent task finished.

## Proposed approach

Thread a `checkpoint_callback` (or shared mutable `shared_results` list) parameter into `evaluate_single_task` and `evaluate_tasks_in_batches`. After each domain (or batch) result is appended in the inner helpers, invoke the callback to flush those results into the outer `all_results` before the next `await`. The outer interrupt handlers then persist whatever has accumulated.

Rough sketch:

```python
# run_config_mode
all_results: List[Dict[str, Any]] = []

def _checkpoint(results):
    all_results.extend(results)

# pass _checkpoint into evaluate_single_task / evaluate_tasks_in_batches
```

Within `evaluate_single_task`, call `checkpoint_callback(evaluator.results)` after each domain loop iteration completes successfully.

## Scope

- `benchmarks/m3/eval_m3.py`: `run_config_mode`, `evaluate_single_task`, `evaluate_tasks_in_batches`
- May also want to consider the same pattern for `compare.sh` sequential domain loops if applicable
- Add or extend regression tests in `benchmarks/m3/tests/test_partial_save_on_interrupt.py`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(m3): domain-level checkpoint callback for partial saves on interrupt #54

Background

Goal

Proposed approach

Scope

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

feat(m3): domain-level checkpoint callback for partial saves on interrupt #54

Description

Background

Goal

Proposed approach

Scope

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions