[PR 2/7] Add local checkpointing and progress tracking#70
Merged
gkamradt merged 3 commits intoJan 23, 2026
Merged
Conversation
Implements two-level checkpointing system: - BatchProgressManager: tracks task status across the batch (pending, in_progress, completed, failed) with worker assignment and stale task recovery - TaskCheckpointManager: tracks within-task progress (attempts per test pair) for resume capability after interruption Key features: - Persists to JSON files via storage abstraction - Schema versioning for future compatibility - Decimal-based cost tracking (not float) - retry_failed_tasks() to reset failed tasks for retry - Comprehensive tests (32 tests) including edge cases Includes demo script (scripts/demo_checkpoint.py) demonstrating checkpointing with simulated task failures and retries.
Bug fixes: - Fix race condition in claim_next_task() with retry loop - Fix cost aggregation on task failure - mark_failed() now accepts cost/token params and accumulates to batch total - Fix S3 exists() to raise StorageReadError on non-404 errors instead of silently returning False - Add run_id validation on load - mismatched run_id starts fresh Code quality: - Replace deprecated datetime.utcnow() with datetime.now(timezone.utc) throughout checkpoint module Tests: - Add test_mark_failed_accumulates_costs - Add test_run_id_mismatch_starts_fresh - Fix test_reset_stale_tasks to use timezone-aware datetime Updates demo script to pass costs to mark_failed().
4902f2d to
3158bfa
Compare
gkamradt
approved these changes
Jan 23, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements two-level checkpointing system:
Key features:
retry_failed_tasks()to reset failed tasks for retryDependencies
Test plan
pytest src/arc_agi_benchmarking/tests/test_checkpoint.py(32 tests)python scripts/demo_checkpoint.py