Skip to content

[PR 4/7] CLI checkpoint integration#72

Merged
gkamradt merged 13 commits into
mainfrom
checkpoint/pr4-checkpoint-integration
Jan 23, 2026
Merged

[PR 4/7] CLI checkpoint integration#72
gkamradt merged 13 commits into
mainfrom
checkpoint/pr4-checkpoint-integration

Conversation

@ericc59
Copy link
Copy Markdown
Contributor

@ericc59 ericc59 commented Jan 22, 2026

Summary

  • Add --resume/--no-resume CLI flags to run_all.py
  • Create BatchProgressManager to track task status (pending/in_progress/completed/failed)
  • Filter to only pending tasks on resume, skip already-completed work
  • Mark tasks completed/failed after execution with timing and error info
  • Auto-recover stale tasks (in_progress > 1 hour reset to pending)

Test plan

  • Run benchmark, verify checkpoint file created in .checkpoints/progress.json
  • Interrupt and resume, verify completed tasks skipped
  • Verify --no-resume re-runs all tasks
  • Run pytest - all 389 tests pass

Introduces a unified storage interface with implementations for local
filesystem and S3, enabling the same checkpoint logic to work in both
local development and AWS production environments.

- StorageBackend ABC with read/write/exists/delete/list_keys interface
- LocalStorageBackend with atomic writes and path traversal protection
- S3StorageBackend with prefix support (optional boto3 dependency)
- Comprehensive test suite (28 tests)
The *local* gitignore pattern was preventing local.py from being committed.
@ericc59 ericc59 force-pushed the checkpoint/pr4-checkpoint-integration branch 2 times, most recently from a179efb to aefd666 Compare January 22, 2026 20:44
Implements two-level checkpointing system:
- BatchProgressManager: tracks task status across the batch (pending,
  in_progress, completed, failed) with worker assignment and stale
  task recovery
- TaskCheckpointManager: tracks within-task progress (attempts per
  test pair) for resume capability after interruption

Key features:
- Persists to JSON files via storage abstraction
- Schema versioning for future compatibility
- Decimal-based cost tracking (not float)
- retry_failed_tasks() to reset failed tasks for retry
- Comprehensive tests (32 tests) including edge cases

Includes demo script (scripts/demo_checkpoint.py) demonstrating
checkpointing with simulated task failures and retries.
Bug fixes:
- Fix race condition in claim_next_task() with retry loop
- Fix cost aggregation on task failure - mark_failed() now accepts
  cost/token params and accumulates to batch total
- Fix S3 exists() to raise StorageReadError on non-404 errors
  instead of silently returning False
- Add run_id validation on load - mismatched run_id starts fresh

Code quality:
- Replace deprecated datetime.utcnow() with datetime.now(timezone.utc)
  throughout checkpoint module

Tests:
- Add test_mark_failed_accumulates_costs
- Add test_run_id_mismatch_starts_fresh
- Fix test_reset_stale_tasks to use timezone-aware datetime

Updates demo script to pass costs to mark_failed().
- Add resilience module with TaskTimeoutError, request_timeout, task_timeout
- Add CircuitBreaker with CLOSED/OPEN/HALF_OPEN states and configurable thresholds
- Add CircuitBreakerRegistry for managing per-provider circuit breakers
- Integrate timeout and circuit breaker into cli/run_all.py
- Add --max-task-timeout and --circuit-breaker-threshold CLI flags
- Add timeout and circuit breaker config to provider_config.yml
- Add 41 tests for resilience module
Timeouts now only apply when --max-task-timeout is explicitly passed.
This prevents accidentally failing long-running reasoning model tasks.
- Add BatchProgressManager to track task progress
- Add --resume/--no-resume CLI flags (resume enabled by default)
- Claim tasks before execution, mark completed/failed after
- Skip already-completed tasks on resume
- Reset stale in-progress tasks from crashed workers
- Add checkpoint progress summary to output
@ericc59 ericc59 force-pushed the checkpoint/pr4-checkpoint-integration branch from aefd666 to cbbb68a Compare January 22, 2026 20:51
@gkamradt gkamradt merged commit 59542aa into main Jan 23, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants