Skip to content

feat(checkpoint): validate checkpoint entries before resume#314

Open
hershalb wants to merge 2 commits intomainfrom
hb/validate-checkpoint-entry
Open

feat(checkpoint): validate checkpoint entries before resume#314
hershalb wants to merge 2 commits intomainfrom
hb/validate-checkpoint-entry

Conversation

@hershalb
Copy link
Copy Markdown
Contributor

@hershalb hershalb commented Apr 9, 2026

Description

Reject malformed checkpoints.jsonl entries early with descriptive errors instead of letting them surface as cryptic KeyErrors or silent bad defaults during training resume. Previously, hand-edited or corrupted checkpoint entries would either hang during load_state_with_optimizer or silently resume with step=0, data_consumed=0, forcing operators to debug by trial and error.

Architecture / Code Overview Diagram

flowchart TD
    A[resolve_resume] --> B{init_from_checkpoint?}
    B -- yes --> C[client.resolve_checkpoint_path + load]
    B -- no --> D[get_last_checkpoint]
    D --> E[read_jsonl]
    E --> F["validate_checkpoint_entry ⚡ NEW"]
    F -- valid --> G[client.load_state_with_optimizer]
    F -- invalid --> H["raise InvalidCheckpointError<br/>(descriptive message)"]
    G --> I[return ResumeInfo]

    style F fill:#2d6,stroke:#1a4,color:#fff
    style H fill:#d33,stroke:#a11,color:#fff
Loading

Type of Change

  • Bug fix
  • New feature
  • Breaking change
  • Refactoring
  • Documentation
  • Infrastructure/DevOps

Testing

  • Added/updated tests
  • Tested manually
  • No testing needed

32 new unit tests covering:

  • TestValidateCheckpointEntry (20 tests): exhaustive validation of individual fields (name, state_path, step, data_consumed), type checks, edge cases (bool, float, whitespace)
  • TestGetLastCheckpointValidation (4 tests): bad JSONL entries caught at read time
  • TestResolveResumeValidation (5 tests): end-to-end: bad entries prevent load, --init-from-checkpoint bypasses JSONL
  • All 34 existing tests continue to pass
python3.12 -m pytest training/tests/unit/test_checkpoint_utils.py -v --timeout=30
# 66 passed in 3.51s

Surface Consistency

  • No customer-facing surface impact

Deployment Notes

  • No special deployment considerations

Change Size

  • Small (< 200 LOC)
  • Medium (200–999 LOC)
  • Large (≥ 1,000 LOC)

(~90 LOC of production code, ~280 LOC of tests)

Checklist

  • Agent-reviewed the diff before committing
  • Self-reviewed my code
  • Change is the minimum necessary diff
  • Added tests for my changes
  • Updated relevant documentation
  • No new linter warnings/errors
  • No secrets or credentials in the diff
  • Checked surface consistency for customer-facing changes
  • Visual diagram included (or change is cosmetic-only)

Additional Context

Motivated by an operator who had to manually create a checkpoints.jsonl for cross-job resume and got stuck debugging a hanging load because the entry format was wrong. The new validation catches these issues at parse time with actionable error messages.

Made with Cursor

hershalb added 2 commits April 9, 2026 00:18
Reject malformed checkpoints.jsonl entries early with descriptive errors
instead of letting them surface as cryptic KeyErrors or silent bad defaults
during training resume.

- Add InvalidCheckpointError and validate_checkpoint_entry()
- Validate name, state_path (required + prefix check), step, data_consumed
- Wire validation into get_last_checkpoint() and resolve_resume()
- Add 32 new unit tests covering all validation paths

Made-with: Cursor
Replace real customer job ID and checkpoint hash with generic test values.

Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants