feat(checkpoint): validate checkpoint entries before resume#314
Open
feat(checkpoint): validate checkpoint entries before resume#314
Conversation
Reject malformed checkpoints.jsonl entries early with descriptive errors instead of letting them surface as cryptic KeyErrors or silent bad defaults during training resume. - Add InvalidCheckpointError and validate_checkpoint_entry() - Validate name, state_path (required + prefix check), step, data_consumed - Wire validation into get_last_checkpoint() and resolve_resume() - Add 32 new unit tests covering all validation paths Made-with: Cursor
Replace real customer job ID and checkpoint hash with generic test values. Made-with: Cursor
benjibc
approved these changes
Apr 9, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Reject malformed
checkpoints.jsonlentries early with descriptive errors instead of letting them surface as crypticKeyErrors or silent bad defaults during training resume. Previously, hand-edited or corrupted checkpoint entries would either hang duringload_state_with_optimizeror silently resume withstep=0, data_consumed=0, forcing operators to debug by trial and error.Architecture / Code Overview Diagram
flowchart TD A[resolve_resume] --> B{init_from_checkpoint?} B -- yes --> C[client.resolve_checkpoint_path + load] B -- no --> D[get_last_checkpoint] D --> E[read_jsonl] E --> F["validate_checkpoint_entry ⚡ NEW"] F -- valid --> G[client.load_state_with_optimizer] F -- invalid --> H["raise InvalidCheckpointError<br/>(descriptive message)"] G --> I[return ResumeInfo] style F fill:#2d6,stroke:#1a4,color:#fff style H fill:#d33,stroke:#a11,color:#fffType of Change
Testing
32 new unit tests covering:
TestValidateCheckpointEntry(20 tests): exhaustive validation of individual fields (name, state_path, step, data_consumed), type checks, edge cases (bool, float, whitespace)TestGetLastCheckpointValidation(4 tests): bad JSONL entries caught at read timeTestResolveResumeValidation(5 tests): end-to-end: bad entries prevent load,--init-from-checkpointbypasses JSONLSurface Consistency
Deployment Notes
Change Size
(~90 LOC of production code, ~280 LOC of tests)
Checklist
Additional Context
Motivated by an operator who had to manually create a
checkpoints.jsonlfor cross-job resume and got stuck debugging a hanging load because the entry format was wrong. The new validation catches these issues at parse time with actionable error messages.Made with Cursor