Skip to content

[Checkpointer] Remove the dependencies on PyTorch distributed state_dict APIs#3623

Open
fegin wants to merge 4 commits into
gh/fegin/137/basefrom
gh/fegin/137/head
Open

[Checkpointer] Remove the dependencies on PyTorch distributed state_dict APIs#3623
fegin wants to merge 4 commits into
gh/fegin/137/basefrom
gh/fegin/137/head

Conversation

@fegin

@fegin fegin commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Stack from ghstack (oldest at bottom):

Summary:
Same as #2441 but with new implementation to be compatiblewith the latest OptimizerContainer

Verification:

Check Result
Checkpoint compat, FSDP (old saves -> new loads, resume to step 100) PASS - bit-identical all 100
steps
Checkpoint compat, PP=2 (old saves -> new loads) PASS - bit-identical
loss_compare, FSDP (old vs new, 100 steps, seed checkpoint) PASS - identical loss
CPU unit tests (lr_scheduler, optimizer_param_groups, state_dict_keys, checkpoint) 51 passed
Full unit suite 411 passed (2 unrelated tokenizer-download failures)
Lint (ufmt, flake8, pydoclint) pass

[ghstack-poisoned]
@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 10, 2026
[ghstack-poisoned]
fegin added a commit that referenced this pull request Jun 10, 2026
…ict APIs

Summary:
Same as #2441 but with new implementation to be compatiblewith the latest OptimizerContainer

ghstack-source-id: 2a34d94
Pull-Request: #3623
[ghstack-poisoned]
fegin added a commit that referenced this pull request Jun 10, 2026
…ict APIs

Summary:
Same as #2441 but with new implementation to be compatiblewith the latest OptimizerContainer

ghstack-source-id: c6be0ff
Pull-Request: #3623
[ghstack-poisoned]
fegin added a commit that referenced this pull request Jun 10, 2026
…ict APIs

Summary:
Same as #2441 but with new implementation to be compatiblewith the latest OptimizerContainer

ghstack-source-id: 07dee57
Pull-Request: #3623
@fegin fegin marked this pull request as ready for review June 10, 2026 21:12

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we are not using DCP APIs, could you prompt claude to add the _save_to_state_dict / ``_load_from_state_dict` hooks and see if it just solves #3569

# Per-optimizer regex patterns (aligned with self.schedulers), used as lr
# metric labels. Sourced from the container so patterns stay off the
# optimizer param groups and out of the saved state dict.
self._param_group_patterns = optimizers._param_group_patterns

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we not add this simply for logging purpose? I think it's available in optimizers.config?

# the list of patterns for that optimizer's param groups. Kept here, off the
# optimizer param groups, so they feed logging and lr metrics without leaking
# into the saved optimizer state dict.
_param_group_patterns: list[list[str]]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems it's also only for logging. Let's remove for now, or we can move the logging into _build_param_groups where we still have access to patterns.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/8gpu CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants