Fix: fix hang in FSDP UBR with symmetric registration#3565
Fix: fix hang in FSDP UBR with symmetric registration#3565youngeunkwon0405 wants to merge 1 commit into
Conversation
…p group Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
|
/ok to test f5ff88b |
|
This specific hang should be fixed by using UBR on every communication buffer (in HSDP/HFSDP in particular, which had unregistered communication buffers), which is a corollary of my data-type customization work in this PR: #3067 I believe it is directly caused by deallocation of temporary buffers that hangs registration after the first pass. (Or something related to that.) Properly using @youngeunkwon0405 @dingqingy-nv to confirm if |
There was a problem hiding this comment.
I am wrong. This is NOT the same hang I experienced. It hangs post-UBR, so it is not caused by UBR directly, but indirectly!
Clearly UBR is successful:
INFO:megatron.core.distributed.fsdp.src.megatron_fsdp.param_and_grad_buffer:[MCORE][FSDP][Manual REG] Registered mem pool to group <torch.distributed.distributed_c10d.ProcessGroup object at 0xfffc02cee470>,group.group_desc:INTRA_PARTIAL_DATA_PARALLEL_GROUP_WITH_CP, group.size(): 32
INFO:megatron.core.distributed.fsdp.src.megatron_fsdp.param_and_grad_buffer:[MCORE][FSDP][Manual REG] Registered mem pool to group <torch.distributed.distributed_c10d.ProcessGroup object at 0xfffc02c4eff0>,group.group_desc:HIERARCHICAL_EXPERT_DATA_PARALLEL_GROUP_L0, group.size(): 32
INFO:megatron.core.distributed.fsdp.src.megatron_fsdp.param_and_grad_buffer:[MCORE][FSDP][Manual REG] Registered mem pool to group <torch.distributed.distributed_c10d.ProcessGroup object at 0xfffc02c4ebf0>,group.group_desc:HIERARCHICAL_EXPERT_DATA_PARALLEL_GROUP_L1, group.size(): 2
Aligned to merge this as a WAR to main branch since the DP-Outer group does not have significant performance ramifications for HSDP/HFSDP. We can reduce gradients in native precision and not use symmetric kernels for these groups.
@youngeunkwon0405 to add a FIXME(@youngeunkwon0405, @cspades) note explaining why we need the WAR (though it looks 👌🏻 already, maybe just to put our names onto the bug as a TODO), and also if you could double check if all 3 of those groups cause the hang, and not just one or two of them. (That will be the first thing I check when I investigate this.)
Currently, nccl ubr with symmetric registration causes a hang when we have multiple optimizer instances.
This is a WAR fix to avoid that issue, targeting the release branch.
By this change, user buffer registration is done only for the DP_CP group, which is the most important target to register the buffers.
What does this PR do ?
Contribution process
flowchart LR A[Pre-checks] --> B[PR Tests] subgraph Code Review/Approval C1[Expert Review] --> C2[Final Review] end B --> C1 C2 --> D[Merge]Pre-checks
Core 0.8)Code review
The following process is enforced via the CODEOWNERS file for changes into
megatron/core. For changes outside ofmegatron/core, it is up to the PR author whether or not to tag the Final Reviewer team.For MRs into `main` branch
Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!
(Step 1): Add PR label
Expert Review(Step 2): Collect the expert reviewers reviews
Expert Reviewlabel when your PR is ready for review.Final Review might get declined if these requirements are not fulfilled.
(Step 3): Final Review
Final Reviewlabel(Optional Step 4): Cherry-pick into release branch
If this PR also needs to be merged into
core_r*release branches, after this PR has been merged, selectCherry-pickto open a new PR into the release branch.For MRs into `dev` branch
The proposed review process for `dev` branch is under active discussion.MRs are mergable after one approval by either
eharper@nvidia.comorzijiey@nvidia.com.Merging your PR
Any member of core-adlr and
core-nemowill be able to merge your PR.