Describe the bug
This test is occasionally hanging on:
9-task-1-0/0 [default0]:PASSED
9-task-1-0/0 [default0]:tests/unit_tests/pipeline_parallel/test_fine_grained_activation_offloading.py::test_fine_grained_activation_offload_with_ep_a2a_overlap_compatibility[alltoall-False-offload_modules8] [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:
9-task-1-0/0 [default0]:================================================================================
9-task-1-0/0 [default0]: Activation Offload Summary (MB)
9-task-1-0/0 [default0]:================================================================================
9-task-1-0/0 [default0]:Rank attn_proj core_attn expert_fc1 mlp_norm moe_act Total
9-task-1-0/0 [default0]:--------------------------------------------------------------------------------
9-task-1-0/0 [default0]:Rank 0 28.00 1204.00 51.90 28.00 207.80 1519.69
9-task-1-0/0 [default0]:Rank 1 28.00 1204.00 52.66 28.00 210.83 1523.49
9-task-1-0/0 [default0]:Rank 2 28.00 1204.00 57.17 28.00 228.91 1546.08
9-task-1-0/0 [default0]:Rank 3 28.00 1204.00 62.27 28.00 249.34 1571.61
9-task-1-0/0 [default0]:Rank 4 28.00 1204.00 51.90 28.00 207.80 1519.69
9-task-1-0/0 [default0]:Rank 5 28.00 1204.00 52.66 28.00 210.83 1523.49
9-task-1-0/0 [default0]:Rank 6 28.00 1204.00 57.17 28.00 228.91 1546.08
9-task-1-0/0 [default0]:Rank 7 28.00 1204.00 62.27 28.00 249.34 1571.61
9-task-1-0/0 [default0]:--------------------------------------------------------------------------------
9-task-1-0/0 [default0]:Total 224.00 9632.00 448.00 224.00 1793.75 12321.75
9-task-1-0/0 [default0]:================================================================================
9-task-1-0/0 [default0]:
Steps/Code to reproduce bug
Please list minimal steps or code snippet for us to be able to reproduce the bug.
A helpful guide on on how to craft a minimal bug report http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports.
Expected behavior
A clear and concise description of what you expected to happen.
Additional context
Add any other context about the problem here.
Describe the bug
This test is occasionally hanging on:
Steps/Code to reproduce bug
Please list minimal steps or code snippet for us to be able to reproduce the bug.
A helpful guide on on how to craft a minimal bug report http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports.
Expected behavior
A clear and concise description of what you expected to happen.
Additional context
Add any other context about the problem here.