Skip to content

Flaky test test_fine_grained_activation_offloading.py::test_fine_grained_activation_offload_with_ep_a2a_overlap_compatibility #3952

@ko3n1g

Description

@ko3n1g

Describe the bug

This test is occasionally hanging on:

9-task-1-0/0 [default0]:PASSED
9-task-1-0/0 [default0]:tests/unit_tests/pipeline_parallel/test_fine_grained_activation_offloading.py::test_fine_grained_activation_offload_with_ep_a2a_overlap_compatibility[alltoall-False-offload_modules8] [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:WARNING:megatron.core.tensor_parallel.random:CPU RNG state changed within GPU RNG context
9-task-1-0/0 [default0]:
9-task-1-0/0 [default0]:================================================================================
9-task-1-0/0 [default0]:                        Activation Offload Summary (MB)                         
9-task-1-0/0 [default0]:================================================================================
9-task-1-0/0 [default0]:Rank       attn_proj   core_attn  expert_fc1    mlp_norm     moe_act       Total
9-task-1-0/0 [default0]:--------------------------------------------------------------------------------
9-task-1-0/0 [default0]:Rank 0         28.00     1204.00       51.90       28.00      207.80     1519.69
9-task-1-0/0 [default0]:Rank 1         28.00     1204.00       52.66       28.00      210.83     1523.49
9-task-1-0/0 [default0]:Rank 2         28.00     1204.00       57.17       28.00      228.91     1546.08
9-task-1-0/0 [default0]:Rank 3         28.00     1204.00       62.27       28.00      249.34     1571.61
9-task-1-0/0 [default0]:Rank 4         28.00     1204.00       51.90       28.00      207.80     1519.69
9-task-1-0/0 [default0]:Rank 5         28.00     1204.00       52.66       28.00      210.83     1523.49
9-task-1-0/0 [default0]:Rank 6         28.00     1204.00       57.17       28.00      228.91     1546.08
9-task-1-0/0 [default0]:Rank 7         28.00     1204.00       62.27       28.00      249.34     1571.61
9-task-1-0/0 [default0]:--------------------------------------------------------------------------------
9-task-1-0/0 [default0]:Total         224.00     9632.00      448.00      224.00     1793.75    12321.75
9-task-1-0/0 [default0]:================================================================================
9-task-1-0/0 [default0]:

Steps/Code to reproduce bug

Please list minimal steps or code snippet for us to be able to reproduce the bug.

A helpful guide on on how to craft a minimal bug report http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports.

Expected behavior

A clear and concise description of what you expected to happen.

Additional context

Add any other context about the problem here.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions