Skip to content

Tracking excessive cpu memory usage in z2 cpu offload #7693

@stas00

Description

@stas00

I have observed about numel * 28 bytes cpu memory requirements for doing z2+optim cpu offload.

Here is the mapping out of current cpu memory allocations when offload_optimizer.device: cpu is used with zero: stage 2

Ideally it should be numel * 16 bytes (4 bytes per master weights, grad, 2x optim states).

The total of cpu memory allocation happen in 2 parts deepspeed.initialize + first step call - I'm going to break it down into each per param allocation.

part 1: deepspeed.initialize

  1. master weights 4 bytes - kosher
  2. intermediary H2D copy buffer (for speed) - 2 bytes (half precision) - seems we can't avoid this one
  3. related to above - 2 byte pinned memory overhead - fixed it here zero stage 1-2: don't pin memory if not configured #7689
  4. initialize_optimizer_states - creates 4 bytes for grads - kosher

total 10 bytes per param

we confirmed empirically that's the case.

part 2: first step

  1. 8 bytes for optim states - kosher
  2. unscale_and_clip_grads - adds 4 bytes - but we think it's peak cpu memory - because grad.data.mul_(1. / combined_scale) shouldn't allocate more memory other than for a temp buffer in pytorch - it doesn't appear to be a temp buffer not released by linux cpu memory - because the next allocation doesn't re-use it and allocates its own full sized tensor
  3. self.ds_opt_adam.adam_update for some reason allocates 4 bytes more

total: 16 bytes per param

grand total: 26 bytes

plus additional ~2 bytes I still don't have an account for

Details:

  • this leaks 4 bytes per param (inside the first step call)
        for grad in grad_groups_flat:
            # checking cpu memory before and after the next call shows an additional fp32 
            # allocation on the first call of unscale_and_clip_grads, on subsequent calls it's stable
            grad.data.mul_(1. / combined_scale) #

I even tried to move it to gpu do the mul there and move back to cpu I still get the leak.

               rank = torch.distributed.get_rank()
                device = torch.device(f"cuda:{rank}")

                t = 1. / combined_scale
                see_memory_usage(f"before", force=True)

                grad = grad.to(device)
                t = t.to(device)
                see_memory_usage(f"after to cuda", force=True)

                grad.mul_(t)
                see_memory_usage(f"after mul_", force=True)

                grad = grad.cpu()
                see_memory_usage(f"after to cpu", force=True)

and the memory is (the first row is cuda, the second is cpu):

before
MA 9.36 GB         Max_MA 9.36 GB         CA 12.05 GB         Max_CA 12 GB
CPU Virtual Memory:  used = 135.32 GB, percent = 6.8%

after to cuda
MA 21.36 GB         Max_MA 21.36 GB         CA 24.05 GB         Max_CA 24 GB
CPU Virtual Memory:  used = 135.33 GB, percent = 6.8%

after mul_
MA 21.36 GB         Max_MA 21.36 GB         CA 24.05 GB         Max_CA 24 GB
CPU Virtual Memory:  used = 135.33 GB, percent = 6.8%

after to cpu
MA 9.36 GB         Max_MA 21.36 GB         CA 24.05 GB         Max_CA 24 GB
CPU Virtual Memory:  used = 147.38 GB, percent = 7.4%

so you can see 12GB lost (3 layers out of 48)

the leak happens only in the first step

and it's not reclaimable - the next allocation doesn't re-use it

  • cpu adam has some weird issue of allocating 4 bytes per param the first time self.ds_opt_adam.adam_update is called - even though if one looks at the cpp code there is no 4 bytes allocation there and if it's a temp buffer - it's not it sticks through - validated with getting cpu-oom with Qwen3-Next-80B - if I disable the update the cpu oom goes away

cc: @tjruwase - added the summary above, but I have to move on - perhaps someone else will get a chance to solve at least one of the two 4 bytes per param leaks. At 80B param model - that's 640GB of CPU memory wasted.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions