Tracking excessive cpu memory usage in z2 cpu offload

I have observed about `numel * 28 bytes` cpu memory requirements for doing z2+optim cpu offload.

Here is the mapping out of current cpu memory allocations when offload_optimizer.device: cpu is used with zero: stage 2

Ideally it should be `numel * 16 bytes` (4 bytes per master weights, grad, 2x optim states).

The total of cpu memory allocation happen in 2 parts `deepspeed.initialize` + first `step` call - I'm going to break it down into each per param allocation.

## part 1: deepspeed.initialize
1. master weights 4 bytes - kosher
2. intermediary H2D copy buffer (for speed) - 2 bytes (half precision) - seems we can't avoid this one
3. related to above - 2 byte pinned memory overhead - fixed it here https://github.com/deepspeedai/DeepSpeed/pull/7689 
4. initialize_optimizer_states - creates 4 bytes for grads - kosher

total 10 bytes per param

we confirmed empirically that's the case.


## part 2: first step
1. 8 bytes for optim states - kosher
2. unscale_and_clip_grads - adds 4 bytes - but we think it's peak cpu memory - because grad.data.mul_(1. / combined_scale) shouldn't allocate more memory other than for a temp buffer in pytorch - it doesn't appear to be a temp buffer not released by linux cpu memory - because the next allocation doesn't re-use it and allocates its own full sized tensor
3. self.ds_opt_adam.adam_update for some reason allocates 4 bytes more

total: 16 bytes per param

grand total: 26 bytes

plus additional ~2 bytes I still don't have an account for

Details:

- this leaks 4 bytes per param (inside the first `step` call)
```
        for grad in grad_groups_flat:
            # checking cpu memory before and after the next call shows an additional fp32 
            # allocation on the first call of unscale_and_clip_grads, on subsequent calls it's stable
            grad.data.mul_(1. / combined_scale) #
```
I even tried to move it to gpu do the mul there and move back to cpu I still get the leak.

```
               rank = torch.distributed.get_rank()
                device = torch.device(f"cuda:{rank}")

                t = 1. / combined_scale
                see_memory_usage(f"before", force=True)

                grad = grad.to(device)
                t = t.to(device)
                see_memory_usage(f"after to cuda", force=True)

                grad.mul_(t)
                see_memory_usage(f"after mul_", force=True)

                grad = grad.cpu()
                see_memory_usage(f"after to cpu", force=True)
```
and the memory is (the first row is cuda, the second is cpu):
```
before
MA 9.36 GB         Max_MA 9.36 GB         CA 12.05 GB         Max_CA 12 GB
CPU Virtual Memory:  used = 135.32 GB, percent = 6.8%

after to cuda
MA 21.36 GB         Max_MA 21.36 GB         CA 24.05 GB         Max_CA 24 GB
CPU Virtual Memory:  used = 135.33 GB, percent = 6.8%

after mul_
MA 21.36 GB         Max_MA 21.36 GB         CA 24.05 GB         Max_CA 24 GB
CPU Virtual Memory:  used = 135.33 GB, percent = 6.8%

after to cpu
MA 9.36 GB         Max_MA 21.36 GB         CA 24.05 GB         Max_CA 24 GB
CPU Virtual Memory:  used = 147.38 GB, percent = 7.4%
```

so you can see 12GB lost (3 layers out of 48)

the leak happens only in the first step

and it's not reclaimable - the next allocation doesn't re-use it

- cpu adam has some weird issue of allocating 4 bytes per param the first time `self.ds_opt_adam.adam_update` is called - even though if one looks at the cpp code there is no 4 bytes allocation there and if it's a temp buffer - it's not it sticks through - validated with getting cpu-oom with Qwen3-Next-80B - if I disable the update the cpu oom goes away

cc: @tjruwase - added the summary above, but I have to move on - perhaps someone else will get a chance to solve at least one of the two 4 bytes per param leaks. At 80B param model - that's 640GB of CPU memory wasted.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tracking excessive cpu memory usage in z2 cpu offload #7693

part 1: deepspeed.initialize

part 2: first step

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Tracking excessive cpu memory usage in z2 cpu offload #7693

Description

part 1: deepspeed.initialize

part 2: first step

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions