-
Notifications
You must be signed in to change notification settings - Fork 4.7k
Description
I have observed about numel * 28 bytes cpu memory requirements for doing z2+optim cpu offload.
Here is the mapping out of current cpu memory allocations when offload_optimizer.device: cpu is used with zero: stage 2
Ideally it should be numel * 16 bytes (4 bytes per master weights, grad, 2x optim states).
The total of cpu memory allocation happen in 2 parts deepspeed.initialize + first step call - I'm going to break it down into each per param allocation.
part 1: deepspeed.initialize
- master weights 4 bytes - kosher
- intermediary H2D copy buffer (for speed) - 2 bytes (half precision) - seems we can't avoid this one
- related to above - 2 byte pinned memory overhead - fixed it here zero stage 1-2: don't pin memory if not configured #7689
- initialize_optimizer_states - creates 4 bytes for grads - kosher
total 10 bytes per param
we confirmed empirically that's the case.
part 2: first step
- 8 bytes for optim states - kosher
- unscale_and_clip_grads - adds 4 bytes - but we think it's peak cpu memory - because grad.data.mul_(1. / combined_scale) shouldn't allocate more memory other than for a temp buffer in pytorch - it doesn't appear to be a temp buffer not released by linux cpu memory - because the next allocation doesn't re-use it and allocates its own full sized tensor
- self.ds_opt_adam.adam_update for some reason allocates 4 bytes more
total: 16 bytes per param
grand total: 26 bytes
plus additional ~2 bytes I still don't have an account for
Details:
- this leaks 4 bytes per param (inside the first
stepcall)
for grad in grad_groups_flat:
# checking cpu memory before and after the next call shows an additional fp32
# allocation on the first call of unscale_and_clip_grads, on subsequent calls it's stable
grad.data.mul_(1. / combined_scale) #
I even tried to move it to gpu do the mul there and move back to cpu I still get the leak.
rank = torch.distributed.get_rank()
device = torch.device(f"cuda:{rank}")
t = 1. / combined_scale
see_memory_usage(f"before", force=True)
grad = grad.to(device)
t = t.to(device)
see_memory_usage(f"after to cuda", force=True)
grad.mul_(t)
see_memory_usage(f"after mul_", force=True)
grad = grad.cpu()
see_memory_usage(f"after to cpu", force=True)
and the memory is (the first row is cuda, the second is cpu):
before
MA 9.36 GB Max_MA 9.36 GB CA 12.05 GB Max_CA 12 GB
CPU Virtual Memory: used = 135.32 GB, percent = 6.8%
after to cuda
MA 21.36 GB Max_MA 21.36 GB CA 24.05 GB Max_CA 24 GB
CPU Virtual Memory: used = 135.33 GB, percent = 6.8%
after mul_
MA 21.36 GB Max_MA 21.36 GB CA 24.05 GB Max_CA 24 GB
CPU Virtual Memory: used = 135.33 GB, percent = 6.8%
after to cpu
MA 9.36 GB Max_MA 21.36 GB CA 24.05 GB Max_CA 24 GB
CPU Virtual Memory: used = 147.38 GB, percent = 7.4%
so you can see 12GB lost (3 layers out of 48)
the leak happens only in the first step
and it's not reclaimable - the next allocation doesn't re-use it
- cpu adam has some weird issue of allocating 4 bytes per param the first time
self.ds_opt_adam.adam_updateis called - even though if one looks at the cpp code there is no 4 bytes allocation there and if it's a temp buffer - it's not it sticks through - validated with getting cpu-oom with Qwen3-Next-80B - if I disable the update the cpu oom goes away
cc: @tjruwase - added the summary above, but I have to move on - perhaps someone else will get a chance to solve at least one of the two 4 bytes per param leaks. At 80B param model - that's 640GB of CPU memory wasted.