Skip to content

Reserve memory headroom and cap MAX_JOBS in source builds (#2493)#2548

Open
shivam2199 wants to merge 1 commit into
Dao-AILab:mainfrom
shivam2199:fix/2493-build-oom-safety-margin
Open

Reserve memory headroom and cap MAX_JOBS in source builds (#2493)#2548
shivam2199 wants to merge 1 commit into
Dao-AILab:mainfrom
shivam2199:fix/2493-build-oom-safety-margin

Conversation

@shivam2199
Copy link
Copy Markdown
Contributor

Summary

Closes #2493.

The auto-MAX_JOBS heuristic in NinjaBuildExtension.__init__ targets ~100% of free memory, leaving zero safety margin. On large hosts this consistently OOMs the box when nvcc per-thread peaks land above the 5 GB average, when other processes consume memory, or when CUDA 13.x adds gencode targets (which raise per-process memory usage above the 5 GB estimate from #2079, which was measured under CUDA 12.8).

Reporter's failure (from #2493): 880 GB / 96 CPU host, default NVCC_THREADS=4MAX_JOBS = min(48, 42) = 42 → 168 concurrent nvcc threads → ~840 GB peak → 10 GB headroom (1.2%) → hard reboot, no clean OOM-killer.

Fix

Two changes in NinjaBuildExtension.__init__:

  1. Memory safety margin. Reserve max(16 GB, 15% of free RAM) before computing the memory-based job limit. Floors handle small machines; the percentage handles big ones.
  2. Hardcap. Cap MAX_JOBS at 32 by default to prevent runaway parallelism on very large hosts. Overridable via MAX_JOBS_HARDCAP=N so build environments that can sustain more aren't bottlenecked.

The user-facing MAX_JOBS=N env var still wins over both — same behavior as before for anyone setting it explicitly.

Effect across host sizes

RAM CPU Prev MAX_JOBS New MAX_JOBS New headroom
32 GB 8 1 1 ~12 GB
64 GB 16 3 2 ~24 GB
256 GB 32 12 10 ~56 GB
512 GB 64 25 21 ~92 GB
880 GB 96 42 32 (capped) ~240 GB

Compilation slows roughly 25% on the largest hosts but the formula no longer drives them into OOM. Small hosts are unaffected — the floor of 16 GB matches what the prior 9 GB/job heuristic effectively reserved.

Validation

Verified the new formula numerically against the table from #2493 on the host configurations above. Full source build was not exercised — change is contained to setup.py arithmetic that runs once during pip install, before any CUDA code, and there's no test harness for build-system logic in the repo today.

Notes for reviewers

  • MAX_JOBS_HARDCAP is a new env var; if Dao-AILab CI builds wheels on a box that previously ran with >32 jobs and benefits from it, set MAX_JOBS_HARDCAP=64 (or whatever is appropriate) in the CI config to keep the prior throughput. The default is conservative on purpose.
  • MAX_JOBS=N (existing) still takes precedence and skips the heuristic entirely.

Test plan

  • Numerical validation of the new formula across the table from the issue
  • setup.py parses cleanly (python -c "import ast; ast.parse(open('setup.py').read())")
  • Maintainer source-build smoke test on a CUDA host (optional — change is in pre-build math)

…2493)

The auto-MAX_JOBS heuristic targeted ~100% of free memory, leaving zero
margin for nvcc spikes above the 5GB/thread average, OS overhead, or
co-running processes. On a 880GB / 96-CPU host this picked MAX_JOBS=42
(168 nvcc threads, ~840GB peak) and triggered hard reboots when peaks
landed simultaneously.

Reserve max(16GB, 15% of free RAM) before computing the memory-based
limit and add a hardcap of 32 jobs (overridable via MAX_JOBS_HARDCAP)
to prevent runaway parallelism on very large hosts.
@shivam2199
Copy link
Copy Markdown
Contributor Author

@Johnsonms Please review this as well. Thanks!

@shivam2199
Copy link
Copy Markdown
Contributor Author

@drisspg @jayhshah @Johnsonms Can you please review this?

@shivam2199
Copy link
Copy Markdown
Contributor Author

@Johnsonms Please review/approve this as well please.

@shivam2199
Copy link
Copy Markdown
Contributor Author

@Johnsonms @drisspg @jayhshah can you please take a look here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Source build OOM on high-memory machines due to zero safety margin in MAX_JOBS calculation

1 participant