Reserve memory headroom and cap MAX_JOBS in source builds (#2493)#2548
Open
shivam2199 wants to merge 1 commit into
Open
Reserve memory headroom and cap MAX_JOBS in source builds (#2493)#2548shivam2199 wants to merge 1 commit into
shivam2199 wants to merge 1 commit into
Conversation
…2493) The auto-MAX_JOBS heuristic targeted ~100% of free memory, leaving zero margin for nvcc spikes above the 5GB/thread average, OS overhead, or co-running processes. On a 880GB / 96-CPU host this picked MAX_JOBS=42 (168 nvcc threads, ~840GB peak) and triggered hard reboots when peaks landed simultaneously. Reserve max(16GB, 15% of free RAM) before computing the memory-based limit and add a hardcap of 32 jobs (overridable via MAX_JOBS_HARDCAP) to prevent runaway parallelism on very large hosts.
Contributor
Author
|
@Johnsonms Please review this as well. Thanks! |
Contributor
Author
|
@drisspg @jayhshah @Johnsonms Can you please review this? |
Contributor
Author
|
@Johnsonms Please review/approve this as well please. |
Contributor
Author
|
@Johnsonms @drisspg @jayhshah can you please take a look here? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes #2493.
The auto-
MAX_JOBSheuristic inNinjaBuildExtension.__init__targets ~100% of free memory, leaving zero safety margin. On large hosts this consistently OOMs the box when nvcc per-thread peaks land above the 5 GB average, when other processes consume memory, or when CUDA 13.x adds gencode targets (which raise per-process memory usage above the 5 GB estimate from #2079, which was measured under CUDA 12.8).Reporter's failure (from #2493): 880 GB / 96 CPU host, default
NVCC_THREADS=4→MAX_JOBS = min(48, 42) = 42→ 168 concurrent nvcc threads → ~840 GB peak → 10 GB headroom (1.2%) → hard reboot, no clean OOM-killer.Fix
Two changes in
NinjaBuildExtension.__init__:max(16 GB, 15% of free RAM)before computing the memory-based job limit. Floors handle small machines; the percentage handles big ones.MAX_JOBSat 32 by default to prevent runaway parallelism on very large hosts. Overridable viaMAX_JOBS_HARDCAP=Nso build environments that can sustain more aren't bottlenecked.The user-facing
MAX_JOBS=Nenv var still wins over both — same behavior as before for anyone setting it explicitly.Effect across host sizes
MAX_JOBSMAX_JOBSCompilation slows roughly 25% on the largest hosts but the formula no longer drives them into OOM. Small hosts are unaffected — the floor of 16 GB matches what the prior 9 GB/job heuristic effectively reserved.
Validation
Verified the new formula numerically against the table from #2493 on the host configurations above. Full source build was not exercised — change is contained to
setup.pyarithmetic that runs once duringpip install, before any CUDA code, and there's no test harness for build-system logic in the repo today.Notes for reviewers
MAX_JOBS_HARDCAPis a new env var; if Dao-AILab CI builds wheels on a box that previously ran with >32 jobs and benefits from it, setMAX_JOBS_HARDCAP=64(or whatever is appropriate) in the CI config to keep the prior throughput. The default is conservative on purpose.MAX_JOBS=N(existing) still takes precedence and skips the heuristic entirely.Test plan
setup.pyparses cleanly (python -c "import ast; ast.parse(open('setup.py').read())")