Reserve memory headroom and cap MAX_JOBS in source builds (#2493) by shivam2199 · Pull Request #2548 · Dao-AILab/flash-attention

shivam2199 · 2026-05-08T03:48:23Z

Summary

Closes #2493.

The auto-MAX_JOBS heuristic in NinjaBuildExtension.__init__ targets ~100% of free memory, leaving zero safety margin. On large hosts this consistently OOMs the box when nvcc per-thread peaks land above the 5 GB average, when other processes consume memory, or when CUDA 13.x adds gencode targets (which raise per-process memory usage above the 5 GB estimate from #2079, which was measured under CUDA 12.8).

Reporter's failure (from #2493): 880 GB / 96 CPU host, default NVCC_THREADS=4 → MAX_JOBS = min(48, 42) = 42 → 168 concurrent nvcc threads → ~840 GB peak → 10 GB headroom (1.2%) → hard reboot, no clean OOM-killer.

Fix

Two changes in NinjaBuildExtension.__init__:

Memory safety margin. Reserve max(16 GB, 15% of free RAM) before computing the memory-based job limit. Floors handle small machines; the percentage handles big ones.
Hardcap. Cap MAX_JOBS at 32 by default to prevent runaway parallelism on very large hosts. Overridable via MAX_JOBS_HARDCAP=N so build environments that can sustain more aren't bottlenecked.

The user-facing MAX_JOBS=N env var still wins over both — same behavior as before for anyone setting it explicitly.

Effect across host sizes

RAM	CPU	Prev `MAX_JOBS`	New `MAX_JOBS`	New headroom
32 GB	8	1	1	~12 GB
64 GB	16	3	2	~24 GB
256 GB	32	12	10	~56 GB
512 GB	64	25	21	~92 GB
880 GB	96	42	32 (capped)	~240 GB

Compilation slows roughly 25% on the largest hosts but the formula no longer drives them into OOM. Small hosts are unaffected — the floor of 16 GB matches what the prior 9 GB/job heuristic effectively reserved.

Validation

Verified the new formula numerically against the table from #2493 on the host configurations above. Full source build was not exercised — change is contained to setup.py arithmetic that runs once during pip install, before any CUDA code, and there's no test harness for build-system logic in the repo today.

Notes for reviewers

MAX_JOBS_HARDCAP is a new env var; if Dao-AILab CI builds wheels on a box that previously ran with >32 jobs and benefits from it, set MAX_JOBS_HARDCAP=64 (or whatever is appropriate) in the CI config to keep the prior throughput. The default is conservative on purpose.
MAX_JOBS=N (existing) still takes precedence and skips the heuristic entirely.

Test plan

Numerical validation of the new formula across the table from the issue
setup.py parses cleanly (python -c "import ast; ast.parse(open('setup.py').read())")
Maintainer source-build smoke test on a CUDA host (optional — change is in pre-build math)

…2493) The auto-MAX_JOBS heuristic targeted ~100% of free memory, leaving zero margin for nvcc spikes above the 5GB/thread average, OS overhead, or co-running processes. On a 880GB / 96-CPU host this picked MAX_JOBS=42 (168 nvcc threads, ~840GB peak) and triggered hard reboots when peaks landed simultaneously. Reserve max(16GB, 15% of free RAM) before computing the memory-based limit and add a hardcap of 32 jobs (overridable via MAX_JOBS_HARDCAP) to prevent runaway parallelism on very large hosts.

shivam2199 · 2026-05-08T12:53:13Z

@Johnsonms Please review this as well. Thanks!

shivam2199 · 2026-05-10T15:23:36Z

@drisspg @jayhshah @Johnsonms Can you please review this?

shivam2199 · 2026-05-14T07:39:24Z

@Johnsonms Please review/approve this as well please.

shivam2199 · 2026-05-19T12:29:30Z

@Johnsonms @drisspg @jayhshah can you please take a look here?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reserve memory headroom and cap MAX_JOBS in source builds (#2493)#2548

Reserve memory headroom and cap MAX_JOBS in source builds (#2493)#2548
shivam2199 wants to merge 1 commit into
Dao-AILab:mainfrom
shivam2199:fix/2493-build-oom-safety-margin

shivam2199 commented May 8, 2026

Uh oh!

shivam2199 commented May 8, 2026

Uh oh!

shivam2199 commented May 10, 2026

Uh oh!

shivam2199 commented May 14, 2026

Uh oh!

shivam2199 commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

shivam2199 commented May 8, 2026

Summary

Fix

Effect across host sizes

Validation

Notes for reviewers

Test plan

Uh oh!

shivam2199 commented May 8, 2026

Uh oh!

shivam2199 commented May 10, 2026

Uh oh!

shivam2199 commented May 14, 2026

Uh oh!

shivam2199 commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant