Change the cudagraph distribution from linearly to exponentially-decreasing + grid for mixed prefill#3509
Conversation
…thub.com/mathemakitten/Megatron-LM into helenn-exponential-decay-cudagraph-sizes
|
Hey, are there empirics available to support the change? Should the old setting still be supported for cases where it may be better? Also, do any tests need to be updated because of this? |
The empirics are the reinforcement learning runs. I can provide internal pointers if you need. I don't think anyone can presently make a strong case for the old setting. I will update the values for |
Awesome, thank you! |
|
/ok to test 3c718e9 |
| return "%d bytes" % mem_bytes | ||
|
|
||
|
|
||
| def _cuda_graph_mempool_bytes(): |
There was a problem hiding this comment.
Nit: can you make return type -> Tuple[int, int]?
| controller = self.controller | ||
|
|
||
| time_start = time.time() | ||
| torch.cuda.reset_peak_memory_stats() |
There was a problem hiding this comment.
Is it safe to reset the peak memory stats here for every request entry? Would this disrupt any existing memory recording?
santhnm2
left a comment
There was a problem hiding this comment.
LGTM pending functional tests passing
|
/ok to test 60e7929 |
What does this PR do ?
This changes the distribution of cudagraphs from linearly-spaced in the top end to exponentially decreasing. In standalone inference, this is a >2x decrease in the number of graphs, a 15GB memory decrease, and slightly-better-throughput-than-before.
These savings are possible because many of the existing graphs were redundant (at near-zero-padding token counts in the upper range where the next-largest graph would have done the job anyway), or not useful sizes. At max_tokens=16384 (RL use case) we were producing an enormous number of graphs, but only replaying at the max batch size since rollouts are sustained and the mixed/prefill graphs were all captured at a fixed request count of 16.
We also include a geometric distribution in the mixed loop: previously, every mixed CG used prefill request count (P) = cuda_graph_mixed_prefill_request_count (default 16). A real batch with P != 16 would slot-count-match a captured P=16 graph but the captured graph's prefill metadata laid out tokens assuming 16 prefill slots, which doesn't replay cleanly when real P differs. This means that mixed CGs were captured but were mostly unusable. Replaced with a grid e.g.
{1, 2, 4, 8, …, max_requests}so real batches find a captured CG within a 2x factor of their actual P value.breaking: --inference-dynamic-batching-cuda-graph-mixed-prefill-count (mapped to cuda_graph_mixed_prefill_request_count) is now an on/off toggle rather than a numeric specifier. > 0 enables mixed CGs across the full P-grid; <= 0 disables mixed CGs (decode-only path, as before).
Enable with
--inference-dynamic-batching-cuda-graph-sizing-distribution.main:
now:
We also include bonus logging when cudagraphs are created to understand pool reuse efficiency. Logs now look like this, which tells us that while we allocated an extra 256kb for this graph, it did not increase the actual reserved mempool space:
INFO:root: [graph 65/65] [1]: 0 P + 1 D | pool reserved=5.5 gb (Δiter=0 bytes) pool allocated=1.8 gb (Δiter=256.0 kb)Contribution process
flowchart LR A[Pre-checks] --> B[PR Tests] subgraph Code Review/Approval C1[Expert Review] --> C2[Final Review] end B --> C1 C2 --> D[Merge]Pre-checks
Core 0.8)Code review
The following process is enforced via the CODEOWNERS file for changes into
megatron/core. For changes outside ofmegatron/core, it is up to the PR author whether or not to tag the Final Reviewer team.For MRs into `main` branch
Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!
(Step 1): Add PR label
Expert Review(Step 2): Collect the expert reviewers reviews
Expert Reviewlabel when your PR is ready for review.Final Review might get declined if these requirements are not fulfilled.
(Step 3): Final Review
Final Reviewlabel(Optional Step 4): Cherry-pick into release branch
If this PR also needs to be merged into
core_r*release branches, after this PR has been merged, selectCherry-pickto open a new PR into the release branch.For MRs into `dev` branch
The proposed review process for `dev` branch is under active discussion.MRs are mergable after one approval by either
eharper@nvidia.comorzijiey@nvidia.com.Merging your PR
Any member of core-adlr and
core-nemowill be able to merge your PR.