Inference: Cudagraph-aware admission gating in prefill scheduler#4870
Inference: Cudagraph-aware admission gating in prefill scheduler#4870mathemakitten wants to merge 4 commits into
Conversation
|
This PR has been automatically converted to draft because all PRs must start as drafts. When you are ready for review, click Ready for Review to begin the review process. This will:
See the contribution guide for more details. |
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
The code changes LGTM but can we add some tests which exercise the code path that defers a request due to lack of a CUDA graph match? Basically we should confirm that even if a request is deferred: |
| for _ in range(20): | ||
| assert engine._cg_admission_check(req, candidate) is False |
There was a problem hiding this comment.
Can you add a comment that this is happening 20 times to increment the wait iterations?
| token_count=8, prefill_req_count=0, decode_req_count=1 | ||
| ) | ||
|
|
||
| # Iterate the "scheduler". Each iteration: try both, record outcomes. |
There was a problem hiding this comment.
Can you clarify that this loop is increasing the waiting step count?
What does this PR do ?
Presently,
schedule_chunked_prefillandschedule_non_chunked_prefilladmit requests based on the available token/request budget only without consideration for whether the resulting batch shape will match any captured CG. When a request pushes the batch into a shape the captured set doesn't cover, the engine silently falls back to eager mode. While Transformer models can absorb extra decodes into prefill slots, for hybrid models, the matcher requires captured_decode_req_count >= real_decode_req_count (strict mode), so we fallback to eager more often.Now, when
cuda_graph_all_prefillsis on,_find_cg_chunk_size(max_chunk_tokens)traverses the list of cudagraphs to find the best-fit cudagraph for(active_token_count + chunk_size, num_prefill_requests + 1, num_decode_requests).schedule_chunked_prefillsnapsprefill_chunk_lengthto the largest CG-aligned boundary in the token budget, and if no CG covers the resulting shape, the request is deferred.This is designed without an eager fallback: if there is no cudagraph, the deferred request waits until the present decode finishes, when the next prefill will be automatically admitted. The worst case of this is
max_sequence_length. This avoids the one-off version of forcing an eager step which then continues to trigger eager due to the imbalance, and ensures that we run every step cudagraphed.#3509 needs to be merged first to avoid unnecessary starvation for hybrid inference. These two PRs together ensure that no steps will run eager.
Issue tracking
For PRs from open-source community contributors:
Linked issue:
Contribution process
Pre-checks
Code review
Feel free to message or comment @NVIDIA/mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!
All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.
Step 1: Mark PR as "Ready for Review"
.github/CODEOWNERS.Final Review might get declined if these requirements are not fulfilled.
Step 2: Final Review
For PRs that change
megatron/core, once all expert reviewers have approved, theFinal Reviewlabel is applied automatically and final reviewers are assigned.For PRs outside
megatron/core, this step is skipped.Step 3: Approved
Once all required reviewers have approved, the
Approvedlabel is applied automatically.Merge
Any member of mcore-engineers will be able to merge your PR.
For MRs into `dev` branch
The proposed review process for `dev` branch is under active discussion.MRs are mergable after one approval by either
eharper@nvidia.comorzijiey@nvidia.com.