Draft: merge Q2 submission branch back into main by jasonlizhengjian · Pull Request #190 · NVIDIA/srt-slurm

jasonlizhengjian · 2026-05-29T17:55:41Z

Summary

Draft merge-back of sa-submission-q2-2026 into main for review.
Brings Q2 submission-specific benchmark/runtime work from the submission branch, including GLM5/Kimi/Minimax recipes, eval/post-eval setup flow, and spread-worker/vLLM colocation support.
Includes the current submission branch state only; follow-up Q2 cherry-pick PRs are included only after they land in sa-submission-q2-2026.

Important caveats

This is a direct branch merge PR, not a curated backport.
main has moved substantially past Q2. The raw compare currently shows hundreds of files changed and many apparent deletions of main-only files/features, so this should not be merged as-is without review or a curated merge branch.
PR Cherry-pick Dynamo wheel install support to Q2 #184, which cherry-picks Dynamo wheel install support into Q2, is not included here unless it lands in sa-submission-q2-2026 first.
The median interactivity CSV rollup from main (cfe10922, plus likely rollup hardening from 7858d309) is not in Q2 unless separately ported.

Suggested review focus

Decide whether we actually want a direct branch merge, or a curated branch that preserves current main behavior while bringing only the Q2 benchmark/recipe/runtime deltas.
Verify benchmark reporting expectations, especially mean JSON rollups versus median CSV interactivity output.
Check recipes and submission-only assets for anything that should stay out of main.

Validation

Not run. This draft PR is for review/triage of the merge-back scope.

Auto-detect container type at runtime: if /sgl-workspace exists (SGLang), use original install path unchanged; otherwise use portable /tmp build path with conditional dependency installation for non-SGLang containers.

* Add Kimi-K2.5 vLLM recipes and fix NIXL side channel host - Add kimi-k2.5 1k1k and 8k1k disagg GB200 recipes (from #7) - Fix vLLM NIXL handshake failures: set VLLM_NIXL_SIDE_CHANNEL_HOST to node's routable IP in get_process_environment() instead of leaving it as 0.0.0.0/localhost which caused transfer handshake failures - Update test_vllm_get_process_environment to cover NIXL host env var Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * ci: run checks on PRs targeting sa-submission-q2-2026 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

#24) * Add Kimi K2.5 disagg STP and MTP recipes for GB200 NVfp4 (ISL8K_OSL1K and ISL1K_OSL1K) Add optimized disaggregated inference recipes for Kimi K2.5 model with NVfp4 precision on GB200 GPUs. Includes both STP and MTP configurations for ISL8K_OSL1K and ISL1K_OSL1K workloads covering concurrency points from 5 to 2253, with Eagle speculative decoding for MTP variants. * Update Kimi K2.5 recipes: container, model path, concurrency format, and env cleanup - Update container to tensorrtllm-runtime-1.1.0-dev.2.sqsh - Point model path to shared /mnt/lustre01/models/kimi-k2.5-nvfp4 - Update Eagle model mount path for MTP configs - Remove HF_HOME (defaults to ~/.cache/huggingface) - Fix concurrency separator from space to 'x' for sa-bench compatibility - Enable multiple frontends for ctx1dep4_gen1dep32_batch64 * Use generic model path and container aliases for cluster portability Replace cluster-specific paths with generic alias names that are resolved via srtslurm.yaml model_paths and containers mappings, as per upstream convention. * Add extra_mount alias resolution and use generic Eagle model path Add model_paths alias resolution for extra_mount host paths in config.py, enabling MTP recipes to use generic name "kimi-k2.5-eagle3" instead of cluster-specific path for the Eagle speculative decoding model. * Use HuggingFace model names and full NVCR container paths Per review feedback, update model paths to HuggingFace format (nvidia/Kimi-K2.5-NVFP4) and container to full NVCR registry path (nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2) so recipes are portable and work without pre-built sqsh files. --------- Co-authored-by: nlevin-ui <nlevin@nvidia.com>

* recipes for minimax m2.5 fp4 b200 agg vllm * commit for signature

* Add lm-eval benchmark runner for InferenceX evals Adds support for running lm-eval accuracy evaluations as a post-benchmark step, leveraging the InferenceX benchmark_lib.sh harness.

…#47) * fix tokenizer for glm5 (#20) fix * add nvidia pre-release url (#22)

Add 66 GLM5 NVFP4 disaggregated recipe configs for GB200 and GB300 on the sa-submission branch; standardize model path and container values across the recipe set for consistency.

* Add GLM5 GB200 NVFP4 Apr-09 disagg recipes. Include the updated 1K/1K and 8K/1K STP and MTP TensorRT-LLM Dynamo configs so submission testing can run on the latest GB200 parameter set. * Keep only Apr-09 GB200 configs and align YAML quoting. Remove legacy GB200 trtllm_dynamo recipes inherited from the submission base branch, and normalize concurrencies/custom_tokenizer fields to double-quoted style for consistency with existing GB300 recipes. * fix: enable chat template and 16x rounds for GB200 GLM5 configs Update GB200 GLM5 trtllm_dynamo recipes to set use_chat_template=true and num_prompts_mult=16 so sa-bench runs align with current submission benchmarking methodology.

Set GLM5 GB300 trtllm_dynamo recipes to use chat template and num_prompts_mult=16 so throughput runs match TRTLLM multi-round methodology, while keeping warmup fixed at 2x.

Add setup_script install-trtllm-pip.sh to all GB300 GLM5 trtllm_dynamo recipes so eval-only jobs can install lm-eval even when pip is missing in the runtime container venv.

* Add spread_workers option to ResourceConfig Allow placing each partial-node worker on its own node instead of packing multiple onto the same node. Useful when colocating workers on a single node causes resource contention (port collisions, etc.). Caller must reserve enough nodes (e.g. set decode_nodes=decode_workers when gpus_per_decode<gpus_per_node). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * try fix * allow multiple DEP2 workers per node * multi worker fix * Allow vLLM one-node prefill decode colocation * Avoid same-node worker port collisions * Fix spread workers tests and lint * Cover vLLM colocation guard --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: hjjq <50634613+hjjq@users.noreply.github.com>

jasonlizhengjian · 2026-05-29T18:08:44Z

Closing in favor of #191, which keeps the Q2 merge-back scoped to non-recipe changes and leaves the recipe tree out of the PR diff.

Albert Cheng (Engrg-Hardware 1) and others added 14 commits April 2, 2026 14:17

Make Dynamo source install container-agnostic (vLLM, SGLang, etc.)

9cc6d50

Auto-detect container type at runtime: if /sgl-workspace exists (SGLang), use original install path unchanged; otherwise use portable /tmp build path with conditional dependency installation for non-SGLang containers.

Add Minimax M2.5 NVFP4 agg B200 single-node configs (#36)

b0f5b83

* recipes for minimax m2.5 fp4 b200 agg vllm * commit for signature

Add lm-eval benchmark runner for InferenceX evals (#12)

f61dbba

* Add lm-eval benchmark runner for InferenceX evals Adds support for running lm-eval accuracy evaluations as a post-benchmark step, leveraging the InferenceX benchmark_lib.sh harness.

fix: add glm5 dynamo trtllm benchmark support to sa submission branch (…

10f4ac9

…#47) * fix tokenizer for glm5 (#20) fix * add nvidia pre-release url (#22)

Add GLM5 disaggregated recipes for SA submission (#48)

a10acd3

Add 66 GLM5 NVFP4 disaggregated recipe configs for GB200 and GB300 on the sa-submission branch; standardize model path and container values across the recipe set for consistency.

fix: add chat template to the glm5 tokenizer

b85ec1d

fix: align glm5 gb300 sa-bench rounds with submission baselines (#113)

c88c68e

Set GLM5 GB300 trtllm_dynamo recipes to use chat template and num_prompts_mult=16 so throughput runs match TRTLLM multi-round methodology, while keeping warmup fixed at 2x.

fix: using a setup script to install pip in trtllm venv # (#117)

95b0a33

fix: add trtllm venv pip bootstrap to GB300 GLM5 recipes (#120)

0f0aa60

Add setup_script install-trtllm-pip.sh to all GB300 GLM5 trtllm_dynamo recipes so eval-only jobs can install lm-eval even when pip is missing in the runtime container venv.

run setup script before post eval (#123)

fdb1be7

jasonlizhengjian mentioned this pull request May 29, 2026

Draft: merge Q2 submission support into main without recipes #191

Draft

jasonlizhengjian closed this May 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Draft: merge Q2 submission branch back into main#190

Draft: merge Q2 submission branch back into main#190
jasonlizhengjian wants to merge 14 commits into
mainfrom
sa-submission-q2-2026

jasonlizhengjian commented May 29, 2026

Uh oh!

jasonlizhengjian commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

jasonlizhengjian commented May 29, 2026

Summary

Important caveats

Suggested review focus

Validation

Uh oh!

jasonlizhengjian commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants