Skip to content

Draft: merge Q2 submission branch back into main#190

Closed
jasonlizhengjian wants to merge 14 commits into
mainfrom
sa-submission-q2-2026
Closed

Draft: merge Q2 submission branch back into main#190
jasonlizhengjian wants to merge 14 commits into
mainfrom
sa-submission-q2-2026

Conversation

@jasonlizhengjian
Copy link
Copy Markdown
Contributor

Summary

  • Draft merge-back of sa-submission-q2-2026 into main for review.
  • Brings Q2 submission-specific benchmark/runtime work from the submission branch, including GLM5/Kimi/Minimax recipes, eval/post-eval setup flow, and spread-worker/vLLM colocation support.
  • Includes the current submission branch state only; follow-up Q2 cherry-pick PRs are included only after they land in sa-submission-q2-2026.

Important caveats

  • This is a direct branch merge PR, not a curated backport.
  • main has moved substantially past Q2. The raw compare currently shows hundreds of files changed and many apparent deletions of main-only files/features, so this should not be merged as-is without review or a curated merge branch.
  • PR Cherry-pick Dynamo wheel install support to Q2 #184, which cherry-picks Dynamo wheel install support into Q2, is not included here unless it lands in sa-submission-q2-2026 first.
  • The median interactivity CSV rollup from main (cfe10922, plus likely rollup hardening from 7858d309) is not in Q2 unless separately ported.

Suggested review focus

  • Decide whether we actually want a direct branch merge, or a curated branch that preserves current main behavior while bringing only the Q2 benchmark/recipe/runtime deltas.
  • Verify benchmark reporting expectations, especially mean JSON rollups versus median CSV interactivity output.
  • Check recipes and submission-only assets for anything that should stay out of main.

Validation

  • Not run. This draft PR is for review/triage of the merge-back scope.

Albert Cheng (Engrg-Hardware 1) and others added 14 commits April 2, 2026 14:17
Auto-detect container type at runtime: if /sgl-workspace exists (SGLang),
use original install path unchanged; otherwise use portable /tmp build path
with conditional dependency installation for non-SGLang containers.
* Add Kimi-K2.5 vLLM recipes and fix NIXL side channel host

- Add kimi-k2.5 1k1k and 8k1k disagg GB200 recipes (from #7)
- Fix vLLM NIXL handshake failures: set VLLM_NIXL_SIDE_CHANNEL_HOST to
  node's routable IP in get_process_environment() instead of leaving it
  as 0.0.0.0/localhost which caused transfer handshake failures
- Update test_vllm_get_process_environment to cover NIXL host env var

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* ci: run checks on PRs targeting sa-submission-q2-2026

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
#24)

* Add Kimi K2.5 disagg STP and MTP recipes for GB200 NVfp4 (ISL8K_OSL1K and ISL1K_OSL1K)

Add optimized disaggregated inference recipes for Kimi K2.5 model with NVfp4
precision on GB200 GPUs. Includes both STP and MTP configurations for
ISL8K_OSL1K and ISL1K_OSL1K workloads covering concurrency points from 5
to 2253, with Eagle speculative decoding for MTP variants.

* Update Kimi K2.5 recipes: container, model path, concurrency format, and env cleanup

- Update container to tensorrtllm-runtime-1.1.0-dev.2.sqsh
- Point model path to shared /mnt/lustre01/models/kimi-k2.5-nvfp4
- Update Eagle model mount path for MTP configs
- Remove HF_HOME (defaults to ~/.cache/huggingface)
- Fix concurrency separator from space to 'x' for sa-bench compatibility
- Enable multiple frontends for ctx1dep4_gen1dep32_batch64

* Use generic model path and container aliases for cluster portability

Replace cluster-specific paths with generic alias names that are resolved
via srtslurm.yaml model_paths and containers mappings, as per upstream convention.

* Add extra_mount alias resolution and use generic Eagle model path

Add model_paths alias resolution for extra_mount host paths in config.py,
enabling MTP recipes to use generic name "kimi-k2.5-eagle3" instead of
cluster-specific path for the Eagle speculative decoding model.

* Use HuggingFace model names and full NVCR container paths

Per review feedback, update model paths to HuggingFace format
(nvidia/Kimi-K2.5-NVFP4) and container to full NVCR registry path
(nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2) so recipes
are portable and work without pre-built sqsh files.

---------

Co-authored-by: nlevin-ui <nlevin@nvidia.com>
* recipes for minimax m2.5 fp4 b200 agg vllm

* commit for signature
* Add lm-eval benchmark runner for InferenceX evals

Adds support for running lm-eval accuracy evaluations as a post-benchmark
step, leveraging the InferenceX benchmark_lib.sh harness.
…#47)

* fix tokenizer for glm5 (#20)

fix

* add nvidia pre-release url (#22)
Add 66 GLM5 NVFP4 disaggregated recipe configs for GB200 and GB300 on the sa-submission branch; standardize model path and container values across the recipe set for consistency.
* Add GLM5 GB200 NVFP4 Apr-09 disagg recipes.

Include the updated 1K/1K and 8K/1K STP and MTP TensorRT-LLM Dynamo configs so submission testing can run on the latest GB200 parameter set.

* Keep only Apr-09 GB200 configs and align YAML quoting.

Remove legacy GB200 trtllm_dynamo recipes inherited from the submission base branch, and normalize concurrencies/custom_tokenizer fields to double-quoted style for consistency with existing GB300 recipes.

* fix: enable chat template and 16x rounds for GB200 GLM5 configs

Update GB200 GLM5 trtllm_dynamo recipes to set use_chat_template=true and num_prompts_mult=16 so sa-bench runs align with current submission benchmarking methodology.
Set GLM5 GB300 trtllm_dynamo recipes to use chat template and num_prompts_mult=16 so throughput runs match TRTLLM multi-round methodology, while keeping warmup fixed at 2x.
Add setup_script install-trtllm-pip.sh to all GB300 GLM5 trtllm_dynamo recipes so eval-only jobs can install lm-eval even when pip is missing in the runtime container venv.
* Add spread_workers option to ResourceConfig

Allow placing each partial-node worker on its own node instead of
packing multiple onto the same node. Useful when colocating workers
on a single node causes resource contention (port collisions, etc.).

Caller must reserve enough nodes (e.g. set decode_nodes=decode_workers
when gpus_per_decode<gpus_per_node).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* try fix

* allow multiple DEP2 workers per node

* multi worker fix

* Allow vLLM one-node prefill decode colocation

* Avoid same-node worker port collisions

* Fix spread workers tests and lint

* Cover vLLM colocation guard

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: hjjq <50634613+hjjq@users.noreply.github.com>
@jasonlizhengjian
Copy link
Copy Markdown
Contributor Author

Closing in favor of #191, which keeps the Q2 merge-back scoped to non-recipe changes and leaves the recipe tree out of the PR diff.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants