Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 11 additions & 23 deletions .github/configs/nvidia-master.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1999,44 +1999,32 @@ dsr1-fp8-b300-sglang:
- { tp: 8, ep: 1, conc-start: 1, conc-end: 4 }
- { tp: 4, ep: 1, conc-start: 1, conc-end: 32 }

# NOTE: https://docs.sglang.io/cookbook/autoregressive/DeepSeek/DeepSeek-V4
# lists B200 (not B300) as the Blackwell target. This config reuses the
# B200 Pro FP4 Max-Throughput recipe (DP=8 + DeepEP, no MTP) on B300
# until a B300-specific recipe ships. Prefix caching is disabled.
# Parallelisms and concurrency ranges mirror dsv4-fp4-b200-vllm.
# DeepSeek-V4-Pro on B300 with SGLang (non-MTP). This follows the 8k/1k
# submission frontier from the 2026-05-19 Pareto HTML:
# TP-only low-latency line: TP8/EP1, no DP attention, c1-c64
# DP-attention throughput line: DEP8, DP attention, c512-c2048
dsv4-fp4-b300-sglang:
image: lmsysorg/sglang:deepseek-v4-b300@sha256:2fec8d7958bb0d53b50d7bf04d6ae6a7de8a35503775826e0550a45dd8c3ee15
image: lmsysorg/sglang:nightly-dev-cu13-20260522-7cf193fe
model: deepseek-ai/DeepSeek-V4-Pro
model-prefix: dsv4
runner: b300
precision: fp4
framework: sglang
multinode: false
# Three recipes from https://docs.sglang.io/cookbook/autoregressive/DeepSeek/DeepSeek-V4
# are selected inside benchmarks/single_node/dsv4_fp4_b300_sglang.sh by CONC:
# low-latency (CONC <= 32): TP-only
# balanced (32 < CONC <= 128): + DP-attn
# max-throughput (CONC > 128): + DP-attn
# Split so result filenames (ep=, dpa=) accurately reflect the recipe.
# ep is implicit in sglang: --moe-a2a-backend deepep forces ep_size=tp_size,
# while low-latency leaves ep_size at the default of 1.
# The benchmark script maps dp-attn=false to the TP-only recipe and
# dp-attn=true to the mixed-chunk DEP8 throughput recipe.
scenarios:
fixed-seq-len:
- isl: 1024
osl: 1024
search-space:
- { tp: 8, ep: 1, conc-start: 1, conc-end: 1 }
- { tp: 4, ep: 1, conc-start: 32, conc-end: 32 }
- { tp: 4, ep: 4, dp-attn: true, conc-start: 512, conc-end: 512 }
- { tp: 8, ep: 8, dp-attn: true, conc-start: 8192, conc-end: 8192 }
- { tp: 8, ep: 1, dp-attn: false, conc-list: [1, 2, 4, 8, 16, 32, 64] }
- { tp: 8, ep: 8, dp-attn: true, conc-list: [512, 768, 1024, 1536, 2048] }
- isl: 8192
osl: 1024
search-space:
- { tp: 8, ep: 1, conc-start: 1, conc-end: 1 }
- { tp: 4, ep: 1, conc-start: 32, conc-end: 32 }
- { tp: 4, ep: 4, dp-attn: true, conc-start: 512, conc-end: 512 }
- { tp: 8, ep: 8, dp-attn: true, conc-start: 2048, conc-end: 2048 }
- { tp: 8, ep: 8, dp-attn: true, conc-start: 4096, conc-end: 4096 }
- { tp: 8, ep: 1, dp-attn: false, conc-list: [1, 2, 4, 8, 16, 32, 64] }
- { tp: 8, ep: 8, dp-attn: true, conc-list: [512, 768, 1024, 1536, 2048] }
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Single-node full-sweep crashes on conc-list configs

Medium Severity

The new dsv4-fp4-b300-sglang config uses conc-list in single-node fixed-seq-len search-space entries, but the generate_full_sweep() function's single-node code path unconditionally accesses conc-start and conc-end without first checking for conc-list. Running full-sweep over the nvidia master config will crash with a KeyError. The test-config path (used by the PR's CI and process_changelog.py) handles conc-list correctly, which is why the PR's own tests pass — but the full-sweep command (documented in the README and available via e2e-tests.yml) is broken for this config.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit deee4cc. Configure here.


# DeepSeek-V4-Pro on B300 with EAGLE/MTP speculative decoding. Recipe is
# selected inside benchmarks/single_node/dsv4_fp4_b300_sglang_mtp.sh by
Expand Down
132 changes: 42 additions & 90 deletions benchmarks/single_node/dsv4_fp4_b300_sglang.sh
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,6 @@ fi
nvidia-smi

# Common SGLANG env vars (apply to every config).
export SGLANG_JIT_DEEPGEMM_PRECOMPILE=0
export SGLANG_OPT_SWA_SPLIT_LEAF_ON_INSERT=1
export SGLANG_OPT_USE_JIT_NORM=1
export SGLANG_OPT_USE_JIT_INDEXER_METADATA=1
Expand All @@ -48,6 +47,8 @@ EVAL_CONTEXT_ARGS=""
if [ "${EVAL_ONLY}" = "true" ]; then
setup_eval_context
EVAL_CONTEXT_ARGS="--context-length $EVAL_MAX_MODEL_LEN"
else
EVAL_CONTEXT_ARGS="--context-length 16384"
fi

start_gpu_monitor --output "$PWD/gpu_metrics.csv"
Expand All @@ -60,105 +61,56 @@ else
SWA_FULL_TOKENS_RATIO=0.1
fi

# Pick the parallelism + MoE backend based on DP_ATTENTION (mirrors the vllm
# script's pattern). DP-attention runs the empirically-tuned high-concurrency
# recipe (flashinfer_mxfp4 runner + halved prefill chunks + prefill-delayer);
# single-instance uses flashinfer_mxfp4 with the cookbook defaults.
DEEPEP_CONFIG='{"normal_dispatch":{"num_sms":96},"normal_combine":{"num_sms":96}}'

# Default; the DP-attn branch below overrides to 0.94.
MEM_FRACTION_STATIC=0.90
# Pick the launch recipe based on the two-line submission frontier:
# TP8/no-DP-attn for low latency and DEP8/DP-attn for throughput.

if [ "${DP_ATTENTION}" = "true" ]; then
export SGLANG_JIT_DEEPGEMM_PRECOMPILE=0
export SGLANG_CLIP_MAX_NEW_TOKENS_ESTIMATION=8
export SGLANG_OPT_SWA_EVICT_DROP_PAGE_MARGIN=1
export SGLANG_OPT_USE_FAST_MASK_EP=1
export SGLANG_OPT_FIX_MEGA_MOE_MEMORY=1
export SGLANG_OPT_FIX_NEXTN_MEGA_MOE=1
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=0
# ep=8 in the yaml signals the mega_moe deepep backend; check high-conc
# recipes first (they also have ep=8) so they aren't shadowed by the
# medium-conc EP_SIZE=8 branch below.
if [ "$CONC" = "2048" ] || [ "$CONC" = "4096" ] || [ "$CONC" = "8192" ]; then
export NVSHMEM_DISABLE_IB=1
export SGLANG_OPT_SWA_RELEASE_LEAF_LOCK_AFTER_WINDOW=1
export SGLANG_OPT_USE_DEEPGEMM_MEGA_MOE=1
export SGLANG_OPT_FIX_HASH_MEGA_MOE=1
if [ "$CONC" = "2048" ]; then
export SGLANG_LOG_FORWARD_ITERS=1
export SGLANG_OPT_DEEPGEMM_MEGA_MOE_NUM_MAX_TOKENS_PER_RANK=8320
CUDA_GRAPH_MAX_BS=288
MAX_RUNNING_REQUESTS=2560
MEM_FRACTION_STATIC=0.87
SWA_FULL_TOKENS_RATIO=0.06
TOKENIZER_WORKER_NUM=4
elif [ "$CONC" = "4096" ]; then
export SGLANG_OPT_DEEPGEMM_MEGA_MOE_NUM_MAX_TOKENS_PER_RANK=8320
CUDA_GRAPH_MAX_BS=544
MAX_RUNNING_REQUESTS=4352
MEM_FRACTION_STATIC=0.835
SWA_FULL_TOKENS_RATIO=0.075
TOKENIZER_WORKER_NUM=8
else
export SGLANG_OPT_USE_ONLINE_COMPRESS=1
export SGLANG_OPT_DEEPGEMM_MEGA_MOE_NUM_MAX_TOKENS_PER_RANK=8256
CUDA_GRAPH_MAX_BS=1088
MAX_RUNNING_REQUESTS=8192
MEM_FRACTION_STATIC=0.80
SWA_FULL_TOKENS_RATIO=0.3
TOKENIZER_WORKER_NUM=16
fi
PARALLEL_ARGS=(
--dp-size "$TP"
--enable-dp-attention
--moe-a2a-backend deepep
--cuda-graph-max-bs "$CUDA_GRAPH_MAX_BS"
--deepep-config "$DEEPEP_CONFIG"
--chunked-prefill-size 65536
--tokenizer-worker-num "$TOKENIZER_WORKER_NUM"
--enable-prefill-delayer
)
if [ "$CONC" = "4096" ]; then
PARALLEL_ARGS+=(--decode-log-interval 5)
fi
if [ "$CONC" = "8192" ]; then
PARALLEL_ARGS+=(--stream-interval 30)
fi
elif [ "${EP_SIZE}" = "8" ]; then
export NVSHMEM_DISABLE_IB=1
export SGLANG_OPT_USE_DEEPGEMM_MEGA_MOE=1
export SGLANG_OPT_FIX_HASH_MEGA_MOE=1
export SGLANG_OPT_DEEPGEMM_MEGA_MOE_NUM_MAX_TOKENS_PER_RANK=550
PARALLEL_ARGS=(
--dp-size "$TP"
--enable-dp-attention
--moe-a2a-backend deepep
--cuda-graph-max-bs 550
--deepep-config "$DEEPEP_CONFIG"
--chunked-prefill-size 16384
--enable-prefill-delayer
)
MAX_RUNNING_REQUESTS=768
MEM_FRACTION_STATIC=0.94
else
export SGLANG_OPT_USE_DEEPGEMM_MEGA_MOE=0
export SGLANG_OPT_FIX_HASH_MEGA_MOE=0
export SGLANG_OPT_DEEPGEMM_MEGA_MOE_NUM_MAX_TOKENS_PER_RANK=4096
PARALLEL_ARGS=(
--dp-size "$TP"
--enable-dp-attention
--moe-runner-backend flashinfer_mxfp4
--disable-flashinfer-autotune
--deepep-config "$DEEPEP_CONFIG"
--chunked-prefill-size 16384
--enable-prefill-delayer
)
MEM_FRACTION_STATIC=0.94
fi
export NVSHMEM_DISABLE_IB=1
export SGLANG_OPT_SWA_RELEASE_LEAF_LOCK_AFTER_WINDOW=1
export SGLANG_OPT_USE_ONLINE_COMPRESS=1
export SGLANG_OPT_DEEPGEMM_MEGA_MOE_NUM_MAX_TOKENS_PER_RANK=2048
export SGLANG_OPT_DEEPGEMM_MEGA_MOE_USE_FP4_ACTS=1
export SGLANG_OPT_DEEPGEMM_MEGA_MOE_USE_MXF4_KIND=1
export SGLANG_EXPERIMENTAL_ENABLE_PIECEWISE_CUDA_GRAPH_MOE_A2A=1
export NCCL_MNNVL_ENABLE=1
export NCCL_CUMEM_ENABLE=1
export MC_FORCE_MNNVL=1
export SGLANG_MOONCAKE_CUSTOM_MEM_POOL=True

MEM_FRACTION_STATIC=0.835
MAX_RUNNING_REQUESTS=4352
SWA_FULL_TOKENS_RATIO=0.075
PARALLEL_ARGS=(
--dp-size "$TP"
--enable-dp-attention
--moe-a2a-backend megamoe
--cuda-graph-max-bs 544
--enable-mixed-chunk
--chunked-prefill-size 16384
--max-prefill-tokens 16384
--tokenizer-worker-num 8
--decode-log-interval 5
--stream-interval 30
)
else
export SGLANG_JIT_DEEPGEMM_PRECOMPILE=1
MEM_FRACTION_STATIC=0.90
MAX_RUNNING_REQUESTS=512
PARALLEL_ARGS=(
--moe-runner-backend flashinfer_mxfp4
--chunked-prefill-size 8192
--disable-flashinfer-autotune
--cuda-graph-max-bs 512
--tokenizer-worker-num 8
--decode-log-interval 60
--stream-interval 30
--scheduler-recv-interval 30
)
fi

Expand All @@ -177,7 +129,7 @@ PYTHONNOUSERSITE=1 sglang serve \
--port $PORT \
--trust-remote-code \
--tp $TP \
--max-running-requests "${MAX_RUNNING_REQUESTS:-$(( CONC * 3 / 2 > 8 ? CONC * 3 / 2 : 8 ))}" \
--max-running-requests "$MAX_RUNNING_REQUESTS" \
--mem-fraction-static "$MEM_FRACTION_STATIC" \
--swa-full-tokens-ratio "$SWA_FULL_TOKENS_RATIO" \
"${PARALLEL_ARGS[@]}" $EVAL_CONTEXT_ARGS >> $SERVER_LOG 2>&1 &
Expand Down
8 changes: 8 additions & 0 deletions perf-changelog.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3171,3 +3171,11 @@
description:
- "Validates measured-power aggregation pipeline (PR #1558) on both NVIDIA (H200) and AMD (MI355X) hardware — different SMI tools (nvidia-smi vs amd-smi), different CSV schemas (power.draw [W] vs socket_power), same aggregator. No config change. Entry intentionally kept past merge so run-sweep produces canonical agg JSONs with avg_power_w + joules_per_output_token on main for both vendors, seeding the dashboard's day-zero data."
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1558

- config-keys:
- dsv4-fp4-b300-sglang
description:
- "Update DeepSeek-V4-Pro FP4 B300 SGLang non-MTP sweep to the 2026-05-19 8k/1k submission frontier: TP8 no-DP-attention c1-c64 and DEP8 DP-attention c512/c768/c1024/c1536/c2048"
- "Use lmsysorg/sglang:nightly-dev-cu13-20260522-7cf193fe to pick up the merged SGLang warmup path"
- "Map dp-attn=false to TP8 flashinfer_mxfp4 with chunked-prefill 8192; map dp-attn=true to DEP8 mixed-chunk MegaMoE throughput settings"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1552

Check failure on line 3181 in perf-changelog.yaml

View check run for this annotation

Claude / Claude Code Review

perf-changelog pr-link references superseded PR #1552 instead of merging PR #1575

The new `perf-changelog.yaml` entry's `pr-link` points to PR #1552 (the closed/superseded original), but the merging PR is #1575. This breaks the file's convention (every neighboring entry links to its own merging PR) and will trip `utils/merge_with_reuse.sh`: its conflict-resolution helper skips entries whose link doesn't end with `/pull/<current-pr>`, and the post-merge `assert last["pr-link"].endswith("/$PR")` would fail. Change the link to `/pull/1575` (or use the `XXX` placeholder).
Comment on lines +3175 to +3181
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The new perf-changelog.yaml entry's pr-link points to PR #1552 (the closed/superseded original), but the merging PR is #1575. This breaks the file's convention (every neighboring entry links to its own merging PR) and will trip utils/merge_with_reuse.sh: its conflict-resolution helper skips entries whose link doesn't end with /pull/<current-pr>, and the post-merge assert last["pr-link"].endswith("/$PR") would fail. Change the link to /pull/1575 (or use the XXX placeholder).

Extended reasoning...

What the bug is

The new entry appended at perf-changelog.yaml:3175-3181 sets:

  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1552

…but the PR being merged is #1575 (the rebased successor — the PR description itself notes "Rebased copy of #1552 with origin/main merged in"). PR #1552 is closed and will not contain the merge commit for these changes.

Why this is a real issue, not just cosmetic

Every other recent entry in perf-changelog.yaml links to its own merging PR (e.g. the immediately preceding dsv4-fp4-mi355x-sglang entry → /pull/1568, others → /pull/1555, /pull/1558, /pull/1516, /pull/1354). AGENTS.md documents this convention. More importantly, the repo's canonical merge tool utils/merge_with_reuse.sh (referenced from .claude/commands/merge-prs.md) enforces it programmatically:

  • At ~line 136, when resolving perf-changelog.yaml conflicts it filters incoming entries with: if "XXX" not in link and not link.endswith(f"/pull/{pr}"): continue. With pr=1575 and link=…/pull/1552, this entry would be silently skipped during automated conflict resolution, producing "No PR contributions found".
  • At ~line 172, after the merge it runs assert last["pr-link"].endswith("/$PR"). With $PR=1575 and the last entry pointing at /pull/1552, this assertion would fail.

Step-by-step proof

  1. Maintainer runs utils/merge_with_reuse.sh 1575.
  2. The script fetches origin/main, attempts the merge, and hits a conflict in perf-changelog.yaml (likely, given how frequently this file changes).
  3. The embedded Python helper walks the incoming entries from the PR branch. For each new entry it checks:
    if "XXX" not in link and not link.endswith(f"/pull/{pr}"): continue
    With pr = "1575" and link = "https://github.com/SemiAnalysisAI/InferenceX/pull/1552", the condition is true → the entry is dropped from the merged result.
  4. Even if no conflict arises (so the helper isn't invoked), the post-merge sanity check runs:
    assert last["pr-link"].endswith(f"/{pr}")
    Since the last entry is the new one and ends with /1552, this assertion raises and the merge tool aborts.
  5. Downstream readers using pr-link to find the diff/merge commit land on a closed, superseded PR instead of the one that actually merged.

Fix

Update line 3181 to:

  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1575

or, equivalently, use the XXX placeholder that merge_with_reuse.sh rewrites at merge time:

  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX

The historical reference to #1552 (which the PR description already provides) can stay in the PR description; the changelog entry should point at the PR that actually lands the change, both for convention and for tooling correctness.

Loading