[Klaud Cold] KLAUD_DEBUG: B300 is sm_103 (not sm_120) + cross-link sglang#25563#1479
Conversation
Two corrections to §4 (B300 sglang v0.5.12 regressions): 1. **Arch fix.** B300 (Blackwell Ultra datacenter) is compute capability 10.3 / `sm_103`, NOT `sm_120`. sm_120 is for consumer Blackwell (RTX 50 series / GB20x dies). This had propagated through agent diagnoses and into upstream issue sgl-project/sglang#25563 (already corrected there). 2. **§4c reframe.** sm_103 is *nominally inside* the asserted range `sm_100 <= arch <= sm_110f` (since 100 <= 103 <= 110), so the assertion failure is more interesting than "outside the range" — best guess is the cute kernel's `Arch.sm_110f` set only matches the architecture-specific feature-flag variants it was compiled for (sm_100, sm_100f, sm_110, sm_110f) and sm_103/sm_103a isn't in that list. Also cross-linked sgl-project/sglang#25563 under §4b (filed earlier this session for the EAGLE draft graph capture crash on GLM-5-NVFP4 at bs=128 — same B300 v0.5.12 regression family). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
LGTM, simple doc correction; nit about a leftover sm_120 reference in a related slash-command template is flagged inline.
Extended reasoning...
Overview
PR #1479 is a 4-line documentation-only edit to KLAUD_DEBUG.md §4, correcting the B300 architecture string from sm_120 (which is consumer Blackwell / RTX 50) to sm_103 (Blackwell Ultra, compute capability 10.3), and reframing §4c so the assertion failure is explained as a feature-flag set mismatch rather than an out-of-range arch. Also adds a cross-link to upstream issue sglang#25563 under §4b. No code paths touched.
Security risks
None. This is an operator-facing internal debug playbook; the file is not loaded by any runtime, build, or CI process, and the changes are purely textual descriptions of architecture identifiers.
Level of scrutiny
Low. The PR explicitly scopes itself to a doc-only 4-line edit and that is exactly what the diff contains. The factual correction (sm_103 vs sm_120 for B300 / Blackwell Ultra GB300) is consistent with NVIDIA's published compute capabilities, and the §4c reframe is internally consistent (100 ≤ 103 ≤ 110 makes the original "outside the range" explanation incorrect on its face).
Other factors
The bug hunter flagged a related stale sm_120 reference in .claude/commands/klaud-pr-status-html.md:171 that this cleanup pass missed — it is a legitimate finding (the template is a seed for fresh dashboard JSON) but is a nit relative to the PR's stated 4-line scope, and the inline comment will surface it for follow-up without blocking this correction.
| ## 4. Upstream sglang v0.5.12 B300 regressions | ||
|
|
||
| Two distinct upstream regressions on NVIDIA B300 (Blackwell, `sm_120`) shipped in `lmsysorg/sglang:v0.5.12-cu130`: | ||
| Three distinct upstream regressions on NVIDIA B300 (Blackwell Ultra, `sm_103` — compute capability 10.3) shipped in `lmsysorg/sglang:v0.5.12-cu130`. (sm_120 is for *consumer* Blackwell / RTX 50 series, not B300 — don't propagate that.) |
There was a problem hiding this comment.
🟡 Pre-existing stale reference missed by this PR's cleanup pass: .claude/commands/klaud-pr-status-html.md:171 still contains a dashboard Reason-cell example keyed to PR #1422 reading "Upstream sglang v0.5.12 flash_attn SM-arch regression on B300 (sm_120)." The PR description explicitly identifies "dashboard Reason cells" as one of the propagation targets for the bad sm_120 assumption — this template, which agents copy into /tmp/klaud_pr_diag.json each /klaud-pr-status-html invocation, was missed. Suggest updating it to sm_103 in the same PR so the correction is not re-seeded on every run.
Extended reasoning...
What the bug is
The PR corrects KLAUD_DEBUG.md §4 so that B300 (Blackwell Ultra) is described as sm_103 (compute capability 10.3) rather than the consumer-Blackwell sm_120. The PR description states explicitly that this bad assumption "had propagated through agent diagnoses, through several dashboard Reason cells, and into the upstream issue body" and that those have been corrected. However, one such propagated dashboard Reason cell — directly keyed to the same PR #1422 fixed in §4c — was not updated.
Where it lives
.claude/commands/klaud-pr-status-html.md:171 contains the example JSON template that agents copy verbatim into /tmp/klaud_pr_diag.json when generating dashboard HTML (per the surrounding instructions on lines 161–176). That template's Reason cell for PR #1422 still reads:
"Upstream sglang v0.5.12
flash_attnSM-arch regression on B300 (sm_120)."
A repo-wide grep for sm_120 returns exactly two hits: the corrective line in KLAUD_DEBUG.md:69 (introduced by this PR, and which deliberately disambiguates sm_120 as consumer Blackwell) and this stale line in the slash-command template.
Why this matters for this PR specifically
The PR is explicitly doc-only and scoped to scrubbing the bad sm_120 → B300 association from agent-facing materials. The template at klaud-pr-status-html.md:171 is exactly the class of artifact the PR description calls out ("dashboard Reason cells"). Because the template is the seed agents copy when generating fresh status JSON, leaving it as sm_120 causes each future /klaud-pr-status-html run to re-inject the wrong arch string into newly produced dashboards — defeating the purpose of §4c's correction.
Step-by-step propagation proof
- A user (or cron) invokes
/klaud-pr-status-html. - Per the playbook in
.claude/commands/klaud-pr-status-html.md(lines 161–176), the agent copies the example JSON block — including the Reason for PR 1422 — into/tmp/klaud_pr_diag.jsonas its starting scaffold. - The agent renders the dashboard HTML from that JSON; the Reason cell for PR Update qwen3.5-bf16-b300-sglang and -mtp SGLang image to v0.5.12-cu130 #1422 (and structurally similar new B300 flash_attn rows) carries the literal
sm_120string. - A future debug session reads the dashboard, sees
sm_120attributed to B300, and re-seeds the same wrong assumption that this PR is specifically trying to eradicate.
Impact
Doc/template only — no runtime effect — but it directly undercuts the stated cleanup goal of the PR and will silently reintroduce the wrong arch on each dashboard regeneration until corrected.
Fix
Single-line edit in .claude/commands/klaud-pr-status-html.md:171: change (<code>sm_120</code>) to (<code>sm_103</code>) so the template matches the corrected KLAUD_DEBUG.md §4c. Severity is nit since it is a doc-template fix, but it is in scope for this PR (which is explicitly a doc-only sm_120 → sm_103 cleanup pass).
Summary
Two corrections to
KLAUD_DEBUG.md§4 (the B300 sglang v0.5.12 regressions playbook):Arch fix. B300 (Blackwell Ultra datacenter) is compute capability 10.3 /
sm_103.sm_120is consumer Blackwell (RTX 50 series / GB20x dies), not B300. This bad assumption had propagated through agent diagnoses, through several dashboardReasoncells, and into the upstream issue body for [Bug] GLM-5-NVFP4 + EAGLE on B300 (sm_103): trtllm_batched_gemm_runner.cu:276 dispatches sm100f kernel — crashes at bs=128 draft graph capture (v0.5.12-cu130; v0.5.11 works) sgl-project/sglang#25563 (already corrected there in the same pass).§4c reframe. sm_103 is nominally inside the asserted range
sm_100 <= arch <= sm_110f(since 100 ≤ 103 ≤ 110), so the assertion failure isn't really "B300 is outside the range" — best guess is the cute kernel'sArch.sm_110fset only matches the architecture-specific feature-flag variants it was compiled for (sm_100,sm_100f,sm_110,sm_110f) andsm_103/sm_103aisn't in that explicit list.Also cross-linked sgl-project/sglang#25563 under §4b (filed earlier this session for the EAGLE draft graph capture crash on GLM-5-NVFP4 at bs=128 — same B300 v0.5.12 regression family).
Test plan
KLAUD_DEBUG.md4-line edit).🤖 Generated with Claude Code