Skip to content

[Klaud Cold] KLAUD_DEBUG: B300 is sm_103 (not sm_120) + cross-link sglang#25563#1479

Merged
functionstackx merged 1 commit into
mainfrom
fix-klaud-debug-b300-sm103
May 18, 2026
Merged

[Klaud Cold] KLAUD_DEBUG: B300 is sm_103 (not sm_120) + cross-link sglang#25563#1479
functionstackx merged 1 commit into
mainfrom
fix-klaud-debug-b300-sm103

Conversation

@functionstackx
Copy link
Copy Markdown
Collaborator

Summary

Two corrections to KLAUD_DEBUG.md §4 (the B300 sglang v0.5.12 regressions playbook):

  1. Arch fix. B300 (Blackwell Ultra datacenter) is compute capability 10.3 / sm_103. sm_120 is consumer Blackwell (RTX 50 series / GB20x dies), not B300. This bad assumption had propagated through agent diagnoses, through several dashboard Reason cells, and into the upstream issue body for [Bug] GLM-5-NVFP4 + EAGLE on B300 (sm_103): trtllm_batched_gemm_runner.cu:276 dispatches sm100f kernel — crashes at bs=128 draft graph capture (v0.5.12-cu130; v0.5.11 works) sgl-project/sglang#25563 (already corrected there in the same pass).

  2. §4c reframe. sm_103 is nominally inside the asserted range sm_100 <= arch <= sm_110f (since 100 ≤ 103 ≤ 110), so the assertion failure isn't really "B300 is outside the range" — best guess is the cute kernel's Arch.sm_110f set only matches the architecture-specific feature-flag variants it was compiled for (sm_100, sm_100f, sm_110, sm_110f) and sm_103 / sm_103a isn't in that explicit list.

Also cross-linked sgl-project/sglang#25563 under §4b (filed earlier this session for the EAGLE draft graph capture crash on GLM-5-NVFP4 at bs=128 — same B300 v0.5.12 regression family).

Test plan

  • Diff is doc-only (KLAUD_DEBUG.md 4-line edit).

🤖 Generated with Claude Code

Two corrections to §4 (B300 sglang v0.5.12 regressions):

1. **Arch fix.** B300 (Blackwell Ultra datacenter) is compute capability
   10.3 / `sm_103`, NOT `sm_120`. sm_120 is for consumer Blackwell
   (RTX 50 series / GB20x dies). This had propagated through agent
   diagnoses and into upstream issue sgl-project/sglang#25563 (already
   corrected there).

2. **§4c reframe.** sm_103 is *nominally inside* the asserted range
   `sm_100 <= arch <= sm_110f` (since 100 <= 103 <= 110), so the
   assertion failure is more interesting than "outside the range" —
   best guess is the cute kernel's `Arch.sm_110f` set only matches the
   architecture-specific feature-flag variants it was compiled for
   (sm_100, sm_100f, sm_110, sm_110f) and sm_103/sm_103a isn't in
   that list.

Also cross-linked sgl-project/sglang#25563 under §4b (filed earlier
this session for the EAGLE draft graph capture crash on GLM-5-NVFP4
at bs=128 — same B300 v0.5.12 regression family).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@functionstackx functionstackx requested a review from a team May 18, 2026 04:16
@functionstackx functionstackx merged commit 165d625 into main May 18, 2026
@functionstackx functionstackx deleted the fix-klaud-debug-b300-sm103 branch May 18, 2026 04:16
Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, simple doc correction; nit about a leftover sm_120 reference in a related slash-command template is flagged inline.

Extended reasoning...

Overview

PR #1479 is a 4-line documentation-only edit to KLAUD_DEBUG.md §4, correcting the B300 architecture string from sm_120 (which is consumer Blackwell / RTX 50) to sm_103 (Blackwell Ultra, compute capability 10.3), and reframing §4c so the assertion failure is explained as a feature-flag set mismatch rather than an out-of-range arch. Also adds a cross-link to upstream issue sglang#25563 under §4b. No code paths touched.

Security risks

None. This is an operator-facing internal debug playbook; the file is not loaded by any runtime, build, or CI process, and the changes are purely textual descriptions of architecture identifiers.

Level of scrutiny

Low. The PR explicitly scopes itself to a doc-only 4-line edit and that is exactly what the diff contains. The factual correction (sm_103 vs sm_120 for B300 / Blackwell Ultra GB300) is consistent with NVIDIA's published compute capabilities, and the §4c reframe is internally consistent (100 ≤ 103 ≤ 110 makes the original "outside the range" explanation incorrect on its face).

Other factors

The bug hunter flagged a related stale sm_120 reference in .claude/commands/klaud-pr-status-html.md:171 that this cleanup pass missed — it is a legitimate finding (the template is a seed for fresh dashboard JSON) but is a nit relative to the PR's stated 4-line scope, and the inline comment will surface it for follow-up without blocking this correction.

Comment thread KLAUD_DEBUG.md
## 4. Upstream sglang v0.5.12 B300 regressions

Two distinct upstream regressions on NVIDIA B300 (Blackwell, `sm_120`) shipped in `lmsysorg/sglang:v0.5.12-cu130`:
Three distinct upstream regressions on NVIDIA B300 (Blackwell Ultra, `sm_103` — compute capability 10.3) shipped in `lmsysorg/sglang:v0.5.12-cu130`. (sm_120 is for *consumer* Blackwell / RTX 50 series, not B300 — don't propagate that.)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Pre-existing stale reference missed by this PR's cleanup pass: .claude/commands/klaud-pr-status-html.md:171 still contains a dashboard Reason-cell example keyed to PR #1422 reading "Upstream sglang v0.5.12 flash_attn SM-arch regression on B300 (sm_120)." The PR description explicitly identifies "dashboard Reason cells" as one of the propagation targets for the bad sm_120 assumption — this template, which agents copy into /tmp/klaud_pr_diag.json each /klaud-pr-status-html invocation, was missed. Suggest updating it to sm_103 in the same PR so the correction is not re-seeded on every run.

Extended reasoning...

What the bug is

The PR corrects KLAUD_DEBUG.md §4 so that B300 (Blackwell Ultra) is described as sm_103 (compute capability 10.3) rather than the consumer-Blackwell sm_120. The PR description states explicitly that this bad assumption "had propagated through agent diagnoses, through several dashboard Reason cells, and into the upstream issue body" and that those have been corrected. However, one such propagated dashboard Reason cell — directly keyed to the same PR #1422 fixed in §4c — was not updated.

Where it lives

.claude/commands/klaud-pr-status-html.md:171 contains the example JSON template that agents copy verbatim into /tmp/klaud_pr_diag.json when generating dashboard HTML (per the surrounding instructions on lines 161–176). That template's Reason cell for PR #1422 still reads:

"Upstream sglang v0.5.12 flash_attn SM-arch regression on B300 (sm_120)."

A repo-wide grep for sm_120 returns exactly two hits: the corrective line in KLAUD_DEBUG.md:69 (introduced by this PR, and which deliberately disambiguates sm_120 as consumer Blackwell) and this stale line in the slash-command template.

Why this matters for this PR specifically

The PR is explicitly doc-only and scoped to scrubbing the bad sm_120 → B300 association from agent-facing materials. The template at klaud-pr-status-html.md:171 is exactly the class of artifact the PR description calls out ("dashboard Reason cells"). Because the template is the seed agents copy when generating fresh status JSON, leaving it as sm_120 causes each future /klaud-pr-status-html run to re-inject the wrong arch string into newly produced dashboards — defeating the purpose of §4c's correction.

Step-by-step propagation proof

  1. A user (or cron) invokes /klaud-pr-status-html.
  2. Per the playbook in .claude/commands/klaud-pr-status-html.md (lines 161–176), the agent copies the example JSON block — including the Reason for PR 1422 — into /tmp/klaud_pr_diag.json as its starting scaffold.
  3. The agent renders the dashboard HTML from that JSON; the Reason cell for PR Update qwen3.5-bf16-b300-sglang and -mtp SGLang image to v0.5.12-cu130 #1422 (and structurally similar new B300 flash_attn rows) carries the literal sm_120 string.
  4. A future debug session reads the dashboard, sees sm_120 attributed to B300, and re-seeds the same wrong assumption that this PR is specifically trying to eradicate.

Impact

Doc/template only — no runtime effect — but it directly undercuts the stated cleanup goal of the PR and will silently reintroduce the wrong arch on each dashboard regeneration until corrected.

Fix

Single-line edit in .claude/commands/klaud-pr-status-html.md:171: change (<code>sm_120</code>) to (<code>sm_103</code>) so the template matches the corrected KLAUD_DEBUG.md §4c. Severity is nit since it is a doc-template fix, but it is in scope for this PR (which is explicitly a doc-only sm_120 → sm_103 cleanup pass).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Development

Successfully merging this pull request may close these issues.

1 participant