You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
These 13 PRs all have a real diagnosis and (where applicable) an applied workaround, but they're blocked on upstream fixes, infra outages, or judgment calls outside what /loop should keep retrying. Handing them off so they don't keep churning sweep capacity.
Recommendation: keep DSV4 pinned to its SHA-pinned custom images (deepseek-v4-blackwell@sha256:..., deepseek-v4-b300@sha256:..., deepseek-v4-hopper); close the generic-bump PRs (#1460, #1455, #1450). #1461 is worth one more env-var attempt before closing.
Diagnosis: vLLM v0.21.0 CUDA-graph memory profiler still over-reserves VRAM even after dropping --gpu-memory-utilization to 0.90 (already pushed). Sweep finished with FAILURE=77 SKIPPED=7 SUCCESS=15 — almost entirely failing.
Tried (didn't fix): Lowered --gpu-memory-utilization from default → 0.90. Did not unblock.
Proposed next: Add export VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=0 before vllm serve in dsv4_fp8_h200{,_mtp}.sh. Worth one more attempt before escalating to vLLM upstream.
Upstream: None filed yet.
Escalation: If you try the VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=0 workaround (or any other debug attempt) and it still fails, please ping @ywang96 — DSV4 + vLLM v0.21.0 on H200 is in their wheelhouse.
Diagnosis: DSV4-Pro FP8 + MTP weights take ~125 GB / 141 GB per H200 on generic v0.5.12-cu130. The custom deepseek-v4-hopper image uses a different EAGLE / weight layout that fits.
Tried (didn't fix): Image bump alone; no flag toggle would close a ~16 GB/H200 gap.
Recommendation: Keep DSV4 pinned to SHA-pinned custom image; the generic-bump is not viable on H200. Likely close this PR.
Upstream: None — this is an sglang-image packaging decision, not a bug.
Diagnosis:Not OOM, not a B300 kernel regression. Server starts cleanly ("fired up and ready to roll!"); the bench client crashes in benchmark_serving.py calling AutoTokenizer.from_pretrained("/data/models/dsv4-pro") with KeyError: 'deepseek_v4' — the generic v0.5.12-cu130 image's transformers doesn't know that model_type. The custom deepseek-v4-b300@sha256:... image bundles a patched transformers.
Tried (didn't fix): Image bump only. Post-cluster-fix rebase (head 7e3166ec) also fails (FAILURE=2 IN_PROGRESS=2 QUEUED=25 SKIPPED=23 SUCCESS=7) — confirming the diagnosis isn't about cluster availability.
Category summary: Each of these has a known upstream issue blocking the recipe. AMD (sgl#25742) acknowledged the GLM-5.1-MXFP4 GSM8K regression. The two B300/sglang-v0.5.12 ones (sgl#25563, sgl#25863) are filed and awaiting triage. Don't keep retrying these — wait for upstream.
Diagnosis:Quality regression, not a crash. 1 eval-only job failed: glm5.1 fp4 mi355x sglang tp=2 spec-mtp conc-256 eval-only. Server warmed up, lm-eval ran gsm8k to completion, but accuracy was exact_match = 0.1774 / 0.1782 against the 0.85 threshold. EAGLE+MTP draft on GLM-5.1 MXFP4 is producing degenerate output on math reasoning — likely the draft model isn't aligned for chain-of-thought, or the new recipe's speculative knobs need tuning (--speculative-num-steps=3, --speculative-num-draft-tokens=4). Perf-bench jobs were still in flight at handoff time.
Tried (didn't fix): None — fix needs a judgment call on which option below to take.
Options:
(a) Drop the eval-only entry from the recipe (let perf bench validate; skip the gsm8k accuracy gate for the MTP variant).
(c) Lower the gsm8k threshold for this recipe in utils/evals/thresholds.json.
(d) Wait for the perf-bench jobs to finish — if those pass, merge with eval gate removed.
Recommendation: (a) or (d). Tuning EAGLE knobs (b) without a real perf-quality study is just guessing, and dropping the threshold (c) silently hides the regression.
Diagnosis: Post-cluster-fix rebase clears the vision-encoder cute crash via --mm-attention-backend triton_attn workaround, but now exposes a silent GSM8K quality regression: exact_match=0.0000 (strict-match) / 0.0015 (flexible-extract) against a 0.85 threshold on the tp=4 8k1k spec-none conc-256 eval-only matrix entry. Server starts cleanly; requests succeed; model just isn't producing GSM8K-formatted answers. Likely interaction between --quantization fp8, --moe-runner-backend flashinfer_trtllm, --attention-backend trtllm_mha, and chat-template handling in v0.5.12-cu130.
Tried (didn't fix): Rebased onto current main (head e1d3a181). --mm-attention-backend triton_attn workaround in place (for the unrelated sgl#25564 cute crash); doesn't fix the quality regression.
Diagnosis: Upstream sglang v0.5.12 trtllm-batched-gemm bug — EAGLE draft CUDA-graph capture crashes at bs=128 (numBatches=256, GemmMNK 128x1024x6144, kernel ...sm100f) on B300 for GLM-5-NVFP4. Per @trevor-m on sgl#25563, likely a flashinfer regression (flashinfer_python 0.6.8.post1 → 0.6.11.post1 bump between v0.5.11 and v0.5.12).
Tried (currently running): Pushed cfaf3bdd pinning flashinfer_python==0.6.8.post1 + flashinfer_cubin==0.6.8.post1 in glm5_fp4_b300{,_mtp}.sh; rebased onto current main (head 006a3908). Post-rebase sweep shows FAILURE=1 IN_PROGRESS=8 QUEUED=1 SKIPPED=6 SUCCESS=33 — most jobs pass with the pin, but one 8k1k MTP job still fails. Worth checking whether the remaining FAILURE is the same trtllm-batched-gemm site or something new.
MI300 cluster down — waiting for firmware upgrades (3 PRs)
Category summary: MI300 cluster is in a firmware-upgrade window; sweep retries that hit mi300x-amds_* nodes get cancelled or can't allocate. No code change needed — these PRs just need a rerun once the upgrade window ends. The high cancellation counts (e.g. CANCELLED=10–13) are infra, not recipe regressions.
Diagnosis: Sweep breakdown CANCELLED=10 FAILURE=3 SKIPPED=6 SUCCESS=12 on the mi300x pool — mostly cancellations consistent with the firmware-upgrade outage. The successful 12 jobs prove the recipe itself works on the available nodes.
Tried (didn't fix): Nothing — reruns will keep getting cancelled until the cluster is back.
Proposed fix: Wait for the firmware-upgrade window to end, then gh run rerun --failed on the latest sweep run.
Diagnosis:Originally diagnosed as a transient SLURM controller flake; now re-categorized as MI300 cluster downtime for firmware upgrades. Single matrix job (single-node 8k1k spec-none conc-X) timed out after ~5h waiting for an allocation on mi300x-amds_01. The salloc log shows _accept_msg_connection[167.94.146.58:63632]: Connection reset by peer; Job submit/allocate failed. The other 42 successes prove the image bump itself is fine.
Tried (didn't fix): Nothing — reruns will keep hitting the same infra gap until the upgrade completes.
Proposed fix: Wait for the firmware-upgrade window to end, then gh run rerun 26008643806 --failed — the cluster will allocate and the PR will go green.
Upstream: None — infra schedule, not a bug.
Other (2 PRs)
#1512 — Test sgl-deep-gemm==0.0.1 pin for sgl#25551 (glm5-fp8-b300 DeepGemm regression)
What this PR is: A debug-only test of @trevor-m's suggestion in sgl-project/sglang#25551 (comment) — pin sgl-deep-gemm==0.0.1 inside the v0.5.12 container (re-enable JIT DeepGemm) to check whether the deep-gemm 0.0.1 → 0.1.0 upgrade is what triggers the B300 CUDA_ERROR_ILLEGAL_ADDRESS TMA-descriptor regression. Not meant to merge.
Diagnosis:First attempt was invalid — the pin never applied. Both pip install --no-deps lines in glm5_fp8_b300.sh got blocked by Debian PEP 668 (error: externally-managed-environment) inside lmsysorg/sglang:v0.5.12-cu130. Pushed f24746e5 adding --break-system-packages so the pin actually takes effect; awaiting fresh sweep result.
Tried (didn't fix): Initial 0.0.1 pin via pip install --no-deps — blocked by PEP 668. Now fixed with --break-system-packages.
Diagnosis: Sweep breakdown CANCELLED=1 FAILURE=1 SKIPPED=24 SUCCESS=23 — a single eval-only job died early (srun: error: mia1-p01-g31: task 0: Exited with exit code 1, no results*.json produced). Server fully started (cuda-graph capture completed, mem ~88 GB free); failure is in the eval shell stage, not the model. Likely a flake or a missing eval-tool dep on that one node — needs investigation before judging the recipe.
Tried (didn't fix): Nothing — needs a closer look at the failing eval job's pre-run shell output.
Proposed next: Rerun the failed job (gh run rerun 26134317173 --failed) and if it reproduces, dig into the eval shell stage for that node.
Upstream: None — likely flake/env issue, not a bug.
The /loop skill keeps refreshing the dashboard and applying surface-level workarounds, but none of these are productive to keep retrying without either (a) an upstream fix landing, (b) the MI300 cluster coming back, or (c) a human deciding scope (close vs keep open, change strategy).
Each affected PR's title has been prefixed with [Handoff to @Oseltamivir Claude /loop] so they're easy to find in the PR list and the dashboard won't keep re-diagnosing them.
human
handing off to @Oseltamivir to try to debug this, my /loop failed on this and didnt get the chance to manually look at it
below is [AI Generated]
Handoff to @Oseltamivir — 13 stuck Klaud Cold PRs
These 13 PRs all have a real diagnosis and (where applicable) an applied workaround, but they're blocked on upstream fixes, infra outages, or judgment calls outside what
/loopshould keep retrying. Handing them off so they don't keep churning sweep capacity.PRs grouped by category:
#1461, #1460, #1455, #1450#1451 → sgl#25863; #1420 → sgl#25563#1512, #1521doneAnd the same set grouped by vendor (NVIDIA vs AMD):
For each PR below: links to the PR + most recent failing sweep run, what's wrong, what's been tried, and the upstream tickets (if filed).
DSV4 — latest stable image didnt work (4 PRs)
Category summary: Every DSV4 PR is blocked on the generic upstream image lacking what DSV4-Pro needs:
KeyError: 'deepseek_v4') for the model_type registration — affects [Handoff to @Oseltamivir Claude /loop] [Klaud Cold] Update dsv4-fp4-b300-sglang (+mtp) SGLang image to v0.5.12-cu130 #1455, [Handoff to @Oseltamivir Claude /loop] [Klaud Cold] Update dsv4-fp4-b200-sglang SGLang image to v0.5.12-cu130 #1450.Recommendation: keep DSV4 pinned to its SHA-pinned custom images (
deepseek-v4-blackwell@sha256:...,deepseek-v4-b300@sha256:...,deepseek-v4-hopper); close the generic-bump PRs (#1460, #1455, #1450). #1461 is worth one more env-var attempt before closing.#1461 — dsv4-fp8-h200-vllm (+mtp) → v0.21.0
--gpu-memory-utilizationto0.90(already pushed). Sweep finished withFAILURE=77 SKIPPED=7 SUCCESS=15— almost entirely failing.--gpu-memory-utilizationfrom default →0.90. Did not unblock.export VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=0beforevllm serveindsv4_fp8_h200{,_mtp}.sh. Worth one more attempt before escalating to vLLM upstream.VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=0workaround (or any other debug attempt) and it still fails, please ping @ywang96 — DSV4 + vLLM v0.21.0 on H200 is in their wheelhouse.#1460 — dsv4-fp8-h200-sglang (+mtp) → v0.5.12-cu130
v0.5.12-cu130. The customdeepseek-v4-hopperimage uses a different EAGLE / weight layout that fits.#1455 — dsv4-fp4-b300-sglang (+mtp) → v0.5.12-cu130
"fired up and ready to roll!"); the bench client crashes inbenchmark_serving.pycallingAutoTokenizer.from_pretrained("/data/models/dsv4-pro")withKeyError: 'deepseek_v4'— the genericv0.5.12-cu130image's transformers doesn't know thatmodel_type. The customdeepseek-v4-b300@sha256:...image bundles a patched transformers.7e3166ec) also fails (FAILURE=2 IN_PROGRESS=2 QUEUED=25 SKIPPED=23 SUCCESS=7) — confirming the diagnosis isn't about cluster availability.deepseek-v4-b300@sha256:...image until sglang ships an image that bundles transformers withdeepseek_v4support. Likely close. Same conclusion as [Handoff to @Oseltamivir Claude /loop] [Klaud Cold] Update dsv4-fp8-h200-sglang (+mtp) SGLang image to v0.5.12-cu130 #1460 / [Handoff to @Oseltamivir Claude /loop] [Klaud Cold] Update dsv4-fp4-b200-sglang SGLang image to v0.5.12-cu130 #1450.#1450 — dsv4-fp4-b200-sglang → v0.5.12-cu130
v0.5.12-cu130bundles a transformers that doesn't recognizemodel_type: "deepseek_v4", bench client crashes inAutoTokenizer.from_pretrainedwithKeyError: 'deepseek_v4'. The customdeepseek-v4-blackwell@sha256image presumably bundles a patched transformers.lmsysorg/sglang:deepseek-v4-blackwell@sha256:.... Same handling as [Handoff to @Oseltamivir Claude /loop] [Klaud Cold] Update dsv4-fp8-h200-sglang (+mtp) SGLang image to v0.5.12-cu130 #1460 / [Handoff to @Oseltamivir Claude /loop] [Klaud Cold] Update dsv4-fp4-b300-sglang (+mtp) SGLang image to v0.5.12-cu130 #1455 — likely close.Upstream bugs with ticket filed (4 PRs)
Category summary: Each of these has a known upstream issue blocking the recipe. AMD (sgl#25742) acknowledged the GLM-5.1-MXFP4 GSM8K regression. The two B300/sglang-v0.5.12 ones (sgl#25563, sgl#25863) are filed and awaiting triage. Don't keep retrying these — wait for upstream.
#1494 — Add glm5.1-fp4-mi355x-sglang-mtp recipe — sgl#25742 (AMD)
glm5.1 fp4 mi355x sglang tp=2 spec-mtp conc-256 eval-only. Server warmed up, lm-eval ran gsm8k to completion, but accuracy wasexact_match = 0.1774 / 0.1782against the0.85threshold. EAGLE+MTP draft on GLM-5.1 MXFP4 is producing degenerate output on math reasoning — likely the draft model isn't aligned for chain-of-thought, or the new recipe's speculative knobs need tuning (--speculative-num-steps=3,--speculative-num-draft-tokens=4). Perf-bench jobs were still in flight at handoff time.--speculative-num-steps/--speculative-eagle-topkdown.utils/evals/thresholds.json.#1441 — Update glm5.1-fp4-mi355x-sglang SGLang ROCm image to v0.5.12-rocm720-mi35x-20260517 — sgl#25742 (AMD)
exact_match = 0.3177 (< 0.85 threshold). Better than [Handoff to @Oseltamivir Claude /loop] [Klaud Cold] Add glm5.1-fp4-mi355x-sglang-mtp recipe #1494's 0.18 (no draft model degrading output) but still ~3x below the gate. The GLM-5.1-MXFP4 model itself doesn't pass GSM8K at fp4 on mi355x — not an image-bump regression. 27 perf-bench jobs succeeded; only eval-only is failing.utils/evals/thresholds.jsonfor these recipes (e.g. to 0.30 for off, 0.15 for mtp).#1451 — qwen3.5-fp8-b300-sglang (+mtp) → v0.5.12-cu130 — sgl#25863
--mm-attention-backend triton_attnworkaround, but now exposes a silent GSM8K quality regression:exact_match=0.0000 (strict-match) / 0.0015 (flexible-extract)against a0.85threshold on thetp=4 8k1k spec-none conc-256 eval-onlymatrix entry. Server starts cleanly; requests succeed; model just isn't producing GSM8K-formatted answers. Likely interaction between--quantization fp8,--moe-runner-backend flashinfer_trtllm,--attention-backend trtllm_mha, and chat-template handling in v0.5.12-cu130.e1d3a181).--mm-attention-backend triton_attnworkaround in place (for the unrelated sgl#25564 cute crash); doesn't fix the quality regression.#1420 — glm5-fp4-b300-sglang (+mtp) → v0.5.12-cu130 — sgl#25563
trtllm-batched-gemmbug — EAGLE draft CUDA-graph capture crashes atbs=128(numBatches=256,GemmMNK 128x1024x6144, kernel...sm100f) on B300 for GLM-5-NVFP4. Per @trevor-m on sgl#25563, likely a flashinfer regression (flashinfer_python0.6.8.post1 → 0.6.11.post1 bump between v0.5.11 and v0.5.12).cfaf3bddpinningflashinfer_python==0.6.8.post1+flashinfer_cubin==0.6.8.post1inglm5_fp4_b300{,_mtp}.sh; rebased onto current main (head006a3908). Post-rebase sweep showsFAILURE=1 IN_PROGRESS=8 QUEUED=1 SKIPPED=6 SUCCESS=33— most jobs pass with the pin, but one 8k1k MTP job still fails. Worth checking whether the remaining FAILURE is the same trtllm-batched-gemm site or something new.MI300 cluster down — waiting for firmware upgrades (3 PRs)
Category summary: MI300 cluster is in a firmware-upgrade window; sweep retries that hit
mi300x-amds_*nodes get cancelled or can't allocate. No code change needed — these PRs just need a rerun once the upgrade window ends. The high cancellation counts (e.g.CANCELLED=10–13) are infra, not recipe regressions.#1499 — Add dsr1-fp8-mi300x-sglang-mtp recipe
CANCELLED=10 FAILURE=3 SKIPPED=6 SUCCESS=12on the mi300x pool — mostly cancellations consistent with the firmware-upgrade outage. The successful 12 jobs prove the recipe itself works on the available nodes.gh run rerun --failedon the latest sweep run.#1482 — Add qwen3.5-fp8-mi300x-sglang-mtp recipe
CANCELLED=13 SKIPPED=6 SUCCESS=12— all 13 non-skipped failures are cancellations on mi300x nodes; zero hard FAILUREs. Same firmware-upgrade outage as [Handoff to @Oseltamivir Claude /loop] [Klaud Cold] Add dsr1-fp8-mi300x-sglang-mtp recipe #1499.gh run rerun --failed.#1403 — Update gptoss-fp4-mi300x-vllm vLLM ROCm image to v0.21.0
single-node 8k1k spec-none conc-X) timed out after ~5h waiting for an allocation onmi300x-amds_01. The salloc log shows_accept_msg_connection[167.94.146.58:63632]: Connection reset by peer; Job submit/allocate failed. The other 42 successes prove the image bump itself is fine.gh run rerun 26008643806 --failed— the cluster will allocate and the PR will go green.Other (2 PRs)
#1512 — Test
sgl-deep-gemm==0.0.1pin for sgl#25551 (glm5-fp8-b300 DeepGemm regression)sgl-deep-gemm==0.0.1inside the v0.5.12 container (re-enable JIT DeepGemm) to check whether the deep-gemm0.0.1 → 0.1.0upgrade is what triggers the B300CUDA_ERROR_ILLEGAL_ADDRESSTMA-descriptor regression. Not meant to merge.pip install --no-depslines inglm5_fp8_b300.shgot blocked by Debian PEP 668 (error: externally-managed-environment) insidelmsysorg/sglang:v0.5.12-cu130. Pushedf24746e5adding--break-system-packagesso the pin actually takes effect; awaiting fresh sweep result.pip install --no-deps— blocked by PEP 668. Now fixed with--break-system-packages.#1521 — Add dsr1-fp8-mi355x-sglang-mtp single-node MTP recipe
CANCELLED=1 FAILURE=1 SKIPPED=24 SUCCESS=23— a single eval-only job died early (srun: error: mia1-p01-g31: task 0: Exited with exit code 1, noresults*.jsonproduced). Server fully started (cuda-graph capture completed, mem ~88 GB free); failure is in the eval shell stage, not the model. Likely a flake or a missing eval-tool dep on that one node — needs investigation before judging the recipe.gh run rerun 26134317173 --failed) and if it reproduces, dig into the eval shell stage for that node.The
/loopskill keeps refreshing the dashboard and applying surface-level workarounds, but none of these are productive to keep retrying without either (a) an upstream fix landing, (b) the MI300 cluster coming back, or (c) a human deciding scope (close vs keep open, change strategy).Each affected PR's title has been prefixed with
[Handoff to @Oseltamivir Claude /loop]so they're easy to find in the PR list and the dashboard won't keep re-diagnosing them.