[WIP] Chore/agentx v0.3 by cquil11 · Pull Request #1571 · SemiAnalysisAI/InferenceX

cquil11 · 2026-05-27T13:58:31Z

Note

Medium Risk
Large CI matrix and benchmark-script changes affect production sweep behavior; LMCache/ROCm runtime patches and multinode GB300 agentic recipes add operational complexity but are confined to benchmark infrastructure.

Overview
This PR advances AgentX v0.3: agentic-coding benchmarks move from the legacy trace-replay submodule to aiperf (cjq/agentx-v0.3-subagents), with artifacts under aiperf_artifacts/ and shared replay via run_agentic_replay_and_write_outputs in benchmark_lib.sh. Workflows route non-agentic runs to fixed_seq_len/ and expand offload modes (lmcache, hicache, etc.).

Sweep configs add many agentic (and some fixed-seq) matrix entries across AMD/NVIDIA (Qwen3.5 HiCache, DSv4, Kimi, MiniMax, GB300 dynamo-vLLM disagg agentic on NV/CW). Several sweeps drop CPU-offload points for this iteration in favor of no-offload curves on a newer trace corpus; Kimi agentic on MI355X/B200/B300 shifts toward LMCache (with substantial ROCm-specific LMCache/vLLM patches on MI355X). Runner labels for mi355x are normalized (mi355x-amds_00–_08).

Benchmark scripts gain DSv4 MI355X SGLang agentic, Qwen HiCache launchers (B300/H100/MI355X), DSv4 vLLM native CPU offload tuning, and GB300 srt-slurm agentic recipes (NATS payload, Slurm mem/CPU, agentic_srt.sh / keepalive). Prefix/radix cache is enabled where agentic replay depends on it; MiniMax uses a 256k-capped Weka loader when context is limited.

^{Reviewed by Cursor Bugbot for commit 6a77acb. Bugbot is set up for automated code reviews on this repo. Configure here.}

…loadingConnector vLLM's --kv_offloading_backend native resolves to two different connectors based on the VLLM_USE_SIMPLE_KV_OFFLOAD env var (see vllm/config/vllm.py:662): VLLM_USE_SIMPLE_KV_OFFLOAD=1 -> SimpleCPUOffloadConnector (the path we were using; carries the popleft_n + context-overflow + completion-barrier bugs we hit on B200/B300/H200) unset (default) -> OffloadingConnector (the regular native path) This commit drops the env var and the JSON form, switching MI355X to the shortcut form which now routes to OffloadingConnector. We're trying the regular path here to see if it sidesteps the SimpleCPUOffloadConnector- specific issues that have been forcing lazy_offload + workarounds. Also drops the --kv-transfer-config JSON since the shortcut form constructs the KVTransferConfig itself at engine startup. Keeps --disable-hybrid-kv-cache-manager since MI355X uses --block-size=1 + AITER which doesn't play with the hybrid manager.

Test SimpleCPUOffloadConnector lazy_offload behavior on a newer vLLM than the default v0.20.0-cu130. Image: cquil/vllm-openai:v0.21.0-8813c92. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Mirrors the dsv4-fp4-b200-vllm-agentic CONC sweep (tp8 [16,32,64] + tp8 dp-attn [64,128,256]) so the two SKUs can be compared on the same trace load. Uses the same SGLang image as the fixed-seq-len sibling (rocm/sgl-dev:rocm720-mi35x-0363e6c-20260509-DSv4). Offload sweep is none-only (SGLang has no equivalent of vLLM's SimpleCPUOffloadConnector that we exercise on b200). Launcher swaps the fixed-seq-len harness (run_benchmark_serving) for the agentic harness (build_replay_cmd / write_agentic_result_json / analyze_benchmark_distributions) but keeps all SGLang server flags and SGLANG_* env vars identical to the fixed-seq-len sibling. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

R2 dispatch failed on all 6 b200 shards with the same enroot error during manifest fetch: [INFO] Fetching image manifest list [INFO] Fetching image manifest [ERROR] Could not process JSON input curl: (23) Failure writing output to destination Docker Hub confirms the image exists with a clean Docker v2 manifest, but enroot import was being invoked as `docker://docker.io/cquil/vllm-openai:...` because the image field had the docker.io/ prefix. Every other image entry in the repo uses the bare `org/repo:tag` form (no docker.io/ prefix), so this entry was the outlier. Dropping the prefix matches convention and should let enroot resolve the registry host normally. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

First multi-node agentic config with the recipe local to this repo. Adds: - Two new agentic recipes under benchmarks/multi_node/srt-slurm-recipes/ vllm/deepseek-v4/agentic/, adapted from the corresponding 8k1k fixed- seq-len siblings: * disagg-gb300-1p6d-dep4-tp4-agentic.yaml (low-lat conc=32, mid conc=192) * disagg-gb300-4p1d-dep4-dep8-24-c4096-agentic.yaml (high-tput conc=4096) Both drop max-model-len, drop no-enable-prefix-caching, add DSv4 tool/reasoning parsers, switch benchmark.type sa-bench -> custom (hands off to benchmarks/multi_node/agentic_srt.sh which builds the aiperf inferencex-agentx-mvp invocation). - New IS_AGENTIC=1 branch at the top of runners/launch_gb300-nv.sh's framework conditional. Clones the cquil11/srt-slurm-nv fork (the only srt-slurm build that supports benchmark.type=custom) on the cam/sa-submission-q2-2026 branch and overlays the local agentic recipes into recipes/vllm/deepseek-v4/agentic/ so iteration stays in this repo. - New dsv4-fp4-gb300-dynamo-vllm-agentic config entry in nvidia-master.yaml as a sibling of the byte-identical-to-origin/main dsv4-fp4-gb300-dynamo-vllm base. Three-tier sweep: * low-latency (conc=32, 1p6d shape, 28 GPUs / 8 nodes) * mid (conc=192, 1p6d shape, same alloc as low-lat) * high-tput (conc=4096, 4p1d shape, 24 GPUs / 7 nodes) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

R1 of dsv4-fp4-gb300-dynamo-vllm-agentic failed at `srtctl apply` with two schema errors against the cquil11/srt-slurm-nv fork: Invalid config: {'dynamo': {'wheel': ['Unknown field.']}, 'benchmark': {'env': {'PORT': {'value': ['Not a valid string.']}}}} The first (dynamo.wheel) is fixed by cherry-picking commit 0060f857 from NVIDIA upstream onto cquil11/srt-slurm-nv@cam/sa-submission-q2-2026 (adds wheel field + install scripts; pushed separately). The second (PORT) is fixed here: env values must be strings, so `PORT: 8000` -> `PORT: "8000"`. INFMAX_CONTAINER_WORKSPACE / RESULT_DIR parse as strings due to their / chars, and IS_MULTINODE was already quoted; PORT was the only bare int. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

R2 of dsv4-fp4-gb300-dynamo-vllm-agentic landed all 3 shards on gb300-cw_N runners (CoreWeave self-hosted runners advertise both gb300-cw AND gb300-nv labels). RUNNER_NAME%%_* resolves to gb300-cw, which routes to runners/launch_gb300-cw.sh — but that launcher had no IS_AGENTIC handling, so it cloned upstream NVIDIA/srt-slurm (which lacks benchmark.type=custom) instead of the cquil11 fork. srtctl apply then failed: Invalid config: {'benchmark': {'command': ['Unknown field.'], 'env': ['Unknown field.']}} Mirrors the IS_AGENTIC=1 branch I added earlier to launch_gb300-nv.sh: use cquil11/srt-slurm-nv@cam/sa-submission-q2-2026 (now patched with dynamo.wheel support via cherry-picked NVIDIA commit 0060f857) and overlay our local agentic recipes from benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/agentic/. Both gb300-nv and gb300-cw launchers now handle IS_AGENTIC identically, so the workload runs correctly regardless of which runner picks it up. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Upstream NVIDIA/srt-slurm@main has caught up on every schema feature the agentic path needs: - BenchmarkType.CUSTOM + benchmark.command + benchmark.env (the hook that hands off to benchmarks/multi_node/agentic_srt.sh) - DynamoConfig.wheel (so our vllm recipes can pin the same ai-dynamo wheel as the fixed-seq-len path) - default_bash_preamble (no more "Unknown field" warning) So we don't need the cquil11/srt-slurm-nv fork anymore. Pin to upstream commit 127597c0e6d3 (current HEAD) for reproducibility; bump as upstream evolves. Also fix: `uv venv` defaults to no-pip. The upstream prefetch-ai-dynamo-wheel.sh script (called by srtctl when a recipe has `dynamo.wheel` set) does `python3 -m pip download`, which fails with "No module named pip" without a seeded venv. Adding --seed installs pip+setuptools+wheel into the venv so the prefetch path works. R4 of dsv4-fp4-gb300-dynamo-vllm-agentic showed this error on the gb300-cw runner immediately after the lockfile cleanup unblocked the import_squash step. Both gb300-cw and gb300-nv launchers updated identically. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

R5 first-shard failure on gb300-nv runner: fatal: reference is not a tree: 127597c0e6d3c1b3ffd7ac02dd0fea2d2fd62f74 I extrapolated the 40-char SHA from a 7-char short `127597c` shown in git log output instead of resolving it. The real SHA is 127597c2926467db06e6707e0aa9227261c6c02a (NVIDIA/srt-slurm@main, "Update GB300 FP8 GLM-5 recipe (#160)"). R5's gb300-cw shards didn't immediately fail on the same error — either they hadn't reached the checkout step yet when I noticed, or their git was more lenient about the prefix-then-garbage SHA. Either way, the fixed SHA works for both. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

… launcher Two issues caught in R5: 1) dynamo-vllm worker rejects chat parser flags The worker entrypoint (different argparser than `vllm serve`) errors: __main__.py: error: unrecognized arguments: --enable-auto-tool-choice --tool-call-parser deepseek_v4 These belong on the dynamo frontend, not the worker. In disagg, chat parsing happens at the frontend; workers just take tokens. The 8k1k sibling recipes (which work) don't set these either. I mistakenly ported them from the single-node launchers, which run `vllm serve` directly (the chat-serving entrypoint). Drop --tool-call-parser, --enable-auto-tool-choice, --reasoning-parser from both prefill and decode blocks in both agentic recipes. Keep --tokenizer-mode deepseek_v4 (worker DOES accept that one). 2) launch_gb300-cw.sh was missing set -e The fabricated SHA bug from the prior commit only surfaced on the nv launcher (which has set -exo pipefail). The cw launcher silently swallowed the failed `git checkout` and proceeded on origin/HEAD — which happened to be the right commit, masking the bug. Add `set -exo pipefail` to match the nv launcher; loud failures are safer than silent ones. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

R6 surfaced via srtctl preflight that /scratch/models/DeepSeek-V4-Pro is not staged on the gb300-nv cluster: Error: Preflight failed for ...disagg-gb300-1p6d-dep4-tp4-agentic.yaml: - model.path: Model alias 'deepseek-v4-pro' resolved to '/scratch/models/DeepSeek-V4-Pro', but that path is unavailable. DSR1 weights ARE staged on /scratch (node-local SSD), but DSv4-Pro was never staged there. The 806 GB DSv4-Pro checkpoint lives at /home/sa-shared/models/DeepSeek-V4-Pro (NFS, shared across nodes). This silently broke the existing 8k1k fixed-seq-len path for dsv4-vllm on gb300-nv too (just hadn't been exercised against the stricter upstream srtctl preflight). Fix is single-file: re-point the DSv4 leg of the per-model conditional to the NFS path. NFS is slower than /scratch but that's where the model actually lives. Stage to /scratch and switch back if model load I/O becomes a bottleneck. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…S ELOOP R7 of dsv4-fp4-gb300-dynamo-vllm-agentic: Fatal error: Symlink loop from '/home/sa-shared/models/DeepSeek-V4-Pro' OSError: [Errno 40] Too many levels of symbolic links Same Vast NFS ELOOP bug we hit on the squash lockfiles in R3/R4: the /home/sa-shared/ NFS mount returns ELOOP to workflow worker processes (specifically those spawned through GHA runner pod -> sbatch -> pyxis/enroot), even though the same path is a regular directory from interactive sessions (verified via gb300-slurm + srun on c001 — both Path.resolve() and ls succeed cleanly). Workaround: /data/ and /home/sa-shared/ are SEPARATE mount points backed by the SAME storage (storage-vip.vast.p03.globalai.run, with /scratch and /scratch/home/sa-shared as the server-side paths). Switching MODEL_PATH to /data/home/sa-shared/models/DeepSeek-V4-Pro gives us identical files with a separate NFS client cache, which isn't poisoned in the workflow context. Doesn't fix the underlying Vast NFS bug — just routes around it. Long-term: stage DSv4-Pro to /scratch/models/ (node-local SSD) like DSR1, both for performance and to bypass this whole mount class. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

R7 of dsv4-fp4-gb300-dynamo-vllm-agentic had 6/8 worker srun steps OOM-killed within 30s, with `torch.AcceleratorError: CUDA-capable device(s) is/are busy or unavailable` (CUDA init aborts when SIGKILL races it). sacct showed each worker step got AllocTRES mem=4G (empirically verified on CW: default sbatch w/ --gres=gpu:4 -> AllocTRES mem=4G; same sbatch w/ --mem=0 -> AllocTRES mem=868G). Root cause: srt-slurm's start_srun_process doesn't pass --mem on the container srun, so it gets cpus_per_task × DefMemPerCPU = 4 GB by default on clusters with positive DefMemPerCPU (CW gb300 has 4096). 4 GB is wildly insufficient for a vLLM worker mmap'ing multi-GB model weights and pinning CUDA buffers. Fix: re-point both gb300 launchers' IS_AGENTIC clone from upstream NVIDIA/srt-slurm@127597c to cquil11/srt-slurm-nv@cam/agentic-mem-0 (96c443a), which is the same upstream commit + a single patch adding `--mem 0` to start_srun_process when container_image is set. Long-term: PR the --mem=0 change upstream so we can drop the fork indirection for this feature class. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

R9 hit the same Vast NFS ELOOP we fixed for the model path in R8, but this time on the squash lockfile: /usr/bin/bash: line 2: /home/sa-shared/gharunners/squash/<image>.sqsh.lock: Too many levels of symbolic links The /home/sa-shared/ NFS mount poisons lockfiles AND data files alike under the workflow worker NFS session. We applied the /data/ workaround for MODEL_PATH; now do the same for SQUASH_FILE + NGINX_SQUASH_FILE which were still pointing at the bad mount. Both /home/sa-shared/ and /data/ are mounted from the same Vast backing storage; same files, separate NFS client cache. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Earlier I patched srt-slurm's start_srun_process to default --mem=0 on container srun. That's the wrong layer — srtctl has a documented top-level recipe field `srun_options:` (see docs/config-reference.md#srun_options) that gets threaded straight through to the worker srun via mixins/worker_stage.py:235 (`srun_options=self.runtime.srun_options`) and start_srun_process line 248 (`for key, value in srun_options.items()`). Switch to that mechanism: - Add `srun_options: {mem: "0"}` to both agentic recipes - Revert both launchers from the cquil11 fork pin back to upstream NVIDIA/srt-slurm@127597c (the fork patch in cam/agentic-mem-0 is now redundant; leaving the branch around as a fallback but not pinned in the launcher) R9/R10 confirmed sacct still showed mem=4G per worker step despite the launcher cloning the patched fork — likely because srtctl's uv-sync inside the sbatch rebuilds the venv from pyproject.toml and the editable install from src/ doesn't include code modifications the way uv pip install -e . would. The recipe-level mechanism doesn't depend on patching srtctl at all so this whole class of "is the patch loaded?" question goes away. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

R11 verified that srun_options.mem=0 IS now in the worker srun cmdline (confirmed via /proc/<pid>/cmdline on the head node). BUT sacct still showed AllocTRES mem=4G per step. Why: the sbatch only requested `--ntasks=8` with no `--mem`, so the JOB allocation per node is bound to cpus_per_task × DefMemPerCPU = 1 × 4 GB = 4 GB. `--mem=0` on srun means "use ALL of what the JOB has on this node" — and the job has 4 GB. There's nothing to grow into. The other half of the fix is `sbatch_directives.mem=0` which emits `#SBATCH --mem=0` in the generated sbatch script (per src/srtctl/templates/job_script_minimal.j2:26), making SLURM allocate all available node memory (~868 GB on CW gb300) up front. Both layers needed: - sbatch_directives.mem=0 → JOB gets full node memory - srun_options.mem=0 → each container srun step uses it (without this, srun defaults back to cpus_per_task × DefMemPerCPU = 4 GB) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ation) R12 progressed past the memory layer (sbatch_directives.mem=0 from prior commit worked; sacct showed AllocTRES mem=868G per worker), but failed ~10 min in with etcd lease-keepalive `deadline exceeded` errors followed by every worker SIGKILL'd at 16:36:03. Root cause from infra.out: etcd reported `max-cpu-set: 1` at startup. SLURM's default cpus_per_task=1 starved single-CPU etcd under load from 24 concurrent dynamo DP rank lease keep-alives (16 prefill + 8 decode). etcd's gRPC handler couldn't process RPCs fast enough → cascading lease deadline exceeded → workers crashed → orchestrator cancelled job → infra step itself SIGKILL'd at 16:35:49 ("STEP 4572.2 ON slurm-gb300-138-249 CANCELLED ... DUE to SIGNAL Killed"). Fix: sbatch_directives.cpus-per-task=72 grants every task (including the GPU-less infra step) one CW gb300 NUMA socket. etcd now has plenty of compute; vLLM workers also get more aux CPU for tokenizer threads etc. Why cw needs this and nv doesn't: nv cluster's JobDefaults includes DefCpuPerGPU=35 → any task with --gres=gpu:N auto-gets 35*N CPUs (= 140 on a 4-GPU task). cw has no per-GPU default → tasks get cpus_per_task=1 by default. The infra step has no --gres flag at all so it's the worst case on cw. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Two changes: 1) Pin to NVIDIA cluster (drop CW) The dsv4-fp4-gb300-dynamo-vllm-agentic runner field was `gb300`, which is the generic label both NV and CW runner pools advertise (per gh api runners). So shards landed on either cluster, which meant we kept debugging the same recipe path against two different cluster configs (NV's DefCpuPerGPU=35 vs CW's DefMemPerCPU=4096 with no per-GPU defaults). Switch to `runner: gb300-nv`, a label only the NV pool advertises. This matches just gb300-nv_0/1/2 going forward. 2) MODEL_PATH switched to /scratch/models/DeepSeek-V4-Pro The node-local SSD on NV compute nodes. Faster than the /data/home/sa-shared NFS path (where DSv4-Pro currently lives). Caveat: /scratch doesn't exist on the GHA runner pod, so srtctl preflight may fail with "Model alias resolved to ..., but that path is unavailable." We're trying this anyway to see whether the runner pod has /scratch mounted; if it errors, next step is to either (a) patch srt-slurm to add a `skip_model_preflight` recipe field or (b) stub a symlink on the runner pod. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The agentic recipe pins MODEL_PATH=/scratch/models/DeepSeek-V4-Pro (node-local NVMe on compute nodes). srtctl's _preflight_model runs in-process on whatever node invokes srtctl — the GHA runner pod, which doesn't have /scratch mounted — so it bails before sbatch with "Model alias 'deepseek-v4-pro' resolved to '/scratch/...', but that path is unavailable" (R14 hit this). Switch the IS_AGENTIC=1 clone target from NVIDIA/srt-slurm@127597c to cquil11/srt-slurm-nv@cam/no-preflight-flag (854b3fd), which adds one CLI flag — `srtctl apply --no-preflight` — that skips just the optional Python-level FS precheck. vLLM still fails loudly at runtime if the path is genuinely missing on the compute node. The flag is only passed when IS_AGENTIC=1. Fixed-seq-len recipes resolve model.path to an NFS path visible from the runner pod, where the precheck is a useful sanity guard, so leave enforcement on for them. Fork commit: cquil11/srt-slurm-nv@854b3fd Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Aiperf's content-addressed mmap dataset cache (~65 GB per dataset) needs to be persisted across runs so the first run of the day doesn't re-tokenize + re-write it on every shard. Same pattern as launch_h200-dgxc-slurm.sh, launch_b200-dgxc.sh, launch_mi355x-amds.sh. Three layers wired: 1) Host paths (cluster-specific, created with 0777 so all gharunner_X SLURM users can write): gb300-nv /data/home/sa-shared/gharunners/ai-perf-cache gb300-cw /mnt/vast/ai-perf-cache 2) Both launchers export AIPERF_MMAP_CACHE_HOST_PATH and add a line to the generated srtslurm.yaml's default_mounts block — srt-slurm's runtime.py reads default_mounts via get_srtslurm_setting() and bind-mounts each entry into every worker container. cw already had a default_mounts block (for dynamo-wheels-cache); nv had none. 3) Both agentic recipes set AIPERF_DATASET_MMAP_CACHE_DIR=/aiperf_mmap_cache in benchmark.env so the aiperf process inside the container reads from the persistent mount instead of ~/.cache/aiperf/dataset_mmap. Single-node launchers don't need updating — they have their own srun --container-mounts line that already bind-mounts the cache. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Brings in 45 commits from upstream/ajc/inferencex-agentx-mvp (PR #875): - InferenceX AgentX-MVP scenario (default corpus switched to 051226 no-subagents 949-trace variant) - semianalysis_cc_traces_weka_no_subagents HF loader - Wrap-fill trajectory recycling + correlation-id double-recycle guard - DAG benchmarks, reproducible payload replay, agentic_replay E2E test - assorted dataset/timing fixes Local commits preserved (no rebase). One docstring-only conflict in src/aiperf/dataset/loader/semianalysis_cc_traces_weka.py resolved by taking upstream's text (more comprehensive — documents both 042026 and 051226 variants). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

vllm/vllm-openai:v0.21.0-ubuntu2404 ships without git, but pip's editable install (-e) of utils/aiperf invokes `git version` to record direct_url.json provenance. Without git, every R16 shard on both gb300-nv and gb300-cw failed at: + python3 -m pip install --break-system-packages -q --ignore-installed -e /infmax-workspace/utils/aiperf ERROR: Error [Errno 2] No such file or directory: 'git' while executing command git version ERROR: Cannot find command 'git' - do you have 'git' installed and in your PATH? This happens AFTER server boot is healthy and "Server is healthy - starting benchmark" has fired, so all the upstream cluster/recipe work (preflight, mem=0 x2 layers, etcd cpus-per-task=72, --no-preflight, /scratch model path, NixlConnector P<->D, model load) is working end-to-end. Only the pip install step is blocked. Fix: prepend a `command -v git || apt-get update && apt-get install -y git` to install_agentic_deps. Cheap no-op on images that already ship git (AMD images, custom containers). The vLLM image's apt is functional from inside the container so this works without container rebuild. The -e install was introduced yesterday in e92a9bf (aiperf v0.2 migration); previously the agentic flow used kv-cache-tester which didn't need git. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…t containers R17 surfaced two distinct failures, one per cluster: 1) gb300-cw (all 3 shards): aiperf rejected --public-dataset semianalysis_cc_traces_weka with "Scenario invariants violated ... required loader=any of ['semianalysis_cc_traces_weka_no_subagents', 'weka_trace']". Yesterday's aiperf merge (PR #875 commit fef78a96) switched the inferencex-agentx-mvp scenario's default corpus to the 051226 no-subagents 949-trace variant and tightened the loader contract. The old name is no longer accepted. Fix: resolve_trace_source emits --public-dataset semianalysis_cc_traces_weka_no_subagents. 2) gb300-nv (all 3 shards): "dpkg: error: requested operation requires superuser privilege" from yesterday's install_agentic_deps git install path. The gb300-nv pyxis/enroot setup maps the calling user (sa-shared) into the container as non-root, while gb300-cw runs as root. The git install needs sudo on nv; cw is fine without. Fix: branch on `id -u` — apt-get directly when root, sudo apt-get otherwise. The vllm-base layer installs `sudo` so the binary is available, and the typical enroot config grants the calling user passwordless sudo. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

R17/R18 made it clear that there's no clean way to install git into the vllm/vllm-openai container at run-time on gb300-nv: - R16/R17: container ships without git -> pip's editable install of aiperf fails with "Cannot find command 'git'" - R18: tried `sudo apt-get install git`. gb300-nv pyxis/enroot remaps the calling user to uid=345200007 inside the container, and sudo refuses to run with "/usr/bin/sudo must be owned by uid 0 and have the setuid bit set" -- the setuid bit can't carry across user namespaces. cw container runs as root so sudo wasn't tripped there, but the right answer is one that works on both clusters. The actual fix is upstream from this entirely: drop `-e`. pip's editable install needs git only to record direct_url.json provenance; the non-editable install just builds a wheel via hatchling and copies into site-packages. aiperf's pyproject.toml pins version="0.8.0" rather than deriving it from git tags, so non-editable install works without git in any environment. We don't edit aiperf source mid-benchmark anyway -- loss of -e ergonomics is zero. `--ignore-installed` is still needed (handles the apt-managed-blinker distutils-uninstall pile-up) and is orthogonal to -e. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Drop the sudo/root-detection complexity from R18 and restore -e on the aiperf pip install. Per user direction. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The vllm/vllm-openai container ships without git; agentic_srt.sh needs to apt-get install it because pip's install of utils/aiperf calls `git version`. R17/R18/R19/R20 chased this on gb300-nv with various combinations of sudo / no-sudo / drop-e / etc., all failing because pyxis maps the calling user to uid 345200007 inside the container and dpkg's hardcoded geteuid()!=0 check rejects every attempt regardless of filesystem permissions. The cleanest fix is to ask pyxis to remap us to uid 0 inside the container, matching the gb300-cw behavior (where the container already runs as root and apt-get install works directly). pyxis exposes this as a per-srun flag: --container-remap-root. srt-slurm renders empty-string srun_options as flag-only srun args (see core/slurm.py:250 in NVIDIA/srt-slurm@127597c). No-op on gb300-cw (cw is already remapped to root by default). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Picks up cquil11/srt-slurm-nv@6e34b8b which propagates srun_options through the benchmark_stage srun (previously only worker/frontend/ telemetry stages honored them). Required for the recipe-level srun_options.container-remap-root: "" to apply to the benchmark.command container — the one that runs agentic_srt.sh + apt install git. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Picks up cquil11/aiperf@9b858ae which fixes PhaseRunner.cancel() to set all_credits_sent_event / all_credits_returned_event so the outer runner awaits wake immediately. Previously cancelled runs (e.g. via --failed-request-threshold) blocked for the full phase timeout (~1800s default) before reaching the graceful exit path. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ncel) When a workflow run is cancelled mid-flight (gh run cancel, or UI cancel button), the launcher gets SIGTERM during its `tail -F` wait and exits before reaching the `tar czf .../multinode_server_logs.tar.gz` line in the main flow. The Upload server logs workflow step runs (it has if: always()) but finds no file (if-no-files-found: ignore silently skips), so the artifact never gets uploaded. Fix: install an EXIT trap right after JOB_ID extraction that produces the tarball on any exit path — normal completion, error, SIGTERM, SIGKILL of our parent. The main-flow tar block is now an idempotent no-op (kept for log narrative). Applied identically to both gb300-nv and gb300-cw launchers. The b200-dgxc launcher has the same pattern but its multi-node flow is currently only used by other configs; leaving it alone for now to avoid mixing unrelated changes. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

gb300-nv 1p6d agentic runs hit ~15% errors at conc=32 from Dynamo NATS RPC deadline timeouts when the single prefill worker is saturated by 32 concurrent 50-100k token prefills. Each timeout returns HTTP 500 "Failed to generate completions: Prefill execution failed: ... NATS request to dynamo_prefill.generate-... failed: ... deadline has elapsed" — a real failure but driven by the single-prefill-worker capacity limit, not a regression. At the previous 0.05 threshold the run tripped its ProfileCancel mechanism early and produced no usable numbers. At 0.20 the run completes and we get steady-state metrics for the ~85% of requests that succeed; the underlying NATS saturation is a separate work item (Dynamo deadline tuning, or more prefill workers in the recipe, or both). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

claude · 2026-05-27T14:16:03Z

+            LOGS/agentic/aiperf_artifacts/detailed_results.csv
+            LOGS/agentic/aiperf_artifacts/debug_trace.jsonl


🔴 The 'Upload agentic raw results' step in benchmark-multinode-tmpl.yml (lines 294-295) lists LOGS/agentic/aiperf_artifacts/detailed_results.csv and LOGS/agentic/aiperf_artifacts/debug_trace.jsonl — those filenames were produced by the removed utils/trace-replay submodule and are never written by the new aiperf pipeline. Combined with if-no-files-found: ignore, multinode agentic runs will silently upload an empty agentic_<RESULT_FILENAME> artifact, losing all per-request profile and server-metrics data. Mirror benchmark-tmpl.yml's full aiperf file list (profile_export*, server_metrics_export*, gpu_telemetry_export.jsonl, aiperf logs), translating the results/ prefix to LOGS/agentic/.

Extended reasoning...

What the bug is

The multinode template'''s Upload agentic raw results step was only half-migrated in this PR. The directory rename trace_replay/ → aiperf_artifacts/ was applied, but the filenames underneath it were left as the legacy trace-replay outputs:

LOGS/agentic/aiperf_artifacts/detailed_results.csv LOGS/agentic/aiperf_artifacts/debug_trace.jsonl

Those two filenames were specific to the now-removed utils/trace-replay submodule (the kv-cache-tester scripts). This PR deletes that submodule entry from .gitmodules and removes load_trace_replay_records (which read detailed_results.csv) from utils/agentic-benchmark/scripts/analyze_benchmark_distributions.py. Grep confirms detailed_results.csv and debug_trace.jsonl appear nowhere else in the repo — nothing writes them anymore.

How it manifests

The new aiperf pipeline (wired up via benchmarks/benchmark_lib.sh:build_replay_cmd with --output-artifact-dir /aiperf_artifacts) writes an entirely different set of files: profile_export.jsonl, profile_export_aiperf.{json,csv}, profile_export_aiperf_{timeslices,aggregate,collated}.*, server_metrics_export.{json,jsonl,csv,parquet}, gpu_telemetry_export.jsonl, and logs/aiperf.log. The sibling single-node template benchmark-tmpl.yml was correctly updated in this same PR to enumerate all of those.

Why existing code doesn'''t prevent it

actions/upload-artifact@v7.0.1 is invoked with if-no-files-found: ignore, so a glob/path that matches zero files produces an empty artifact upload without warning. There is no schema check that the listed paths exist.

Impact

Every multinode agentic run (the new dsv4-fp4-gb300-dynamo-vllm-agentic and dsv4-fp4-gb300-cw-dynamo-vllm-agentic configs introduced by this PR, plus future multinode agentic configs) silently produces an empty agentic_<RESULT_FILENAME> artifact. The entire per-request profile (profile_export.jsonl), aiperf aggregate exports, server scrape time series, GPU telemetry, and aiperf logs from multinode jobs are lost. Downstream consumers like utils/process_agentic_result.py (which reads profile_export.jsonl + profile_export_aiperf.json + server_metrics_export.json) cannot reanalyze multinode runs after the fact.

Step-by-step proof

A multinode agentic job runs and benchmarks/multi_node/agentic_srt.sh calls build_replay_cmd → run_agentic_replay_and_write_outputs.

benchmark_lib.sh:1003 invokes aiperf with --output-artifact-dir /aiperf_artifacts (where = /logs/agentic from the recipe'''s benchmark.env).

aiperf populates the directory with profile_export.jsonl, profile_export_aiperf.csv, server_metrics_export.{json,jsonl,csv,parquet}, etc. — but NOT detailed_results.csv or debug_trace.jsonl (those were trace-replay outputs).

The launcher copies/tars the logs back to /LOGS/agentic/ on the runner.

The Upload agentic raw results step runs with the two listed paths:

LOGS/agentic/aiperf_artifacts/detailed_results.csv → does not exist.

LOGS/agentic/aiperf_artifacts/debug_trace.jsonl → does not exist.

if-no-files-found: ignore causes upload-artifact to emit an empty bundle without warning.

The agentic_<RESULT_FILENAME> artifact appears in the workflow run UI but contains zero files; gh run download returns nothing.

How to fix

Mirror the file list from the correctly-updated sibling benchmark-tmpl.yml (which lists every aiperf export name plus the new lmcache_server.log and *_command.txt files), translating the results/ prefix to LOGS/agentic/. At minimum the list must include profile_export.jsonl, profile_export_aiperf.{json,csv}, profile_export_aiperf_{timeslices,aggregate,collated}.*, server_metrics_export.{json,jsonl,csv,parquet}, gpu_telemetry_export.jsonl, and aiperf_artifacts/logs/*.log.

claude · 2026-05-27T14:16:03Z

 # ``hash_ids`` and ``output_length``. Built lazily from the HF dataset cache.
 _TRACE_METADATA_CACHE: dict[str, list[dict]] | None = None
-_HF_DATASET = "semianalysisai/cc-traces-weka-042026"
+_HF_DATASET = "semianalysisai/cc-traces-weka-with-subagents-051926"


🔴 The test fixture in utils/test_process_agentic_result.py (test_processor_loads_traces_jsonl_for_theoretical_cache) still hard-codes the old dataset directory name datasets--semianalysisai--cc-traces-weka-042026, but this PR renamed _HF_DATASET in process_agentic_result.py:40 to semianalysisai/cc-traces-weka-with-subagents-051926. The processor's _hf_traces_dir() now looks under the new directory name, so the fixture is never found, theoretical_cache_hit_rate stays None, and the assertions at lines 461 and 463 (== pytest.approx(0.5) and mean_output_tokens_expected == ...) will fail every CI run. Fix: update the fixture path to datasets--semianalysisai--cc-traces-weka-with-subagents-051926.

Extended reasoning...

Bug

test_processor_loads_traces_jsonl_for_theoretical_cache writes a synthetic Hugging Face snapshot to validate that process_agentic_result.py correctly walks per-trace hash_ids arrays and computes theoretical_cache_hit_rate. After this PR, the test will deterministically fail on first execution.

Root Cause

This PR changed utils/process_agentic_result.py:40 from:

_HF_DATASET = "semianalysisai/cc-traces-weka-042026"

to:

_HF_DATASET = "semianalysisai/cc-traces-weka-with-subagents-051926"

_hf_traces_dir() (around line 133-134) derives the on-disk cache directory from this constant via the HF naming convention datasets--{org}--{name}. So after the rename the processor looks for:

$HF_HUB_CACHE/datasets--semianalysisai--cc-traces-weka-with-subagents-051926/snapshots/<rev>/traces.jsonl

But the test fixture at utils/test_process_agentic_result.py:408 still hard-codes the old name:

snapshot = hf_cache / "datasets--semianalysisai--cc-traces-weka-042026" / "snapshots" / "abc"

The other call sites in the same test file (_write_fixture, the per-run subdir test, etc.) were updated from trace_replay → aiperf_artifacts in this PR, but this particular hard-coded HF dataset directory was missed.

Step-by-Step Proof

Test calls _write_fixture, then writes traces.jsonl to <tmp>/_hf/datasets--semianalysisai--cc-traces-weka-042026/snapshots/abc/traces.jsonl.

Test sets HF_HUB_CACHE=<tmp>/_hf and invokes the processor.

Inside _hf_traces_dir(), the code builds: Path($HF_HUB_CACHE) / f"datasets--semianalysisai--cc-traces-weka-with-subagents-051926" / "snapshots" — using the new _HF_DATASET constant.

That directory does not exist in the fixture (only the old-name directory does), so _hf_traces_dir() returns None.

_iter_trace_blobs is never called; _TRACE_METADATA_CACHE remains empty.

Without trace metadata, theoretical_cache_hit_rate is computed as None and mean_output_tokens_expected is None (or missing) in the emitted agg JSON.

The assertion at line 461 (agg["theoretical_cache_hit_rate"] == pytest.approx(0.5)) compares None == 0.5 → fails.

The assertion at line 463 (agg["mean_output_tokens_expected"] == pytest.approx((50+60+55+40+70)/5)) compares None to a float → fails.

Independent verifier confirmation: one verifier reproduced this by running the processor against both paths and observed that the old path produces theoretical_cache_hit_rate=None, while only the new path populates it as expected.

Fix

Rename the fixture directory in utils/test_process_agentic_result.py (around line 408) from:

snapshot = hf_cache / "datasets--semianalysisai--cc-traces-weka-042026" / "snapshots" / "abc"

to:

snapshot = hf_cache / "datasets--semianalysisai--cc-traces-weka-with-subagents-051926" / "snapshots" / "abc"

No other test fixture changes are needed; the processor will then find the synthetic snapshot at the new path and the assertions will pass.

…n/ subdir Match the existing benchmarks/single_node/agentic/ split: all 111 non- agentic per-cluster launch scripts move into benchmarks/single_node/ fixed_seq_len/. chat_templates/ stays at single_node/chat_templates/ as a shared resource (referenced by both agentic and fixed_seq_len scripts). Plumbing: - .github/workflows/benchmark-tmpl.yml + benchmark-multinode-tmpl.yml: SCENARIO_SUBDIR default flips from '' to 'fixed_seq_len/'. - runners/launch_mi355x-amds.sh: parameter-expansion fallback also defaults to fixed_seq_len/ so direct invocations (without the workflow setting SCENARIO_SUBDIR) still resolve. - Each moved script's `source "$(dirname \"$0\")/../benchmark_lib.sh"` becomes `../../benchmark_lib.sh`. - dsv4_fp4_mi355x_sglang.sh's --chat-template path becomes `../chat_templates/...` (matches the agentic copy's pattern). - .github/configs/{nvidia,amd}-master.yaml: forward-looking comments repath to fixed_seq_len/. perf-changelog.yaml historical entries left untouched (they describe paths at the time of the change). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Cam Quilici <cjquilici@gmail.com>

…xups Resolutions: - perf-changelog.yaml: took main verbatim. - runners/launch_b300-nv.sh: took main (drops --nodelist pin entirely; supersedes our narrower 017-019 fix). - benchmarks/single_node/fixed_seq_len/dsv4_fp8_mi355x{,_vllm}.sh: accepted main's deletes (orphan recipes removed in #1374, #1501). - .github/configs/amd-master.yaml: took main as the base, then re-applied our agentic-only additions on top: * qwen3.5-fp8-mi355x-sglang-agentic-hicache (new entry) * dsv4-fp4-mi355x-vllm-agentic (new entry) * dsv4-fp4-mi355x-sglang-agentic (new entry) * kimik2.5-fp4-mi355x-vllm-agentic (cpu -> lmcache) Dropped our comment-path edit for dsv4_fp8_mi355x_vllm.sh since main deleted that entry. Fixed_seq_len reorg fixups for files added on main during our branch's lifetime: - git mv 14 stranded scripts from benchmarks/single_node/*.sh into benchmarks/single_node/fixed_seq_len/ (dsr1_fp4_b200_mtp, dsr1_fp4_mi355x_mtp, dsr1_fp8_h200_mtp, dsr1_fp8_mi325x_mtp, dsr1_fp8_mi355x_mtp, dsv4_fp4_mi355x_vllm, glm5_fp8_h200_mtp, glm5_fp8_mi325x, glm5_fp8_mi325x_mtp, qwen3.5_bf16_mi325x_mtp, qwen3.5_fp4_mi355x_mtp, qwen3.5_fp8_h100, qwen3.5_fp8_h100_mtp, qwen3.5_fp8_mi325x_mtp). Patched their source paths from ../benchmark_lib.sh to ../../benchmark_lib.sh. - runners/launch_mi355x-amds.sh: multinode-non-disagg BENCHMARK_SUBDIR bumped from `single_node` to `single_node/fixed_seq_len`. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Per-recipe scripts had stale `VAR=${VAR:-default}` lines for variables that are either reliably plumbed by the workflow template or completely unused. The defaults masked missing-env bugs (the workflow could forget to plumb a var and the script would silently fall back to a stale local default instead of failing loudly) and left dead lines hanging around from the pre-aiperf-v0.2 era. benchmarks/benchmark_lib.sh: - PORT: new `export PORT="${PORT:-8888}"` near the top so a single source of truth governs the server port. Launchers that need a non-default value (launch_mi355x-amds.sh derives PORT from RUNNER_NAME to avoid collisions across concurrent gh-runners) set PORT themselves; the `:-` fallback only kicks in if nothing upstream set it. - build_replay_cmd: `local duration="${DURATION:-1800}"` -> `"$DURATION"` (DURATION is now a check_env_vars-enforced requirement in callers). benchmarks/single_node/agentic/*.sh (32 scripts) and benchmarks/multi_node/agentic_srt.sh: - Removed: PORT=${PORT:-8888} (benchmark_lib owns it now). - Removed: DURATION/EP_SIZE/DP_ATTENTION defaults; added each to check_env_vars in the scripts that consume them. DURATION is consumed by build_replay_cmd in benchmark_lib, so every agentic script now requires it explicitly. - Removed: MAX_DELAY/ADVANCE_MIN/ADVANCE_MAX. These were CLI args to the old trace_replay_tester.py (commit b7ae440); the aiperf v0.2 migration (commit e92a9bf) dropped all consumption but left the top-of-script var-definitions behind. Pure dead code. - Kept: SCHEDULER_RECV_INTERVAL (per-model sglang server tuning, not workflow-plumbed; values vary 5/10/30 per recipe). benchmarks/single_node/fixed_seq_len/*.sh (120 scripts): - Removed: PORT=${PORT:-8888} only. fixed_seq_len's check_env_vars block already requires what it uses (DP_ATTENTION/EP_SIZE/ISL/OSL/ RANDOM_RANGE_RATIO/RESULT_FILENAME) per the existing convention; no further changes needed. Net: 343 deletions, 46 insertions across 154 files; no behavior change on any green CI path (workflow input defaults match the removed local defaults). Behavior change only when an upstream caller fails to set DURATION/EP_SIZE/DP_ATTENTION on an agentic recipe -- which now fails loudly via check_env_vars instead of silently inheriting a stale value. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Cam Quilici <cjquilici@gmail.com>

Matches the existing pattern from launch_{b200-dgxc,h200-dgxc-slurm, gb300-{nv,cw},mi355x-amds}.sh: define AIPERF_MMAP_CACHE_HOST_PATH on the host, mount it to /aiperf_mmap_cache inside the container, and expose AIPERF_DATASET_MMAP_CACHE_DIR=/aiperf_mmap_cache via --export so aiperf's DatasetLoaderManager finds it. Lets agentic benchmarks reuse the pre-built mmap dataset cache instead of re-mmaping every run. - h200-nb: /mnt/data/gharunners/ai-perf-cache (sibling of hf-hub-cache) - h200-cw: /mnt/vast/gharunner/ai-perf-cache (sibling of hf-hub-cache) Host-side directories will be created out-of-band before next run. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Cam Quilici <cjquilici@gmail.com>

cursor · 2026-05-27T19:35:07Z

+        OFFLOAD_ARGS=(
+            --kv-transfer-config
+            "{\"kv_connector\":\"LMCacheMPConnector\",\"kv_connector_module_path\":\"lmcache.integration.vllm.lmcache_mp_connector\",\"kv_role\":\"kv_both\",\"kv_connector_extra_config\":{\"lmcache.mp.host\":\"$LMCACHE_HOST\",\"lmcache.mp.port\":$LMCACHE_PORT}}"
+        )


Dead code after explicit exit 1 in disabled branch

Low Severity

The lmcache-mp case in the OFFLOADING switch immediately calls exit 1 (line 140) to disable the path, but ~47 lines of live server-startup code follow after that exit 1 — including agentic_pip_install, LMCache server launch, wait_for_lmcache_ready, and OFFLOAD_ARGS construction. All of it is permanently unreachable. The comment says to "re-enable after PR #3261 merges", but the implementation was left as dead statements rather than being commented out, which gives the misleading impression that the code runs.

^{Reviewed by Cursor Bugbot for commit a98fcaa. Configure here.}

… corpus Adds a per-recipe override hook in benchmark_lib.sh's resolve_trace_source: recipes set WEKA_LOADER_OVERRIDE to one of the aiperf public-dataset loader names allowed by the inferencex-agentx-mvp scenario, and resolve_trace_source swaps both the --public-dataset flag and the HF dataset pre-download to match. Default remains semianalysis_cc_traces_weka_with_subagents (052726, 472 traces). Unknown overrides fail loudly with the allowed-values hint. Wires the new override into all 8 minimaxm2.5 agentic recipes (minimaxm2.5_fp{4,8}_{b200,b300,h100,h200,mi300x,mi325x,mi355x}.sh) to use semianalysis_cc_traces_weka_with_subagents_256k -- the 256k-capped variant (051926-256k, 217 traces, max in+out <= 256k by construction). MiniMax-M2.5 servers run at max_model_len ~256k, so the unfiltered 052726 corpus would have its longest requests rejected. Submodule bump: utils/aiperf -> 6fc5f5d6 registers the new loader name in plugins.yaml and adds it to inferencex_agentx_mvp's require_loader tuple. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Cam Quilici <cjquilici@gmail.com>

Mirrors aiperf 519580fb: the semianalysis_cc_traces_weka_with_subagents_256k loader now points at semianalysisai/cc-traces-weka-with-subagents-052726-256k (470 traces) instead of the earlier 051926-256k (217 traces). Loader name and override env var (WEKA_LOADER_OVERRIDE) unchanged. - benchmark_lib.sh resolve_trace_source: case-statement HF repo path bumped to ...052726-256k for the _256k loader. - All 8 minimaxm2.5_*.sh agentic recipe comments: trace count 217 -> 470. - utils/aiperf submodule pointer -> 519580fb. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Cam Quilici <cjquilici@gmail.com>

The proxy occasionally records the same logical request twice. On the 472-session par<=5 sample, 2,339 of 115,593 rows (2.0%) are byte- identical duplicates of a prior row in the same session — 1,923 are main-agent turns and 416 are subagent inner requests. 275 of 472 sessions (58%) have at least one duplicate. Worst session has 165 dup rows. Without deduping, the weka conversion silently inflates token counts, request counts, and throughput by ~2%, and the converter misclassifies duplicate-pair rows as "two requests started at the same nanosecond" when grouping subagents. Fingerprint: (timestamp, model, input_tokens, output_tokens, duration_ms, agent_id). On the 2,339 detected pairs, 100% are also byte-identical when full JSON is serialized, so the fingerprint produces zero false positives. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Cam Quilici <cjquilici@gmail.com>

…0.21.0 v0.20.2's bundled huggingface_hub==1.14.0 silently fetches Git-LFS pointer files instead of LFS content for `hf download --repo-type dataset`. Every kimik2.5-fp4-b200-vllm-agentic job in run 26536606210 hit "pyarrow.lib.ArrowInvalid: JSON parse error: Missing a name for object member. in row 0" -- the signature of pyarrow trying to parse the literal `version https://git-lfs.github.com/spec/v1` line of an LFS pointer file as JSON. b200-dgxc has no persistent /mnt/hf_hub_cache mount (per launcher diff), so every container re-downloads the dataset and re-hits the bug. v0.21.0 ships a newer huggingface_hub that resolves LFS correctly. v0.20.x's flashinfer fix for the max_model_len=131072 + prefix-caching warmup crash is included in v0.21.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Cam Quilici <cjquilici@gmail.com>

New agentic-coding recipe targeting H100 (runner: h100-dgxc) running Qwen3.5-397B-A17B FP8 via SGLang v0.5.12-cu130. Mirrors the b300 SGLang agentic shape with H100-appropriate kernel flags: - attention-backend: flashinfer (sm_90; trtllm_mha is Blackwell-only). - mem-fraction-static 0.75 (vs 0.80 on B300) and chunked-prefill-size 8192 (vs 16384) to fit Qwen-397B FP8 weights + KV in H100's 80 GB HBM3 at TP=8. - conc-list capped at 16 across both arms; agentic ISLs hit ~80k-200k on the 256k corpus and Qwen at conc=32 OOM'd in the fixed_seq_len sweep at lower ISL too. Recipe wires WEKA_LOADER_OVERRIDE=semianalysis_cc_traces_weka_with_subagents_256k so the 256k-capped variant (470 traces, max in+out <= 256k) is used instead of the unfiltered 052726 corpus (which has up to ~1M-token requests the H100 max_model_len=131k server would reject). Two sweep arms: - none: --disable-radix-cache, conc-list [1, 2, 4, 8, 16] - hicache: --enable-hierarchical-cache + sized from TOTAL_CPU_DRAM_GB, conc-list [4, 8, 16] (capped where hicache stabilizes) Yaml key is qwen3.5-fp8-h100-sglang-agentic; script filename is the bare `qwen3.5_fp8_h100.sh` under benchmarks/single_node/agentic/ — the h100 launchers don't support framework-tagged script names, and this matches the precedent set by qwen3.5_fp8_b200.sh (which is the sglang-agentic recipe under the same bare name). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Cam Quilici <cjquilici@gmail.com>

Matches the same pattern as launch_b200-dgxc, launch_h200-dgxc-slurm, launch_gb300-{nv,cw}, launch_mi355x-amds, launch_h200-{nb,cw}: define AIPERF_MMAP_CACHE_HOST_PATH on the host, bind-mount it to /aiperf_mmap_cache in the container, and expose AIPERF_DATASET_MMAP_CACHE_DIR=/aiperf_mmap_cache via --export. Host path: /mnt/nfs/sa-shared/gharunners/ai-perf-cache (sibling of the existing hf-hub-cache mount on the same NFS volume). Needed for the new qwen3.5-fp8-h100-sglang-agentic recipe to reuse the pre-built mmap dataset cache across runs rather than re-mmaping every job. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Cam Quilici <cjquilici@gmail.com>

Pulls in cjq/agentx-v0.3-subagents @ baa95d73, which adds SGLang metric-name fallbacks to ServerMetricsAccumulator.realtime_snapshot so the realtime `srv prefix_cache_hit=... kv_usage=... queue=...` log row populates for sglang servers instead of being suppressed (every field was vLLM-only before). Signed-off-by: Cam Quilici <cjquilici@gmail.com>

cursor · 2026-05-27T22:44:31Z

+    wait "$tail_pid" 2>/dev/null || true
+    cat "$LMCACHE_LOG" >&2 || true
+    exit 1
+}


LMCache helper functions duplicated across three scripts

Medium Severity

cleanup_lmcache_server and wait_for_lmcache_ready are identically copy-pasted across three scripts (dsv4_fp4_b200_vllm.sh, kimik2.5_fp4_b200.sh, kimik2.5_fp4_mi355x.sh). Other shared helpers like resolve_trace_source, install_agentic_deps, and the new run_agentic_replay_and_write_outputs already live in benchmark_lib.sh. These LMCache helpers belong there too, reducing the risk of inconsistent bug fixes across the three copies.

Additional Locations (2)

benchmarks/single_node/agentic/kimik2.5_fp4_b200.sh#L39-L86

benchmarks/single_node/agentic/kimik2.5_fp4_mi355x.sh#L585-L632

^{Reviewed by Cursor Bugbot for commit 4933cf3. Configure here.}

Pulls in cjq/agentx-v0.3-subagents @ 006417a8, which fixes a silent regression in the realtime srv-row: counter lookups that included `_total` (e.g. `vllm:prompt_tokens_total`, `sglang:prompt_tokens_total`) never matched because `prometheus_client.parser` strips that suffix before the data collector stores the family. Server-side throughput rows were missing on every backend, not just SGLang — masked by unit tests that bypassed the parser. Signed-off-by: Cam Quilici <cjquilici@gmail.com>

Agentic replay traces have a theoretical prefix-cache hit rate above 95% on every workload we benchmark; the realtime srv row only reads 0.0% because the launch script turns the SGLang RadixAttention cache off. Every server recipe in this directory had it on — either as the only branch of an OFFLOADING=none case or as an unconditional launch-line flag — so the hit-rate number was never meaningful and the run was paying full prefill cost on every turn. Removed unconditionally from: dsv4_fp4_mi355x_sglang, glm5.1_fp4_mi355x, glm5_fp8_b200, qwen3.5_bf16_b200, qwen3.5_fp8_b200, qwen3.5_fp8_mi355x. Removed from the OFFLOADING=none branch of: qwen3.5_fp8_h100, qwen3.5_fp8_b300_sglang, qwen3.5_fp8_mi355x_sglang. Replaced with a short comment so the next person editing the `case` doesn't put it back. OFFLOADING=none still means "no CPU/host offload"; the GPU RadixAttention cache stays on, which is the only sensible default for an agentic workload. Signed-off-by: Cam Quilici <cjquilici@gmail.com>

Pulls in cjq/agentx-v0.3-subagents @ b2d047dd, which switches the realtime srv-row prefix_cache_hit_rate fallback from SGLang's per-batch `cache_hit_rate` gauge (reads 0 between requests) to the cumulative `cached_tokens_total` / `prompt_tokens_total` counter pair, matching vLLM's `hits/queries` shape. Also unlocks unique_input_tokens_srv on SGLang. Signed-off-by: Cam Quilici <cjquilici@gmail.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 3 total unresolved issues (including 2 from previous reviews).

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 6a77acb. Configure here.}

cursor · 2026-05-28T04:33:36Z

+    --tokenizer-path "$MODEL"
+    --enable-metrics
+    "${CACHE_ARGS[@]}"
+)


Missing context-length limits in H100 SGLang launcher

High Severity

The H100 SGLang launcher for Qwen3.5 omits both MAX_MODEL_LEN initialization and the --context-length flag. The sibling B300 script (qwen3.5_fp8_b300_sglang.sh) and MI355X script both default MAX_MODEL_LEN to 131072 and pass --context-length "$MAX_MODEL_LEN". Without this, SGLang will allocate KV cache for the model's full context window (potentially 512k+), which on H100's 80 GB HBM3 severely reduces usable KV blocks or causes OOM. Additionally, build_replay_cmd won't pass --max-context-length to aiperf since MAX_MODEL_LEN is unset, so over-length traces from the corpus won't be filtered client-side either.

^{Reviewed by Cursor Bugbot for commit 6a77acb. Configure here.}

cquil11 and others added 30 commits May 17, 2026 15:50

dsv4-fp4-b200-vllm-agentic: bump image to cquil v0.21.0 custom build

9996180

Test SimpleCPUOffloadConnector lazy_offload behavior on a newer vLLM than the default v0.20.0-cu130. Image: cquil/vllm-openai:v0.21.0-8813c92. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

agentic: simplify git install to bare apt-get update && install; keep -e

ea13e41

Drop the sudo/root-detection complexity from R18 and restore -e on the aiperf pip install. Per user direction. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

claude Bot reviewed May 27, 2026

View reviewed changes