[WIP] Chore/agentx v0.3#1571
Conversation
…loadingConnector
vLLM's --kv_offloading_backend native resolves to two different connectors
based on the VLLM_USE_SIMPLE_KV_OFFLOAD env var (see vllm/config/vllm.py:662):
VLLM_USE_SIMPLE_KV_OFFLOAD=1 -> SimpleCPUOffloadConnector (the path
we were using; carries the popleft_n
+ context-overflow + completion-barrier
bugs we hit on B200/B300/H200)
unset (default) -> OffloadingConnector (the regular
native path)
This commit drops the env var and the JSON form, switching MI355X to the
shortcut form which now routes to OffloadingConnector. We're trying the
regular path here to see if it sidesteps the SimpleCPUOffloadConnector-
specific issues that have been forcing lazy_offload + workarounds.
Also drops the --kv-transfer-config JSON since the shortcut form constructs
the KVTransferConfig itself at engine startup. Keeps
--disable-hybrid-kv-cache-manager since MI355X uses --block-size=1 + AITER
which doesn't play with the hybrid manager.
Test SimpleCPUOffloadConnector lazy_offload behavior on a newer vLLM than the default v0.20.0-cu130. Image: cquil/vllm-openai:v0.21.0-8813c92. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Mirrors the dsv4-fp4-b200-vllm-agentic CONC sweep (tp8 [16,32,64] + tp8 dp-attn [64,128,256]) so the two SKUs can be compared on the same trace load. Uses the same SGLang image as the fixed-seq-len sibling (rocm/sgl-dev:rocm720-mi35x-0363e6c-20260509-DSv4). Offload sweep is none-only (SGLang has no equivalent of vLLM's SimpleCPUOffloadConnector that we exercise on b200). Launcher swaps the fixed-seq-len harness (run_benchmark_serving) for the agentic harness (build_replay_cmd / write_agentic_result_json / analyze_benchmark_distributions) but keeps all SGLang server flags and SGLANG_* env vars identical to the fixed-seq-len sibling. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
R2 dispatch failed on all 6 b200 shards with the same enroot error during manifest fetch: [INFO] Fetching image manifest list [INFO] Fetching image manifest [ERROR] Could not process JSON input curl: (23) Failure writing output to destination Docker Hub confirms the image exists with a clean Docker v2 manifest, but enroot import was being invoked as `docker://docker.io/cquil/vllm-openai:...` because the image field had the docker.io/ prefix. Every other image entry in the repo uses the bare `org/repo:tag` form (no docker.io/ prefix), so this entry was the outlier. Dropping the prefix matches convention and should let enroot resolve the registry host normally. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
First multi-node agentic config with the recipe local to this repo. Adds:
- Two new agentic recipes under benchmarks/multi_node/srt-slurm-recipes/
vllm/deepseek-v4/agentic/, adapted from the corresponding 8k1k fixed-
seq-len siblings:
* disagg-gb300-1p6d-dep4-tp4-agentic.yaml (low-lat conc=32, mid conc=192)
* disagg-gb300-4p1d-dep4-dep8-24-c4096-agentic.yaml (high-tput conc=4096)
Both drop max-model-len, drop no-enable-prefix-caching, add DSv4
tool/reasoning parsers, switch benchmark.type sa-bench -> custom (hands
off to benchmarks/multi_node/agentic_srt.sh which builds the aiperf
inferencex-agentx-mvp invocation).
- New IS_AGENTIC=1 branch at the top of runners/launch_gb300-nv.sh's
framework conditional. Clones the cquil11/srt-slurm-nv fork (the only
srt-slurm build that supports benchmark.type=custom) on the
cam/sa-submission-q2-2026 branch and overlays the local agentic
recipes into recipes/vllm/deepseek-v4/agentic/ so iteration stays in
this repo.
- New dsv4-fp4-gb300-dynamo-vllm-agentic config entry in
nvidia-master.yaml as a sibling of the byte-identical-to-origin/main
dsv4-fp4-gb300-dynamo-vllm base. Three-tier sweep:
* low-latency (conc=32, 1p6d shape, 28 GPUs / 8 nodes)
* mid (conc=192, 1p6d shape, same alloc as low-lat)
* high-tput (conc=4096, 4p1d shape, 24 GPUs / 7 nodes)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
R1 of dsv4-fp4-gb300-dynamo-vllm-agentic failed at `srtctl apply` with
two schema errors against the cquil11/srt-slurm-nv fork:
Invalid config: {'dynamo': {'wheel': ['Unknown field.']},
'benchmark': {'env': {'PORT': {'value': ['Not a valid string.']}}}}
The first (dynamo.wheel) is fixed by cherry-picking commit 0060f857 from
NVIDIA upstream onto cquil11/srt-slurm-nv@cam/sa-submission-q2-2026
(adds wheel field + install scripts; pushed separately).
The second (PORT) is fixed here: env values must be strings, so
`PORT: 8000` -> `PORT: "8000"`. INFMAX_CONTAINER_WORKSPACE / RESULT_DIR
parse as strings due to their / chars, and IS_MULTINODE was already
quoted; PORT was the only bare int.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
R2 of dsv4-fp4-gb300-dynamo-vllm-agentic landed all 3 shards on
gb300-cw_N runners (CoreWeave self-hosted runners advertise both
gb300-cw AND gb300-nv labels). RUNNER_NAME%%_* resolves to gb300-cw,
which routes to runners/launch_gb300-cw.sh — but that launcher had
no IS_AGENTIC handling, so it cloned upstream NVIDIA/srt-slurm
(which lacks benchmark.type=custom) instead of the cquil11 fork.
srtctl apply then failed:
Invalid config: {'benchmark': {'command': ['Unknown field.'],
'env': ['Unknown field.']}}
Mirrors the IS_AGENTIC=1 branch I added earlier to launch_gb300-nv.sh:
use cquil11/srt-slurm-nv@cam/sa-submission-q2-2026 (now patched with
dynamo.wheel support via cherry-picked NVIDIA commit 0060f857) and
overlay our local agentic recipes from
benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/agentic/.
Both gb300-nv and gb300-cw launchers now handle IS_AGENTIC identically,
so the workload runs correctly regardless of which runner picks it up.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Upstream NVIDIA/srt-slurm@main has caught up on every schema feature
the agentic path needs:
- BenchmarkType.CUSTOM + benchmark.command + benchmark.env (the
hook that hands off to benchmarks/multi_node/agentic_srt.sh)
- DynamoConfig.wheel (so our vllm recipes can pin the same
ai-dynamo wheel as the fixed-seq-len path)
- default_bash_preamble (no more "Unknown field" warning)
So we don't need the cquil11/srt-slurm-nv fork anymore. Pin to
upstream commit 127597c0e6d3 (current HEAD) for reproducibility;
bump as upstream evolves.
Also fix: `uv venv` defaults to no-pip. The upstream
prefetch-ai-dynamo-wheel.sh script (called by srtctl when a recipe
has `dynamo.wheel` set) does `python3 -m pip download`, which fails
with "No module named pip" without a seeded venv. Adding --seed
installs pip+setuptools+wheel into the venv so the prefetch path
works. R4 of dsv4-fp4-gb300-dynamo-vllm-agentic showed this error
on the gb300-cw runner immediately after the lockfile cleanup
unblocked the import_squash step.
Both gb300-cw and gb300-nv launchers updated identically.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
R5 first-shard failure on gb300-nv runner: fatal: reference is not a tree: 127597c0e6d3c1b3ffd7ac02dd0fea2d2fd62f74 I extrapolated the 40-char SHA from a 7-char short `127597c` shown in git log output instead of resolving it. The real SHA is 127597c2926467db06e6707e0aa9227261c6c02a (NVIDIA/srt-slurm@main, "Update GB300 FP8 GLM-5 recipe (#160)"). R5's gb300-cw shards didn't immediately fail on the same error — either they hadn't reached the checkout step yet when I noticed, or their git was more lenient about the prefix-then-garbage SHA. Either way, the fixed SHA works for both. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… launcher
Two issues caught in R5:
1) dynamo-vllm worker rejects chat parser flags
The worker entrypoint (different argparser than `vllm serve`) errors:
__main__.py: error: unrecognized arguments: --enable-auto-tool-choice
--tool-call-parser deepseek_v4
These belong on the dynamo frontend, not the worker. In disagg, chat
parsing happens at the frontend; workers just take tokens. The 8k1k
sibling recipes (which work) don't set these either. I mistakenly
ported them from the single-node launchers, which run `vllm serve`
directly (the chat-serving entrypoint).
Drop --tool-call-parser, --enable-auto-tool-choice, --reasoning-parser
from both prefill and decode blocks in both agentic recipes. Keep
--tokenizer-mode deepseek_v4 (worker DOES accept that one).
2) launch_gb300-cw.sh was missing set -e
The fabricated SHA bug from the prior commit only surfaced on the nv
launcher (which has set -exo pipefail). The cw launcher silently
swallowed the failed `git checkout` and proceeded on origin/HEAD —
which happened to be the right commit, masking the bug. Add
`set -exo pipefail` to match the nv launcher; loud failures are
safer than silent ones.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
R6 surfaced via srtctl preflight that /scratch/models/DeepSeek-V4-Pro is
not staged on the gb300-nv cluster:
Error: Preflight failed for ...disagg-gb300-1p6d-dep4-tp4-agentic.yaml:
- model.path: Model alias 'deepseek-v4-pro' resolved to
'/scratch/models/DeepSeek-V4-Pro', but that path is unavailable.
DSR1 weights ARE staged on /scratch (node-local SSD), but DSv4-Pro was
never staged there. The 806 GB DSv4-Pro checkpoint lives at
/home/sa-shared/models/DeepSeek-V4-Pro (NFS, shared across nodes).
This silently broke the existing 8k1k fixed-seq-len path for dsv4-vllm
on gb300-nv too (just hadn't been exercised against the stricter
upstream srtctl preflight). Fix is single-file: re-point the DSv4 leg
of the per-model conditional to the NFS path.
NFS is slower than /scratch but that's where the model actually lives.
Stage to /scratch and switch back if model load I/O becomes a bottleneck.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…S ELOOP R7 of dsv4-fp4-gb300-dynamo-vllm-agentic: Fatal error: Symlink loop from '/home/sa-shared/models/DeepSeek-V4-Pro' OSError: [Errno 40] Too many levels of symbolic links Same Vast NFS ELOOP bug we hit on the squash lockfiles in R3/R4: the /home/sa-shared/ NFS mount returns ELOOP to workflow worker processes (specifically those spawned through GHA runner pod -> sbatch -> pyxis/enroot), even though the same path is a regular directory from interactive sessions (verified via gb300-slurm + srun on c001 — both Path.resolve() and ls succeed cleanly). Workaround: /data/ and /home/sa-shared/ are SEPARATE mount points backed by the SAME storage (storage-vip.vast.p03.globalai.run, with /scratch and /scratch/home/sa-shared as the server-side paths). Switching MODEL_PATH to /data/home/sa-shared/models/DeepSeek-V4-Pro gives us identical files with a separate NFS client cache, which isn't poisoned in the workflow context. Doesn't fix the underlying Vast NFS bug — just routes around it. Long-term: stage DSv4-Pro to /scratch/models/ (node-local SSD) like DSR1, both for performance and to bypass this whole mount class. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
R7 of dsv4-fp4-gb300-dynamo-vllm-agentic had 6/8 worker srun steps OOM-killed within 30s, with `torch.AcceleratorError: CUDA-capable device(s) is/are busy or unavailable` (CUDA init aborts when SIGKILL races it). sacct showed each worker step got AllocTRES mem=4G (empirically verified on CW: default sbatch w/ --gres=gpu:4 -> AllocTRES mem=4G; same sbatch w/ --mem=0 -> AllocTRES mem=868G). Root cause: srt-slurm's start_srun_process doesn't pass --mem on the container srun, so it gets cpus_per_task × DefMemPerCPU = 4 GB by default on clusters with positive DefMemPerCPU (CW gb300 has 4096). 4 GB is wildly insufficient for a vLLM worker mmap'ing multi-GB model weights and pinning CUDA buffers. Fix: re-point both gb300 launchers' IS_AGENTIC clone from upstream NVIDIA/srt-slurm@127597c to cquil11/srt-slurm-nv@cam/agentic-mem-0 (96c443a), which is the same upstream commit + a single patch adding `--mem 0` to start_srun_process when container_image is set. Long-term: PR the --mem=0 change upstream so we can drop the fork indirection for this feature class. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
R9 hit the same Vast NFS ELOOP we fixed for the model path in R8, but
this time on the squash lockfile:
/usr/bin/bash: line 2: /home/sa-shared/gharunners/squash/<image>.sqsh.lock:
Too many levels of symbolic links
The /home/sa-shared/ NFS mount poisons lockfiles AND data files alike
under the workflow worker NFS session. We applied the /data/ workaround
for MODEL_PATH; now do the same for SQUASH_FILE + NGINX_SQUASH_FILE
which were still pointing at the bad mount. Both /home/sa-shared/
and /data/ are mounted from the same Vast backing storage; same files,
separate NFS client cache.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Earlier I patched srt-slurm's start_srun_process to default --mem=0 on
container srun. That's the wrong layer — srtctl has a documented
top-level recipe field `srun_options:` (see docs/config-reference.md#srun_options)
that gets threaded straight through to the worker srun via
mixins/worker_stage.py:235 (`srun_options=self.runtime.srun_options`)
and start_srun_process line 248 (`for key, value in srun_options.items()`).
Switch to that mechanism:
- Add `srun_options: {mem: "0"}` to both agentic recipes
- Revert both launchers from the cquil11 fork pin back to upstream
NVIDIA/srt-slurm@127597c (the fork patch in cam/agentic-mem-0 is
now redundant; leaving the branch around as a fallback but not
pinned in the launcher)
R9/R10 confirmed sacct still showed mem=4G per worker step despite the
launcher cloning the patched fork — likely because srtctl's uv-sync
inside the sbatch rebuilds the venv from pyproject.toml and the
editable install from src/ doesn't include code modifications the way
uv pip install -e . would. The recipe-level mechanism doesn't depend
on patching srtctl at all so this whole class of "is the patch
loaded?" question goes away.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
R11 verified that srun_options.mem=0 IS now in the worker srun
cmdline (confirmed via /proc/<pid>/cmdline on the head node).
BUT sacct still showed AllocTRES mem=4G per step.
Why: the sbatch only requested `--ntasks=8` with no `--mem`, so the
JOB allocation per node is bound to cpus_per_task × DefMemPerCPU =
1 × 4 GB = 4 GB. `--mem=0` on srun means "use ALL of what the JOB
has on this node" — and the job has 4 GB. There's nothing to grow
into.
The other half of the fix is `sbatch_directives.mem=0` which emits
`#SBATCH --mem=0` in the generated sbatch script (per
src/srtctl/templates/job_script_minimal.j2:26), making SLURM
allocate all available node memory (~868 GB on CW gb300) up front.
Both layers needed:
- sbatch_directives.mem=0 → JOB gets full node memory
- srun_options.mem=0 → each container srun step uses it
(without this, srun defaults back to
cpus_per_task × DefMemPerCPU = 4 GB)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ation)
R12 progressed past the memory layer (sbatch_directives.mem=0 from prior
commit worked; sacct showed AllocTRES mem=868G per worker), but failed
~10 min in with etcd lease-keepalive `deadline exceeded` errors followed
by every worker SIGKILL'd at 16:36:03.
Root cause from infra.out: etcd reported `max-cpu-set: 1` at startup.
SLURM's default cpus_per_task=1 starved single-CPU etcd under load from
24 concurrent dynamo DP rank lease keep-alives (16 prefill + 8 decode).
etcd's gRPC handler couldn't process RPCs fast enough → cascading lease
deadline exceeded → workers crashed → orchestrator cancelled job →
infra step itself SIGKILL'd at 16:35:49 ("STEP 4572.2 ON
slurm-gb300-138-249 CANCELLED ... DUE to SIGNAL Killed").
Fix: sbatch_directives.cpus-per-task=72 grants every task (including
the GPU-less infra step) one CW gb300 NUMA socket. etcd now has
plenty of compute; vLLM workers also get more aux CPU for tokenizer
threads etc.
Why cw needs this and nv doesn't: nv cluster's JobDefaults includes
DefCpuPerGPU=35 → any task with --gres=gpu:N auto-gets 35*N CPUs (=
140 on a 4-GPU task). cw has no per-GPU default → tasks get
cpus_per_task=1 by default. The infra step has no --gres flag at all
so it's the worst case on cw.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two changes: 1) Pin to NVIDIA cluster (drop CW) The dsv4-fp4-gb300-dynamo-vllm-agentic runner field was `gb300`, which is the generic label both NV and CW runner pools advertise (per gh api runners). So shards landed on either cluster, which meant we kept debugging the same recipe path against two different cluster configs (NV's DefCpuPerGPU=35 vs CW's DefMemPerCPU=4096 with no per-GPU defaults). Switch to `runner: gb300-nv`, a label only the NV pool advertises. This matches just gb300-nv_0/1/2 going forward. 2) MODEL_PATH switched to /scratch/models/DeepSeek-V4-Pro The node-local SSD on NV compute nodes. Faster than the /data/home/sa-shared NFS path (where DSv4-Pro currently lives). Caveat: /scratch doesn't exist on the GHA runner pod, so srtctl preflight may fail with "Model alias resolved to ..., but that path is unavailable." We're trying this anyway to see whether the runner pod has /scratch mounted; if it errors, next step is to either (a) patch srt-slurm to add a `skip_model_preflight` recipe field or (b) stub a symlink on the runner pod. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The agentic recipe pins MODEL_PATH=/scratch/models/DeepSeek-V4-Pro (node-local NVMe on compute nodes). srtctl's _preflight_model runs in-process on whatever node invokes srtctl — the GHA runner pod, which doesn't have /scratch mounted — so it bails before sbatch with "Model alias 'deepseek-v4-pro' resolved to '/scratch/...', but that path is unavailable" (R14 hit this). Switch the IS_AGENTIC=1 clone target from NVIDIA/srt-slurm@127597c to cquil11/srt-slurm-nv@cam/no-preflight-flag (854b3fd), which adds one CLI flag — `srtctl apply --no-preflight` — that skips just the optional Python-level FS precheck. vLLM still fails loudly at runtime if the path is genuinely missing on the compute node. The flag is only passed when IS_AGENTIC=1. Fixed-seq-len recipes resolve model.path to an NFS path visible from the runner pod, where the precheck is a useful sanity guard, so leave enforcement on for them. Fork commit: cquil11/srt-slurm-nv@854b3fd Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Aiperf's content-addressed mmap dataset cache (~65 GB per dataset)
needs to be persisted across runs so the first run of the day doesn't
re-tokenize + re-write it on every shard. Same pattern as
launch_h200-dgxc-slurm.sh, launch_b200-dgxc.sh, launch_mi355x-amds.sh.
Three layers wired:
1) Host paths (cluster-specific, created with 0777 so all gharunner_X
SLURM users can write):
gb300-nv /data/home/sa-shared/gharunners/ai-perf-cache
gb300-cw /mnt/vast/ai-perf-cache
2) Both launchers export AIPERF_MMAP_CACHE_HOST_PATH and add a line to
the generated srtslurm.yaml's default_mounts block — srt-slurm's
runtime.py reads default_mounts via get_srtslurm_setting() and
bind-mounts each entry into every worker container. cw already had
a default_mounts block (for dynamo-wheels-cache); nv had none.
3) Both agentic recipes set AIPERF_DATASET_MMAP_CACHE_DIR=/aiperf_mmap_cache
in benchmark.env so the aiperf process inside the container reads
from the persistent mount instead of ~/.cache/aiperf/dataset_mmap.
Single-node launchers don't need updating — they have their own srun
--container-mounts line that already bind-mounts the cache.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Brings in 45 commits from upstream/ajc/inferencex-agentx-mvp (PR #875): - InferenceX AgentX-MVP scenario (default corpus switched to 051226 no-subagents 949-trace variant) - semianalysis_cc_traces_weka_no_subagents HF loader - Wrap-fill trajectory recycling + correlation-id double-recycle guard - DAG benchmarks, reproducible payload replay, agentic_replay E2E test - assorted dataset/timing fixes Local commits preserved (no rebase). One docstring-only conflict in src/aiperf/dataset/loader/semianalysis_cc_traces_weka.py resolved by taking upstream's text (more comprehensive — documents both 042026 and 051226 variants). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
vllm/vllm-openai:v0.21.0-ubuntu2404 ships without git, but pip's
editable install (-e) of utils/aiperf invokes `git version` to record
direct_url.json provenance. Without git, every R16 shard on both
gb300-nv and gb300-cw failed at:
+ python3 -m pip install --break-system-packages -q --ignore-installed -e /infmax-workspace/utils/aiperf
ERROR: Error [Errno 2] No such file or directory: 'git' while executing command git version
ERROR: Cannot find command 'git' - do you have 'git' installed and in your PATH?
This happens AFTER server boot is healthy and "Server is healthy - starting
benchmark" has fired, so all the upstream cluster/recipe work (preflight,
mem=0 x2 layers, etcd cpus-per-task=72, --no-preflight, /scratch model
path, NixlConnector P<->D, model load) is working end-to-end. Only the
pip install step is blocked.
Fix: prepend a `command -v git || apt-get update && apt-get install -y git`
to install_agentic_deps. Cheap no-op on images that already ship git
(AMD images, custom containers). The vLLM image's apt is functional from
inside the container so this works without container rebuild.
The -e install was introduced yesterday in e92a9bf (aiperf v0.2
migration); previously the agentic flow used kv-cache-tester which
didn't need git.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…t containers R17 surfaced two distinct failures, one per cluster: 1) gb300-cw (all 3 shards): aiperf rejected --public-dataset semianalysis_cc_traces_weka with "Scenario invariants violated ... required loader=any of ['semianalysis_cc_traces_weka_no_subagents', 'weka_trace']". Yesterday's aiperf merge (PR #875 commit fef78a96) switched the inferencex-agentx-mvp scenario's default corpus to the 051226 no-subagents 949-trace variant and tightened the loader contract. The old name is no longer accepted. Fix: resolve_trace_source emits --public-dataset semianalysis_cc_traces_weka_no_subagents. 2) gb300-nv (all 3 shards): "dpkg: error: requested operation requires superuser privilege" from yesterday's install_agentic_deps git install path. The gb300-nv pyxis/enroot setup maps the calling user (sa-shared) into the container as non-root, while gb300-cw runs as root. The git install needs sudo on nv; cw is fine without. Fix: branch on `id -u` — apt-get directly when root, sudo apt-get otherwise. The vllm-base layer installs `sudo` so the binary is available, and the typical enroot config grants the calling user passwordless sudo. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
R17/R18 made it clear that there's no clean way to install git into the
vllm/vllm-openai container at run-time on gb300-nv:
- R16/R17: container ships without git -> pip's editable install of
aiperf fails with "Cannot find command 'git'"
- R18: tried `sudo apt-get install git`. gb300-nv pyxis/enroot remaps
the calling user to uid=345200007 inside the container, and sudo
refuses to run with "/usr/bin/sudo must be owned by uid 0 and have
the setuid bit set" -- the setuid bit can't carry across user
namespaces. cw container runs as root so sudo wasn't tripped there,
but the right answer is one that works on both clusters.
The actual fix is upstream from this entirely: drop `-e`. pip's editable
install needs git only to record direct_url.json provenance; the
non-editable install just builds a wheel via hatchling and copies into
site-packages. aiperf's pyproject.toml pins version="0.8.0" rather than
deriving it from git tags, so non-editable install works without git in
any environment. We don't edit aiperf source mid-benchmark anyway --
loss of -e ergonomics is zero.
`--ignore-installed` is still needed (handles the apt-managed-blinker
distutils-uninstall pile-up) and is orthogonal to -e.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Drop the sudo/root-detection complexity from R18 and restore -e on the aiperf pip install. Per user direction. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The vllm/vllm-openai container ships without git; agentic_srt.sh needs to apt-get install it because pip's install of utils/aiperf calls `git version`. R17/R18/R19/R20 chased this on gb300-nv with various combinations of sudo / no-sudo / drop-e / etc., all failing because pyxis maps the calling user to uid 345200007 inside the container and dpkg's hardcoded geteuid()!=0 check rejects every attempt regardless of filesystem permissions. The cleanest fix is to ask pyxis to remap us to uid 0 inside the container, matching the gb300-cw behavior (where the container already runs as root and apt-get install works directly). pyxis exposes this as a per-srun flag: --container-remap-root. srt-slurm renders empty-string srun_options as flag-only srun args (see core/slurm.py:250 in NVIDIA/srt-slurm@127597c). No-op on gb300-cw (cw is already remapped to root by default). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Picks up cquil11/srt-slurm-nv@6e34b8b which propagates srun_options through the benchmark_stage srun (previously only worker/frontend/ telemetry stages honored them). Required for the recipe-level srun_options.container-remap-root: "" to apply to the benchmark.command container — the one that runs agentic_srt.sh + apt install git. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Picks up cquil11/aiperf@9b858ae which fixes PhaseRunner.cancel() to set all_credits_sent_event / all_credits_returned_event so the outer runner awaits wake immediately. Previously cancelled runs (e.g. via --failed-request-threshold) blocked for the full phase timeout (~1800s default) before reaching the graceful exit path. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ncel) When a workflow run is cancelled mid-flight (gh run cancel, or UI cancel button), the launcher gets SIGTERM during its `tail -F` wait and exits before reaching the `tar czf .../multinode_server_logs.tar.gz` line in the main flow. The Upload server logs workflow step runs (it has if: always()) but finds no file (if-no-files-found: ignore silently skips), so the artifact never gets uploaded. Fix: install an EXIT trap right after JOB_ID extraction that produces the tarball on any exit path — normal completion, error, SIGTERM, SIGKILL of our parent. The main-flow tar block is now an idempotent no-op (kept for log narrative). Applied identically to both gb300-nv and gb300-cw launchers. The b200-dgxc launcher has the same pattern but its multi-node flow is currently only used by other configs; leaving it alone for now to avoid mixing unrelated changes. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
gb300-nv 1p6d agentic runs hit ~15% errors at conc=32 from Dynamo NATS RPC deadline timeouts when the single prefill worker is saturated by 32 concurrent 50-100k token prefills. Each timeout returns HTTP 500 "Failed to generate completions: Prefill execution failed: ... NATS request to dynamo_prefill.generate-... failed: ... deadline has elapsed" — a real failure but driven by the single-prefill-worker capacity limit, not a regression. At the previous 0.05 threshold the run tripped its ProfileCancel mechanism early and produced no usable numbers. At 0.20 the run completes and we get steady-state metrics for the ~85% of requests that succeed; the underlying NATS saturation is a separate work item (Dynamo deadline tuning, or more prefill workers in the recipe, or both). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
| LOGS/agentic/aiperf_artifacts/detailed_results.csv | ||
| LOGS/agentic/aiperf_artifacts/debug_trace.jsonl |
There was a problem hiding this comment.
🔴 The 'Upload agentic raw results' step in benchmark-multinode-tmpl.yml (lines 294-295) lists LOGS/agentic/aiperf_artifacts/detailed_results.csv and LOGS/agentic/aiperf_artifacts/debug_trace.jsonl — those filenames were produced by the removed utils/trace-replay submodule and are never written by the new aiperf pipeline. Combined with if-no-files-found: ignore, multinode agentic runs will silently upload an empty agentic_<RESULT_FILENAME> artifact, losing all per-request profile and server-metrics data. Mirror benchmark-tmpl.yml's full aiperf file list (profile_export*, server_metrics_export*, gpu_telemetry_export.jsonl, aiperf logs), translating the results/ prefix to LOGS/agentic/.
Extended reasoning...
What the bug is
The multinode template'''s Upload agentic raw results step was only half-migrated in this PR. The directory rename trace_replay/ → aiperf_artifacts/ was applied, but the filenames underneath it were left as the legacy trace-replay outputs:
LOGS/agentic/aiperf_artifacts/detailed_results.csv
LOGS/agentic/aiperf_artifacts/debug_trace.jsonlThose two filenames were specific to the now-removed utils/trace-replay submodule (the kv-cache-tester scripts). This PR deletes that submodule entry from .gitmodules and removes load_trace_replay_records (which read detailed_results.csv) from utils/agentic-benchmark/scripts/analyze_benchmark_distributions.py. Grep confirms detailed_results.csv and debug_trace.jsonl appear nowhere else in the repo — nothing writes them anymore.
How it manifests
The new aiperf pipeline (wired up via benchmarks/benchmark_lib.sh:build_replay_cmd with --output-artifact-dir /aiperf_artifacts) writes an entirely different set of files: profile_export.jsonl, profile_export_aiperf.{json,csv}, profile_export_aiperf_{timeslices,aggregate,collated}.*, server_metrics_export.{json,jsonl,csv,parquet}, gpu_telemetry_export.jsonl, and logs/aiperf.log. The sibling single-node template benchmark-tmpl.yml was correctly updated in this same PR to enumerate all of those.
Why existing code doesn'''t prevent it
actions/upload-artifact@v7.0.1 is invoked with if-no-files-found: ignore, so a glob/path that matches zero files produces an empty artifact upload without warning. There is no schema check that the listed paths exist.
Impact
Every multinode agentic run (the new dsv4-fp4-gb300-dynamo-vllm-agentic and dsv4-fp4-gb300-cw-dynamo-vllm-agentic configs introduced by this PR, plus future multinode agentic configs) silently produces an empty agentic_<RESULT_FILENAME> artifact. The entire per-request profile (profile_export.jsonl), aiperf aggregate exports, server scrape time series, GPU telemetry, and aiperf logs from multinode jobs are lost. Downstream consumers like utils/process_agentic_result.py (which reads profile_export.jsonl + profile_export_aiperf.json + server_metrics_export.json) cannot reanalyze multinode runs after the fact.
Step-by-step proof
- A multinode agentic job runs and
benchmarks/multi_node/agentic_srt.shcallsbuild_replay_cmd→run_agentic_replay_and_write_outputs. benchmark_lib.sh:1003invokes aiperf with--output-artifact-dir /aiperf_artifacts(where= /logs/agenticfrom the recipe'''sbenchmark.env).- aiperf populates the directory with
profile_export.jsonl,profile_export_aiperf.csv,server_metrics_export.{json,jsonl,csv,parquet}, etc. — but NOTdetailed_results.csvordebug_trace.jsonl(those were trace-replay outputs). - The launcher copies/tars the logs back to
/LOGS/agentic/on the runner. - The
Upload agentic raw resultsstep runs with the two listed paths:LOGS/agentic/aiperf_artifacts/detailed_results.csv→ does not exist.LOGS/agentic/aiperf_artifacts/debug_trace.jsonl→ does not exist.
if-no-files-found: ignorecauses upload-artifact to emit an empty bundle without warning.- The
agentic_<RESULT_FILENAME>artifact appears in the workflow run UI but contains zero files;gh run downloadreturns nothing.
How to fix
Mirror the file list from the correctly-updated sibling benchmark-tmpl.yml (which lists every aiperf export name plus the new lmcache_server.log and *_command.txt files), translating the results/ prefix to LOGS/agentic/. At minimum the list must include profile_export.jsonl, profile_export_aiperf.{json,csv}, profile_export_aiperf_{timeslices,aggregate,collated}.*, server_metrics_export.{json,jsonl,csv,parquet}, gpu_telemetry_export.jsonl, and aiperf_artifacts/logs/*.log.
| # ``hash_ids`` and ``output_length``. Built lazily from the HF dataset cache. | ||
| _TRACE_METADATA_CACHE: dict[str, list[dict]] | None = None | ||
| _HF_DATASET = "semianalysisai/cc-traces-weka-042026" | ||
| _HF_DATASET = "semianalysisai/cc-traces-weka-with-subagents-051926" |
There was a problem hiding this comment.
🔴 The test fixture in utils/test_process_agentic_result.py (test_processor_loads_traces_jsonl_for_theoretical_cache) still hard-codes the old dataset directory name datasets--semianalysisai--cc-traces-weka-042026, but this PR renamed _HF_DATASET in process_agentic_result.py:40 to semianalysisai/cc-traces-weka-with-subagents-051926. The processor's _hf_traces_dir() now looks under the new directory name, so the fixture is never found, theoretical_cache_hit_rate stays None, and the assertions at lines 461 and 463 (== pytest.approx(0.5) and mean_output_tokens_expected == ...) will fail every CI run. Fix: update the fixture path to datasets--semianalysisai--cc-traces-weka-with-subagents-051926.
Extended reasoning...
Bug
test_processor_loads_traces_jsonl_for_theoretical_cache writes a synthetic Hugging Face snapshot to validate that process_agentic_result.py correctly walks per-trace hash_ids arrays and computes theoretical_cache_hit_rate. After this PR, the test will deterministically fail on first execution.
Root Cause
This PR changed utils/process_agentic_result.py:40 from:
_HF_DATASET = "semianalysisai/cc-traces-weka-042026"to:
_HF_DATASET = "semianalysisai/cc-traces-weka-with-subagents-051926"_hf_traces_dir() (around line 133-134) derives the on-disk cache directory from this constant via the HF naming convention datasets--{org}--{name}. So after the rename the processor looks for:
$HF_HUB_CACHE/datasets--semianalysisai--cc-traces-weka-with-subagents-051926/snapshots/<rev>/traces.jsonl
But the test fixture at utils/test_process_agentic_result.py:408 still hard-codes the old name:
snapshot = hf_cache / "datasets--semianalysisai--cc-traces-weka-042026" / "snapshots" / "abc"The other call sites in the same test file (_write_fixture, the per-run subdir test, etc.) were updated from trace_replay → aiperf_artifacts in this PR, but this particular hard-coded HF dataset directory was missed.
Step-by-Step Proof
- Test calls
_write_fixture, then writestraces.jsonlto<tmp>/_hf/datasets--semianalysisai--cc-traces-weka-042026/snapshots/abc/traces.jsonl. - Test sets
HF_HUB_CACHE=<tmp>/_hfand invokes the processor. - Inside
_hf_traces_dir(), the code builds:Path($HF_HUB_CACHE) / f"datasets--semianalysisai--cc-traces-weka-with-subagents-051926" / "snapshots"— using the new_HF_DATASETconstant. - That directory does not exist in the fixture (only the old-name directory does), so
_hf_traces_dir()returnsNone. _iter_trace_blobsis never called;_TRACE_METADATA_CACHEremains empty.- Without trace metadata,
theoretical_cache_hit_rateis computed asNoneandmean_output_tokens_expectedisNone(or missing) in the emitted agg JSON. - The assertion at line 461 (
agg["theoretical_cache_hit_rate"] == pytest.approx(0.5)) comparesNone == 0.5→ fails. - The assertion at line 463 (
agg["mean_output_tokens_expected"] == pytest.approx((50+60+55+40+70)/5)) comparesNoneto a float → fails.
Independent verifier confirmation: one verifier reproduced this by running the processor against both paths and observed that the old path produces theoretical_cache_hit_rate=None, while only the new path populates it as expected.
Fix
Rename the fixture directory in utils/test_process_agentic_result.py (around line 408) from:
snapshot = hf_cache / "datasets--semianalysisai--cc-traces-weka-042026" / "snapshots" / "abc"to:
snapshot = hf_cache / "datasets--semianalysisai--cc-traces-weka-with-subagents-051926" / "snapshots" / "abc"No other test fixture changes are needed; the processor will then find the synthetic snapshot at the new path and the assertions will pass.
…n/ subdir
Match the existing benchmarks/single_node/agentic/ split: all 111 non-
agentic per-cluster launch scripts move into benchmarks/single_node/
fixed_seq_len/. chat_templates/ stays at single_node/chat_templates/ as
a shared resource (referenced by both agentic and fixed_seq_len scripts).
Plumbing:
- .github/workflows/benchmark-tmpl.yml + benchmark-multinode-tmpl.yml:
SCENARIO_SUBDIR default flips from '' to 'fixed_seq_len/'.
- runners/launch_mi355x-amds.sh: parameter-expansion fallback also
defaults to fixed_seq_len/ so direct invocations (without the
workflow setting SCENARIO_SUBDIR) still resolve.
- Each moved script's `source "$(dirname \"$0\")/../benchmark_lib.sh"`
becomes `../../benchmark_lib.sh`.
- dsv4_fp4_mi355x_sglang.sh's --chat-template path becomes
`../chat_templates/...` (matches the agentic copy's pattern).
- .github/configs/{nvidia,amd}-master.yaml: forward-looking comments
repath to fixed_seq_len/. perf-changelog.yaml historical entries
left untouched (they describe paths at the time of the change).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Cam Quilici <cjquilici@gmail.com>
…xups
Resolutions:
- perf-changelog.yaml: took main verbatim.
- runners/launch_b300-nv.sh: took main (drops --nodelist pin entirely;
supersedes our narrower 017-019 fix).
- benchmarks/single_node/fixed_seq_len/dsv4_fp8_mi355x{,_vllm}.sh:
accepted main's deletes (orphan recipes removed in #1374, #1501).
- .github/configs/amd-master.yaml: took main as the base, then re-applied
our agentic-only additions on top:
* qwen3.5-fp8-mi355x-sglang-agentic-hicache (new entry)
* dsv4-fp4-mi355x-vllm-agentic (new entry)
* dsv4-fp4-mi355x-sglang-agentic (new entry)
* kimik2.5-fp4-mi355x-vllm-agentic (cpu -> lmcache)
Dropped our comment-path edit for dsv4_fp8_mi355x_vllm.sh since main
deleted that entry.
Fixed_seq_len reorg fixups for files added on main during our branch's
lifetime:
- git mv 14 stranded scripts from benchmarks/single_node/*.sh into
benchmarks/single_node/fixed_seq_len/ (dsr1_fp4_b200_mtp,
dsr1_fp4_mi355x_mtp, dsr1_fp8_h200_mtp, dsr1_fp8_mi325x_mtp,
dsr1_fp8_mi355x_mtp, dsv4_fp4_mi355x_vllm, glm5_fp8_h200_mtp,
glm5_fp8_mi325x, glm5_fp8_mi325x_mtp, qwen3.5_bf16_mi325x_mtp,
qwen3.5_fp4_mi355x_mtp, qwen3.5_fp8_h100, qwen3.5_fp8_h100_mtp,
qwen3.5_fp8_mi325x_mtp). Patched their source paths from
../benchmark_lib.sh to ../../benchmark_lib.sh.
- runners/launch_mi355x-amds.sh: multinode-non-disagg BENCHMARK_SUBDIR
bumped from `single_node` to `single_node/fixed_seq_len`.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Per-recipe scripts had stale `VAR=${VAR:-default}` lines for variables
that are either reliably plumbed by the workflow template or completely
unused. The defaults masked missing-env bugs (the workflow could forget
to plumb a var and the script would silently fall back to a stale local
default instead of failing loudly) and left dead lines hanging around
from the pre-aiperf-v0.2 era.
benchmarks/benchmark_lib.sh:
- PORT: new `export PORT="${PORT:-8888}"` near the top so a single
source of truth governs the server port. Launchers that need a
non-default value (launch_mi355x-amds.sh derives PORT from
RUNNER_NAME to avoid collisions across concurrent gh-runners) set
PORT themselves; the `:-` fallback only kicks in if nothing
upstream set it.
- build_replay_cmd: `local duration="${DURATION:-1800}"` -> `"$DURATION"`
(DURATION is now a check_env_vars-enforced requirement in callers).
benchmarks/single_node/agentic/*.sh (32 scripts) and
benchmarks/multi_node/agentic_srt.sh:
- Removed: PORT=${PORT:-8888} (benchmark_lib owns it now).
- Removed: DURATION/EP_SIZE/DP_ATTENTION defaults; added each to
check_env_vars in the scripts that consume them. DURATION is
consumed by build_replay_cmd in benchmark_lib, so every agentic
script now requires it explicitly.
- Removed: MAX_DELAY/ADVANCE_MIN/ADVANCE_MAX. These were CLI args to
the old trace_replay_tester.py (commit b7ae440); the aiperf v0.2
migration (commit e92a9bf) dropped all consumption but left the
top-of-script var-definitions behind. Pure dead code.
- Kept: SCHEDULER_RECV_INTERVAL (per-model sglang server tuning,
not workflow-plumbed; values vary 5/10/30 per recipe).
benchmarks/single_node/fixed_seq_len/*.sh (120 scripts):
- Removed: PORT=${PORT:-8888} only. fixed_seq_len's check_env_vars
block already requires what it uses (DP_ATTENTION/EP_SIZE/ISL/OSL/
RANDOM_RANGE_RATIO/RESULT_FILENAME) per the existing convention;
no further changes needed.
Net: 343 deletions, 46 insertions across 154 files; no behavior change
on any green CI path (workflow input defaults match the removed local
defaults). Behavior change only when an upstream caller fails to set
DURATION/EP_SIZE/DP_ATTENTION on an agentic recipe -- which now fails
loudly via check_env_vars instead of silently inheriting a stale value.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Cam Quilici <cjquilici@gmail.com>
Matches the existing pattern from launch_{b200-dgxc,h200-dgxc-slurm,
gb300-{nv,cw},mi355x-amds}.sh: define AIPERF_MMAP_CACHE_HOST_PATH on the
host, mount it to /aiperf_mmap_cache inside the container, and expose
AIPERF_DATASET_MMAP_CACHE_DIR=/aiperf_mmap_cache via --export so aiperf's
DatasetLoaderManager finds it. Lets agentic benchmarks reuse the
pre-built mmap dataset cache instead of re-mmaping every run.
- h200-nb: /mnt/data/gharunners/ai-perf-cache (sibling of hf-hub-cache)
- h200-cw: /mnt/vast/gharunner/ai-perf-cache (sibling of hf-hub-cache)
Host-side directories will be created out-of-band before next run.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Cam Quilici <cjquilici@gmail.com>
| OFFLOAD_ARGS=( | ||
| --kv-transfer-config | ||
| "{\"kv_connector\":\"LMCacheMPConnector\",\"kv_connector_module_path\":\"lmcache.integration.vllm.lmcache_mp_connector\",\"kv_role\":\"kv_both\",\"kv_connector_extra_config\":{\"lmcache.mp.host\":\"$LMCACHE_HOST\",\"lmcache.mp.port\":$LMCACHE_PORT}}" | ||
| ) |
There was a problem hiding this comment.
Dead code after explicit exit 1 in disabled branch
Low Severity
The lmcache-mp case in the OFFLOADING switch immediately calls exit 1 (line 140) to disable the path, but ~47 lines of live server-startup code follow after that exit 1 — including agentic_pip_install, LMCache server launch, wait_for_lmcache_ready, and OFFLOAD_ARGS construction. All of it is permanently unreachable. The comment says to "re-enable after PR #3261 merges", but the implementation was left as dead statements rather than being commented out, which gives the misleading impression that the code runs.
Reviewed by Cursor Bugbot for commit a98fcaa. Configure here.
… corpus
Adds a per-recipe override hook in benchmark_lib.sh's resolve_trace_source:
recipes set WEKA_LOADER_OVERRIDE to one of the aiperf public-dataset loader
names allowed by the inferencex-agentx-mvp scenario, and resolve_trace_source
swaps both the --public-dataset flag and the HF dataset pre-download to match.
Default remains semianalysis_cc_traces_weka_with_subagents (052726, 472
traces). Unknown overrides fail loudly with the allowed-values hint.
Wires the new override into all 8 minimaxm2.5 agentic recipes
(minimaxm2.5_fp{4,8}_{b200,b300,h100,h200,mi300x,mi325x,mi355x}.sh) to
use semianalysis_cc_traces_weka_with_subagents_256k -- the 256k-capped
variant (051926-256k, 217 traces, max in+out <= 256k by construction).
MiniMax-M2.5 servers run at max_model_len ~256k, so the unfiltered 052726
corpus would have its longest requests rejected.
Submodule bump: utils/aiperf -> 6fc5f5d6 registers the new loader name in
plugins.yaml and adds it to inferencex_agentx_mvp's require_loader tuple.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Cam Quilici <cjquilici@gmail.com>
Mirrors aiperf 519580fb: the semianalysis_cc_traces_weka_with_subagents_256k loader now points at semianalysisai/cc-traces-weka-with-subagents-052726-256k (470 traces) instead of the earlier 051926-256k (217 traces). Loader name and override env var (WEKA_LOADER_OVERRIDE) unchanged. - benchmark_lib.sh resolve_trace_source: case-statement HF repo path bumped to ...052726-256k for the _256k loader. - All 8 minimaxm2.5_*.sh agentic recipe comments: trace count 217 -> 470. - utils/aiperf submodule pointer -> 519580fb. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Cam Quilici <cjquilici@gmail.com>
The proxy occasionally records the same logical request twice. On the 472-session par<=5 sample, 2,339 of 115,593 rows (2.0%) are byte- identical duplicates of a prior row in the same session — 1,923 are main-agent turns and 416 are subagent inner requests. 275 of 472 sessions (58%) have at least one duplicate. Worst session has 165 dup rows. Without deduping, the weka conversion silently inflates token counts, request counts, and throughput by ~2%, and the converter misclassifies duplicate-pair rows as "two requests started at the same nanosecond" when grouping subagents. Fingerprint: (timestamp, model, input_tokens, output_tokens, duration_ms, agent_id). On the 2,339 detected pairs, 100% are also byte-identical when full JSON is serialized, so the fingerprint produces zero false positives. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Cam Quilici <cjquilici@gmail.com>
…0.21.0 v0.20.2's bundled huggingface_hub==1.14.0 silently fetches Git-LFS pointer files instead of LFS content for `hf download --repo-type dataset`. Every kimik2.5-fp4-b200-vllm-agentic job in run 26536606210 hit "pyarrow.lib.ArrowInvalid: JSON parse error: Missing a name for object member. in row 0" -- the signature of pyarrow trying to parse the literal `version https://git-lfs.github.com/spec/v1` line of an LFS pointer file as JSON. b200-dgxc has no persistent /mnt/hf_hub_cache mount (per launcher diff), so every container re-downloads the dataset and re-hits the bug. v0.21.0 ships a newer huggingface_hub that resolves LFS correctly. v0.20.x's flashinfer fix for the max_model_len=131072 + prefix-caching warmup crash is included in v0.21.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Cam Quilici <cjquilici@gmail.com>
New agentic-coding recipe targeting H100 (runner: h100-dgxc) running
Qwen3.5-397B-A17B FP8 via SGLang v0.5.12-cu130. Mirrors the b300 SGLang
agentic shape with H100-appropriate kernel flags:
- attention-backend: flashinfer (sm_90; trtllm_mha is Blackwell-only).
- mem-fraction-static 0.75 (vs 0.80 on B300) and chunked-prefill-size
8192 (vs 16384) to fit Qwen-397B FP8 weights + KV in H100's 80 GB
HBM3 at TP=8.
- conc-list capped at 16 across both arms; agentic ISLs hit ~80k-200k
on the 256k corpus and Qwen at conc=32 OOM'd in the fixed_seq_len
sweep at lower ISL too.
Recipe wires WEKA_LOADER_OVERRIDE=semianalysis_cc_traces_weka_with_subagents_256k
so the 256k-capped variant (470 traces, max in+out <= 256k) is used
instead of the unfiltered 052726 corpus (which has up to ~1M-token
requests the H100 max_model_len=131k server would reject).
Two sweep arms:
- none: --disable-radix-cache, conc-list [1, 2, 4, 8, 16]
- hicache: --enable-hierarchical-cache + sized from TOTAL_CPU_DRAM_GB,
conc-list [4, 8, 16] (capped where hicache stabilizes)
Yaml key is qwen3.5-fp8-h100-sglang-agentic; script filename is the
bare `qwen3.5_fp8_h100.sh` under benchmarks/single_node/agentic/ —
the h100 launchers don't support framework-tagged script names, and
this matches the precedent set by qwen3.5_fp8_b200.sh (which is the
sglang-agentic recipe under the same bare name).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Cam Quilici <cjquilici@gmail.com>
Matches the same pattern as launch_b200-dgxc, launch_h200-dgxc-slurm,
launch_gb300-{nv,cw}, launch_mi355x-amds, launch_h200-{nb,cw}: define
AIPERF_MMAP_CACHE_HOST_PATH on the host, bind-mount it to
/aiperf_mmap_cache in the container, and expose
AIPERF_DATASET_MMAP_CACHE_DIR=/aiperf_mmap_cache via --export.
Host path: /mnt/nfs/sa-shared/gharunners/ai-perf-cache (sibling of
the existing hf-hub-cache mount on the same NFS volume).
Needed for the new qwen3.5-fp8-h100-sglang-agentic recipe to reuse
the pre-built mmap dataset cache across runs rather than re-mmaping
every job.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Cam Quilici <cjquilici@gmail.com>
Pulls in cjq/agentx-v0.3-subagents @ baa95d73, which adds SGLang metric-name fallbacks to ServerMetricsAccumulator.realtime_snapshot so the realtime `srv prefix_cache_hit=... kv_usage=... queue=...` log row populates for sglang servers instead of being suppressed (every field was vLLM-only before). Signed-off-by: Cam Quilici <cjquilici@gmail.com>
| wait "$tail_pid" 2>/dev/null || true | ||
| cat "$LMCACHE_LOG" >&2 || true | ||
| exit 1 | ||
| } |
There was a problem hiding this comment.
LMCache helper functions duplicated across three scripts
Medium Severity
cleanup_lmcache_server and wait_for_lmcache_ready are identically copy-pasted across three scripts (dsv4_fp4_b200_vllm.sh, kimik2.5_fp4_b200.sh, kimik2.5_fp4_mi355x.sh). Other shared helpers like resolve_trace_source, install_agentic_deps, and the new run_agentic_replay_and_write_outputs already live in benchmark_lib.sh. These LMCache helpers belong there too, reducing the risk of inconsistent bug fixes across the three copies.
Additional Locations (2)
Reviewed by Cursor Bugbot for commit 4933cf3. Configure here.
Pulls in cjq/agentx-v0.3-subagents @ 006417a8, which fixes a silent regression in the realtime srv-row: counter lookups that included `_total` (e.g. `vllm:prompt_tokens_total`, `sglang:prompt_tokens_total`) never matched because `prometheus_client.parser` strips that suffix before the data collector stores the family. Server-side throughput rows were missing on every backend, not just SGLang — masked by unit tests that bypassed the parser. Signed-off-by: Cam Quilici <cjquilici@gmail.com>
Agentic replay traces have a theoretical prefix-cache hit rate above 95% on every workload we benchmark; the realtime srv row only reads 0.0% because the launch script turns the SGLang RadixAttention cache off. Every server recipe in this directory had it on — either as the only branch of an OFFLOADING=none case or as an unconditional launch-line flag — so the hit-rate number was never meaningful and the run was paying full prefill cost on every turn. Removed unconditionally from: dsv4_fp4_mi355x_sglang, glm5.1_fp4_mi355x, glm5_fp8_b200, qwen3.5_bf16_b200, qwen3.5_fp8_b200, qwen3.5_fp8_mi355x. Removed from the OFFLOADING=none branch of: qwen3.5_fp8_h100, qwen3.5_fp8_b300_sglang, qwen3.5_fp8_mi355x_sglang. Replaced with a short comment so the next person editing the `case` doesn't put it back. OFFLOADING=none still means "no CPU/host offload"; the GPU RadixAttention cache stays on, which is the only sensible default for an agentic workload. Signed-off-by: Cam Quilici <cjquilici@gmail.com>
Pulls in cjq/agentx-v0.3-subagents @ b2d047dd, which switches the realtime srv-row prefix_cache_hit_rate fallback from SGLang's per-batch `cache_hit_rate` gauge (reads 0 between requests) to the cumulative `cached_tokens_total` / `prompt_tokens_total` counter pair, matching vLLM's `hits/queries` shape. Also unlocks unique_input_tokens_srv on SGLang. Signed-off-by: Cam Quilici <cjquilici@gmail.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 3 total unresolved issues (including 2 from previous reviews).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 6a77acb. Configure here.
| --tokenizer-path "$MODEL" | ||
| --enable-metrics | ||
| "${CACHE_ARGS[@]}" | ||
| ) |
There was a problem hiding this comment.
Missing context-length limits in H100 SGLang launcher
High Severity
The H100 SGLang launcher for Qwen3.5 omits both MAX_MODEL_LEN initialization and the --context-length flag. The sibling B300 script (qwen3.5_fp8_b300_sglang.sh) and MI355X script both default MAX_MODEL_LEN to 131072 and pass --context-length "$MAX_MODEL_LEN". Without this, SGLang will allocate KV cache for the model's full context window (potentially 512k+), which on H100's 80 GB HBM3 severely reduces usable KV blocks or causes OOM. Additionally, build_replay_cmd won't pass --max-context-length to aiperf since MAX_MODEL_LEN is unset, so over-length traces from the corpus won't be filtered client-side either.
Reviewed by Cursor Bugbot for commit 6a77acb. Configure here.


Note
Medium Risk
Large CI matrix and benchmark-script changes affect production sweep behavior; LMCache/ROCm runtime patches and multinode GB300 agentic recipes add operational complexity but are confined to benchmark infrastructure.
Overview
This PR advances AgentX v0.3: agentic-coding benchmarks move from the legacy trace-replay submodule to aiperf (
cjq/agentx-v0.3-subagents), with artifacts underaiperf_artifacts/and shared replay viarun_agentic_replay_and_write_outputsinbenchmark_lib.sh. Workflows route non-agentic runs tofixed_seq_len/and expand offload modes (lmcache,hicache, etc.).Sweep configs add many agentic (and some fixed-seq) matrix entries across AMD/NVIDIA (Qwen3.5 HiCache, DSv4, Kimi, MiniMax, GB300 dynamo-vLLM disagg agentic on NV/CW). Several sweeps drop CPU-offload points for this iteration in favor of no-offload curves on a newer trace corpus; Kimi agentic on MI355X/B200/B300 shifts toward LMCache (with substantial ROCm-specific LMCache/vLLM patches on MI355X). Runner labels for
mi355xare normalized (mi355x-amds_00–_08).Benchmark scripts gain DSv4 MI355X SGLang agentic, Qwen HiCache launchers (B300/H100/MI355X), DSv4 vLLM native CPU offload tuning, and GB300 srt-slurm agentic recipes (NATS payload, Slurm mem/CPU,
agentic_srt.sh/ keepalive). Prefix/radix cache is enabled where agentic replay depends on it; MiniMax uses a 256k-capped Weka loader when context is limited.Reviewed by Cursor Bugbot for commit 6a77acb. Bugbot is set up for automated code reviews on this repo. Configure here.