Skip to content

[WIP] Chore/agentx v0.3#1571

Open
cquil11 wants to merge 143 commits into
mainfrom
chore/agentx-v0.3
Open

[WIP] Chore/agentx v0.3#1571
cquil11 wants to merge 143 commits into
mainfrom
chore/agentx-v0.3

Conversation

@cquil11
Copy link
Copy Markdown
Collaborator

@cquil11 cquil11 commented May 27, 2026

Note

Medium Risk
Large CI matrix and benchmark-script changes affect production sweep behavior; LMCache/ROCm runtime patches and multinode GB300 agentic recipes add operational complexity but are confined to benchmark infrastructure.

Overview
This PR advances AgentX v0.3: agentic-coding benchmarks move from the legacy trace-replay submodule to aiperf (cjq/agentx-v0.3-subagents), with artifacts under aiperf_artifacts/ and shared replay via run_agentic_replay_and_write_outputs in benchmark_lib.sh. Workflows route non-agentic runs to fixed_seq_len/ and expand offload modes (lmcache, hicache, etc.).

Sweep configs add many agentic (and some fixed-seq) matrix entries across AMD/NVIDIA (Qwen3.5 HiCache, DSv4, Kimi, MiniMax, GB300 dynamo-vLLM disagg agentic on NV/CW). Several sweeps drop CPU-offload points for this iteration in favor of no-offload curves on a newer trace corpus; Kimi agentic on MI355X/B200/B300 shifts toward LMCache (with substantial ROCm-specific LMCache/vLLM patches on MI355X). Runner labels for mi355x are normalized (mi355x-amds_00_08).

Benchmark scripts gain DSv4 MI355X SGLang agentic, Qwen HiCache launchers (B300/H100/MI355X), DSv4 vLLM native CPU offload tuning, and GB300 srt-slurm agentic recipes (NATS payload, Slurm mem/CPU, agentic_srt.sh / keepalive). Prefix/radix cache is enabled where agentic replay depends on it; MiniMax uses a 256k-capped Weka loader when context is limited.

Reviewed by Cursor Bugbot for commit 6a77acb. Bugbot is set up for automated code reviews on this repo. Configure here.

cquil11 and others added 30 commits May 17, 2026 15:50
…loadingConnector

vLLM's --kv_offloading_backend native resolves to two different connectors
based on the VLLM_USE_SIMPLE_KV_OFFLOAD env var (see vllm/config/vllm.py:662):

  VLLM_USE_SIMPLE_KV_OFFLOAD=1  -> SimpleCPUOffloadConnector  (the path
                                    we were using; carries the popleft_n
                                    + context-overflow + completion-barrier
                                    bugs we hit on B200/B300/H200)
  unset (default)               -> OffloadingConnector        (the regular
                                    native path)

This commit drops the env var and the JSON form, switching MI355X to the
shortcut form which now routes to OffloadingConnector. We're trying the
regular path here to see if it sidesteps the SimpleCPUOffloadConnector-
specific issues that have been forcing lazy_offload + workarounds.

Also drops the --kv-transfer-config JSON since the shortcut form constructs
the KVTransferConfig itself at engine startup. Keeps
--disable-hybrid-kv-cache-manager since MI355X uses --block-size=1 + AITER
which doesn't play with the hybrid manager.
Test SimpleCPUOffloadConnector lazy_offload behavior on a newer vLLM
than the default v0.20.0-cu130. Image: cquil/vllm-openai:v0.21.0-8813c92.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Mirrors the dsv4-fp4-b200-vllm-agentic CONC sweep (tp8 [16,32,64] + tp8
dp-attn [64,128,256]) so the two SKUs can be compared on the same trace
load. Uses the same SGLang image as the fixed-seq-len sibling
(rocm/sgl-dev:rocm720-mi35x-0363e6c-20260509-DSv4). Offload sweep is
none-only (SGLang has no equivalent of vLLM's SimpleCPUOffloadConnector
that we exercise on b200).

Launcher swaps the fixed-seq-len harness (run_benchmark_serving) for the
agentic harness (build_replay_cmd / write_agentic_result_json /
analyze_benchmark_distributions) but keeps all SGLang server flags and
SGLANG_* env vars identical to the fixed-seq-len sibling.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
R2 dispatch failed on all 6 b200 shards with the same enroot error during
manifest fetch:

  [INFO] Fetching image manifest list
  [INFO] Fetching image manifest
  [ERROR] Could not process JSON input
  curl: (23) Failure writing output to destination

Docker Hub confirms the image exists with a clean Docker v2 manifest, but
enroot import was being invoked as `docker://docker.io/cquil/vllm-openai:...`
because the image field had the docker.io/ prefix. Every other image entry
in the repo uses the bare `org/repo:tag` form (no docker.io/ prefix), so
this entry was the outlier. Dropping the prefix matches convention and
should let enroot resolve the registry host normally.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
First multi-node agentic config with the recipe local to this repo. Adds:

- Two new agentic recipes under benchmarks/multi_node/srt-slurm-recipes/
  vllm/deepseek-v4/agentic/, adapted from the corresponding 8k1k fixed-
  seq-len siblings:
    * disagg-gb300-1p6d-dep4-tp4-agentic.yaml  (low-lat conc=32, mid conc=192)
    * disagg-gb300-4p1d-dep4-dep8-24-c4096-agentic.yaml  (high-tput conc=4096)
  Both drop max-model-len, drop no-enable-prefix-caching, add DSv4
  tool/reasoning parsers, switch benchmark.type sa-bench -> custom (hands
  off to benchmarks/multi_node/agentic_srt.sh which builds the aiperf
  inferencex-agentx-mvp invocation).

- New IS_AGENTIC=1 branch at the top of runners/launch_gb300-nv.sh's
  framework conditional. Clones the cquil11/srt-slurm-nv fork (the only
  srt-slurm build that supports benchmark.type=custom) on the
  cam/sa-submission-q2-2026 branch and overlays the local agentic
  recipes into recipes/vllm/deepseek-v4/agentic/ so iteration stays in
  this repo.

- New dsv4-fp4-gb300-dynamo-vllm-agentic config entry in
  nvidia-master.yaml as a sibling of the byte-identical-to-origin/main
  dsv4-fp4-gb300-dynamo-vllm base. Three-tier sweep:
    * low-latency  (conc=32, 1p6d shape, 28 GPUs / 8 nodes)
    * mid          (conc=192, 1p6d shape, same alloc as low-lat)
    * high-tput    (conc=4096, 4p1d shape, 24 GPUs / 7 nodes)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
R1 of dsv4-fp4-gb300-dynamo-vllm-agentic failed at `srtctl apply` with
two schema errors against the cquil11/srt-slurm-nv fork:

  Invalid config: {'dynamo': {'wheel': ['Unknown field.']},
                   'benchmark': {'env': {'PORT': {'value': ['Not a valid string.']}}}}

The first (dynamo.wheel) is fixed by cherry-picking commit 0060f857 from
NVIDIA upstream onto cquil11/srt-slurm-nv@cam/sa-submission-q2-2026
(adds wheel field + install scripts; pushed separately).

The second (PORT) is fixed here: env values must be strings, so
`PORT: 8000` -> `PORT: "8000"`. INFMAX_CONTAINER_WORKSPACE / RESULT_DIR
parse as strings due to their / chars, and IS_MULTINODE was already
quoted; PORT was the only bare int.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
R2 of dsv4-fp4-gb300-dynamo-vllm-agentic landed all 3 shards on
gb300-cw_N runners (CoreWeave self-hosted runners advertise both
gb300-cw AND gb300-nv labels). RUNNER_NAME%%_* resolves to gb300-cw,
which routes to runners/launch_gb300-cw.sh — but that launcher had
no IS_AGENTIC handling, so it cloned upstream NVIDIA/srt-slurm
(which lacks benchmark.type=custom) instead of the cquil11 fork.
srtctl apply then failed:

  Invalid config: {'benchmark': {'command': ['Unknown field.'],
                                  'env': ['Unknown field.']}}

Mirrors the IS_AGENTIC=1 branch I added earlier to launch_gb300-nv.sh:
use cquil11/srt-slurm-nv@cam/sa-submission-q2-2026 (now patched with
dynamo.wheel support via cherry-picked NVIDIA commit 0060f857) and
overlay our local agentic recipes from
benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/agentic/.

Both gb300-nv and gb300-cw launchers now handle IS_AGENTIC identically,
so the workload runs correctly regardless of which runner picks it up.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Upstream NVIDIA/srt-slurm@main has caught up on every schema feature
the agentic path needs:
  - BenchmarkType.CUSTOM + benchmark.command + benchmark.env (the
    hook that hands off to benchmarks/multi_node/agentic_srt.sh)
  - DynamoConfig.wheel (so our vllm recipes can pin the same
    ai-dynamo wheel as the fixed-seq-len path)
  - default_bash_preamble (no more "Unknown field" warning)

So we don't need the cquil11/srt-slurm-nv fork anymore. Pin to
upstream commit 127597c0e6d3 (current HEAD) for reproducibility;
bump as upstream evolves.

Also fix: `uv venv` defaults to no-pip. The upstream
prefetch-ai-dynamo-wheel.sh script (called by srtctl when a recipe
has `dynamo.wheel` set) does `python3 -m pip download`, which fails
with "No module named pip" without a seeded venv. Adding --seed
installs pip+setuptools+wheel into the venv so the prefetch path
works. R4 of dsv4-fp4-gb300-dynamo-vllm-agentic showed this error
on the gb300-cw runner immediately after the lockfile cleanup
unblocked the import_squash step.

Both gb300-cw and gb300-nv launchers updated identically.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
R5 first-shard failure on gb300-nv runner:

  fatal: reference is not a tree: 127597c0e6d3c1b3ffd7ac02dd0fea2d2fd62f74

I extrapolated the 40-char SHA from a 7-char short `127597c` shown in
git log output instead of resolving it. The real SHA is
127597c2926467db06e6707e0aa9227261c6c02a (NVIDIA/srt-slurm@main,
"Update GB300 FP8 GLM-5 recipe (#160)").

R5's gb300-cw shards didn't immediately fail on the same error —
either they hadn't reached the checkout step yet when I noticed, or
their git was more lenient about the prefix-then-garbage SHA. Either
way, the fixed SHA works for both.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… launcher

Two issues caught in R5:

1) dynamo-vllm worker rejects chat parser flags
   The worker entrypoint (different argparser than `vllm serve`) errors:
     __main__.py: error: unrecognized arguments: --enable-auto-tool-choice
     --tool-call-parser deepseek_v4
   These belong on the dynamo frontend, not the worker. In disagg, chat
   parsing happens at the frontend; workers just take tokens. The 8k1k
   sibling recipes (which work) don't set these either. I mistakenly
   ported them from the single-node launchers, which run `vllm serve`
   directly (the chat-serving entrypoint).
   Drop --tool-call-parser, --enable-auto-tool-choice, --reasoning-parser
   from both prefill and decode blocks in both agentic recipes. Keep
   --tokenizer-mode deepseek_v4 (worker DOES accept that one).

2) launch_gb300-cw.sh was missing set -e
   The fabricated SHA bug from the prior commit only surfaced on the nv
   launcher (which has set -exo pipefail). The cw launcher silently
   swallowed the failed `git checkout` and proceeded on origin/HEAD —
   which happened to be the right commit, masking the bug. Add
   `set -exo pipefail` to match the nv launcher; loud failures are
   safer than silent ones.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
R6 surfaced via srtctl preflight that /scratch/models/DeepSeek-V4-Pro is
not staged on the gb300-nv cluster:

  Error: Preflight failed for ...disagg-gb300-1p6d-dep4-tp4-agentic.yaml:
  - model.path: Model alias 'deepseek-v4-pro' resolved to
    '/scratch/models/DeepSeek-V4-Pro', but that path is unavailable.

DSR1 weights ARE staged on /scratch (node-local SSD), but DSv4-Pro was
never staged there. The 806 GB DSv4-Pro checkpoint lives at
/home/sa-shared/models/DeepSeek-V4-Pro (NFS, shared across nodes).

This silently broke the existing 8k1k fixed-seq-len path for dsv4-vllm
on gb300-nv too (just hadn't been exercised against the stricter
upstream srtctl preflight). Fix is single-file: re-point the DSv4 leg
of the per-model conditional to the NFS path.

NFS is slower than /scratch but that's where the model actually lives.
Stage to /scratch and switch back if model load I/O becomes a bottleneck.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…S ELOOP

R7 of dsv4-fp4-gb300-dynamo-vllm-agentic:
  Fatal error: Symlink loop from '/home/sa-shared/models/DeepSeek-V4-Pro'
  OSError: [Errno 40] Too many levels of symbolic links

Same Vast NFS ELOOP bug we hit on the squash lockfiles in R3/R4:
the /home/sa-shared/ NFS mount returns ELOOP to workflow worker
processes (specifically those spawned through GHA runner pod ->
sbatch -> pyxis/enroot), even though the same path is a regular
directory from interactive sessions (verified via gb300-slurm +
srun on c001 — both Path.resolve() and ls succeed cleanly).

Workaround: /data/ and /home/sa-shared/ are SEPARATE mount points
backed by the SAME storage (storage-vip.vast.p03.globalai.run, with
/scratch and /scratch/home/sa-shared as the server-side paths).
Switching MODEL_PATH to /data/home/sa-shared/models/DeepSeek-V4-Pro
gives us identical files with a separate NFS client cache, which
isn't poisoned in the workflow context.

Doesn't fix the underlying Vast NFS bug — just routes around it.
Long-term: stage DSv4-Pro to /scratch/models/ (node-local SSD) like
DSR1, both for performance and to bypass this whole mount class.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
R7 of dsv4-fp4-gb300-dynamo-vllm-agentic had 6/8 worker srun steps
OOM-killed within 30s, with `torch.AcceleratorError: CUDA-capable
device(s) is/are busy or unavailable` (CUDA init aborts when SIGKILL
races it). sacct showed each worker step got AllocTRES mem=4G
(empirically verified on CW: default sbatch w/ --gres=gpu:4 ->
AllocTRES mem=4G; same sbatch w/ --mem=0 -> AllocTRES mem=868G).

Root cause: srt-slurm's start_srun_process doesn't pass --mem on the
container srun, so it gets cpus_per_task × DefMemPerCPU = 4 GB by
default on clusters with positive DefMemPerCPU (CW gb300 has 4096).
4 GB is wildly insufficient for a vLLM worker mmap'ing multi-GB model
weights and pinning CUDA buffers.

Fix: re-point both gb300 launchers' IS_AGENTIC clone from upstream
NVIDIA/srt-slurm@127597c to cquil11/srt-slurm-nv@cam/agentic-mem-0
(96c443a), which is the same upstream commit + a single patch adding
`--mem 0` to start_srun_process when container_image is set.

Long-term: PR the --mem=0 change upstream so we can drop the fork
indirection for this feature class.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
R9 hit the same Vast NFS ELOOP we fixed for the model path in R8, but
this time on the squash lockfile:

  /usr/bin/bash: line 2: /home/sa-shared/gharunners/squash/<image>.sqsh.lock:
                          Too many levels of symbolic links

The /home/sa-shared/ NFS mount poisons lockfiles AND data files alike
under the workflow worker NFS session. We applied the /data/ workaround
for MODEL_PATH; now do the same for SQUASH_FILE + NGINX_SQUASH_FILE
which were still pointing at the bad mount. Both /home/sa-shared/
and /data/ are mounted from the same Vast backing storage; same files,
separate NFS client cache.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Earlier I patched srt-slurm's start_srun_process to default --mem=0 on
container srun. That's the wrong layer — srtctl has a documented
top-level recipe field `srun_options:` (see docs/config-reference.md#srun_options)
that gets threaded straight through to the worker srun via
mixins/worker_stage.py:235 (`srun_options=self.runtime.srun_options`)
and start_srun_process line 248 (`for key, value in srun_options.items()`).

Switch to that mechanism:
  - Add `srun_options: {mem: "0"}` to both agentic recipes
  - Revert both launchers from the cquil11 fork pin back to upstream
    NVIDIA/srt-slurm@127597c (the fork patch in cam/agentic-mem-0 is
    now redundant; leaving the branch around as a fallback but not
    pinned in the launcher)

R9/R10 confirmed sacct still showed mem=4G per worker step despite the
launcher cloning the patched fork — likely because srtctl's uv-sync
inside the sbatch rebuilds the venv from pyproject.toml and the
editable install from src/ doesn't include code modifications the way
uv pip install -e . would. The recipe-level mechanism doesn't depend
on patching srtctl at all so this whole class of "is the patch
loaded?" question goes away.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
R11 verified that srun_options.mem=0 IS now in the worker srun
cmdline (confirmed via /proc/<pid>/cmdline on the head node).
BUT sacct still showed AllocTRES mem=4G per step.

Why: the sbatch only requested `--ntasks=8` with no `--mem`, so the
JOB allocation per node is bound to cpus_per_task × DefMemPerCPU =
1 × 4 GB = 4 GB. `--mem=0` on srun means "use ALL of what the JOB
has on this node" — and the job has 4 GB. There's nothing to grow
into.

The other half of the fix is `sbatch_directives.mem=0` which emits
`#SBATCH --mem=0` in the generated sbatch script (per
src/srtctl/templates/job_script_minimal.j2:26), making SLURM
allocate all available node memory (~868 GB on CW gb300) up front.

Both layers needed:
  - sbatch_directives.mem=0 → JOB gets full node memory
  - srun_options.mem=0       → each container srun step uses it
                                (without this, srun defaults back to
                                 cpus_per_task × DefMemPerCPU = 4 GB)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ation)

R12 progressed past the memory layer (sbatch_directives.mem=0 from prior
commit worked; sacct showed AllocTRES mem=868G per worker), but failed
~10 min in with etcd lease-keepalive `deadline exceeded` errors followed
by every worker SIGKILL'd at 16:36:03.

Root cause from infra.out: etcd reported `max-cpu-set: 1` at startup.
SLURM's default cpus_per_task=1 starved single-CPU etcd under load from
24 concurrent dynamo DP rank lease keep-alives (16 prefill + 8 decode).
etcd's gRPC handler couldn't process RPCs fast enough → cascading lease
deadline exceeded → workers crashed → orchestrator cancelled job →
infra step itself SIGKILL'd at 16:35:49 ("STEP 4572.2 ON
slurm-gb300-138-249 CANCELLED ... DUE to SIGNAL Killed").

Fix: sbatch_directives.cpus-per-task=72 grants every task (including
the GPU-less infra step) one CW gb300 NUMA socket. etcd now has
plenty of compute; vLLM workers also get more aux CPU for tokenizer
threads etc.

Why cw needs this and nv doesn't: nv cluster's JobDefaults includes
DefCpuPerGPU=35 → any task with --gres=gpu:N auto-gets 35*N CPUs (=
140 on a 4-GPU task). cw has no per-GPU default → tasks get
cpus_per_task=1 by default. The infra step has no --gres flag at all
so it's the worst case on cw.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two changes:

1) Pin to NVIDIA cluster (drop CW)
   The dsv4-fp4-gb300-dynamo-vllm-agentic runner field was `gb300`,
   which is the generic label both NV and CW runner pools advertise
   (per gh api runners). So shards landed on either cluster, which
   meant we kept debugging the same recipe path against two different
   cluster configs (NV's DefCpuPerGPU=35 vs CW's DefMemPerCPU=4096
   with no per-GPU defaults).

   Switch to `runner: gb300-nv`, a label only the NV pool advertises.
   This matches just gb300-nv_0/1/2 going forward.

2) MODEL_PATH switched to /scratch/models/DeepSeek-V4-Pro
   The node-local SSD on NV compute nodes. Faster than the
   /data/home/sa-shared NFS path (where DSv4-Pro currently lives).
   Caveat: /scratch doesn't exist on the GHA runner pod, so srtctl
   preflight may fail with "Model alias resolved to ..., but that
   path is unavailable." We're trying this anyway to see whether
   the runner pod has /scratch mounted; if it errors, next step is
   to either (a) patch srt-slurm to add a `skip_model_preflight`
   recipe field or (b) stub a symlink on the runner pod.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The agentic recipe pins MODEL_PATH=/scratch/models/DeepSeek-V4-Pro
(node-local NVMe on compute nodes). srtctl's _preflight_model
runs in-process on whatever node invokes srtctl — the GHA runner
pod, which doesn't have /scratch mounted — so it bails before
sbatch with "Model alias 'deepseek-v4-pro' resolved to
'/scratch/...', but that path is unavailable" (R14 hit this).

Switch the IS_AGENTIC=1 clone target from NVIDIA/srt-slurm@127597c
to cquil11/srt-slurm-nv@cam/no-preflight-flag (854b3fd), which
adds one CLI flag — `srtctl apply --no-preflight` — that skips
just the optional Python-level FS precheck. vLLM still fails
loudly at runtime if the path is genuinely missing on the
compute node.

The flag is only passed when IS_AGENTIC=1. Fixed-seq-len recipes
resolve model.path to an NFS path visible from the runner pod,
where the precheck is a useful sanity guard, so leave enforcement
on for them.

Fork commit:
  cquil11/srt-slurm-nv@854b3fd

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Aiperf's content-addressed mmap dataset cache (~65 GB per dataset)
needs to be persisted across runs so the first run of the day doesn't
re-tokenize + re-write it on every shard. Same pattern as
launch_h200-dgxc-slurm.sh, launch_b200-dgxc.sh, launch_mi355x-amds.sh.

Three layers wired:

1) Host paths (cluster-specific, created with 0777 so all gharunner_X
   SLURM users can write):
     gb300-nv  /data/home/sa-shared/gharunners/ai-perf-cache
     gb300-cw  /mnt/vast/ai-perf-cache

2) Both launchers export AIPERF_MMAP_CACHE_HOST_PATH and add a line to
   the generated srtslurm.yaml's default_mounts block — srt-slurm's
   runtime.py reads default_mounts via get_srtslurm_setting() and
   bind-mounts each entry into every worker container. cw already had
   a default_mounts block (for dynamo-wheels-cache); nv had none.

3) Both agentic recipes set AIPERF_DATASET_MMAP_CACHE_DIR=/aiperf_mmap_cache
   in benchmark.env so the aiperf process inside the container reads
   from the persistent mount instead of ~/.cache/aiperf/dataset_mmap.

Single-node launchers don't need updating — they have their own srun
--container-mounts line that already bind-mounts the cache.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Brings in 45 commits from upstream/ajc/inferencex-agentx-mvp (PR #875):
  - InferenceX AgentX-MVP scenario (default corpus switched to 051226
    no-subagents 949-trace variant)
  - semianalysis_cc_traces_weka_no_subagents HF loader
  - Wrap-fill trajectory recycling + correlation-id double-recycle guard
  - DAG benchmarks, reproducible payload replay, agentic_replay E2E test
  - assorted dataset/timing fixes

Local commits preserved (no rebase). One docstring-only conflict in
src/aiperf/dataset/loader/semianalysis_cc_traces_weka.py resolved by
taking upstream's text (more comprehensive — documents both 042026 and
051226 variants).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
vllm/vllm-openai:v0.21.0-ubuntu2404 ships without git, but pip's
editable install (-e) of utils/aiperf invokes `git version` to record
direct_url.json provenance. Without git, every R16 shard on both
gb300-nv and gb300-cw failed at:

    + python3 -m pip install --break-system-packages -q --ignore-installed -e /infmax-workspace/utils/aiperf
      ERROR: Error [Errno 2] No such file or directory: 'git' while executing command git version
      ERROR: Cannot find command 'git' - do you have 'git' installed and in your PATH?

This happens AFTER server boot is healthy and "Server is healthy - starting
benchmark" has fired, so all the upstream cluster/recipe work (preflight,
mem=0 x2 layers, etcd cpus-per-task=72, --no-preflight, /scratch model
path, NixlConnector P<->D, model load) is working end-to-end. Only the
pip install step is blocked.

Fix: prepend a `command -v git || apt-get update && apt-get install -y git`
to install_agentic_deps. Cheap no-op on images that already ship git
(AMD images, custom containers). The vLLM image's apt is functional from
inside the container so this works without container rebuild.

The -e install was introduced yesterday in e92a9bf (aiperf v0.2
migration); previously the agentic flow used kv-cache-tester which
didn't need git.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…t containers

R17 surfaced two distinct failures, one per cluster:

1) gb300-cw (all 3 shards):
   aiperf rejected --public-dataset semianalysis_cc_traces_weka with
   "Scenario invariants violated ... required loader=any of
   ['semianalysis_cc_traces_weka_no_subagents', 'weka_trace']".

   Yesterday's aiperf merge (PR #875 commit fef78a96) switched the
   inferencex-agentx-mvp scenario's default corpus to the 051226
   no-subagents 949-trace variant and tightened the loader contract.
   The old name is no longer accepted.

   Fix: resolve_trace_source emits --public-dataset
   semianalysis_cc_traces_weka_no_subagents.

2) gb300-nv (all 3 shards):
   "dpkg: error: requested operation requires superuser privilege"
   from yesterday's install_agentic_deps git install path.

   The gb300-nv pyxis/enroot setup maps the calling user (sa-shared)
   into the container as non-root, while gb300-cw runs as root. The
   git install needs sudo on nv; cw is fine without.

   Fix: branch on `id -u` — apt-get directly when root, sudo apt-get
   otherwise. The vllm-base layer installs `sudo` so the binary is
   available, and the typical enroot config grants the calling user
   passwordless sudo.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
R17/R18 made it clear that there's no clean way to install git into the
vllm/vllm-openai container at run-time on gb300-nv:

  - R16/R17: container ships without git -> pip's editable install of
    aiperf fails with "Cannot find command 'git'"
  - R18: tried `sudo apt-get install git`. gb300-nv pyxis/enroot remaps
    the calling user to uid=345200007 inside the container, and sudo
    refuses to run with "/usr/bin/sudo must be owned by uid 0 and have
    the setuid bit set" -- the setuid bit can't carry across user
    namespaces. cw container runs as root so sudo wasn't tripped there,
    but the right answer is one that works on both clusters.

The actual fix is upstream from this entirely: drop `-e`. pip's editable
install needs git only to record direct_url.json provenance; the
non-editable install just builds a wheel via hatchling and copies into
site-packages. aiperf's pyproject.toml pins version="0.8.0" rather than
deriving it from git tags, so non-editable install works without git in
any environment. We don't edit aiperf source mid-benchmark anyway --
loss of -e ergonomics is zero.

`--ignore-installed` is still needed (handles the apt-managed-blinker
distutils-uninstall pile-up) and is orthogonal to -e.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Drop the sudo/root-detection complexity from R18 and restore -e on the
aiperf pip install. Per user direction.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The vllm/vllm-openai container ships without git; agentic_srt.sh
needs to apt-get install it because pip's install of utils/aiperf
calls `git version`. R17/R18/R19/R20 chased this on gb300-nv with
various combinations of sudo / no-sudo / drop-e / etc., all failing
because pyxis maps the calling user to uid 345200007 inside the
container and dpkg's hardcoded geteuid()!=0 check rejects every
attempt regardless of filesystem permissions.

The cleanest fix is to ask pyxis to remap us to uid 0 inside the
container, matching the gb300-cw behavior (where the container
already runs as root and apt-get install works directly). pyxis
exposes this as a per-srun flag: --container-remap-root. srt-slurm
renders empty-string srun_options as flag-only srun args (see
core/slurm.py:250 in NVIDIA/srt-slurm@127597c).

No-op on gb300-cw (cw is already remapped to root by default).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Picks up cquil11/srt-slurm-nv@6e34b8b which propagates srun_options
through the benchmark_stage srun (previously only worker/frontend/
telemetry stages honored them). Required for the recipe-level
srun_options.container-remap-root: "" to apply to the benchmark.command
container — the one that runs agentic_srt.sh + apt install git.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Picks up cquil11/aiperf@9b858ae which fixes PhaseRunner.cancel()
to set all_credits_sent_event / all_credits_returned_event so the
outer runner awaits wake immediately. Previously cancelled runs
(e.g. via --failed-request-threshold) blocked for the full phase
timeout (~1800s default) before reaching the graceful exit path.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ncel)

When a workflow run is cancelled mid-flight (gh run cancel, or UI
cancel button), the launcher gets SIGTERM during its `tail -F`
wait and exits before reaching the `tar czf .../multinode_server_logs.tar.gz`
line in the main flow. The Upload server logs workflow step runs
(it has if: always()) but finds no file (if-no-files-found: ignore
silently skips), so the artifact never gets uploaded.

Fix: install an EXIT trap right after JOB_ID extraction that produces
the tarball on any exit path — normal completion, error, SIGTERM,
SIGKILL of our parent. The main-flow tar block is now an idempotent
no-op (kept for log narrative).

Applied identically to both gb300-nv and gb300-cw launchers.

The b200-dgxc launcher has the same pattern but its multi-node flow
is currently only used by other configs; leaving it alone for now
to avoid mixing unrelated changes.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
gb300-nv 1p6d agentic runs hit ~15% errors at conc=32 from Dynamo
NATS RPC deadline timeouts when the single prefill worker is
saturated by 32 concurrent 50-100k token prefills. Each timeout
returns HTTP 500 "Failed to generate completions: Prefill execution
failed: ... NATS request to dynamo_prefill.generate-... failed:
... deadline has elapsed" — a real failure but driven by the
single-prefill-worker capacity limit, not a regression.

At the previous 0.05 threshold the run tripped its ProfileCancel
mechanism early and produced no usable numbers. At 0.20 the run
completes and we get steady-state metrics for the ~85% of requests
that succeed; the underlying NATS saturation is a separate work
item (Dynamo deadline tuning, or more prefill workers in the
recipe, or both).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Comment on lines +294 to +295
LOGS/agentic/aiperf_artifacts/detailed_results.csv
LOGS/agentic/aiperf_artifacts/debug_trace.jsonl
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The 'Upload agentic raw results' step in benchmark-multinode-tmpl.yml (lines 294-295) lists LOGS/agentic/aiperf_artifacts/detailed_results.csv and LOGS/agentic/aiperf_artifacts/debug_trace.jsonl — those filenames were produced by the removed utils/trace-replay submodule and are never written by the new aiperf pipeline. Combined with if-no-files-found: ignore, multinode agentic runs will silently upload an empty agentic_<RESULT_FILENAME> artifact, losing all per-request profile and server-metrics data. Mirror benchmark-tmpl.yml's full aiperf file list (profile_export*, server_metrics_export*, gpu_telemetry_export.jsonl, aiperf logs), translating the results/ prefix to LOGS/agentic/.

Extended reasoning...

What the bug is

The multinode template'''s Upload agentic raw results step was only half-migrated in this PR. The directory rename trace_replay/aiperf_artifacts/ was applied, but the filenames underneath it were left as the legacy trace-replay outputs:

LOGS/agentic/aiperf_artifacts/detailed_results.csv
LOGS/agentic/aiperf_artifacts/debug_trace.jsonl

Those two filenames were specific to the now-removed utils/trace-replay submodule (the kv-cache-tester scripts). This PR deletes that submodule entry from .gitmodules and removes load_trace_replay_records (which read detailed_results.csv) from utils/agentic-benchmark/scripts/analyze_benchmark_distributions.py. Grep confirms detailed_results.csv and debug_trace.jsonl appear nowhere else in the repo — nothing writes them anymore.

How it manifests

The new aiperf pipeline (wired up via benchmarks/benchmark_lib.sh:build_replay_cmd with --output-artifact-dir /aiperf_artifacts) writes an entirely different set of files: profile_export.jsonl, profile_export_aiperf.{json,csv}, profile_export_aiperf_{timeslices,aggregate,collated}.*, server_metrics_export.{json,jsonl,csv,parquet}, gpu_telemetry_export.jsonl, and logs/aiperf.log. The sibling single-node template benchmark-tmpl.yml was correctly updated in this same PR to enumerate all of those.

Why existing code doesn'''t prevent it

actions/upload-artifact@v7.0.1 is invoked with if-no-files-found: ignore, so a glob/path that matches zero files produces an empty artifact upload without warning. There is no schema check that the listed paths exist.

Impact

Every multinode agentic run (the new dsv4-fp4-gb300-dynamo-vllm-agentic and dsv4-fp4-gb300-cw-dynamo-vllm-agentic configs introduced by this PR, plus future multinode agentic configs) silently produces an empty agentic_<RESULT_FILENAME> artifact. The entire per-request profile (profile_export.jsonl), aiperf aggregate exports, server scrape time series, GPU telemetry, and aiperf logs from multinode jobs are lost. Downstream consumers like utils/process_agentic_result.py (which reads profile_export.jsonl + profile_export_aiperf.json + server_metrics_export.json) cannot reanalyze multinode runs after the fact.

Step-by-step proof

  1. A multinode agentic job runs and benchmarks/multi_node/agentic_srt.sh calls build_replay_cmdrun_agentic_replay_and_write_outputs.
  2. benchmark_lib.sh:1003 invokes aiperf with --output-artifact-dir /aiperf_artifacts (where = /logs/agentic from the recipe'''s benchmark.env).
  3. aiperf populates the directory with profile_export.jsonl, profile_export_aiperf.csv, server_metrics_export.{json,jsonl,csv,parquet}, etc. — but NOT detailed_results.csv or debug_trace.jsonl (those were trace-replay outputs).
  4. The launcher copies/tars the logs back to /LOGS/agentic/ on the runner.
  5. The Upload agentic raw results step runs with the two listed paths:
    • LOGS/agentic/aiperf_artifacts/detailed_results.csv → does not exist.
    • LOGS/agentic/aiperf_artifacts/debug_trace.jsonl → does not exist.
  6. if-no-files-found: ignore causes upload-artifact to emit an empty bundle without warning.
  7. The agentic_<RESULT_FILENAME> artifact appears in the workflow run UI but contains zero files; gh run download returns nothing.

How to fix

Mirror the file list from the correctly-updated sibling benchmark-tmpl.yml (which lists every aiperf export name plus the new lmcache_server.log and *_command.txt files), translating the results/ prefix to LOGS/agentic/. At minimum the list must include profile_export.jsonl, profile_export_aiperf.{json,csv}, profile_export_aiperf_{timeslices,aggregate,collated}.*, server_metrics_export.{json,jsonl,csv,parquet}, gpu_telemetry_export.jsonl, and aiperf_artifacts/logs/*.log.

# ``hash_ids`` and ``output_length``. Built lazily from the HF dataset cache.
_TRACE_METADATA_CACHE: dict[str, list[dict]] | None = None
_HF_DATASET = "semianalysisai/cc-traces-weka-042026"
_HF_DATASET = "semianalysisai/cc-traces-weka-with-subagents-051926"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The test fixture in utils/test_process_agentic_result.py (test_processor_loads_traces_jsonl_for_theoretical_cache) still hard-codes the old dataset directory name datasets--semianalysisai--cc-traces-weka-042026, but this PR renamed _HF_DATASET in process_agentic_result.py:40 to semianalysisai/cc-traces-weka-with-subagents-051926. The processor's _hf_traces_dir() now looks under the new directory name, so the fixture is never found, theoretical_cache_hit_rate stays None, and the assertions at lines 461 and 463 (== pytest.approx(0.5) and mean_output_tokens_expected == ...) will fail every CI run. Fix: update the fixture path to datasets--semianalysisai--cc-traces-weka-with-subagents-051926.

Extended reasoning...

Bug

test_processor_loads_traces_jsonl_for_theoretical_cache writes a synthetic Hugging Face snapshot to validate that process_agentic_result.py correctly walks per-trace hash_ids arrays and computes theoretical_cache_hit_rate. After this PR, the test will deterministically fail on first execution.

Root Cause

This PR changed utils/process_agentic_result.py:40 from:

_HF_DATASET = "semianalysisai/cc-traces-weka-042026"

to:

_HF_DATASET = "semianalysisai/cc-traces-weka-with-subagents-051926"

_hf_traces_dir() (around line 133-134) derives the on-disk cache directory from this constant via the HF naming convention datasets--{org}--{name}. So after the rename the processor looks for:

$HF_HUB_CACHE/datasets--semianalysisai--cc-traces-weka-with-subagents-051926/snapshots/<rev>/traces.jsonl

But the test fixture at utils/test_process_agentic_result.py:408 still hard-codes the old name:

snapshot = hf_cache / "datasets--semianalysisai--cc-traces-weka-042026" / "snapshots" / "abc"

The other call sites in the same test file (_write_fixture, the per-run subdir test, etc.) were updated from trace_replayaiperf_artifacts in this PR, but this particular hard-coded HF dataset directory was missed.

Step-by-Step Proof

  1. Test calls _write_fixture, then writes traces.jsonl to <tmp>/_hf/datasets--semianalysisai--cc-traces-weka-042026/snapshots/abc/traces.jsonl.
  2. Test sets HF_HUB_CACHE=<tmp>/_hf and invokes the processor.
  3. Inside _hf_traces_dir(), the code builds: Path($HF_HUB_CACHE) / f"datasets--semianalysisai--cc-traces-weka-with-subagents-051926" / "snapshots" — using the new _HF_DATASET constant.
  4. That directory does not exist in the fixture (only the old-name directory does), so _hf_traces_dir() returns None.
  5. _iter_trace_blobs is never called; _TRACE_METADATA_CACHE remains empty.
  6. Without trace metadata, theoretical_cache_hit_rate is computed as None and mean_output_tokens_expected is None (or missing) in the emitted agg JSON.
  7. The assertion at line 461 (agg["theoretical_cache_hit_rate"] == pytest.approx(0.5)) compares None == 0.5 → fails.
  8. The assertion at line 463 (agg["mean_output_tokens_expected"] == pytest.approx((50+60+55+40+70)/5)) compares None to a float → fails.

Independent verifier confirmation: one verifier reproduced this by running the processor against both paths and observed that the old path produces theoretical_cache_hit_rate=None, while only the new path populates it as expected.

Fix

Rename the fixture directory in utils/test_process_agentic_result.py (around line 408) from:

snapshot = hf_cache / "datasets--semianalysisai--cc-traces-weka-042026" / "snapshots" / "abc"

to:

snapshot = hf_cache / "datasets--semianalysisai--cc-traces-weka-with-subagents-051926" / "snapshots" / "abc"

No other test fixture changes are needed; the processor will then find the synthetic snapshot at the new path and the assertions will pass.

…n/ subdir

Match the existing benchmarks/single_node/agentic/ split: all 111 non-
agentic per-cluster launch scripts move into benchmarks/single_node/
fixed_seq_len/. chat_templates/ stays at single_node/chat_templates/ as
a shared resource (referenced by both agentic and fixed_seq_len scripts).

Plumbing:
- .github/workflows/benchmark-tmpl.yml + benchmark-multinode-tmpl.yml:
  SCENARIO_SUBDIR default flips from '' to 'fixed_seq_len/'.
- runners/launch_mi355x-amds.sh: parameter-expansion fallback also
  defaults to fixed_seq_len/ so direct invocations (without the
  workflow setting SCENARIO_SUBDIR) still resolve.
- Each moved script's `source "$(dirname \"$0\")/../benchmark_lib.sh"`
  becomes `../../benchmark_lib.sh`.
- dsv4_fp4_mi355x_sglang.sh's --chat-template path becomes
  `../chat_templates/...` (matches the agentic copy's pattern).
- .github/configs/{nvidia,amd}-master.yaml: forward-looking comments
  repath to fixed_seq_len/. perf-changelog.yaml historical entries
  left untouched (they describe paths at the time of the change).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Cam Quilici <cjquilici@gmail.com>
Comment thread .github/configs/amd-master.yaml
cquil11 and others added 4 commits May 27, 2026 11:51
…xups

Resolutions:
- perf-changelog.yaml: took main verbatim.
- runners/launch_b300-nv.sh: took main (drops --nodelist pin entirely;
  supersedes our narrower 017-019 fix).
- benchmarks/single_node/fixed_seq_len/dsv4_fp8_mi355x{,_vllm}.sh:
  accepted main's deletes (orphan recipes removed in #1374, #1501).
- .github/configs/amd-master.yaml: took main as the base, then re-applied
  our agentic-only additions on top:
    * qwen3.5-fp8-mi355x-sglang-agentic-hicache  (new entry)
    * dsv4-fp4-mi355x-vllm-agentic               (new entry)
    * dsv4-fp4-mi355x-sglang-agentic             (new entry)
    * kimik2.5-fp4-mi355x-vllm-agentic           (cpu -> lmcache)
  Dropped our comment-path edit for dsv4_fp8_mi355x_vllm.sh since main
  deleted that entry.

Fixed_seq_len reorg fixups for files added on main during our branch's
lifetime:
- git mv 14 stranded scripts from benchmarks/single_node/*.sh into
  benchmarks/single_node/fixed_seq_len/ (dsr1_fp4_b200_mtp,
  dsr1_fp4_mi355x_mtp, dsr1_fp8_h200_mtp, dsr1_fp8_mi325x_mtp,
  dsr1_fp8_mi355x_mtp, dsv4_fp4_mi355x_vllm, glm5_fp8_h200_mtp,
  glm5_fp8_mi325x, glm5_fp8_mi325x_mtp, qwen3.5_bf16_mi325x_mtp,
  qwen3.5_fp4_mi355x_mtp, qwen3.5_fp8_h100, qwen3.5_fp8_h100_mtp,
  qwen3.5_fp8_mi325x_mtp). Patched their source paths from
  ../benchmark_lib.sh to ../../benchmark_lib.sh.
- runners/launch_mi355x-amds.sh: multinode-non-disagg BENCHMARK_SUBDIR
  bumped from `single_node` to `single_node/fixed_seq_len`.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Comment thread benchmarks/single_node/agentic/dsv4_fp4_b300_vllm.sh
cquil11 and others added 2 commits May 27, 2026 13:10
Per-recipe scripts had stale `VAR=${VAR:-default}` lines for variables
that are either reliably plumbed by the workflow template or completely
unused. The defaults masked missing-env bugs (the workflow could forget
to plumb a var and the script would silently fall back to a stale local
default instead of failing loudly) and left dead lines hanging around
from the pre-aiperf-v0.2 era.

benchmarks/benchmark_lib.sh:
  - PORT: new `export PORT="${PORT:-8888}"` near the top so a single
    source of truth governs the server port. Launchers that need a
    non-default value (launch_mi355x-amds.sh derives PORT from
    RUNNER_NAME to avoid collisions across concurrent gh-runners) set
    PORT themselves; the `:-` fallback only kicks in if nothing
    upstream set it.
  - build_replay_cmd: `local duration="${DURATION:-1800}"` -> `"$DURATION"`
    (DURATION is now a check_env_vars-enforced requirement in callers).

benchmarks/single_node/agentic/*.sh (32 scripts) and
benchmarks/multi_node/agentic_srt.sh:
  - Removed: PORT=${PORT:-8888} (benchmark_lib owns it now).
  - Removed: DURATION/EP_SIZE/DP_ATTENTION defaults; added each to
    check_env_vars in the scripts that consume them. DURATION is
    consumed by build_replay_cmd in benchmark_lib, so every agentic
    script now requires it explicitly.
  - Removed: MAX_DELAY/ADVANCE_MIN/ADVANCE_MAX. These were CLI args to
    the old trace_replay_tester.py (commit b7ae440); the aiperf v0.2
    migration (commit e92a9bf) dropped all consumption but left the
    top-of-script var-definitions behind. Pure dead code.
  - Kept: SCHEDULER_RECV_INTERVAL (per-model sglang server tuning,
    not workflow-plumbed; values vary 5/10/30 per recipe).

benchmarks/single_node/fixed_seq_len/*.sh (120 scripts):
  - Removed: PORT=${PORT:-8888} only. fixed_seq_len's check_env_vars
    block already requires what it uses (DP_ATTENTION/EP_SIZE/ISL/OSL/
    RANDOM_RANGE_RATIO/RESULT_FILENAME) per the existing convention;
    no further changes needed.

Net: 343 deletions, 46 insertions across 154 files; no behavior change
on any green CI path (workflow input defaults match the removed local
defaults). Behavior change only when an upstream caller fails to set
DURATION/EP_SIZE/DP_ATTENTION on an agentic recipe -- which now fails
loudly via check_env_vars instead of silently inheriting a stale value.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Cam Quilici <cjquilici@gmail.com>
Matches the existing pattern from launch_{b200-dgxc,h200-dgxc-slurm,
gb300-{nv,cw},mi355x-amds}.sh: define AIPERF_MMAP_CACHE_HOST_PATH on the
host, mount it to /aiperf_mmap_cache inside the container, and expose
AIPERF_DATASET_MMAP_CACHE_DIR=/aiperf_mmap_cache via --export so aiperf's
DatasetLoaderManager finds it. Lets agentic benchmarks reuse the
pre-built mmap dataset cache instead of re-mmaping every run.

- h200-nb: /mnt/data/gharunners/ai-perf-cache (sibling of hf-hub-cache)
- h200-cw: /mnt/vast/gharunner/ai-perf-cache (sibling of hf-hub-cache)

Host-side directories will be created out-of-band before next run.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Cam Quilici <cjquilici@gmail.com>
OFFLOAD_ARGS=(
--kv-transfer-config
"{\"kv_connector\":\"LMCacheMPConnector\",\"kv_connector_module_path\":\"lmcache.integration.vllm.lmcache_mp_connector\",\"kv_role\":\"kv_both\",\"kv_connector_extra_config\":{\"lmcache.mp.host\":\"$LMCACHE_HOST\",\"lmcache.mp.port\":$LMCACHE_PORT}}"
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dead code after explicit exit 1 in disabled branch

Low Severity

The lmcache-mp case in the OFFLOADING switch immediately calls exit 1 (line 140) to disable the path, but ~47 lines of live server-startup code follow after that exit 1 — including agentic_pip_install, LMCache server launch, wait_for_lmcache_ready, and OFFLOAD_ARGS construction. All of it is permanently unreachable. The comment says to "re-enable after PR #3261 merges", but the implementation was left as dead statements rather than being commented out, which gives the misleading impression that the code runs.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit a98fcaa. Configure here.

cquil11 and others added 2 commits May 27, 2026 14:43
… corpus

Adds a per-recipe override hook in benchmark_lib.sh's resolve_trace_source:
recipes set WEKA_LOADER_OVERRIDE to one of the aiperf public-dataset loader
names allowed by the inferencex-agentx-mvp scenario, and resolve_trace_source
swaps both the --public-dataset flag and the HF dataset pre-download to match.
Default remains semianalysis_cc_traces_weka_with_subagents (052726, 472
traces). Unknown overrides fail loudly with the allowed-values hint.

Wires the new override into all 8 minimaxm2.5 agentic recipes
(minimaxm2.5_fp{4,8}_{b200,b300,h100,h200,mi300x,mi325x,mi355x}.sh) to
use semianalysis_cc_traces_weka_with_subagents_256k -- the 256k-capped
variant (051926-256k, 217 traces, max in+out <= 256k by construction).
MiniMax-M2.5 servers run at max_model_len ~256k, so the unfiltered 052726
corpus would have its longest requests rejected.

Submodule bump: utils/aiperf -> 6fc5f5d6 registers the new loader name in
plugins.yaml and adds it to inferencex_agentx_mvp's require_loader tuple.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Cam Quilici <cjquilici@gmail.com>
Mirrors aiperf 519580fb: the semianalysis_cc_traces_weka_with_subagents_256k
loader now points at semianalysisai/cc-traces-weka-with-subagents-052726-256k
(470 traces) instead of the earlier 051926-256k (217 traces). Loader name
and override env var (WEKA_LOADER_OVERRIDE) unchanged.

- benchmark_lib.sh resolve_trace_source: case-statement HF repo path
  bumped to ...052726-256k for the _256k loader.
- All 8 minimaxm2.5_*.sh agentic recipe comments: trace count 217 -> 470.
- utils/aiperf submodule pointer -> 519580fb.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Cam Quilici <cjquilici@gmail.com>
Comment thread benchmarks/benchmark_lib.sh
The proxy occasionally records the same logical request twice. On the
472-session par<=5 sample, 2,339 of 115,593 rows (2.0%) are byte-
identical duplicates of a prior row in the same session — 1,923 are
main-agent turns and 416 are subagent inner requests. 275 of 472
sessions (58%) have at least one duplicate. Worst session has 165
dup rows.

Without deduping, the weka conversion silently inflates token counts,
request counts, and throughput by ~2%, and the converter misclassifies
duplicate-pair rows as "two requests started at the same nanosecond"
when grouping subagents.

Fingerprint: (timestamp, model, input_tokens, output_tokens,
duration_ms, agent_id). On the 2,339 detected pairs, 100% are also
byte-identical when full JSON is serialized, so the fingerprint
produces zero false positives.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Cam Quilici <cjquilici@gmail.com>
Comment thread benchmarks/multi_node/agentic_srt.sh
cquil11 and others added 4 commits May 27, 2026 16:28
…0.21.0

v0.20.2's bundled huggingface_hub==1.14.0 silently fetches Git-LFS
pointer files instead of LFS content for `hf download --repo-type
dataset`. Every kimik2.5-fp4-b200-vllm-agentic job in run 26536606210
hit "pyarrow.lib.ArrowInvalid: JSON parse error: Missing a name for
object member. in row 0" -- the signature of pyarrow trying to parse
the literal `version https://git-lfs.github.com/spec/v1` line of an
LFS pointer file as JSON.

b200-dgxc has no persistent /mnt/hf_hub_cache mount (per launcher
diff), so every container re-downloads the dataset and re-hits the
bug. v0.21.0 ships a newer huggingface_hub that resolves LFS
correctly. v0.20.x's flashinfer fix for the max_model_len=131072 +
prefix-caching warmup crash is included in v0.21.0.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Cam Quilici <cjquilici@gmail.com>
New agentic-coding recipe targeting H100 (runner: h100-dgxc) running
Qwen3.5-397B-A17B FP8 via SGLang v0.5.12-cu130. Mirrors the b300 SGLang
agentic shape with H100-appropriate kernel flags:

- attention-backend: flashinfer (sm_90; trtllm_mha is Blackwell-only).
- mem-fraction-static 0.75 (vs 0.80 on B300) and chunked-prefill-size
  8192 (vs 16384) to fit Qwen-397B FP8 weights + KV in H100's 80 GB
  HBM3 at TP=8.
- conc-list capped at 16 across both arms; agentic ISLs hit ~80k-200k
  on the 256k corpus and Qwen at conc=32 OOM'd in the fixed_seq_len
  sweep at lower ISL too.

Recipe wires WEKA_LOADER_OVERRIDE=semianalysis_cc_traces_weka_with_subagents_256k
so the 256k-capped variant (470 traces, max in+out <= 256k) is used
instead of the unfiltered 052726 corpus (which has up to ~1M-token
requests the H100 max_model_len=131k server would reject).

Two sweep arms:
  - none:    --disable-radix-cache, conc-list [1, 2, 4, 8, 16]
  - hicache: --enable-hierarchical-cache + sized from TOTAL_CPU_DRAM_GB,
             conc-list [4, 8, 16] (capped where hicache stabilizes)

Yaml key is qwen3.5-fp8-h100-sglang-agentic; script filename is the
bare `qwen3.5_fp8_h100.sh` under benchmarks/single_node/agentic/ —
the h100 launchers don't support framework-tagged script names, and
this matches the precedent set by qwen3.5_fp8_b200.sh (which is the
sglang-agentic recipe under the same bare name).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Cam Quilici <cjquilici@gmail.com>
Matches the same pattern as launch_b200-dgxc, launch_h200-dgxc-slurm,
launch_gb300-{nv,cw}, launch_mi355x-amds, launch_h200-{nb,cw}: define
AIPERF_MMAP_CACHE_HOST_PATH on the host, bind-mount it to
/aiperf_mmap_cache in the container, and expose
AIPERF_DATASET_MMAP_CACHE_DIR=/aiperf_mmap_cache via --export.

Host path: /mnt/nfs/sa-shared/gharunners/ai-perf-cache (sibling of
the existing hf-hub-cache mount on the same NFS volume).

Needed for the new qwen3.5-fp8-h100-sglang-agentic recipe to reuse
the pre-built mmap dataset cache across runs rather than re-mmaping
every job.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Cam Quilici <cjquilici@gmail.com>
Pulls in cjq/agentx-v0.3-subagents @ baa95d73, which adds SGLang
metric-name fallbacks to ServerMetricsAccumulator.realtime_snapshot
so the realtime `srv prefix_cache_hit=... kv_usage=... queue=...`
log row populates for sglang servers instead of being suppressed
(every field was vLLM-only before).

Signed-off-by: Cam Quilici <cjquilici@gmail.com>
wait "$tail_pid" 2>/dev/null || true
cat "$LMCACHE_LOG" >&2 || true
exit 1
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LMCache helper functions duplicated across three scripts

Medium Severity

cleanup_lmcache_server and wait_for_lmcache_ready are identically copy-pasted across three scripts (dsv4_fp4_b200_vllm.sh, kimik2.5_fp4_b200.sh, kimik2.5_fp4_mi355x.sh). Other shared helpers like resolve_trace_source, install_agentic_deps, and the new run_agentic_replay_and_write_outputs already live in benchmark_lib.sh. These LMCache helpers belong there too, reducing the risk of inconsistent bug fixes across the three copies.

Additional Locations (2)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 4933cf3. Configure here.

cquil11 added 2 commits May 27, 2026 17:55
Pulls in cjq/agentx-v0.3-subagents @ 006417a8, which fixes a silent
regression in the realtime srv-row: counter lookups that included
`_total` (e.g. `vllm:prompt_tokens_total`, `sglang:prompt_tokens_total`)
never matched because `prometheus_client.parser` strips that suffix
before the data collector stores the family. Server-side throughput
rows were missing on every backend, not just SGLang — masked by unit
tests that bypassed the parser.

Signed-off-by: Cam Quilici <cjquilici@gmail.com>
Agentic replay traces have a theoretical prefix-cache hit rate above
95% on every workload we benchmark; the realtime srv row only reads
0.0% because the launch script turns the SGLang RadixAttention cache
off. Every server recipe in this directory had it on — either as
the only branch of an OFFLOADING=none case or as an unconditional
launch-line flag — so the hit-rate number was never meaningful and
the run was paying full prefill cost on every turn.

Removed unconditionally from: dsv4_fp4_mi355x_sglang,
glm5.1_fp4_mi355x, glm5_fp8_b200, qwen3.5_bf16_b200, qwen3.5_fp8_b200,
qwen3.5_fp8_mi355x.

Removed from the OFFLOADING=none branch of: qwen3.5_fp8_h100,
qwen3.5_fp8_b300_sglang, qwen3.5_fp8_mi355x_sglang. Replaced with a
short comment so the next person editing the `case` doesn't put it
back. OFFLOADING=none still means "no CPU/host offload"; the GPU
RadixAttention cache stays on, which is the only sensible default
for an agentic workload.

Signed-off-by: Cam Quilici <cjquilici@gmail.com>
Comment thread .github/configs/nvidia-master.yaml
cquil11 added 5 commits May 27, 2026 18:11
Pulls in cjq/agentx-v0.3-subagents @ b2d047dd, which switches the
realtime srv-row prefix_cache_hit_rate fallback from SGLang's
per-batch `cache_hit_rate` gauge (reads 0 between requests) to the
cumulative `cached_tokens_total` / `prompt_tokens_total` counter
pair, matching vLLM's `hits/queries` shape. Also unlocks
unique_input_tokens_srv on SGLang.

Signed-off-by: Cam Quilici <cjquilici@gmail.com>
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 3 total unresolved issues (including 2 from previous reviews).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 6a77acb. Configure here.

--tokenizer-path "$MODEL"
--enable-metrics
"${CACHE_ARGS[@]}"
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing context-length limits in H100 SGLang launcher

High Severity

The H100 SGLang launcher for Qwen3.5 omits both MAX_MODEL_LEN initialization and the --context-length flag. The sibling B300 script (qwen3.5_fp8_b300_sglang.sh) and MI355X script both default MAX_MODEL_LEN to 131072 and pass --context-length "$MAX_MODEL_LEN". Without this, SGLang will allocate KV cache for the model's full context window (potentially 512k+), which on H100's 80 GB HBM3 severely reduces usable KV blocks or causes OOM. Additionally, build_replay_cmd won't pass --max-context-length to aiperf since MAX_MODEL_LEN is unset, so over-length traces from the corpus won't be filtered client-side either.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 6a77acb. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

2 participants