[AMD] feat(agentic): AgentX v0.3 — Kimi MI355X LMCache MP benchmark by seungrokj · Pull Request #1565 · SemiAnalysisAI/InferenceX

seungrokj · 2026-05-26T15:26:00Z

Summary

AgentX v0.3 agentic benchmark for Kimi-K2.5-MXFP4 on MI355X using LMCache MP (multi-process) CPU offloading. Key changes on top of main:

feat(agentic): add LMCache MP for Kimi MI355X — wires up the external LMCache server path for ROCm
fix(agentic): avoid CUDA NIXL import on MI355X LMCache — ROCm-safe import guard
fix(agentic): lazily patch ROCm LMCache allocator / defer ROCm LMCache pinned expansion — avoids 10-min startup stall for 2.5 TB pinned pool
fix(agentic): add ROCm LMCache MP block fallback — Python fallback for multi_layer_block_kv_transfer (no CUDA kernel on ROCm)
fix(agentic): avoid partial LMCache import patching — ensures all ROCm patches apply atomically
fix(agentic): extend Kimi MI355X LMCache read lease / use final LMCache capacity on ROCm
fix(agentic): normalize/filter Kimi MI355X replay context — caps contexts to server window
fix(agentic): reduce Kimi FP4 B200 CPU DRAM limit to 1500 GB
AIPerf metadata and schema fixes (carry AIPerf prefix metadata, refresh AIPerf mmap cache schema, update AIPerf replay metadata)

Benchmark Results — Run 26448139220

Environment: vLLM v0.21.0, ROCm 7.2.x, 8× MI355X (288 GB VRAM each)
Model: amd/Kimi-K2.5-MXFP4
Benchmark: AgentX v0.3 agentic trace replay, 1800s duration
Offloading: LMCache MP (2.5 TB CPU DRAM pool)

offloadlmcache — all cases (FAILED/zero-request excluded)

TP	Conc	#Reqs	RPS	Out tok/s	Prefix cache %	Ext cache %	GPU KV avg %
4	16	566	0.310	156.36	47.6	65.7	68.1
4	24	352	0.190	95.61	0.0	84.9	87.1
4	32	304	0.170	87.53	0.0	79.7	85.2
4	40	372	0.210	96.14	0.0	82.1	88.3
8	32	480	0.270	133.04	0.0	83.9	86.9
8	40	514	0.280	124.33	0.0	83.3	86.5
8	48	537	0.300	141.13	0.0	83.3	88.3
8	56	531	0.290	143.60	0.0	82.4	88.3

Notes:

Prefix cache % = GPU-local prefix cache hit rate (cumulative at end of run from server logs)
Ext cache % = External (LMCache CPU offload) cache hit rate
GPU KV avg % = Mean GPU KV cache utilization across non-idle periods
Excluded (0 requests): TP8/conc32 (amds_00, amds_04) and TP8/conc40 (amds_07)
TP4/conc16 is the only case showing GPU-local prefix cache hits (47.6%) — higher concurrencies saturate GPU KV cache quickly, evicting local blocks and relying almost entirely on LMCache external offload (~80–85%)
TP8 shows better throughput scaling vs TP4 at comparable concurrency, with TP8/conc48 peaking at 141 tok/s

🤖 Generated with Claude Code

Note

Medium Risk
Touches agentic replay, LMCache offloading, and large pinned-memory behavior on ROCm; mis-patches could affect startup time or KV correctness, but scope is benchmark/inference paths rather than core auth or payments.

Overview
Adds AgentX v0.3 agentic benchmarking for Kimi-K2.5-MXFP4 on MI355X with LMCache multi-process CPU offloading, including the external LMCache server path on ROCm.

ROCm-specific fixes avoid CUDA/NIXL imports, lazily/defer large pinned CPU pool setup, apply atomic LMCache import patches, add a Python fallback for multi_layer_block_kv_transfer, tune read leases and final cache capacity, and normalize/filter replay contexts to the server window. Also lowers the Kimi FP4 B200 CPU DRAM cap to 1500 GB and updates AIPerf prefix/replay metadata and mmap cache schema.

^{Reviewed by Cursor Bugbot for commit 0a5d493. Bugbot is set up for automated code reviews on this repo. Configure here.}

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Signed-off-by: seungrokj <seungrok.jung@amd.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Signed-off-by: seungrokj <seungrok.jung@amd.com>

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 3fa8c2b. Configure here.}

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

claude

Additional findings (outside current diff — PR may have been updated during review):

🟡 utils/generate_aiperf_plots.py:691-709 — The two parallel arrays built at utils/generate_aiperf_plots.py:691-709 use mismatched filters: starts_ns skips records where request_start_ns is missing or falsy, while the per-record loop for ttfts_ms/e2es_ms/interactivities iterates all records unfiltered. If any record is dropped from the start-time filter, request_times_s ends up shorter than the metric arrays and ax.scatter(request_times_s, values) in panel_per_record_metric raises ValueError: x and y must be the same size. The caller (write_agentic_result_json) masks failures with || true, so metrics_plots.png silently fails to render. Either drop the filter or apply the same filter in the per-record loop.
Extended reasoning...

What the bug is

utils/generate_aiperf_plots.py (new file added by this PR) builds two parallel arrays in main() from the same records iterable but applies different filters:
```
# Lines 691-697: filtered on request_start_ns
starts_ns = [
    int(r["metadata"]["request_start_ns"])
    for r in records
    if r.get("metadata", {}).get("request_start_ns")
]
first_record_start = min(starts_ns) if starts_ns else 0
request_times_s = [(s - first_record_start) / 1e9 for s in starts_ns]

# Lines 700-709: NOT filtered
ttfts_ms: list[float] = []
e2es_ms: list[float] = []
interactivities: list[float] = []
for r in records:
    ttft = metric_value(r, "time_to_first_token")
    ...
    ttfts_ms.append(ttft if ttft is not None else 0.0)
    ...
```
How it manifests

If any record in records has a missing/falsy metadata.request_start_ns (truthy .get(...) check also drops 0/None), it gets excluded from starts_ns/request_times_s but still appended to ttfts_ms/e2es_ms/interactivities (with a 0.0 placeholder). The lists then have different lengths.

Later, main() passes them paired to panel_per_record_metric (lines 749, 758, 767), which calls ax.scatter(request_times_s, values, ...) at line 582. Matplotlib's scatter enforces equal-length sequences and raises ValueError: x and y must be the same size.

Step-by-step proof

Consider 3 records, one missing request_start_ns:
1. records = [R0 (start=1e9), R1 (no start), R2 (start=3e9)]
2. starts_ns loop filters out R1: [1_000_000_000, 3_000_000_000] (len=2)
3. request_times_s = [0.0, 2.0] (len=2)
4. The for r in records loop visits all 3 records and appends placeholders for R1: ttfts_ms = [ttft0, 0.0, ttft2] (len=3)
5. panel_per_record_metric(axes[4,0], request_times_s, ttfts_ms, ...) calls ax.scatter([0.0, 2.0], [ttft0, 0.0, ttft2]) → ValueError: x and y must be the same size.
What the impact would be

The exception propagates out of main() and the process exits non-zero. benchmark_lib.sh's write_agentic_result_json invokes the script with || true:
```
python3 "$INFMAX_CONTAINER_WORKSPACE/utils/generate_aiperf_plots.py" "$result_dir" 2>&1 || true
```
so the launcher does NOT fail — metrics_plots.png just silently disappears from the artifact tarball with no warning in CI logs. This is precisely the silent-failure mode the bug finder flagged.

Why this might be dead code in practice

load_jsonl_records at line 67 already drops records with obj['error'] set, and successful aiperf records always populate metadata.request_start_ns. So under healthy runs the filter is dead code and the mismatch can't actually trigger. The defensive .get() chain itself suggests the author thought the field could be missing — but if it really can't, the filter should be dropped.

How to fix

Either drop the filter on starts_ns (treating request_start_ns as a hard-required key — raises loudly if missing instead of silently dropping):
```
starts_ns = [int(r["metadata"]["request_start_ns"]) for r in records]
```
OR mirror the same filter into the per-record loop:
```
for r in records:
    if not r.get("metadata", {}).get("request_start_ns"):
        continue
    ...
```
Both fixes are mechanical one-liners. The two pieces of code currently cannot both be correct simultaneously.
🔴 utils/process_agentic_result.py:40 — The test fixture in utils/test_process_agentic_result.py still builds the fake HF cache directory using the old dataset name (cc-traces-weka-042026), but this PR updated _HF_DATASET in utils/process_agentic_result.py:40 to semianalysisai/cc-traces-weka-with-subagents-051926. As a result, test_processor_loads_traces_jsonl_for_theoretical_cache will fail because _hf_traces_dir() cannot find the snapshots directory under the new name, leaving theoretical_cache_hit_rate as None instead of the expected 0.5. Fix: update the fixture path in test_process_agentic_result.py (around line 408) to use datasets--semianalysisai--cc-traces-weka-with-subagents-051926.
Extended reasoning...

What the bug is

This PR renamed the agentic-replay dataset constant in utils/process_agentic_result.py:40 from semianalysisai/cc-traces-weka-042026 to semianalysisai/cc-traces-weka-with-subagents-051926. The dataset name flows through to _hf_traces_dir() (utils/process_agentic_result.py:133-134) which splits _HF_DATASET on / and looks for:
```
$HF_HUB_CACHE/datasets--semianalysisai--cc-traces-weka-with-subagents-051926/snapshots
```
But test_processor_loads_traces_jsonl_for_theoretical_cache in utils/test_process_agentic_result.py still constructs the fake HF cache at the old path:
```
snapshot = hf_cache / "datasets--semianalysisai--cc-traces-weka-042026" / "snapshots" / "abc"
snapshot.mkdir(parents=True)
...
with open(snapshot / "traces.jsonl", "w") as f:
    for t in traces:
        f.write(json.dumps(t) + "\n")
```
Step-by-step proof the test will fail
1. The test sets HF_HUB_CACHE to a directory containing only datasets--semianalysisai--cc-traces-weka-042026/snapshots/abc/traces.jsonl.
2. The subprocess runs process_agentic_result.py, which sees the new _HF_DATASET = "semianalysisai/cc-traces-weka-with-subagents-051926".
3. _hf_traces_dir() computes org, name = _HF_DATASET.split("/") → ("semianalysisai", "cc-traces-weka-with-subagents-051926") and globs cache_root/"datasets--semianalysisai--cc-traces-weka-with-subagents-051926"/"snapshots".
4. That directory does not exist in the fixture cache (only the -042026 directory does), so _hf_traces_dir() returns None.
5. _TRACE_METADATA_CACHE is never populated, so per-request hash_ids and trace output_lengths are never attached.
6. theoretical_cache_hit_rate is computed as None instead of the expected 0.5.
7. The assertion assert agg["theoretical_cache_hit_rate"] == pytest.approx(0.5) fails. The companion assertion assert agg["mean_output_tokens_expected"] == pytest.approx((50 + 60 + 55 + 40 + 70) / 5) will also fail for the same root cause (no metadata loaded → no output_tokens_expected aggregated).
Why existing code doesn't prevent it

The two strings are not deduplicated — the production constant lives in the module under test, and the fixture rebuilds the cache directory name by hand. There's no shared constant or fixture builder that would force them to stay in sync. The PR author touched this test module (changing docstrings/paths from trace_replay/ to aiperf_artifacts/) but missed updating the dataset directory name.

Impact

test_processor_loads_traces_jsonl_for_theoretical_cache will fail in CI on every run until fixed. The processor's theoretical_cache_hit_rate / mean_output_tokens_expected codepath is no longer exercised, hiding any future regressions in trace-metadata loading.

Fix

One-line change: update the fixture path string in utils/test_process_agentic_result.py to use the new directory name (and ideally extract _HF_DATASET into a shared constant or reference process_agentic_result._HF_DATASET from the test so this can't drift again). The minimal patch:
```
snapshot = hf_cache / "datasets--semianalysisai--cc-traces-weka-with-subagents-051926" / "snapshots" / "abc"
```

seungrokj and others added 9 commits May 26, 2026 12:53

chore(agentic): annotate CPU DRAM limit comments for Kimi FP4 B200

2441f1f

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix(agentic): reduce Kimi FP4 B200 CPU DRAM limit to 1500 GB

461bbe7

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

manual

806b3c9

Signed-off-by: seungrokj <seungrok.jung@amd.com>

manual

18eb2d5

Signed-off-by: seungrokj <seungrok.jung@amd.com>

manual

c050d08

Signed-off-by: seungrokj <seungrok.jung@amd.com>

manual

2af7377

Signed-off-by: seungrokj <seungrok.jung@amd.com>

fix(agentic): add CUDA LMCache MP patch for Kimi FP4 B200

b089e28

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

manual

2912288

Signed-off-by: seungrokj <seungrok.jung@amd.com>

manual

3fa8c2b

Signed-off-by: seungrokj <seungrok.jung@amd.com>

seungrokj requested a review from a team May 26, 2026 15:26

seungrokj requested review from 1am9trash, billishyahao, chunfangamd, jgangani, kedarpotdar-nv and yctseng0211 as code owners May 26, 2026 15:26

github-project-automation Bot added this to InferenceMAX Board May 26, 2026

cursor Bot reviewed May 26, 2026

View reviewed changes

Comment thread benchmarks/single_node/agentic/kimik2.5_fp4_mi355x.sh

Comment thread benchmarks/benchmark_lib.sh

seungrokj changed the base branch from main to chore/agentx-v0.3 May 26, 2026 15:27

manual

0323ccb

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

seungrokj changed the title ~~feat(agentic): AgentX v0.3 — Kimi MI355X LMCache MP benchmark~~ [AMD] feat(agentic): AgentX v0.3 — Kimi MI355X LMCache MP benchmark May 26, 2026

Merge branch 'chore/agentx-v0.3' into srok/chore_agentx-v0.3

0a5d493

claude Bot reviewed May 26, 2026

View reviewed changes

Comment thread benchmarks/single_node/agentic/kimik2.5_fp4_mi355x.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMD] feat(agentic): AgentX v0.3 — Kimi MI355X LMCache MP benchmark#1565

[AMD] feat(agentic): AgentX v0.3 — Kimi MI355X LMCache MP benchmark#1565
seungrokj wants to merge 11 commits into
chore/agentx-v0.3from
srok/chore_agentx-v0.3

seungrokj commented May 26, 2026 •

edited

Loading

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

Uh oh!

claude Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

seungrokj commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Benchmark Results — Run 26448139220

offloadlmcache — all cases (FAILED/zero-request excluded)

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

What the bug is

How it manifests

Step-by-step proof

What the impact would be

Why this might be dead code in practice

How to fix

What the bug is

Step-by-step proof the test will fail

Why existing code doesn't prevent it

Impact

Fix

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

seungrokj commented May 26, 2026 •

edited

Loading