Skip to content

[AMD] feat(agentic): AgentX v0.3 — Kimi MI355X LMCache MP benchmark#1565

Open
seungrokj wants to merge 11 commits into
chore/agentx-v0.3from
srok/chore_agentx-v0.3
Open

[AMD] feat(agentic): AgentX v0.3 — Kimi MI355X LMCache MP benchmark#1565
seungrokj wants to merge 11 commits into
chore/agentx-v0.3from
srok/chore_agentx-v0.3

Conversation

@seungrokj
Copy link
Copy Markdown
Collaborator

@seungrokj seungrokj commented May 26, 2026

Summary

AgentX v0.3 agentic benchmark for Kimi-K2.5-MXFP4 on MI355X using LMCache MP (multi-process) CPU offloading. Key changes on top of main:

  • feat(agentic): add LMCache MP for Kimi MI355X — wires up the external LMCache server path for ROCm
  • fix(agentic): avoid CUDA NIXL import on MI355X LMCache — ROCm-safe import guard
  • fix(agentic): lazily patch ROCm LMCache allocator / defer ROCm LMCache pinned expansion — avoids 10-min startup stall for 2.5 TB pinned pool
  • fix(agentic): add ROCm LMCache MP block fallback — Python fallback for multi_layer_block_kv_transfer (no CUDA kernel on ROCm)
  • fix(agentic): avoid partial LMCache import patching — ensures all ROCm patches apply atomically
  • fix(agentic): extend Kimi MI355X LMCache read lease / use final LMCache capacity on ROCm
  • fix(agentic): normalize/filter Kimi MI355X replay context — caps contexts to server window
  • fix(agentic): reduce Kimi FP4 B200 CPU DRAM limit to 1500 GB
  • AIPerf metadata and schema fixes (carry AIPerf prefix metadata, refresh AIPerf mmap cache schema, update AIPerf replay metadata)

Benchmark Results — Run 26448139220

Environment: vLLM v0.21.0, ROCm 7.2.x, 8× MI355X (288 GB VRAM each)
Model: amd/Kimi-K2.5-MXFP4
Benchmark: AgentX v0.3 agentic trace replay, 1800s duration
Offloading: LMCache MP (2.5 TB CPU DRAM pool)

offloadlmcache — all cases (FAILED/zero-request excluded)

TP Conc #Reqs RPS Out tok/s Prefix cache % Ext cache % GPU KV avg %
4 16 566 0.310 156.36 47.6 65.7 68.1
4 24 352 0.190 95.61 0.0 84.9 87.1
4 32 304 0.170 87.53 0.0 79.7 85.2
4 40 372 0.210 96.14 0.0 82.1 88.3
8 32 480 0.270 133.04 0.0 83.9 86.9
8 40 514 0.280 124.33 0.0 83.3 86.5
8 48 537 0.300 141.13 0.0 83.3 88.3
8 56 531 0.290 143.60 0.0 82.4 88.3

Notes:

  • Prefix cache % = GPU-local prefix cache hit rate (cumulative at end of run from server logs)
  • Ext cache % = External (LMCache CPU offload) cache hit rate
  • GPU KV avg % = Mean GPU KV cache utilization across non-idle periods
  • Excluded (0 requests): TP8/conc32 (amds_00, amds_04) and TP8/conc40 (amds_07)
  • TP4/conc16 is the only case showing GPU-local prefix cache hits (47.6%) — higher concurrencies saturate GPU KV cache quickly, evicting local blocks and relying almost entirely on LMCache external offload (~80–85%)
  • TP8 shows better throughput scaling vs TP4 at comparable concurrency, with TP8/conc48 peaking at 141 tok/s

🤖 Generated with Claude Code


Note

Medium Risk
Touches agentic replay, LMCache offloading, and large pinned-memory behavior on ROCm; mis-patches could affect startup time or KV correctness, but scope is benchmark/inference paths rather than core auth or payments.

Overview
Adds AgentX v0.3 agentic benchmarking for Kimi-K2.5-MXFP4 on MI355X with LMCache multi-process CPU offloading, including the external LMCache server path on ROCm.

ROCm-specific fixes avoid CUDA/NIXL imports, lazily/defer large pinned CPU pool setup, apply atomic LMCache import patches, add a Python fallback for multi_layer_block_kv_transfer, tune read leases and final cache capacity, and normalize/filter replay contexts to the server window. Also lowers the Kimi FP4 B200 CPU DRAM cap to 1500 GB and updates AIPerf prefix/replay metadata and mmap cache schema.

Reviewed by Cursor Bugbot for commit 0a5d493. Bugbot is set up for automated code reviews on this repo. Configure here.

seungrokj and others added 9 commits May 26, 2026 12:53
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: seungrokj <seungrok.jung@amd.com>
Signed-off-by: seungrokj <seungrok.jung@amd.com>
Signed-off-by: seungrokj <seungrok.jung@amd.com>
Signed-off-by: seungrokj <seungrok.jung@amd.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: seungrokj <seungrok.jung@amd.com>
Signed-off-by: seungrokj <seungrok.jung@amd.com>
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 3fa8c2b. Configure here.

Comment thread benchmarks/single_node/agentic/kimik2.5_fp4_mi355x.sh
Comment thread benchmarks/benchmark_lib.sh
@seungrokj seungrokj changed the base branch from main to chore/agentx-v0.3 May 26, 2026 15:27
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@seungrokj seungrokj changed the title feat(agentic): AgentX v0.3 — Kimi MI355X LMCache MP benchmark [AMD] feat(agentic): AgentX v0.3 — Kimi MI355X LMCache MP benchmark May 26, 2026
Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional findings (outside current diff — PR may have been updated during review):

  • 🟡 utils/generate_aiperf_plots.py:691-709 — The two parallel arrays built at utils/generate_aiperf_plots.py:691-709 use mismatched filters: starts_ns skips records where request_start_ns is missing or falsy, while the per-record loop for ttfts_ms/e2es_ms/interactivities iterates all records unfiltered. If any record is dropped from the start-time filter, request_times_s ends up shorter than the metric arrays and ax.scatter(request_times_s, values) in panel_per_record_metric raises ValueError: x and y must be the same size. The caller (write_agentic_result_json) masks failures with || true, so metrics_plots.png silently fails to render. Either drop the filter or apply the same filter in the per-record loop.

    Extended reasoning...

    What the bug is

    utils/generate_aiperf_plots.py (new file added by this PR) builds two parallel arrays in main() from the same records iterable but applies different filters:

    # Lines 691-697: filtered on request_start_ns
    starts_ns = [
        int(r["metadata"]["request_start_ns"])
        for r in records
        if r.get("metadata", {}).get("request_start_ns")
    ]
    first_record_start = min(starts_ns) if starts_ns else 0
    request_times_s = [(s - first_record_start) / 1e9 for s in starts_ns]
    
    # Lines 700-709: NOT filtered
    ttfts_ms: list[float] = []
    e2es_ms: list[float] = []
    interactivities: list[float] = []
    for r in records:
        ttft = metric_value(r, "time_to_first_token")
        ...
        ttfts_ms.append(ttft if ttft is not None else 0.0)
        ...

    How it manifests

    If any record in records has a missing/falsy metadata.request_start_ns (truthy .get(...) check also drops 0/None), it gets excluded from starts_ns/request_times_s but still appended to ttfts_ms/e2es_ms/interactivities (with a 0.0 placeholder). The lists then have different lengths.

    Later, main() passes them paired to panel_per_record_metric (lines 749, 758, 767), which calls ax.scatter(request_times_s, values, ...) at line 582. Matplotlib's scatter enforces equal-length sequences and raises ValueError: x and y must be the same size.

    Step-by-step proof

    Consider 3 records, one missing request_start_ns:

    1. records = [R0 (start=1e9), R1 (no start), R2 (start=3e9)]
    2. starts_ns loop filters out R1: [1_000_000_000, 3_000_000_000] (len=2)
    3. request_times_s = [0.0, 2.0] (len=2)
    4. The for r in records loop visits all 3 records and appends placeholders for R1: ttfts_ms = [ttft0, 0.0, ttft2] (len=3)
    5. panel_per_record_metric(axes[4,0], request_times_s, ttfts_ms, ...) calls ax.scatter([0.0, 2.0], [ttft0, 0.0, ttft2])ValueError: x and y must be the same size.

    What the impact would be

    The exception propagates out of main() and the process exits non-zero. benchmark_lib.sh's write_agentic_result_json invokes the script with || true:

    python3 "$INFMAX_CONTAINER_WORKSPACE/utils/generate_aiperf_plots.py" "$result_dir" 2>&1 || true

    so the launcher does NOT fail — metrics_plots.png just silently disappears from the artifact tarball with no warning in CI logs. This is precisely the silent-failure mode the bug finder flagged.

    Why this might be dead code in practice

    load_jsonl_records at line 67 already drops records with obj['error'] set, and successful aiperf records always populate metadata.request_start_ns. So under healthy runs the filter is dead code and the mismatch can't actually trigger. The defensive .get() chain itself suggests the author thought the field could be missing — but if it really can't, the filter should be dropped.

    How to fix

    Either drop the filter on starts_ns (treating request_start_ns as a hard-required key — raises loudly if missing instead of silently dropping):

    starts_ns = [int(r["metadata"]["request_start_ns"]) for r in records]

    OR mirror the same filter into the per-record loop:

    for r in records:
        if not r.get("metadata", {}).get("request_start_ns"):
            continue
        ...

    Both fixes are mechanical one-liners. The two pieces of code currently cannot both be correct simultaneously.

  • 🔴 utils/process_agentic_result.py:40 — The test fixture in utils/test_process_agentic_result.py still builds the fake HF cache directory using the old dataset name (cc-traces-weka-042026), but this PR updated _HF_DATASET in utils/process_agentic_result.py:40 to semianalysisai/cc-traces-weka-with-subagents-051926. As a result, test_processor_loads_traces_jsonl_for_theoretical_cache will fail because _hf_traces_dir() cannot find the snapshots directory under the new name, leaving theoretical_cache_hit_rate as None instead of the expected 0.5. Fix: update the fixture path in test_process_agentic_result.py (around line 408) to use datasets--semianalysisai--cc-traces-weka-with-subagents-051926.

    Extended reasoning...

    What the bug is

    This PR renamed the agentic-replay dataset constant in utils/process_agentic_result.py:40 from semianalysisai/cc-traces-weka-042026 to semianalysisai/cc-traces-weka-with-subagents-051926. The dataset name flows through to _hf_traces_dir() (utils/process_agentic_result.py:133-134) which splits _HF_DATASET on / and looks for:

    $HF_HUB_CACHE/datasets--semianalysisai--cc-traces-weka-with-subagents-051926/snapshots
    

    But test_processor_loads_traces_jsonl_for_theoretical_cache in utils/test_process_agentic_result.py still constructs the fake HF cache at the old path:

    snapshot = hf_cache / "datasets--semianalysisai--cc-traces-weka-042026" / "snapshots" / "abc"
    snapshot.mkdir(parents=True)
    ...
    with open(snapshot / "traces.jsonl", "w") as f:
        for t in traces:
            f.write(json.dumps(t) + "\n")

    Step-by-step proof the test will fail

    1. The test sets HF_HUB_CACHE to a directory containing only datasets--semianalysisai--cc-traces-weka-042026/snapshots/abc/traces.jsonl.
    2. The subprocess runs process_agentic_result.py, which sees the new _HF_DATASET = "semianalysisai/cc-traces-weka-with-subagents-051926".
    3. _hf_traces_dir() computes org, name = _HF_DATASET.split("/")("semianalysisai", "cc-traces-weka-with-subagents-051926") and globs cache_root/"datasets--semianalysisai--cc-traces-weka-with-subagents-051926"/"snapshots".
    4. That directory does not exist in the fixture cache (only the -042026 directory does), so _hf_traces_dir() returns None.
    5. _TRACE_METADATA_CACHE is never populated, so per-request hash_ids and trace output_lengths are never attached.
    6. theoretical_cache_hit_rate is computed as None instead of the expected 0.5.
    7. The assertion assert agg["theoretical_cache_hit_rate"] == pytest.approx(0.5) fails. The companion assertion assert agg["mean_output_tokens_expected"] == pytest.approx((50 + 60 + 55 + 40 + 70) / 5) will also fail for the same root cause (no metadata loaded → no output_tokens_expected aggregated).

    Why existing code doesn't prevent it

    The two strings are not deduplicated — the production constant lives in the module under test, and the fixture rebuilds the cache directory name by hand. There's no shared constant or fixture builder that would force them to stay in sync. The PR author touched this test module (changing docstrings/paths from trace_replay/ to aiperf_artifacts/) but missed updating the dataset directory name.

    Impact

    test_processor_loads_traces_jsonl_for_theoretical_cache will fail in CI on every run until fixed. The processor's theoretical_cache_hit_rate / mean_output_tokens_expected codepath is no longer exercised, hiding any future regressions in trace-metadata loading.

    Fix

    One-line change: update the fixture path string in utils/test_process_agentic_result.py to use the new directory name (and ideally extract _HF_DATASET into a shared constant or reference process_agentic_result._HF_DATASET from the test so this can't drift again). The minimal patch:

    snapshot = hf_cache / "datasets--semianalysisai--cc-traces-weka-with-subagents-051926" / "snapshots" / "abc"

Comment thread benchmarks/single_node/agentic/kimik2.5_fp4_mi355x.sh
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant