[Bugfix][ROCm] Resolve MoRI connector hangs at high concurrency by simondanielsson · Pull Request #40344 · vllm-project/vllm

simondanielsson · 2026-04-20T10:11:27Z

Purpose

There are a few parts of the MoRI-IO connector code that can cause indefinite hangs of the connector. This PR resolves them so we can run at least 512 concurrency requests, both for READ and WRITE modes.

Co-developed with @ichbinblau @chunfangamd

Implementation details

Click to expand

Disable MoRI's in-band notifications (set enable_notification=False in RdmaBackendConfig) as we use ZMQ for completion notifications anyhow, and under high concurrency those notifications poison the transfer statuses because the QP send queue is exhausted (causing requests to be stuck in WAITING_FOR_REMOTE_KVS).
Handle Failed() transfers in _pop_done_transfers so prefill frees its blocks. These were otherwise stuck in _recving_transfers forever.
Keep transfer_id<->request_id mapping on the producer side until the transfer has been marked as completed (finished_sending or notification from decode). Otherwise no requests are marked as done sending and hence are not freed by the scheduler from get_finished (except in the backup free-er in update_connector_output)
Reap KV blocks after (configurable) deadline for finished request that have not been notified to free the blocks for instance due to ibv_post_send failures.
(WRITE mode only) fix race condition caused by mismatch between when requests are added to transfer_id_to_request_id and when requests are popped using pop_finished_write_req_ids.
(READ mode only) Don't report READ requests on D side as part of done_recving since these requests are scheduled as RUNNING immediately (and specifically never as WAITING_FOR_REMOTE_KVS). This otherwise triggers an assertion error
Replace status.Wait() infinite busy-spin with polling w/ deadline.
Add 1ms sleep in busy-spin while True loops
Fix mismatches between when request ids and transfer ids are used throughout the code.

Test Plan

Build an image in this branch, or if on MI300X w/ Thor2 NICs you can pull the image I built:

docker pull ghcr.io/simondanielsson/vllm-rocm-moriio:dev-hang-fixes

Or we can build it from source for MI300X with Thor2:

Expand for full instructions

docker build \
    -f docker/Dockerfile.rocm_base \
    --build-arg MORI_GPU_ARCHS="gfx942" \
    --build-arg PYTORCH_ROCM_ARCH="gfx942" \
    -t rocm/vllm-dev:base \
    .

docker build \
  -f docker/Dockerfile.rocm \
  --build-arg BASE_IMAGE=rocm/vllm-dev:base \
  -t vllm/vllm-openai-rocm:local \
  .

If your host is equipped with slightly older bnxt-re kernel module versions you might need to uninstall the RDMA userspace libraries shipped with the official image and install a version that works for you:

# docker/Dockerfile.rocm_dev
ARG BASE_IMAGE=vllm/vllm-openai-rocm:local
FROM ${BASE_IMAGE}

# RDMA userspace libraries required by MoRI-IO.
# libibverbs/librdmacm may already be present in the base image;
# apt-get is idempotent so this is safe either way.
RUN apt-get update -q -y && apt-get install -q -y \
        librdmacm1 \
        libibverbs1 \
        ibverbs-providers \
        ibverbs-utils \
        libibverbs-dev \
        autoconf \
        libtool \
        unzip \
        wget \
    && rm -rf /var/lib/apt/lists/*

# Remove the pre-installed bnxt-rocelib 235.x userspace libraries and apt
# package so they don't conflict with the older 230.x build below.
RUN apt-get update -q -y \
    && apt-get remove -y bnxt-rocelib 2>/dev/null || true \
    && rm -rf /var/lib/apt/lists/* \
    && find /usr/local/lib /usr/local/lib/x86_64-linux-gnu \
         /usr/lib /usr/lib64 /usr/lib/x86_64-linux-gnu \
         -name "libbnxt_re*" -delete 2>/dev/null || true \
    && rm -f /etc/libibverbs.d/bnxt_re.driver 2>/dev/null || true \
    && ldconfig


# Thor2 (Broadcom BCM5760x) RDMA user-space driver (libbnxt_re).
# The inbox libbnxt_re-rdmav*.so shipped by libibverbs is renamed so the
# vendor build takes precedence via libibverbs provider discovery.
RUN wget -q \
        https://docs.broadcom.com/docs-and-downloads/ethernet-network-adapters/NXE/Thor2/GCA1/bcm5760x_230.2.52.0a.zip \
    && unzip -q bcm5760x_230.2.52.0a.zip \
    && cd bcm5760x_230.2.52.0a/drivers_linux/bnxt_rocelib/ \
    && tar -xf "$(find . -name 'libbnxt*.tar.gz' | head -n 1)" \
    && cd "$(find . -maxdepth 1 -type d -name 'libbnxt*' ! -name '*.tar.gz' | head -n 1)" \
    && sh autogen.sh \
    && ./configure \
    && make \
    && make install all \
    && echo /usr/local/lib >> /etc/ld.so.conf \
    && ldconfig \
    && cp -f bnxt_re.driver /etc/libibverbs.d/ \
    && cd / \
    && rm -rf /bcm5760x_230.2.52.0a /bcm5760x_230.2.52.0a.zip

and build

docker build \
    -f docker/Dockerfile.rocm_dev \
    -t ghcr.io/simondanielsson/vllm-rocm-moriio:dev-hang-fixes \
    .

1p1d deployment tested with (a) vllm bench serve @ 256 & 512 concurrency; and (b) GSM8k

# Set on both nodes before running any command
export PREFILL_IP=<node1-ip>
export DECODE_IP=<node2-ip>

# Node 1 (prefill node) — command 1: start toy proxy
docker run -d \
  --name moriio-toy-proxy \
  --network host \
  --rm \
  --entrypoint bash \
  ghcr.io/simondanielsson/vllm-rocm-moriio:dev-hang-fixes \
  -c "pip install --quiet --ignore-installed quart aiohttp msgpack && \
           python3 -u /app/vllm/examples/online_serving/disaggregated_serving/moriio_toy_proxy_server.py"

# Node 1 (prefill node) — command 2: start prefill instance
docker run \
  --rm \
  --name moriio-prefill \
  --init --network host --ipc host --privileged \
  --cap-add SYS_PTRACE --security-opt seccomp=unconfined \
  --ulimit memlock=-1 --ulimit stack=67108864 \
  --shm-size 256G \
  --group-add video --group-add render \
  --device /dev/kfd --device /dev/dri --device /dev/infiniband \
  -v /sys:/sys \
  -v "${HOME}/.cache/huggingface:/root/.cache/huggingface" \
  -e HF_HOME=/root/.cache/huggingface \
  -e HF_HUB_ENABLE_HF_TRANSFER=0 \
  -e VLLM_MORIIO_CONNECTOR_READ_MODE=1 \
  -e NCCL_MIN_NCHANNELS=112 \
  -e VLLM_USE_V1=1 \
  -e VLLM_ENGINE_READY_TIMEOUT_S=3600 \
  -e VLLM_SERVER_DEV_MODE=1 \
  -e VLLM_ROCM_USE_AITER=1 \
  -e VLLM_ROCM_USE_AITER_PAGED_ATTN=0 \
  -e VLLM_ROCM_USE_AITER_RMSNORM=1 \
  -e VLLM_USE_AITER_TRITON_SILU_MUL=0 \
  ghcr.io/simondanielsson/vllm-rocm-moriio:dev-hang-fixes \
  deepseek-ai/DeepSeek-R1-0528 \
    --port 8100 \
    --tensor-parallel-size 8 \
    --kv-cache-dtype fp8 \
    --gpu-memory-utilization 0.7 \
    --max-num-batched-tokens 32768 \
    --max-model-len 16384 \
    --trust-remote-code \
    --no-enable-prefix-caching \
    --block-size 1 \
    --enforce-eager \
    --kv-transfer-config '{
      "kv_connector": "MoRIIOConnector",
      "kv_role": "kv_producer",
      "kv_connector_extra_config": {
        "proxy_ip": "'"${PREFILL_IP}"'",
        "proxy_ping_port": "36367",
        "http_port": "8100",
        "handshake_port": "6301",
        "notify_port": "61005"
      }
    }'

# Node 2 (decode node) — command 3: start decode instance
docker run \
  --rm \
  --name moriio-decode \
  --init --network host --ipc host --privileged \
  --cap-add SYS_PTRACE --security-opt seccomp=unconfined \
  --ulimit memlock=-1 --ulimit stack=67108864 \
  --shm-size 256G \
  --group-add video --group-add render \
  --device /dev/kfd --device /dev/dri --device /dev/infiniband \
  -v /sys:/sys \
  -v "${HOME}/.cache/huggingface:/root/.cache/huggingface" \
  -e HF_HOME=/root/.cache/huggingface \
  -e HF_HUB_ENABLE_HF_TRANSFER=0 \
  -e VLLM_MORIIO_CONNECTOR_READ_MODE=1 \
  -e NCCL_MIN_NCHANNELS=112 \
  -e VLLM_USE_V1=1 \
  -e VLLM_ENGINE_READY_TIMEOUT_S=3600 \
  -e VLLM_SERVER_DEV_MODE=1 \
  -e VLLM_ROCM_USE_AITER=1 \
  -e VLLM_ROCM_USE_AITER_PAGED_ATTN=0 \
  -e VLLM_ROCM_USE_AITER_RMSNORM=1 \
  -e VLLM_USE_AITER_TRITON_SILU_MUL=0 \
  ghcr.io/simondanielsson/vllm-rocm-moriio:dev-hang-fixes \
  deepseek-ai/DeepSeek-R1-0528 \
    --port 8200 \
    --tensor-parallel-size 8 \
    --kv-cache-dtype fp8 \
    --gpu-memory-utilization 0.7 \
    --max-num-batched-tokens 32768 \
    --max-model-len 16384 \
    --trust-remote-code \
    --no-enable-prefix-caching \
    --block-size 1 \
    --enable-expert-parallel \
    --all2all-backend mori \
    --compilation-config '{"cudagraph_mode": "PIECEWISE"}' \
    --kv-transfer-config '{
      "kv_connector": "MoRIIOConnector",
      "kv_role": "kv_consumer",
      "kv_connector_extra_config": {
        "proxy_ip": "'"${PREFILL_IP}"'",
        "proxy_ping_port": "36367",
        "http_port": "8200",
        "handshake_port": "6301",
        "notify_port": "61005"
      }
    }'

#  Node 1 (prefill node) — command 4: verify both instances registered with toy proxy
docker logs moriio-toy-proxy 2>&1 | grep -E "Registered (Prefill|Decode)"

# Node 1 (prefill node) — command 5: run vllm bench serve
# 1st benchmark - 256 conc
docker exec moriio-prefill \
  vllm bench serve \
    --base-url http://localhost:10001 \
    --backend vllm \
    --model deepseek-ai/DeepSeek-R1-0528 \
    --dataset-name random \
    --random-input-len 1000 \
    --random-output-len 1000 \
    --max-concurrency 256 \
    --num-warmups 512 \
    --num-prompts 2560 \
    --seed 1234

# 2nd benchmark - 512 conc
docker exec moriio-prefill \
  vllm bench serve \
    --base-url http://localhost:10001 \
    --backend vllm \
    --model deepseek-ai/DeepSeek-R1-0528 \
    --dataset-name random \
    --random-input-len 1000 \
    --random-output-len 1000 \
    --max-concurrency 512 \
    --num-warmups 1024 \
    --num-prompts 5120 \
    --seed 1234

# GSM8k
docker exec moriio-prefill bash -c \
  "pip install --quiet 'lm_eval[api]' && \
   lm_eval \
     --model local-completions \
     --model_args model=deepseek-ai/DeepSeek-R1-0528,base_url=http://localhost:10001/v1/completions,tokenized_requests=False,trust_remote_code=True \
     --tasks gsm8k \
     --num_fewshot 5 \
     --output_path /tmp/lm_eval_gsm8k"

Test Result

Works at least up to 512 concurrency:

INFO 04-21 09:29:11 [utils.py:90] Sampling input_len from [1000, 1000] and output_len from [1000, 1000]
Maximum request concurrency: 256
100%|██████████| 2560/2560 [06:51<00:00,  6.22it/s]
============ Serving Benchmark Result ============
Successful requests:                     2558
Failed requests:                         2
Maximum request concurrency:             256
Benchmark duration (s):                  411.36
Total input tokens:                      2558000
Total generated tokens:                  2555442
Request throughput (req/s):              6.22
Output token throughput (tok/s):         6212.17
Peak output token throughput (tok/s):    14689.00
Peak concurrent requests:                335.00
Total token throughput (tok/s):          12430.56
---------------Time to First Token----------------
Mean TTFT (ms):                          13215.72
Median TTFT (ms):                        2018.86
P99 TTFT (ms):                           61736.69
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          26.81
Median TPOT (ms):                        27.95
P99 TPOT (ms):                           29.05
---------------Inter-token Latency----------------
Mean ITL (ms):                           26.81
Median ITL (ms):                         24.46
P99 ITL (ms):                            277.11
==================================================

INFO 04-21 09:46:19 [utils.py:90] Sampling input_len from [1000, 1000] and output_len from [1000, 1000]
Maximum request concurrency: 512
100%|██████████| 5120/5120 [11:24<00:00,  7.48it/s]
============ Serving Benchmark Result ============
Successful requests:                     5114
Failed requests:                         6
Maximum request concurrency:             512
Benchmark duration (s):                  684.65
Total input tokens:                      5114000
Total generated tokens:                  5108886
Request throughput (req/s):              7.47
Output token throughput (tok/s):         7462.09
Peak output token throughput (tok/s):    22272.00
Peak concurrent requests:                634.00
Total token throughput (tok/s):          14931.65
---------------Time to First Token----------------
Mean TTFT (ms):                          24764.53
Median TTFT (ms):                        4507.56
P99 TTFT (ms):                           79403.65
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          42.95
Median TPOT (ms):                        45.41
P99 TPOT (ms):                           49.87
---------------Inter-token Latency----------------
Mean ITL (ms):                           42.95
Median ITL (ms):                         29.05
P99 ITL (ms):                            670.78
==================================================

Accuracy

# This branch
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9439|±  |0.0063|
|     |       |strict-match    |     5|exact_match|↑  |0.9378|±  |0.0067|

# Main
# Crashes with this error:
# /app/mori/src/io/rdma/backend_impl.cpp:291: void mori::io::NotifManager::ProcessOneCqe(int, const EpPair &): Assertion `msg.totalNum > 0' failed.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.