[ROCm][MoRI] WRITE mode support (layerwise xfer) by simondanielsson · Pull Request #157 · vllm-project/router

simondanielsson · 2026-04-29T07:47:35Z

Purpose

Fixes #160.

This PR allows us to use WRITE mode of MoRI with vllm-router, enabling layer-wise KV transfer. It works by implementing concurrent dispatch of requests to P and D.

Dependent PR:

[Bugfix][ROCm] Resolve MoRI connector hangs at high concurrency vllm#40344 - MoRI connector can easily deadlock without these changes, particularly so in WRITE mode. This PR resolves this issue allows running MoRI (both READ and WRITE) at high concurrency.

Results

vllm-router performance similar to toy proxy reference implementation. However, toy proxy starts failing at high concurrency. (green == orange)
Compared to READ mode, WRITE mode provides
- up to 1.8x lower ⁠TTFT vs READ mode
- significantly better TPOT
[Fix][MoRI] Align MoRI-IO message format with P2pNcclConnector and vllm-router vllm#39565 already showed READ mode performance of vllm-router is similar to that of toy proxy, so WRITE mode vllm-router should definitely outperform READ mode toy proxy.

1P1D with DSR1. See details below for reproducer.

Usage

Usage is identical to as introduced in #138, i.e. simply use --kv-connector moriio with service discovery enabled. The transfer mode is automatically negotiated during router<->worker handshake. On vLLM side, run with VLLM_MORIIO_CONNECTOR_READ_MODE=0 (or omit this env var entirely as it's the default).

Implementation details

Click to expand

Layerwise transfer requires dispatching request to the P and D instances concurrently, rather than sequentially as during READ mode.

The current PD dispatching logic (i.e. READ mode) has four steps: (1) sending max_tokens=1 request to P, (2) awaiting response from P (3) sending request to D (4) stream response from D.
The dispatching logic for WRITE mode is rather: (1) Prepare P and D requests, with P request having max_tokens=1 (2) send requests to both P and D concurrently (3) stream response from D.

How WRITE mode works in vLLM: D allocates blocks and notifies P about these blocks and then waits. P receives this notification, and after every layer it writes its produced KV into these blocks asynchronously. Hence transfer or layer N is overlapped with computation of layer N+1. After all layers have been written, it awaits the last write and validates it succeeded. P then sends a notification to D that all blocks have been written. Upon this notification, D wakes up and and the request is scheduled on the D side, iteratively performing decode steps using the KV.

Test Plan

Bench: Compare READ vs WRITE mode, vllm-router vs toy-proxy
GSM8k for correctness validation

Reproducer

Build router on this branch

docker build \
  -f Dockerfile.router \
  -t ghcr.io/simondanielsson/vllm-router:write-mode \
  .
# or alternatively docker pull ghcr.io/simondanielsson/vllm-router:write-mode

Need vllm built on this branch to avoid hanging issues: vllm-project/vllm#40344:

docker pull ghcr.io/simondanielsson/vllm-rocm-moriio:dev-hang-fixes
# or nightly vllm docker once https://github.com/vllm-project/vllm/pull/40344 merged

Run 1P1D with WRITE mode (VLLM_MORIIO_CONNECTOR_READ_MODE=0):

# Set on both nodes before running any command
export PREFILL_IP=<set this>
export DECODE_IP=<set this>

# Node 1 (prefill node) — command 1: start vllm router
docker run \
  --name vllm-router \
  --network host \
  --rm \
  ghcr.io/simondanielsson/vllm-router:write-mode \
  vllm-router \
  --vllm-pd-disaggregation \
  --kv-connector moriio \
  --vllm-discovery-address "0.0.0.0:36367" \
  --policy consistent_hash \
  --prefill-policy consistent_hash \
  --decode-policy consistent_hash \
  --log-level info


# Node 1 (prefill node) — command 2: start prefill instance
docker run \
  --rm \
  --name moriio-prefill \
  --init --network host --ipc host --privileged \
  --cap-add SYS_PTRACE --security-opt seccomp=unconfined \
  --ulimit memlock=-1 --ulimit stack=67108864 \
  --shm-size 256G \
  --group-add video --group-add render \
  --device /dev/kfd --device /dev/dri --device /dev/infiniband \
  -v /sys:/sys \
  -v "${HOME}/.cache/huggingface:/root/.cache/huggingface" \
  -e HF_HOME=/root/.cache/huggingface \
  -e HF_HUB_ENABLE_HF_TRANSFER=0 \
  -e VLLM_MORIIO_CONNECTOR_READ_MODE=0 \
  -e NCCL_MIN_NCHANNELS=112 \
  -e VLLM_USE_V1=1 \
  -e VLLM_ENGINE_READY_TIMEOUT_S=3600 \
  -e VLLM_SERVER_DEV_MODE=1 \
  -e VLLM_ROCM_USE_AITER=1 \
  -e VLLM_ROCM_USE_AITER_PAGED_ATTN=0 \
  -e VLLM_ROCM_USE_AITER_RMSNORM=1 \
  -e VLLM_USE_AITER_TRITON_SILU_MUL=0 \
  -e MORI_IO_ENABLE_NOTIFICATION=0 \
  -e NCCL_SOCKET_IFNAME=ens51np0 \
  ghcr.io/simondanielsson/vllm-rocm-moriio:dev-hang-fixes \
  deepseek-ai/DeepSeek-R1-0528 \
    --load-format dummy \
    --port 8100 \
    --tensor-parallel-size 8 \
    --kv-cache-dtype fp8 \
    --gpu-memory-utilization 0.7 \
    --max-num-batched-tokens 32768 \
    --max-model-len 16384 \
    --trust-remote-code \
    --no-enable-prefix-caching \
    --block-size 1 \
    --enforce-eager \
    --kv-transfer-config '{
      "kv_connector": "MoRIIOConnector",
      "kv_role": "kv_producer",
      "kv_connector_extra_config": {
        "proxy_ip": "'"${PREFILL_IP}"'",
        "proxy_ping_port": "36367",
        "http_port": "8100",
        "handshake_port": "6301",
        "notify_port": "61005"
      }
    }'

# Node 2 (decode node) — command 3: start decode instance
docker run \
  --rm \
  --name moriio-decode \
  --init --network host --ipc host --privileged \
  --cap-add SYS_PTRACE --security-opt seccomp=unconfined \
  --ulimit memlock=-1 --ulimit stack=67108864 \
  --shm-size 256G \
  --group-add video --group-add render \
  --device /dev/kfd --device /dev/dri --device /dev/infiniband \
  -v /sys:/sys \
  -v "${HOME}/.cache/huggingface:/root/.cache/huggingface" \
  -e HF_HOME=/root/.cache/huggingface \
  -e HF_HUB_ENABLE_HF_TRANSFER=0 \
  -e VLLM_MORIIO_CONNECTOR_READ_MODE=0 \
  -e NCCL_MIN_NCHANNELS=112 \
  -e VLLM_USE_V1=1 \
  -e VLLM_ENGINE_READY_TIMEOUT_S=3600 \
  -e VLLM_SERVER_DEV_MODE=1 \
  -e VLLM_ROCM_USE_AITER=1 \
  -e VLLM_ROCM_USE_AITER_PAGED_ATTN=0 \
  -e VLLM_ROCM_USE_AITER_RMSNORM=1 \
  -e VLLM_USE_AITER_TRITON_SILU_MUL=0 \
  -e MORI_IO_ENABLE_NOTIFICATION=0 \
  ghcr.io/simondanielsson/vllm-rocm-moriio:dev-hang-fixes \
  deepseek-ai/DeepSeek-R1-0528 \
    --port 8200 \
    --load-format dummy \
    --tensor-parallel-size 8 \
    --kv-cache-dtype fp8 \
    --gpu-memory-utilization 0.7 \
    --max-num-batched-tokens 32768 \
    --max-model-len 16384 \
    --trust-remote-code \
    --no-enable-prefix-caching \
    --block-size 1 \
    --enable-expert-parallel \
    --all2all-backend mori \
    --compilation-config '{"cudagraph_mode": "PIECEWISE"}' \
    --kv-transfer-config '{
      "kv_connector": "MoRIIOConnector",
      "kv_role": "kv_consumer",
      "kv_connector_extra_config": {
        "proxy_ip": "'"${PREFILL_IP}"'",
        "proxy_ping_port": "36367",
        "http_port": "8200",
        "handshake_port": "6301",
        "notify_port": "61005"
      }
    }'

# Node 1 (prefill node) — command 5: run vllm bench serve
for input_len in 1000 8000; do
  for concurrency in 16 32 64 128 256; do
    docker exec moriio-prefill \
      vllm bench serve \
        --base-url http://localhost:30000 \
        --backend vllm \
        --model deepseek-ai/DeepSeek-R1-0528 \
        --dataset-name random \
        --random-input-len $input_len \
        --random-output-len 1000 \
        --max-concurrency $concurrency \
        --num-warmups $((concurrency * 2)) \
        --num-prompts $((concurrency * 10)) \
        --goodput ttft:1000 \
        --seed 1234 \
    | tee bench_router_input${input_len}_concurrency${concurrency}.txt
  done
done

#  Node 1 (prefill node) - GSM8k.
# Remember to remove --load-format dummy on the vllm instances though
docker exec moriio-prefill bash -c \
  "pip install --quiet 'lm_eval[api]' && \
   lm_eval \
     --model local-completions \
     --model_args model=deepseek-ai/DeepSeek-R1-0528,base_url=http://localhost:30000/v1/completions,tokenized_requests=False,trust_remote_code=True \
     --tasks gsm8k \
     --num_fewshot 5 \
     --output_path /tmp/lm_eval_gsm8k"

Note: you can also test the above with the toy proxy:

# Swap out the vllm-router container for the toy proxy
docker run \
  --rm \
  --name moriio-toy-proxy \
  --network host \
  --rm \
  --entrypoint bash \
  ghcr.io/simondanielsson/vllm-rocm-moriio:dev-hang-fixes \
  -c "pip install --quiet --ignore-installed quart aiohttp msgpack && \
           python3 -u /app/vllm/examples/online_serving/disaggregated_serving/moriio_toy_proxy_server.py"

# and re-run the benchmark but with the default 10001 port
for input_len in 1000 8000; do
  for concurrency in 16 32 64 128 256; do
    docker exec moriio-prefill \
      vllm bench serve \
        --base-url http://localhost:10001 \
        --backend vllm \
        --model deepseek-ai/DeepSeek-R1-0528 \
        --dataset-name random \
        --random-input-len $input_len \
        --random-output-len 1000 \
        --max-concurrency $concurrency \
        --num-warmups $((concurrency * 2)) \
        --num-prompts $((concurrency * 10)) \
        --goodput ttft:1000 \
        --seed 1234 \
    | tee bench_toy_proxy_input${input_len}_concurrency${concurrency}.txt
  done
done

Test Result

Main results summarized above. Raw results below.

WRITE mode: vllm-router vs toy proxy

Also includes vllm-router READ.

1k/1k

Router	Mode	Concurrency	Failed Reqs	Req/s	TTFT (P50/P99) ms	TPOT (P50/P99) ms	ITL (P50/P99) ms
vllm-router	READ	16	0	1.02	261.96 / 997.72	15.27 / 15.46	15.20 / 16.34
toy-proxy	WRITE	16	0	1.03	170.03 / 708.39	15.14 / 15.29	15.13 / 16.25
vllm-router	WRITE	16	0	1.03	238.92 / 678.49	15.20 / 15.75	15.18 / 16.54

vllm-router	READ	32	0	2.02	243.10 / 1335.86	15.19 / 15.66	15.15 / 17.64
toy-proxy	WRITE	32	0	2.04	242.21 / 1321.08	15.14 / 15.27	15.11 / 16.60
vllm-router	WRITE	32	0	2.02	202.82 / 1268.53	15.28 / 15.70	15.23 / 16.68

vllm-router	READ	64	0	3.60	791.34 / 3103.74	16.51 / 17.21	15.54 / 25.94
toy-proxy	WRITE	64	0	3.90	605.63 / 2528.67	15.48 / 16.06	15.40 / 18.52
vllm-router	WRITE	64	0	2.85	887.21 / 60563.53	15.54 / 15.78	15.49 / 18.05

vllm-router	READ	128	0	4.71	2286.87 / 57920.53	19.82 / 20.98	18.21 / 35.46
toy-proxy	WRITE	128	4	5.03	671.74 / 53810.47	18.23 / 18.53	18.19 / 26.61
vllm-router	WRITE	128	0	5.05	949.18 / 61034.77	18.24 / 18.37	18.21 / 23.91

vllm-router	READ	256	0	6.08	2857.77 / 64422.03	29.75 / 31.56	24.90 / 117.52
toy-proxy	WRITE	256	33	6.81	1354.28 / 61913.80	24.99 / 25.78	24.46 / 65.69
vllm-router	WRITE	256	0	6.65	1299.83 / 61447.22	25.40 / 25.63	25.10 / 56.03

8k/1k

Router	Mode	Concurrency	Failed Reqs	Req/s	TTFT (P50/P99) ms	TPOT (P50/P99) ms	ITL (P50/P99) ms
vllm-router	READ	16	0	0.63	2176.66 / 59528.47	17.55 / 18.55	15.40 / 22.41
toy-proxy	WRITE	16	0	0.74	767.93 / 51578.94	15.19 / 15.35	15.15 / 21.64
vllm-router	WRITE	16	1	0.71	766.99 / 61397.36	15.12 / 15.47	15.10 / 21.03

vllm-router	READ	32	0	0.93	2930.00 / 63933.75	20.35 / 21.91	15.79 / 203.46
toy-proxy	WRITE	32	0	1.09	1067.96 / 70475.83	15.87 / 16.70	15.90 / 24.46
vllm-router	WRITE	32	0	1.08	790.78 / 70608.24	15.84 / 16.13	15.85 / 23.41

vllm-router	READ	64	0	1.16	3098.27 / 66645.72	29.95 / 31.91	19.28 / 424.22
toy-proxy	WRITE	64	0	1.37	5756.09 / 65180.50	17.99 / 18.95	18.01 / 31.69
vllm-router	WRITE	64	0	1.36	6136.02 / 66064.27	17.86 / 18.94	17.96 / 30.50

vllm-router	READ	128	0	1.30	68350.20 / 99389.23	42.07 / 50.25	25.19 / 455.45
toy-proxy	WRITE	128	4	1.38	83536.72 / 99228.10	18.21 / 18.86	18.13 / 32.20
vllm-router	WRITE	128	0	1.39	83549.72 / 99688.20	18.16 / 18.78	18.12 / 31.72

Accuracy (GSM8k)

WRITE mode, vllm-router:

local-completions ({'model': 'deepseek-ai/DeepSeek-R1-0528', 'base_url': 'http://localhost:30000/v1/completions', 'tokenized_requests': False}), gen_kwargs: ({}), limit: None, num_fewshot: 5, batch_size: 1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9538|±  |0.0058|
|     |       |strict-match    |     5|exact_match|↑  |0.9515|±  |0.0059|

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

… service discovery Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

…mode Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

…s not support it anyways Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

simondanielsson · 2026-04-29T18:21:39Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d3d623322a

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

simondanielsson · 2026-04-30T08:36:20Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1a820dda17

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-30T08:50:33Z

+        if moriio_transfer_mode.set(mode).is_err() {
+            let stored = moriio_transfer_mode.get().copied().unwrap();
+            if stored != mode {
+                warn!(


Allow transfer mode to change after instance turnover

The transfer mode is stored in a OnceLock and any later registration with a different mode is rejected forever. Since the value is never reset when old instances expire, a normal READ→WRITE (or WRITE→READ) worker rollout under a long-running router causes all new workers to be skipped; once old workers age out, service discovery can be left with no usable instances until the router is restarted.

Useful? React with 👍 / 👎.

Don't think there's much of a reason to switch modes live. WRITE is the default and should perform better. Suggest we ignore this for now until someone needs this functionality as that would add a lot of complexity (e.g. handling the rollover period where instances of both modes are live)

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b04a049743

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-30T09:16:17Z

+            let (prefill_result, decode_result) = tokio::join!(prefill_fut, decode_fut);
+            let write_prefill_response_json: Option<Value> = match prefill_result {


Fail fast on WRITE prefill errors instead of waiting on decode

In the WRITE path, tokio::join! waits for both futures before prefill_result is inspected, so a fast prefill failure is still delayed until the decode request returns. When decode is blocked waiting for KV transfer, this turns a clear prefill error into a long timeout path and ties up request capacity. Even with the new error propagation, this join! ordering still prevents prompt failure handling.

Useful? React with 👍 / 👎.

Technically we could use tokio::select! or abort to cancel decode on prefill failure but that adds complexity. I would suggest ignoring now but happy to be challenged if disagreed upon.

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0d08cda7bb

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3dbd46e8f8

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

…ntead of WRITRE mode Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

…mode Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

simondanielsson · 2026-05-05T10:01:14Z

CI failing due to unrelated changes, even after rebasing on main. Suggest triggering re-run.

Locally it's happy:

$ cargo clippy --all-targets --all-features -- -D warnings
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 2.77s

simondanielsson · 2026-05-05T15:23:29Z

Ci fixed by this: #169

…mode Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8a5a448af9

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

BugenZhao

Rest LGTM

…ests Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

simondanielsson · 2026-05-06T11:21:05Z

@BugenZhao Thanks for the review! Comments fixed, PTAL 🙏

BugenZhao

LGTM. Thanks!

simondanielsson added 5 commits April 23, 2026 14:39

Initial commit

7eac3b7

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

feat: first impl of async (WRITE) mode and raise if mori used without…

a6374a4

… service discovery Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

fix: add do_remote_* to WRITE path creation of kv transfer params

1584560

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

refactor: use tokio::join! rather than spawn to avoid race condition

19a72b4

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

Merge remote-tracking branch 'upstream/main' into feature/mori-write-…

462f6b1

…mode Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

simondanielsson changed the title ~~[ROCm][MoRI] WRITE mode support~~ [ROCm][MoRI] WRITE mode support (layerwise xfer) Apr 29, 2026

simondanielsson added 15 commits April 29, 2026 07:52

chore: move docstring to generate_transfer_id

da39023

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

chroe: rename build_decode_kv_params

7d2f7e2

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

docs: mention mori in module docstring in vllm_service_discovery

0846eed

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

chore: cargo fmt

79d7efd

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

chore: shorten docstring for MoriIOServiceRegistratoin

12f7fba

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

chore: add trackking issue in comment for why remote_tp_size is 1

6ddf9af

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

feat: error if dispatching before registration

d605ce2

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

refactor: make more DRY with extract_base_http_and_dp_ran

4528365

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

Move prefill stop profiling into big else block

0f1d3fd

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

chore: cargo fmt

988d995

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

feat: support prefill profiling in WRITE mode

ec42d8a

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

refactor: make more DRY by adding handle_decode_response

7a42f50

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

fix: revert all changes in process_vllm_two_stage_request as mori doe…

e39f17a

…s not support it anyways Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

refactor: make more DRY!

2b497bd

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

chore: fix clippy

d3d6233

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

simondanielsson marked this pull request as ready for review April 29, 2026 18:21

chatgpt-codex-connector Bot reviewed Apr 29, 2026

View reviewed changes

Comment thread src/routers/http/vllm_pd_router.rs Outdated

Comment thread src/routers/http/vllm_pd_router.rs

tjtanaa requested a review from BugenZhao April 30, 2026 01:33

simondanielsson added 2 commits April 30, 2026 10:08

fix: propagate prefill errors

898650b

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

fix: support logsprobs from prefill also in WRITE mode

1a820dd

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

chatgpt-codex-connector Bot reviewed Apr 30, 2026

View reviewed changes

fix: enforce mori runs with zmq svc discovery

b04a049

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

chatgpt-codex-connector Bot reviewed Apr 30, 2026

View reviewed changes

fix: commit xfer mode only after registration validation succeeds

0d08cda

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

chatgpt-codex-connector Bot reviewed Apr 30, 2026

View reviewed changes

Comment thread src/routers/http/vllm_pd_router.rs Outdated

Comment thread src/routers/http/vllm_pd_router.rs

simondanielsson added 2 commits April 30, 2026 16:45

fix: use send_client_request for prefill request in WRITE mode

6657952

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

fix: record failed prefills in WRITE mode with record_pd_prefill_error

3dbd46e

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

chatgpt-codex-connector Bot reviewed Apr 30, 2026

View reviewed changes

Comment thread src/routers/http/vllm_pd_router.rs

fix: stop profile decode instance upon prefill error in WRITE mode

45ee5b4

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

functionstackx mentioned this pull request May 1, 2026

[Bug]: parity with CUDA & parity with rocm sglang: vLLM router doesn't current support MoRI kvcache connector vllm-project/vllm#38692

Closed

1 task

simondanielsson added 2 commits May 5, 2026 08:36

chore: update comments and rename variables to concurrent dispatch is…

11dba53

…ntead of WRITRE mode Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

Merge remote-tracking branch 'upstream/main' into feature/mori-write-…

14d2d2f

…mode Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

simondanielsson mentioned this pull request May 5, 2026

[Feature] P2pNccl concurrent dispatch mode #167

Closed

3 tasks

Merge remote-tracking branch 'upstream/main' into feature/mori-write-…

8a5a448

…mode Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

chatgpt-codex-connector Bot reviewed May 6, 2026

View reviewed changes

Comment thread src/routers/http/vllm_pd_router.rs

fix: stop profiling decode upon decode error

a800f2d

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

BugenZhao reviewed May 6, 2026

View reviewed changes

Comment thread src/routers/http/vllm_service_discovery.rs

Comment thread src/routers/http/vllm_service_discovery.rs Outdated

fix: pass in stored transfer mode to parse_registration and improve t…

ffbc9ca

…ests Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

BugenZhao approved these changes May 6, 2026

View reviewed changes

BugenZhao merged commit e667ebb into vllm-project:main May 6, 2026
6 checks passed

simondanielsson deleted the feature/mori-write-mode branch May 6, 2026 13:25

This was referenced May 7, 2026

[Bugfix][ROCm] Resolve MoRI connector hangs at high concurrency vllm-project/vllm#40344

Merged

[P/D][AMD]: Support AMD MoRI-IO llm-d/llm-d#1469

Open

raviguptaamd mentioned this pull request Jun 2, 2026

fix: round-robin DP rank assignment for service discovery mode #181

Open

3 tasks

		let (prefill_result, decode_result) = tokio::join!(prefill_fut, decode_fut);
		let write_prefill_response_json: Option<Value> = match prefill_result {

Conversation

simondanielsson commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Results

Usage

Implementation details

Test Plan

Reproducer

Test Result

WRITE mode: vllm-router vs toy proxy

Accuracy (GSM8k)

Uh oh!

simondanielsson commented Apr 29, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

simondanielsson commented Apr 30, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector Bot Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

simondanielsson Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

simondanielsson Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

simondanielsson commented May 5, 2026

Uh oh!

simondanielsson commented May 5, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

BugenZhao left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

simondanielsson commented May 6, 2026

Uh oh!

BugenZhao left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

simondanielsson commented Apr 29, 2026 •

edited

Loading