[ROCm][MoRI] WRITE mode support (layerwise xfer)#157
Conversation
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
… service discovery Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
…mode Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
…s not support it anyways Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d3d623322a
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1a820dda17
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if moriio_transfer_mode.set(mode).is_err() { | ||
| let stored = moriio_transfer_mode.get().copied().unwrap(); | ||
| if stored != mode { | ||
| warn!( |
There was a problem hiding this comment.
Allow transfer mode to change after instance turnover
The transfer mode is stored in a OnceLock and any later registration with a different mode is rejected forever. Since the value is never reset when old instances expire, a normal READ→WRITE (or WRITE→READ) worker rollout under a long-running router causes all new workers to be skipped; once old workers age out, service discovery can be left with no usable instances until the router is restarted.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Don't think there's much of a reason to switch modes live. WRITE is the default and should perform better. Suggest we ignore this for now until someone needs this functionality as that would add a lot of complexity (e.g. handling the rollover period where instances of both modes are live)
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b04a049743
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| let (prefill_result, decode_result) = tokio::join!(prefill_fut, decode_fut); | ||
| let write_prefill_response_json: Option<Value> = match prefill_result { |
There was a problem hiding this comment.
Fail fast on WRITE prefill errors instead of waiting on decode
In the WRITE path, tokio::join! waits for both futures before prefill_result is inspected, so a fast prefill failure is still delayed until the decode request returns. When decode is blocked waiting for KV transfer, this turns a clear prefill error into a long timeout path and ties up request capacity. Even with the new error propagation, this join! ordering still prevents prompt failure handling.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Technically we could use tokio::select! or abort to cancel decode on prefill failure but that adds complexity. I would suggest ignoring now but happy to be challenged if disagreed upon.
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0d08cda7bb
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3dbd46e8f8
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
…ntead of WRITRE mode Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
…mode Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
|
CI failing due to unrelated changes, even after rebasing on main. Suggest triggering re-run. Locally it's happy: $ cargo clippy --all-targets --all-features -- -D warnings
Finished `dev` profile [unoptimized + debuginfo] target(s) in 2.77s |
|
Ci fixed by this: #169 |
…mode Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8a5a448af9
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
…ests Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
|
@BugenZhao Thanks for the review! Comments fixed, PTAL 🙏 |
Purpose
Fixes #160.
This PR allows us to use WRITE mode of MoRI with vllm-router, enabling layer-wise KV transfer. It works by implementing concurrent dispatch of requests to P and D.
Dependent PR:
Results
1P1D with DSR1. See details below for reproducer.
Usage
Usage is identical to as introduced in #138, i.e. simply use
--kv-connector moriiowith service discovery enabled. The transfer mode is automatically negotiated during router<->worker handshake. On vLLM side, run withVLLM_MORIIO_CONNECTOR_READ_MODE=0(or omit this env var entirely as it's the default).Implementation details
Click to expand
Layerwise transfer requires dispatching request to the P and D instances concurrently, rather than sequentially as during READ mode.
max_tokens=1request to P, (2) awaiting response from P (3) sending request to D (4) stream response from D.max_tokens=1(2) send requests to both P and D concurrently (3) stream response from D.How WRITE mode works in vLLM: D allocates blocks and notifies P about these blocks and then waits. P receives this notification, and after every layer it writes its produced KV into these blocks asynchronously. Hence transfer or layer N is overlapped with computation of layer N+1. After all layers have been written, it awaits the last write and validates it succeeded. P then sends a notification to D that all blocks have been written. Upon this notification, D wakes up and and the request is scheduled on the D side, iteratively performing decode steps using the KV.
Test Plan
Reproducer
Build router on this branch
Need vllm built on this branch to avoid hanging issues: vllm-project/vllm#40344:
docker pull ghcr.io/simondanielsson/vllm-rocm-moriio:dev-hang-fixes # or nightly vllm docker once https://github.com/vllm-project/vllm/pull/40344 mergedRun 1P1D with
WRITEmode (VLLM_MORIIO_CONNECTOR_READ_MODE=0):Note: you can also test the above with the toy proxy:
Test Result
Main results summarized above. Raw results below.
WRITE mode: vllm-router vs toy proxy
Also includes vllm-router READ.
1k/1k
8k/1k
Accuracy (GSM8k)
WRITE mode, vllm-router:
Essential Elements of an Effective PR Description Checklist