feat: add v1 parallel serving strategy support by ndleslx · Pull Request #37 · hw-native-sys/pypto-serving

ndleslx · 2026-06-17T09:43:05Z

Summary

Add a ParallelConfig contract for DP x TP serving topology and validation.
Add DataParallelAsyncLLMEngine to route requests across independent DP replicas by pending token load.
Thread device groups through CLI, serving worker, core executor, and Qwen3-14B L3 DistributedConfig.
Add offline TP CLI support with a fast failure for unsupported offline DP.
Add unit coverage for parallel topology parsing, DP routing, and executor device groups.
Document v1 DP/TP usage and the validated DP=2 runtime settings.

Validation

python -m pytest tests/test_cli.py tests/test_batching.py tests/test_parallel.py -q
python tests/lint/check_headers.py
python tests/lint/check_english_only.py
ruff check --config ruff.toml python examples/model/qwen3_14b tests

Runtime check:

DP=2 HTTP serving with --devices "$TASK_DEVICE" --data-parallel-size 2 --tensor-parallel-size 1
Runtime env: PTO2_RING_HEAP=4294967296 PTO2_RING_TASK_WINDOW=131072 PTO2_RING_DEP_POOL=131072
Two concurrent streaming requests passed: Successful: 2/2, Tokens total: 16, throughput about 6.1 tok/s
Sample output: Huawei is a Chinese multinational technology company headquartered in Sh

Notes

V1 rejects pipeline_parallel_size > 1 and expert parallelism until kernels support those modes.
Offline generation supports one logical TP replica; offline DP should be launched as separate jobs.
For DP=2, setting both PTO2_RING_TASK_WINDOW and PTO2_RING_DEP_POOL to 1048576 with a 4 GiB heap can reserve about 19 GiB of runtime arena per replica and fail with rtMalloc failed: 207001.

Related Issue

Refs #36

gemini-code-assist

Code Review

This pull request introduces a Parallel Strategy V1 supporting a DP x TP device topology, adding ParallelConfig and DataParallelAsyncLLMEngine to route requests across independent data-parallel replicas using a least-pending-tokens policy. It also updates CLI arguments, configuration schemas, and the Qwen3-14B runner to support multi-device groups. The review feedback highlights two key improvement opportunities: resolving a double-counting issue in the load estimation of DataParallelAsyncLLMEngine by decrementing the extra load as soon as a request is registered in the scheduler, and starting DP replica engines in parallel using asyncio.gather to optimize server startup time.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-17T09:45:16Z

+    async def add_request(
+        self,
+        request_id: str,
+        prompt: str,
+        config,
+    ) -> AsyncGenerator[TokenOutput, None]:
+        replica_idx = self._select_replica()
+        request_load = self._estimate_request_load(prompt, config)
+        self._route_extra_load[replica_idx] += request_load
+        self._request_to_replica[request_id] = replica_idx
+        try:
+            engine = self._engines[replica_idx]
+            async for output in engine.add_request(request_id, prompt, config):
+                yield output
+        finally:
+            self._request_to_replica.pop(request_id, None)
+            self._route_extra_load[replica_idx] = max(
+                0,
+                self._route_extra_load[replica_idx] - request_load,
+            )


Active requests are double-counted in the load estimation. Once engine.add_request is called, the request is synchronously added to the scheduler, meaning engine.pending_token_load() immediately includes its remaining tokens. However, self._route_extra_load is not decremented until the request completely finishes. This causes the request's load to be counted twice for its entire duration, leading to suboptimal routing decisions.

We should decrement the extra load as soon as the request is registered in the scheduler (i.e., on the first yield of the generator).

async def add_request( self, request_id: str, prompt: str, config, ) -> AsyncGenerator[TokenOutput, None]: replica_idx = self._select_replica() request_load = self._estimate_request_load(prompt, config) self._route_extra_load[replica_idx] += request_load self._request_to_replica[request_id] = replica_idx decremented = False try: engine = self._engines[replica_idx] async for output in engine.add_request(request_id, prompt, config): if not decremented: self._route_extra_load[replica_idx] = max( 0, self._route_extra_load[replica_idx] - request_load, ) decremented = True yield output finally: self._request_to_replica.pop(request_id, None) if not decremented: self._route_extra_load[replica_idx] = max( 0, self._route_extra_load[replica_idx] - request_load, )

gemini-code-assist · 2026-06-17T09:45:16Z

+    async def start(self) -> None:
+        """Start all DP replica engines."""
+        started: list[Any] = []
+        try:
+            for engine in self._engines:
+                await engine.start()
+                started.append(engine)
+        except Exception:
+            for engine in reversed(started):
+                await engine.stop()
+            raise


Starting DP replica engines sequentially can significantly increase the server startup time, especially since model loading and kernel compilation are time-consuming operations. Starting them in parallel using asyncio.gather allows overlapping these operations and speeds up startup.

Suggested change

async def start(self) -> None:

"""Start all DP replica engines."""

started: list[Any] = []

try:

for engine in self._engines:

await engine.start()

started.append(engine)

except Exception:

for engine in reversed(started):

await engine.stop()

raise

async def start(self) -> None:

"""Start all DP replica engines in parallel."""

try:

await asyncio.gather(*(engine.start() for engine in self._engines))

except Exception:

await self.stop()

raise

coderabbitai · 2026-06-18T03:04:41Z

📝 Walkthrough

Walkthrough

Introduces Parallel Strategy V1 supporting DP x TP topology. A new ParallelConfig dataclass describes parallelism layout and device groupings. AsyncLLMEngine is split into DPEngineCore (per-replica owner) and AsyncLLMEngineClient (load-balancing router). PyptoExecutor and executors throughout the stack gain device_ids support. CLI, serving worker, server type hints, and public exports are updated accordingly.

Changes

Parallel Strategy V1: DP x TP Serving Topology

Layer / File(s)	Summary
ParallelConfig contract and device parsing `python/core/parallel.py`	New `ParallelConfig` frozen dataclass with `__post_init__` validation, `replica_device_groups` partitioning, `for_replica` helper, and `parse_device_ids` utility.
EngineConfig extension and DPEngineCore / AsyncLLMEngineClient `python/core/async_engine.py`	`EngineConfig` gains `device_ids`, `parallel_config`, `dp_rank`, and `worker_device_ids()`. `AsyncLLMEngine` renamed to `DPEngineCore`; new `AsyncLLMEngineClient` constructs one core per DP replica and routes `add_request()` to the least-loaded core via `_select_replica()`.
PyptoExecutor and NPU executor multi-device wiring `python/core/pypto_executor.py`, `examples/model/qwen3_14b/runner/npu_executor.py`, `python/core/serving_worker.py`	`PyptoExecutor.__init__` accepts optional `device_ids` and normalizes to a tuple; `Qwen314BPyptoExecutor` forwards it to base and sets `DistributedConfig.device_ids`; serving worker derives `device_ids` from config and passes them at executor construction.
CLI parallel flags, engine selection, and server type updates `python/cli/main.py`, `python/core/server.py`	Adds `--devices`, `--data-parallel-size/--dp`, `--tensor-parallel-size/--tp`, `--pipeline-parallel-size/--pp` and routing flags; `build_serving_engine_config` builds `ParallelConfig` and injects it into `EngineConfig`; `run_serve` selects `DPEngineCore` or `AsyncLLMEngineClient` based on DP size; server signatures updated to accept `AsyncLLMEngineClient \| DPEngineCore`.
Public API re-exports `python/core/__init__.py`, `python/core/api.py`	`__all__` expanded to include `AsyncLLMEngineClient`, `DPEngineCore`, `EngineConfig`, and `ParallelConfig`.
Example script and config updates `examples/model/qwen3_14b/npu_generate.py`, `examples/model/qwen3_14b/npu_serving.json`	`npu_generate.py` adds `--devices`/`--data-parallel-size` args, builds `ParallelConfig`, and enforces DP=1; `npu_serving.json` gains a `parallel` block with full topology fields.
Tests and documentation `tests/test_parallel.py`, `tests/test_batching.py`, `tests/test_cli.py`, `tests/test_npu_prefix_chunk.py`, `README.md`	New `test_parallel.py` covers `ParallelConfig` grouping/validation, `parse_device_ids`, and `AsyncLLMEngineClient` least-load routing; existing tests extended for multi-device and topology assertions; `DPEngineCore` adopted in NPU prefix test; README adds Parallel Strategy V1 section and DP=2 tuning note.

Sequence Diagram(s)

sequenceDiagram
  participant Client as HTTP Client
  participant AsyncLLMEngineClient
  participant DPEngineCore_0 as DPEngineCore (replica 0)
  participant DPEngineCore_1 as DPEngineCore (replica 1)
  participant WorkerProcess

  Client->>AsyncLLMEngineClient: add_request(prompt, generate_config)
  AsyncLLMEngineClient->>DPEngineCore_0: pending_token_load()
  AsyncLLMEngineClient->>DPEngineCore_1: pending_token_load()
  note over AsyncLLMEngineClient: select replica with min load + _route_extra_load
  AsyncLLMEngineClient->>DPEngineCore_1: add_request(prompt, generate_config)
  DPEngineCore_1->>WorkerProcess: schedule & execute on TP device group
  WorkerProcess-->>DPEngineCore_1: TokenOutput chunks
  DPEngineCore_1-->>AsyncLLMEngineClient: AsyncGenerator[TokenOutput]
  AsyncLLMEngineClient-->>Client: stream TokenOutput
  AsyncLLMEngineClient->>AsyncLLMEngineClient: finally: remove _request_to_replica entry

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related issues

[Feature] Parallel Strategies Support #36: This PR directly implements the parallel strategy feature requested in #36, delivering ParallelConfig, replica_device_groups, DPEngineCore, AsyncLLMEngineClient, and device_ids propagation through executors.

Possibly related PRs

hw-native-sys/pypto-serving#21: Touches the same python/cli/main.py CLI wiring and python/core/async_engine.py EngineConfig paths that this PR extends with DP/TP multi-device support.

Poem

🐇 Hop, hop, devices in line,
Replicas splitting the work so fine!
DP routes to the core with least load,
TP groups share the parallel road.
With ParallelConfig the topology's set —
The fastest bunny serving yet! 🌟

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 22.81% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'feat: add v1 parallel serving strategy support' clearly and accurately summarizes the main objective of the PR, which is to introduce v1 parallel serving strategy support for DP x TP topology.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description check	✅ Passed	The pull request description is directly related to the changeset, detailing the addition of ParallelConfig, DP routing, device group threading, and offline TP support across multiple files.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

python/core/async_engine.py (1)

132-145: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Ensure spawned worker is cleaned up on startup failure.

If ready_event.wait(...) times out or raises (Line 141), DPEngineCore.start() exits without tearing down the already spawned worker process. This can leave orphan processes and block subsequent starts.

💡 Proposed fix

@@
 import asyncio
+import contextlib
 import logging
@@
     async def start(self) -> None:
         """Start worker process and engine loop."""
         with profile_span("DPEngineCore.start", cat="serving"):
             process, input_q, output_q, ready_event = spawn_worker(self.config)
             self._worker_process = process
             self._input_queue = input_q
             self._output_queue = output_q
 
             logger.info("Waiting for worker to initialize model...")
-            await asyncio.to_thread(ready_event.wait, timeout=600)
-            if not ready_event.is_set():
-                raise RuntimeError("Worker failed to initialize within timeout")
+            try:
+                ready = await asyncio.to_thread(ready_event.wait, timeout=600)
+                if not ready:
+                    raise RuntimeError("Worker failed to initialize within timeout")
+            except Exception:
+                with contextlib.suppress(Exception):
+                    input_q.put(WorkerCommand(type="shutdown"))
+                with contextlib.suppress(Exception):
+                    process.join(timeout=5)
+                if process.is_alive():
+                    process.terminate()
+                self._worker_process = None
+                self._input_queue = None
+                self._output_queue = None
+                raise
             logger.info("Worker ready")

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@python/core/async_engine.py` around lines 132 - 145, The start method in
DPEngineCore spawns a worker process but does not clean it up if the ready_event
timeout occurs or an exception is raised. Wrap the asyncio.to_thread call in a
try-except block (or similar error handling) to ensure that if
ready_event.wait() times out or raises an exception, the worker process stored
in self._worker_process is properly terminated before the RuntimeError is
raised. This prevents orphan processes from being left behind when startup
fails.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@README.md`:
- Around line 170-173: The current documentation in the README.md section
(around lines 170-173) presents conflicting ring-size values for
`PTO2_RING_TASK_WINDOW` and `PTO2_RING_DEP_POOL` without clearly distinguishing
which setting applies to which topology. Restructure this section to explicitly
separate guidance by topology: create distinct subsections for single-replica
configurations and DP=2+ (distributed/multi-replica) configurations, clearly
stating that single-replica uses `1048576` while DP=2+ uses `131072`. This will
eliminate ambiguity and provide operators with one clear setting path for their
specific topology.

---

Outside diff comments:
In `@python/core/async_engine.py`:
- Around line 132-145: The start method in DPEngineCore spawns a worker process
but does not clean it up if the ready_event timeout occurs or an exception is
raised. Wrap the asyncio.to_thread call in a try-except block (or similar error
handling) to ensure that if ready_event.wait() times out or raises an exception,
the worker process stored in self._worker_process is properly terminated before
the RuntimeError is raised. This prevents orphan processes from being left
behind when startup fails.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 9ec74ea0-8337-463b-a68e-695a7aed79d0

📥 Commits

Reviewing files that changed from the base of the PR and between 0b0d8a0 and 5f38e42.

📒 Files selected for processing (16)

README.md
examples/model/qwen3_14b/npu_generate.py
examples/model/qwen3_14b/npu_serving.json
examples/model/qwen3_14b/runner/npu_executor.py
python/cli/main.py
python/core/__init__.py
python/core/api.py
python/core/async_engine.py
python/core/parallel.py
python/core/pypto_executor.py
python/core/server.py
python/core/serving_worker.py
tests/test_batching.py
tests/test_cli.py
tests/test_npu_prefix_chunk.py
tests/test_parallel.py

coderabbitai · 2026-06-18T03:12:28Z

+For the DP=2 validation above, keep `PTO2_RING_TASK_WINDOW` and
+`PTO2_RING_DEP_POOL` at `131072`. Setting both to `1048576` with a 4 GiB heap can
+reserve about 19 GiB of runtime arena per replica and fail with
+`rtMalloc failed: 207001`.


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Clarify the concurrent-serving ring-size guidance by topology.

The new DP=2 note (131072) conflicts with the earlier concurrent-serving default (1048576) in the same section, which can lead to wrong runtime tuning choices. Please split guidance explicitly by topology (single-replica vs DP=2+) so operators have one unambiguous setting path.

Suggested doc adjustment

-Single-request HTTP serving does not require the larger PTO2 ring settings. For -concurrent NPU serving, start the server with the larger PTO2 ring settings: +Single-request HTTP serving does not require the larger PTO2 ring settings. +For concurrent NPU serving, use topology-specific ring settings: + +- Single-replica serving (no DP): `PTO2_RING_TASK_WINDOW=1048576` and `PTO2_RING_DEP_POOL=1048576`. +- DP=2 serving: keep `PTO2_RING_TASK_WINDOW=131072` and `PTO2_RING_DEP_POOL=131072`.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@README.md` around lines 170 - 173, The current documentation in the README.md section (around lines 170-173) presents conflicting ring-size values for `PTO2_RING_TASK_WINDOW` and `PTO2_RING_DEP_POOL` without clearly distinguishing which setting applies to which topology. Restructure this section to explicitly separate guidance by topology: create distinct subsections for single-replica configurations and DP=2+ (distributed/multi-replica) configurations, clearly stating that single-replica uses `1048576` while DP=2+ uses `131072`. This will eliminate ambiguity and provide operators with one clear setting path for their specific topology.

ndleslx · 2026-06-18T03:15:01Z

Validation after renaming the serving architecture classes:

AsyncLLMEngine -> DPEngineCore
DataParallelAsyncLLMEngine -> AsyncLLMEngineClient

Local checks:

python -m pytest tests/test_cli.py tests/test_batching.py tests/test_parallel.py -q
python -m compileall -q python/core python/cli tests/test_parallel.py tests/test_npu_prefix_chunk.py
ruff check --config ruff.toml python/core python/cli tests/test_parallel.py tests/test_npu_prefix_chunk.py
python tests/lint/check_headers.py
python tests/lint/check_english_only.py

NPU serving checks:

DP=1 original single-core path passed.

Command shape:

task-submit --device auto --max-time 1200 --run \
  "bash -lc 'python /tmp/pypto-serving-dp-check.py dp1 19221'"

Result:

Device: 2
Exit: 0
Generated text:

Huawei is a Chinese multinational technology company. It is the world's largest telecommunications equipment manufacturer and

DP=2 client/router path passed on devices 12,13.

Command shape:

task-submit --device 12,13 --max-time 1800 --run \
  "bash -lc 'export PTO2_RING_HEAP=4294967296 PTO2_RING_TASK_WINDOW=131072 PTO2_RING_DEP_POOL=131072; python /tmp/pypto-serving-dp-check.py dp2 19225'"

Result:

Device groups: [[12], [13]]
Exit: 0
Request 1 generated text:

Huawei is a Chinese multinational technology company that designs and sells consumer electronics, telecommunications equipment, and

Request 2 generated text:

Huawei is a Chinese multinational technology company that designs and sells consumer electronics, telecommunications equipment, and

Note: the same DP=2 check on auto-selected devices 2,3 did not pass in this run. With inherited PTO2_RING_HEAP=4294967296, device 3 failed arena setup with rtMalloc failed: 207001 (size=17179870207). With PTO2_RING_HEAP=536870912, device 2 failed prefill with aclrtSynchronizeStreamWithTimeout (AICPU) failed: 507018. The class rename itself is behavior-neutral; DP=1 and DP=2 both passed on the device settings above.

ndleslx · 2026-06-18T04:54:13Z

Direct python/cli/main.py validation after the CLI import fix:

DP=1: PASS on device 4, port 19340. Generated text: " a Chinese multinational technology company that designs and sells consumer electronics, telecommunications equipment, and"
DP=2: PASS on devices 4,5, port 19339. Two sequential requests both generated: " a Chinese multinational technology company that designs and sells consumer electronics, telecommunications equipment, and"

Notes:

The direct script entrypoint is python python/cli/main.py; no helper script was used.
Runtime env for both passing checks: PTO2_RING_HEAP=4294967296 PTO2_RING_TASK_WINDOW=131072 PTO2_RING_DEP_POOL=131072.
Failed attempts on devices with high existing HBM usage produced either rtMalloc 207001 or 507018; devices 4/5 were clean and passed.

ndleslx · 2026-06-18T07:15:49Z

Addressed the latest review feedback:

Removed DP route-load double counting by clearing temporary route load as soon as the target replica scheduler accepts the request.
Made DP replica startup parallel and added cleanup so all cores are stopped if any startup task fails.
Clarified README runtime ring settings by topology: single-replica concurrent serving uses the larger window/pool, while DP=2+ keeps the smaller window/pool.
Added unit coverage for route-load clearing and partial startup cleanup.

Validation:

python -m pytest tests/test_cli.py tests/test_batching.py tests/test_parallel.py -q: PASS, 24 passed.
python tests/lint/check_headers.py: PASS.
python tests/lint/check_english_only.py: PASS.
ruff check --config ruff.toml python/core/async_engine.py tests/test_parallel.py: PASS.
DP=1 direct serving check on device 6: PASS. Generated text: " a Chinese multinational technology company that designs and sells consumer electronics, telecommunications equipment, and"
DP=2 direct serving check on devices 4,5: PASS for two sequential requests. Generated text: " a Chinese multinational technology company that designs and sells consumer electronics, telecommunications equipment, and"

Note: full ruff check --config ruff.toml . still reports pre-existing issues under pypto-lib/.claude/skills/...; those files are outside this PR's changed serving code.

bumble0918 · 2026-06-24T10:11:02Z

+        type=int,
+        default=1,
+        help="Offline generation does not launch DP replicas; values > 1 fail fast.",
+    )


args参数类型和换行格式都有点五花八门，尽量统一下（dest和fullname一致的话加了有点冗余

bumble0918 · 2026-06-25T01:27:14Z

+        devices=parse_device_ids(args.devices, default_device=args.device_id),
+    )
+    if parallel_config.data_parallel_size != 1:
+        raise ValueError("offline npu_generate.py supports tensor parallelism only; data_parallel_size must be 1")


离线不能支持dp是为啥

目前离线的engine和online的engine不是同一个（原因是当前qwen有个cpu版本，用的不是serving的engine），当前dp只在online serving中做了。

bumble0918 · 2026-06-25T01:39:23Z

        kv_cache_manager,
        platform=args.platform,
-        device_id=args.device_id,
+        device_id=device_ids[0],


device_ids能不能包括device_id的功能

这个也是为了兼容离线cpu版本

bumble0918 · 2026-06-25T01:47:20Z

+    )
+    parser.add_argument("--dp", "--data-parallel-size", dest="data_parallel_size", type=int, default=1, help="Data-parallel replica count.")
+    parser.add_argument("--tp", "--tensor-parallel-size", dest="tensor_parallel_size", type=int, default=1, help="Tensor-parallel group size.")
+    parser.add_argument("--pp", "--pipeline-parallel-size", dest="pipeline_parallel_size", type=int, default=1, help="Pipeline parallel size.")


从pp到all2all-backend 这几个对外的选项等流程支持了再加吧，没给的先走parallelConfig默认值：1. 跑的时候开了也是无效或报错，没啥意义；2. 选项设计不太全，需要限制choices的都没有。

bumble0918 · 2026-06-25T04:16:11Z

 # -----------------------------------------------------------------------------------------------------------

+from .async_engine import AsyncLLMEngineClient, DPEngineCore, EngineConfig
 from .engine import LLMEngine


api.py 和 init 完全一样？好像没看到api.py在哪用

这个是考虑把serving作为包给别的软件用，暂时不知道要不要去掉。

bumble0918 · 2026-06-25T04:24:46Z

+    )
+    device_groups = parallel_config.replica_device_groups
+    first_group = device_groups[0]
+    worker_device_ids = first_group if parallel_config.data_parallel_size == 1 else devices


此处的计算和EngineConfig.worker_device_ids()什么关系

CLI 手动算出 first_group、worker_device_ids（first_group if DP==1 else devices）和 device_id，然后 EngineConfig.worker_device_ids() 又用一个三分支解析器把同一件事重新推导一遍。结果是：顶层 EngineConfig.device_ids 在 DP=1 时含义是*"本副本的 TP 组"，而在 DP>1 时含义是"所有副本的全部设备"*（这个值随即被 client 用 replace() 覆盖，且绝不能被读取）。更深的修法是单一事实来源——完全从 parallel_config 推导 device 归属，停止在顶层设置 device_ids——这也能顺便删掉 141 行的那个分支。

bumble0918 · 2026-06-25T04:29:52Z

-task-submit --device auto --run \
-  "python -m python.cli.main \
-    --model /path/to/Qwen3-14B \
+task-submit --device auto --max-time 1200 --run \


task-submit 是环境特定信息不要写到readme里；改端口号好像也没意义？

这个是给开发用的，要不加到docs/dev里面吧

嗯，放到使用指导里之类的（按理说这应该是机器上的rules里。。）还有PTO2_xxx的几个变量设置，也是不太适合放readme

bumble0918 · 2026-06-25T11:47:01Z

+    def _estimate_request_load(self, prompt: str, config) -> int:
+        prompt_tokens = 0
+        if self.tokenizer is not None:
+            prompt_tokens = len(self.tokenizer.encode(prompt))


在这里和add_request执行了两次encode，需要简化

bumble0918 · 2026-06-25T11:57:39Z

            self.scheduler.abort_request(sr.request.request_id)
+
+
+class AsyncLLMEngineClient:


DPEngineCore 和 AsyncLLMEngine 的设计不合理 / 有债务的部分，建议在加新路由策略或第三个并行维度之前修改：

路由钩子从编排层反向泄漏进叶子 core —— 这是最大的设计问题。
DPEngineCore.pending_token_load()（async_engine.py:175）和 add_request(*, on_queued=...)（async_engine.py:186-192）存在的唯一原因是给 client 的 least-pending-tokens 路由用。DP=1（默认路径）下 core 独立运行，这俩是纯死重。依赖方向反了：叶子在为编排者的记账需求服务。后果是——一旦路由策略换成 round-robin / 队列深度，DPEngineCore 就得跟着改。

根因是 DPEngineCore 在扮演双重身份：DP=1 时被直接当独立引擎用（main.py:206-208），DP>1 时又被 client 当子 core 组合。两种用法对那两个钩子的需求相反，于是叶子被迫同时迁就两边。

没有公共接口/Protocol，靠 duck-typed union 维持一致。
DPEngineCore 和 AsyncLLMEngineClient 只是恰好方法名相同，没有共同基类或 Protocol。AsyncLLMEngineClient 还硬编码 core_factory: Callable[..., DPEngineCore] = DPEngineCore（async_engine.py:370），并且直接伸手进 core 的内部（pending_token_load、靠 _request_to_replica 维护映射），而不是面向一个抽象编程。后果：两者签名一旦漂移（比如 core 的 add_request 多了 on_queued 而 client 没有），类型检查不会报，直到运行时。

命名误导。

DPEngineCore：这个叶子是单副本 / TP-only 的核心，它自己不做数据并行——DP 是上面那层 client 做的。名字里的 "DP" 把"层级（core vs 路由）"和"路由层管理的拓扑维度"混为一谈。叫EngineCore 或 ReplicaEngineCore 更准确。

AsyncLLMEngineClient：它不是任何远端服务的 client，而是进程内的路由/扇出层。叫 "Client" 暗示了外部依赖，实际只是组合了几个本地 DPEngineCore。反而原来的 AsyncLLMEngine 这个名字更适合放在这一层（对外是"引擎"，对内组合 core）。

4.（小）失败路径双重 stop。 DPEngineCore.start() 失败时自己先 await self.stop()（async_engine.py:150-152），然后 AsyncLLMEngineClient.start() 又会对所有 core 调一遍 stop()（async_engine.py:416）。第二次 stop 基本幂等（_loop_task 是 None 会被跳过，join 已终止的进程也安全），但属于 sloppy。

建议的收紧方向（按性价比）：

始终走 client，消除 DPEngineCore 的双重身份 —— run_serve 在 DP=1 时也用 AsyncLLMEngineClient（它本来就支持单 core）。这样 DPEngineCore 永远只作为"被组合的子core"存在，pending_token_load/on_queued 就名正言顺，双重身份和泄漏问题一起消失，顺带让 DP=1 默认路径也覆盖到路由逻辑（目前只有 fake 测试覆盖）。

抽一个 EngineCore Protocol（start/stop/add_request/abort_request/pending_token_load/generate_request_id），让 DPEngineCore 实现它，client 面向 Protocol 编程而非具体的 DPEngineCore。依赖方向就正了，也为日后换 core 实现留口子。

正名：DPEngineCore → EngineCore/ReplicaEngineCore，AsyncLLMEngineClient → AsyncLLMEngine（或 DPRouter）。让名字反映"路由层 vs 单副本核心"的真实分工。

gemini-code-assist Bot reviewed Jun 17, 2026

View reviewed changes

ndleslx force-pushed the dist-serving branch from b13247c to 5f38e42 Compare June 18, 2026 03:04

coderabbitai Bot reviewed Jun 18, 2026

View reviewed changes

ndleslx force-pushed the dist-serving branch 2 times, most recently from ecb8e9a to 6904dc6 Compare June 18, 2026 04:53

ndleslx force-pushed the dist-serving branch 2 times, most recently from d6415e8 to 2ceb391 Compare June 18, 2026 06:34

feat: add v1 parallel serving strategy support

454d1ca

ndleslx force-pushed the dist-serving branch from 2ceb391 to 454d1ca Compare June 18, 2026 07:14

This was referenced Jun 18, 2026

[Feature] Parallel Strategies Support #36

Open

Add DeepSeek V4 serving integration #40

Draft

bumble0918 reviewed Jun 25, 2026

View reviewed changes

bumble0918 mentioned this pull request Jun 25, 2026

[Feature] Qwen3-14B single-node multi-card serving with DP and TP co-agent-serving/meta-sprint#12

Open

		self.scheduler.abort_request(sr.request.request_id)


		class AsyncLLMEngineClient:

Uh oh!

Conversation

ndleslx commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Notes

Related Issue

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related issues

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

ndleslx commented Jun 18, 2026

Uh oh!

ndleslx commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ndleslx commented Jun 18, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

ndleslx commented Jun 17, 2026 •

edited

Loading

coderabbitai Bot commented Jun 18, 2026 •

edited

Loading

ndleslx commented Jun 18, 2026 •

edited

Loading