feat: Add DeepSeek V4 cache support by superxf · Pull Request #42 · hw-native-sys/pypto-serving

superxf · 2026-06-25T11:48:02Z

概述

此 PR 添加了对 DeepSeek V4 模型的 cache 支持，提升推理性能。

1. 核心类型定义 (`python/core/types.py`)

KVCacheSpec: 定义单个 cache 族的规格（block_size、page_size、compress_ratio）
KVCacheGroupSpec: 定义命名的 cache 组（name、layer_indices、spec、max_blocks_per_seq）
为 PrefillBatch 和 DecodeBatch 添加 block_ids_by_group 字段

2. KV Cache 管理器增强 (`python/core/kv_cache.py`)

新增 _GroupBlockPool 类，为每个 cache 组维护独立的块池和空闲队列
KvCacheManager 新增 init_groups() 方法，支持从 spec 初始化多组块池
每个组独立管理块分配和引用计数

3. DeepSeek V4 支持 (`examples/model/deepseek_v4/runner/npu_runner.py`)

新增 build_deepseek_v4_cache_group_specs() 函数，定义 6 个 cache 族：
- ori (Original KV cache)
- cmp (Compressed KV cache)
- idx (Index cache)
- hca_state (HCA State cache)
- csa_state (CSA State cache)
- csa_inner_state (CSA Inner State cache)
DeepSeekV4CacheManager 新增方法：
- block_table_from_ids(): 从调度器块 ID 构建块表
- slot_mapping_from_ids(): 基于块 ID 的槽映射
- sliding_window_slot_mapping_from_ids(): 滑动窗口槽映射

4. 调度器和引擎集成

RuntimeConfig 添加 kv_cache_groups 字段
调度器支持多组 cache 的分配和管理
Worker 传递按组分配的块 ID 给 runner

coderabbitai · 2026-06-25T11:48:15Z

Caution

Review failed

An error occurred during the review process. Please try again later.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

gemini-code-assist

Code Review

This pull request introduces support for multi-group KV caching and parallel serving topologies (DP x TP), specifically integrating the DeepSeek-V4 model with lazy weight loading for W8A8 compressed-tensors checkpoints. It refactors the async engine into a client-core architecture to route requests across data-parallel replicas and updates the CLI to support parallel configuration arguments. The review feedback highlights a potential ZeroDivisionError in the KV cache group allocation, a critical error-handling issue in the scheduler where group block allocation failures are silently ignored, and redundant I/O operations when parsing config.json in the CLI.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-25T11:49:18Z

+                num_blocks_needed = math.ceil(
+                    token_count / (spec.spec.block_size // max(1, spec.spec.compress_ratio))
+                )


在计算 num_blocks_needed 时，如果 spec.spec.block_size // max(1, spec.spec.compress_ratio) 的结果为 0（例如当 block_size 小于 compress_ratio 时），将会导致 ZeroDivisionError 异常。

建议使用 max(1, ...) 来确保分母至少为 1，从而提高代码的健壮性。

Suggested change

num_blocks_needed = math.ceil(

token_count / (spec.spec.block_size // max(1, spec.spec.compress_ratio))

)

num_blocks_needed = math.ceil(

token_count / max(1, spec.spec.block_size // max(1, spec.spec.compress_ratio))

)

gemini-code-assist · 2026-06-25T11:49:18Z

+    def _try_allocate_group_blocks(self, request: Request, total_tokens: int) -> None:
+        """Allocate multi-group blocks if groups are configured."""
+        if not self.kv_cache_manager.has_groups:
+            return
+        if request.allocated_group_block_ids:
+            return
+        try:
+            request.allocated_group_block_ids = self.kv_cache_manager.allocate_for_groups(
+                request.request_id, total_tokens
+            )
+        except RuntimeError:
+            request.allocated_group_block_ids = {}


在 _try_allocate_group_blocks 中，如果分配多组 KV 缓存块失败（抛出 RuntimeError），当前实现只是静默地将 request.allocated_group_block_ids 设为空字典 {}，并允许调度继续进行。

这会导致该请求在后续被送往 NPU 执行时，由于缺少必要的组缓存块（如 ori、cmp、idx 等）而发生运行时崩溃或键值缺失错误。

建议让 _try_allocate_group_blocks 返回一个布尔值表示是否分配成功，并在调度器中像处理 _try_allocate_blocks 失败一样，对分配失败的请求进行跳过或抢占处理。

gemini-code-assist · 2026-06-25T11:49:19Z

+    """Validate model-specific serving topology constraints."""
+    if model_family != "deepseek_v4":
+        return
+    config_data = json.loads((Path(args.model).resolve() / "config.json").read_text())


在 _validate_model_topology 中，代码再次通过读取并解析 config.json 来获取配置。而在调用此函数之前，_detect_model_family 已经读取并解析过一次该文件。重复读取相同的文件会带来不必要的 I/O 开销。

建议在 build_serving_engine_config 中统一读取一次 config.json，然后将解析后的配置数据传递给这两个函数，以避免重复的磁盘 I/O 操作。

coderabbitai · 2026-06-25T11:56:48Z

Caution

Failed to replace (edit) comment. This is likely due to insufficient permissions or the comment being deleted.

Error details

{"name":"HttpError","status":500,"request":{"method":"PATCH","url":"https://api.github.com/repos/hw-native-sys/pypto-serving/issues/comments/4798876500","headers":{"accept":"application/vnd.github.v3+json","user-agent":"octokit.js/0.0.0-development octokit-core.js/7.0.6 Node.js/24","authorization":"token [REDACTED]","content-type":"application/json; charset=utf-8"},"body":{"body":"<!-- This is an auto-generated comment: summarize by coderabbit.ai -->\n<!-- review_stack_entry_start -->\n\n[![Review Change Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/hw-native-sys/pypto-serving/pull/42?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack)\n\n<!-- review_stack_entry_end -->\n<!-- This is an auto-generated comment: review in progress by coderabbit.ai -->\n\n> [!NOTE]\n> Currently processing new changes in this PR. This may take a few minutes, please wait...\n> \n> <details>\n> <summary>⚙️ Run configuration</summary>\n> \n> **Configuration used**: Organization UI\n> \n> **Review profile**: CHILL\n> \n> **Plan**: Pro\n> \n> **Run ID**: `20e7686f-c5ad-4a48-8bfa-d90604d8a9dd`\n> \n> </details>\n> \n> <details>\n> <summary>📥 Commits</summary>\n> \n> Reviewing files that changed from the base of the PR and between 0b0d8a06b682e079fbe5465498e1274017c26243 and 218ef5d429f431a70adf505da305b1257a749d87.\n> \n> </details>\n> \n> <details>\n> <summary>📒 Files selected for processing (27)</summary>\n> \n> * `README.md`\n> * `examples/model/deepseek_v4/__init__.py`\n> * `examples/model/deepseek_v4/runner/__init__.py`\n> * `examples/model/deepseek_v4/runner/npu_executor.py`\n> * `examples/model/deepseek_v4/runner/npu_runner.py`\n> * `examples/model/deepseek_v4/runner/weight_loader.py`\n> * `examples/model/qwen3_14b/npu_generate.py`\n> * `examples/model/qwen3_14b/npu_serving.json`\n> * `examples/model/qwen3_14b/runner/npu_executor.py`\n> * `python/cli/main.py`\n> * `python/core/__init__.py`\n> * `python/core/api.py`\n> * `python/core/async_engine.py`\n> * `python/core/kv_cache.py`\n> * `python/core/model_loader.py`\n> * `python/core/parallel.py`\n> * `python/core/pypto_executor.py`\n> * `python/core/scheduler.py`\n> * `python/core/server.py`\n> * `python/core/serving_worker.py`\n> * `python/core/tokenizer.py`\n> * `python/core/types.py`\n> * `tests/test_batching.py`\n> * `tests/test_cli.py`\n> * `tests/test_deepseek_v4.py`\n> * `tests/test_npu_prefix_chunk.py`\n> * `tests/test_parallel.py`\n> \n> </details>\n> \n> ```ascii\n>  _____________________________________________________________________________________________\n> < Use the power of command shells. Use the shell when graphical user interfaces don't cut it. >\n>  ---------------------------------------------------------------------------------------------\n>   \\\n>    \\   (\\__/)\n>        (•ㅅ•)\n>        / 　 づ\n> ```\n\n<!-- end of auto-generated comment: review in progress by coderabbit.ai -->\n\n<!-- tips_start -->\n\n---\n\nThanks for using [CodeRabbit](https://coderabbit.ai?utm_source=oss&utm_medium=github&utm_campaign=hw-native-sys/pypto-serving&utm_content=42)! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.\n\n<details>\n<summary>❤️ Share</summary>\n\n- [X](https://twitter.com/intent/tweet?text=I%20just%20used%20%40coderabbitai%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20the%20proprietary%20code.%20Check%20it%20out%3A&url=https%3A//coderabbit.ai)\n- [Mastodon](https://mastodon.social/share?text=I%20just%20used%20%40coderabbitai%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20the%20proprietary%20code.%20Check%20it%20out%3A%20https%3A%2F%2Fcoderabbit.ai)\n- [Reddit](https://www.reddit.com/submit?title=Great%20tool%20for%20code%20review%20-%20CodeRabbit&text=I%20just%20used%20CodeRabbit%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20proprietary%20code.%20Check%20it%20out%3A%20https%3A//coderabbit.ai)\n- [LinkedIn](https://www.linkedin.com/sharing/share-offsite/?url=https%3A%2F%2Fcoderabbit.ai&mini=true&title=Great%20tool%20for%20code%20review%20-%20CodeRabbit&summary=I%20just%20used%20CodeRabbit%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20proprietary%20code)\n\n</details>\n\n\n<sub>Comment `@coderabbitai help` to get the list of available commands.</sub>\n\n<!-- tips_end -->"},"request":{"retryCount":3,"signal":{},"retries":3,"retryAfter":16}}}

ndleslx and others added 3 commits June 18, 2026 15:14

feat: add v1 parallel serving strategy support

454d1ca

Add DeepSeek V4 serving integration

7b706a3

support deeepseek v4 cache

218ef5d

gemini-code-assist Bot reviewed Jun 25, 2026

View reviewed changes

superxf closed this Jun 25, 2026

superxf reopened this Jun 25, 2026

bumble0918 mentioned this pull request Jun 25, 2026

[Feature] DeepSeek-V4 single-node 16-card serving inference co-agent-serving/meta-sprint#13

Open

superxf marked this pull request as draft June 29, 2026 02:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Add DeepSeek V4 cache support#42

feat: Add DeepSeek V4 cache support#42
superxf wants to merge 3 commits into
hw-native-sys:mainfrom
superxf:dpskv4_cache

superxf commented Jun 25, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jun 25, 2026 •

edited

Loading

Review failed

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 25, 2026

Uh oh!

gemini-code-assist Bot Jun 25, 2026

Uh oh!

gemini-code-assist Bot Jun 25, 2026

Uh oh!

coderabbitai Bot commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

superxf commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

概述

1. 核心类型定义 (python/core/types.py)

2. KV Cache 管理器增强 (python/core/kv_cache.py)

3. DeepSeek V4 支持 (examples/model/deepseek_v4/runner/npu_runner.py)

4. 调度器和引擎集成

Uh oh!

coderabbitai Bot commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

superxf commented Jun 25, 2026 •

edited

Loading

1. 核心类型定义 (`python/core/types.py`)

2. KV Cache 管理器增强 (`python/core/kv_cache.py`)

3. DeepSeek V4 支持 (`examples/model/deepseek_v4/runner/npu_runner.py`)

coderabbitai Bot commented Jun 25, 2026 •

edited

Loading