Skip to content

feat: Add DeepSeek V4 cache support#42

Draft
superxf wants to merge 3 commits into
hw-native-sys:mainfrom
superxf:dpskv4_cache
Draft

feat: Add DeepSeek V4 cache support#42
superxf wants to merge 3 commits into
hw-native-sys:mainfrom
superxf:dpskv4_cache

Conversation

@superxf

@superxf superxf commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

概述

此 PR 添加了对 DeepSeek V4 模型的 cache 支持,提升推理性能。

1. 核心类型定义 (python/core/types.py)

  • KVCacheSpec: 定义单个 cache 族的规格(block_size、page_size、compress_ratio)
  • KVCacheGroupSpec: 定义命名的 cache 组(name、layer_indices、spec、max_blocks_per_seq)
  • PrefillBatchDecodeBatch 添加 block_ids_by_group 字段

2. KV Cache 管理器增强 (python/core/kv_cache.py)

  • 新增 _GroupBlockPool 类,为每个 cache 组维护独立的块池和空闲队列
  • KvCacheManager 新增 init_groups() 方法,支持从 spec 初始化多组块池
  • 每个组独立管理块分配和引用计数

3. DeepSeek V4 支持 (examples/model/deepseek_v4/runner/npu_runner.py)

  • 新增 build_deepseek_v4_cache_group_specs() 函数,定义 6 个 cache 族:
    • ori (Original KV cache)
    • cmp (Compressed KV cache)
    • idx (Index cache)
    • hca_state (HCA State cache)
    • csa_state (CSA State cache)
    • csa_inner_state (CSA Inner State cache)
  • DeepSeekV4CacheManager 新增方法:
    • block_table_from_ids(): 从调度器块 ID 构建块表
    • slot_mapping_from_ids(): 基于块 ID 的槽映射
    • sliding_window_slot_mapping_from_ids(): 滑动窗口槽映射

4. 调度器和引擎集成

  • RuntimeConfig 添加 kv_cache_groups 字段
  • 调度器支持多组 cache 的分配和管理
  • Worker 传递按组分配的块 ID 给 runner

@coderabbitai

coderabbitai Bot commented Jun 25, 2026

Copy link
Copy Markdown

Review Change Stack

Caution

Review failed

An error occurred during the review process. Please try again later.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for multi-group KV caching and parallel serving topologies (DP x TP), specifically integrating the DeepSeek-V4 model with lazy weight loading for W8A8 compressed-tensors checkpoints. It refactors the async engine into a client-core architecture to route requests across data-parallel replicas and updates the CLI to support parallel configuration arguments. The review feedback highlights a potential ZeroDivisionError in the KV cache group allocation, a critical error-handling issue in the scheduler where group block allocation failures are silently ignored, and redundant I/O operations when parsing config.json in the CLI.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread python/core/kv_cache.py
Comment on lines +518 to +520
num_blocks_needed = math.ceil(
token_count / (spec.spec.block_size // max(1, spec.spec.compress_ratio))
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

在计算 num_blocks_needed 时,如果 spec.spec.block_size // max(1, spec.spec.compress_ratio) 的结果为 0(例如当 block_size 小于 compress_ratio 时),将会导致 ZeroDivisionError 异常。

建议使用 max(1, ...) 来确保分母至少为 1,从而提高代码的健壮性。

Suggested change
num_blocks_needed = math.ceil(
token_count / (spec.spec.block_size // max(1, spec.spec.compress_ratio))
)
num_blocks_needed = math.ceil(
token_count / max(1, spec.spec.block_size // max(1, spec.spec.compress_ratio))
)

Comment thread python/core/scheduler.py
Comment on lines +411 to +422
def _try_allocate_group_blocks(self, request: Request, total_tokens: int) -> None:
"""Allocate multi-group blocks if groups are configured."""
if not self.kv_cache_manager.has_groups:
return
if request.allocated_group_block_ids:
return
try:
request.allocated_group_block_ids = self.kv_cache_manager.allocate_for_groups(
request.request_id, total_tokens
)
except RuntimeError:
request.allocated_group_block_ids = {}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

_try_allocate_group_blocks 中,如果分配多组 KV 缓存块失败(抛出 RuntimeError),当前实现只是静默地将 request.allocated_group_block_ids 设为空字典 {},并允许调度继续进行。

这会导致该请求在后续被送往 NPU 执行时,由于缺少必要的组缓存块(如 oricmpidx 等)而发生运行时崩溃或键值缺失错误。

建议让 _try_allocate_group_blocks 返回一个布尔值表示是否分配成功,并在调度器中像处理 _try_allocate_blocks 失败一样,对分配失败的请求进行跳过或抢占处理。

Comment thread python/cli/main.py
"""Validate model-specific serving topology constraints."""
if model_family != "deepseek_v4":
return
config_data = json.loads((Path(args.model).resolve() / "config.json").read_text())

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

_validate_model_topology 中,代码再次通过读取并解析 config.json 来获取配置。而在调用此函数之前,_detect_model_family 已经读取并解析过一次该文件。重复读取相同的文件会带来不必要的 I/O 开销。

建议在 build_serving_engine_config 中统一读取一次 config.json,然后将解析后的配置数据传递给这两个函数,以避免重复的磁盘 I/O 操作。

@superxf superxf closed this Jun 25, 2026
@superxf superxf reopened this Jun 25, 2026
@coderabbitai

coderabbitai Bot commented Jun 25, 2026

Copy link
Copy Markdown

Caution

Failed to replace (edit) comment. This is likely due to insufficient permissions or the comment being deleted.

Error details
{"name":"HttpError","status":500,"request":{"method":"PATCH","url":"https://api.github.com/repos/hw-native-sys/pypto-serving/issues/comments/4798876500","headers":{"accept":"application/vnd.github.v3+json","user-agent":"octokit.js/0.0.0-development octokit-core.js/7.0.6 Node.js/24","authorization":"token [REDACTED]","content-type":"application/json; charset=utf-8"},"body":{"body":"<!-- This is an auto-generated comment: summarize by coderabbit.ai -->\n<!-- review_stack_entry_start -->\n\n[![Review Change Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/hw-native-sys/pypto-serving/pull/42?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack)\n\n<!-- review_stack_entry_end -->\n<!-- This is an auto-generated comment: review in progress by coderabbit.ai -->\n\n> [!NOTE]\n> Currently processing new changes in this PR. This may take a few minutes, please wait...\n> \n> <details>\n> <summary>⚙️ Run configuration</summary>\n> \n> **Configuration used**: Organization UI\n> \n> **Review profile**: CHILL\n> \n> **Plan**: Pro\n> \n> **Run ID**: `20e7686f-c5ad-4a48-8bfa-d90604d8a9dd`\n> \n> </details>\n> \n> <details>\n> <summary>📥 Commits</summary>\n> \n> Reviewing files that changed from the base of the PR and between 0b0d8a06b682e079fbe5465498e1274017c26243 and 218ef5d429f431a70adf505da305b1257a749d87.\n> \n> </details>\n> \n> <details>\n> <summary>📒 Files selected for processing (27)</summary>\n> \n> * `README.md`\n> * `examples/model/deepseek_v4/__init__.py`\n> * `examples/model/deepseek_v4/runner/__init__.py`\n> * `examples/model/deepseek_v4/runner/npu_executor.py`\n> * `examples/model/deepseek_v4/runner/npu_runner.py`\n> * `examples/model/deepseek_v4/runner/weight_loader.py`\n> * `examples/model/qwen3_14b/npu_generate.py`\n> * `examples/model/qwen3_14b/npu_serving.json`\n> * `examples/model/qwen3_14b/runner/npu_executor.py`\n> * `python/cli/main.py`\n> * `python/core/__init__.py`\n> * `python/core/api.py`\n> * `python/core/async_engine.py`\n> * `python/core/kv_cache.py`\n> * `python/core/model_loader.py`\n> * `python/core/parallel.py`\n> * `python/core/pypto_executor.py`\n> * `python/core/scheduler.py`\n> * `python/core/server.py`\n> * `python/core/serving_worker.py`\n> * `python/core/tokenizer.py`\n> * `python/core/types.py`\n> * `tests/test_batching.py`\n> * `tests/test_cli.py`\n> * `tests/test_deepseek_v4.py`\n> * `tests/test_npu_prefix_chunk.py`\n> * `tests/test_parallel.py`\n> \n> </details>\n> \n> ```ascii\n>  _____________________________________________________________________________________________\n> < Use the power of command shells. Use the shell when graphical user interfaces don't cut it. >\n>  ---------------------------------------------------------------------------------------------\n>   \\\n>    \\   (\\__/)\n>        (•ㅅ•)\n>        /   づ\n> ```\n\n<!-- end of auto-generated comment: review in progress by coderabbit.ai -->\n\n<!-- tips_start -->\n\n---\n\nThanks for using [CodeRabbit](https://coderabbit.ai?utm_source=oss&utm_medium=github&utm_campaign=hw-native-sys/pypto-serving&utm_content=42)! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.\n\n<details>\n<summary>❤️ Share</summary>\n\n- [X](https://twitter.com/intent/tweet?text=I%20just%20used%20%40coderabbitai%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20the%20proprietary%20code.%20Check%20it%20out%3A&url=https%3A//coderabbit.ai)\n- [Mastodon](https://mastodon.social/share?text=I%20just%20used%20%40coderabbitai%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20the%20proprietary%20code.%20Check%20it%20out%3A%20https%3A%2F%2Fcoderabbit.ai)\n- [Reddit](https://www.reddit.com/submit?title=Great%20tool%20for%20code%20review%20-%20CodeRabbit&text=I%20just%20used%20CodeRabbit%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20proprietary%20code.%20Check%20it%20out%3A%20https%3A//coderabbit.ai)\n- [LinkedIn](https://www.linkedin.com/sharing/share-offsite/?url=https%3A%2F%2Fcoderabbit.ai&mini=true&title=Great%20tool%20for%20code%20review%20-%20CodeRabbit&summary=I%20just%20used%20CodeRabbit%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20proprietary%20code)\n\n</details>\n\n\n<sub>Comment `@coderabbitai help` to get the list of available commands.</sub>\n\n<!-- tips_end -->"},"request":{"retryCount":3,"signal":{},"retries":3,"retryAfter":16}}}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants