feat: Add DeepSeek V4 cache support#42
Conversation
|
Caution Review failedAn error occurred during the review process. Please try again later. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces support for multi-group KV caching and parallel serving topologies (DP x TP), specifically integrating the DeepSeek-V4 model with lazy weight loading for W8A8 compressed-tensors checkpoints. It refactors the async engine into a client-core architecture to route requests across data-parallel replicas and updates the CLI to support parallel configuration arguments. The review feedback highlights a potential ZeroDivisionError in the KV cache group allocation, a critical error-handling issue in the scheduler where group block allocation failures are silently ignored, and redundant I/O operations when parsing config.json in the CLI.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| num_blocks_needed = math.ceil( | ||
| token_count / (spec.spec.block_size // max(1, spec.spec.compress_ratio)) | ||
| ) |
There was a problem hiding this comment.
在计算 num_blocks_needed 时,如果 spec.spec.block_size // max(1, spec.spec.compress_ratio) 的结果为 0(例如当 block_size 小于 compress_ratio 时),将会导致 ZeroDivisionError 异常。
建议使用 max(1, ...) 来确保分母至少为 1,从而提高代码的健壮性。
| num_blocks_needed = math.ceil( | |
| token_count / (spec.spec.block_size // max(1, spec.spec.compress_ratio)) | |
| ) | |
| num_blocks_needed = math.ceil( | |
| token_count / max(1, spec.spec.block_size // max(1, spec.spec.compress_ratio)) | |
| ) |
| def _try_allocate_group_blocks(self, request: Request, total_tokens: int) -> None: | ||
| """Allocate multi-group blocks if groups are configured.""" | ||
| if not self.kv_cache_manager.has_groups: | ||
| return | ||
| if request.allocated_group_block_ids: | ||
| return | ||
| try: | ||
| request.allocated_group_block_ids = self.kv_cache_manager.allocate_for_groups( | ||
| request.request_id, total_tokens | ||
| ) | ||
| except RuntimeError: | ||
| request.allocated_group_block_ids = {} |
There was a problem hiding this comment.
| """Validate model-specific serving topology constraints.""" | ||
| if model_family != "deepseek_v4": | ||
| return | ||
| config_data = json.loads((Path(args.model).resolve() / "config.json").read_text()) |
|
Caution Failed to replace (edit) comment. This is likely due to insufficient permissions or the comment being deleted. Error details |
概述
此 PR 添加了对 DeepSeek V4 模型的 cache 支持,提升推理性能。
1. 核心类型定义 (
python/core/types.py)KVCacheSpec: 定义单个 cache 族的规格(block_size、page_size、compress_ratio)KVCacheGroupSpec: 定义命名的 cache 组(name、layer_indices、spec、max_blocks_per_seq)PrefillBatch和DecodeBatch添加block_ids_by_group字段2. KV Cache 管理器增强 (
python/core/kv_cache.py)_GroupBlockPool类,为每个 cache 组维护独立的块池和空闲队列KvCacheManager新增init_groups()方法,支持从 spec 初始化多组块池3. DeepSeek V4 支持 (
examples/model/deepseek_v4/runner/npu_runner.py)build_deepseek_v4_cache_group_specs()函数,定义 6 个 cache 族:ori(Original KV cache)cmp(Compressed KV cache)idx(Index cache)hca_state(HCA State cache)csa_state(CSA State cache)csa_inner_state(CSA Inner State cache)DeepSeekV4CacheManager新增方法:block_table_from_ids(): 从调度器块 ID 构建块表slot_mapping_from_ids(): 基于块 ID 的槽映射sliding_window_slot_mapping_from_ids(): 滑动窗口槽映射4. 调度器和引擎集成
RuntimeConfig添加kv_cache_groups字段