Add DeepSeek V4 serving integration#40
Conversation
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request integrates DeepSeekV4 serving into the PyPTO framework by adding a dedicated executor, weight loader, and model loader for W8A8 checkpoints. The serving engine is refactored to support data parallelism through the introduction of DPEngineCore and AsyncLLMEngineClient, along with new CLI arguments for parallel configuration. The reviewer identified several improvements, including removing hardcoded paths to improve portability, enhancing the AST parser to handle unary operations, and ensuring consistent UTF-8 encoding across all file read operations to prevent potential decoding errors.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| from python.core.types import RuntimeModel | ||
|
|
||
|
|
||
| _FALLBACK_PYPTO_LIB_ROOT = Path("/data/liuxu/pypto-lib") |
There was a problem hiding this comment.
The fallback path _FALLBACK_PYPTO_LIB_ROOT is hardcoded to a user-specific directory (/data/liuxu/pypto-lib). Hardcoding personal user paths in a shared codebase reduces portability and can lead to unexpected failures on other environments. Consider removing this fallback or replacing it with a more generic path or an environment variable check.
| def _int_constant_from_file(path: Path, name: str) -> int | None: | ||
| """Read a simple integer module constant without importing kernel code.""" | ||
| tree = ast.parse(path.read_text(), filename=str(path)) |
There was a problem hiding this comment.
Calling path.read_text() without specifying an explicit encoding uses the platform's default encoding (e.g., CP1252 on Windows), which can lead to UnicodeDecodeError when parsing files containing non-ASCII characters. It is highly recommended to always specify encoding="utf-8" when reading text files.
| def _int_constant_from_file(path: Path, name: str) -> int | None: | |
| """Read a simple integer module constant without importing kernel code.""" | |
| tree = ast.parse(path.read_text(), filename=str(path)) | |
| def _int_constant_from_file(path: Path, name: str) -> int | None: | |
| """Read a simple integer module constant without importing kernel code.""" | |
| tree = ast.parse(path.read_text(encoding="utf-8"), filename=str(path)) |
| def _eval_int(node: ast.AST) -> int | None: | ||
| nonlocal config_assignments | ||
| if isinstance(node, ast.Constant) and isinstance(node.value, int): | ||
| return int(node.value) | ||
| if isinstance(node, ast.Name): | ||
| if node.id in assignments: | ||
| return _eval_int(assignments[node.id]) | ||
| if config_assignments is None: | ||
| config_path = path.parent / "config.py" | ||
| if config_path == path or not config_path.exists(): | ||
| config_assignments = {} | ||
| else: | ||
| config_tree = ast.parse(config_path.read_text(), filename=str(config_path)) | ||
| config_assignments = { | ||
| target.id: cfg_node.value | ||
| for cfg_node in config_tree.body | ||
| if isinstance(cfg_node, ast.Assign) | ||
| for target in cfg_node.targets | ||
| if isinstance(target, ast.Name) | ||
| } | ||
| config_node = config_assignments.get(node.id) | ||
| return _eval_int(config_node) if config_node is not None else None | ||
| if isinstance(node, ast.BinOp): | ||
| left = _eval_int(node.left) | ||
| right = _eval_int(node.right) | ||
| op = _AST_INT_OPERATORS.get(type(node.op)) | ||
| if left is None or right is None or op is None: | ||
| return None | ||
| return int(op(left, right)) | ||
| return None |
There was a problem hiding this comment.
The _eval_int helper function does not handle unary operations (ast.UnaryOp). In Python's AST, negative integer constants (e.g., -1) are represented as a UnaryOp with a USub operator and a Constant operand. Without handling ast.UnaryOp, any negative integer constants in the configuration files will fail to parse and return None. Supporting ast.UnaryOp (specifically USub and UAdd) makes the constant parser much more robust.
def _eval_int(node: ast.AST) -> int | None:
nonlocal config_assignments
if isinstance(node, ast.Constant) and isinstance(node.value, int):
return int(node.value)
if isinstance(node, ast.Name):
if node.id in assignments:
return _eval_int(assignments[node.id])
if config_assignments is None:
config_path = path.parent / "config.py"
if config_path == path or not config_path.exists():
config_assignments = {}
else:
config_tree = ast.parse(config_path.read_text(encoding="utf-8"), filename=str(config_path))
config_assignments = {
target.id: cfg_node.value
for cfg_node in config_tree.body
if isinstance(cfg_node, ast.Assign)
for target in cfg_node.targets
if isinstance(target, ast.Name)
}
config_node = config_assignments.get(node.id)
return _eval_int(config_node) if config_node is not None else None
if isinstance(node, ast.UnaryOp):
operand = _eval_int(node.operand)
if operand is None:
return None
if isinstance(node.op, ast.USub):
return -operand
if isinstance(node.op, ast.UAdd):
return operand
if isinstance(node, ast.BinOp):
left = _eval_int(node.left)
right = _eval_int(node.right)
op = _AST_INT_OPERATORS.get(type(node.op))
if left is None or right is None or op is None:
return None
return int(op(left, right))
return None| config_path = model_path / "tokenizer_config.json" | ||
| tokenizer_config = json.loads(config_path.read_text()) if config_path.exists() else {} |
There was a problem hiding this comment.
Specify encoding="utf-8" when calling read_text() to prevent potential UnicodeDecodeError on platforms where the default system encoding is not UTF-8.
| config_path = model_path / "tokenizer_config.json" | |
| tokenizer_config = json.loads(config_path.read_text()) if config_path.exists() else {} | |
| config_path = model_path / "tokenizer_config.json" | |
| tokenizer_config = json.loads(config_path.read_text(encoding="utf-8")) if config_path.exists() else {} |
| try: | ||
| config_data = json.loads(config_path.read_text()) |
There was a problem hiding this comment.
Specify encoding="utf-8" when calling read_text() to ensure cross-platform compatibility and avoid UnicodeDecodeError on systems with non-UTF-8 default encodings.
| try: | |
| config_data = json.loads(config_path.read_text()) | |
| try: | |
| config_data = json.loads(config_path.read_text(encoding="utf-8")) |
| if not index_path.exists(): | ||
| raise FileNotFoundError(f"Missing model.safetensors.index.json in {model_path}") | ||
|
|
||
| config_data = json.loads(config_path.read_text()) |
| config = _build_deepseek_v4_model_config(request.model_id, config_data, tokenizer) | ||
| runtime = request.runtime_config or RuntimeConfig(max_seq_len=min(config.max_position_embeddings, 8192)) | ||
| layer_specs = _build_layer_specs(config) | ||
| index_data = json.loads(index_path.read_text()) |
| try: | ||
| config_data = json.loads(config_path.read_text()) |
There was a problem hiding this comment.
| """Validate model-specific serving topology constraints.""" | ||
| if model_family != "deepseek_v4": | ||
| return | ||
| config_data = json.loads((Path(args.model).resolve() / "config.json").read_text()) |
There was a problem hiding this comment.
Specify encoding="utf-8" when calling read_text() to ensure cross-platform compatibility and avoid UnicodeDecodeError.
| config_data = json.loads((Path(args.model).resolve() / "config.json").read_text()) | |
| config_data = json.loads((Path(args.model).resolve() / "config.json").read_text(encoding="utf-8")) |
0a6d110 to
7b706a3
Compare
Rewrite the DeepSeek V4 runner to drive the packed all-layer kernels in single calls, replacing the per-layer dispatch path: - run_prefill / run_decode each issue one packed kernel call (l3_prefill_fwd, l3_decode_fwd) via the L3 worker. - Add stacked-weight loading/staging (load_stacked_layer_weights, _stage_stacked_weights) and share one resident weight copy between prefill and decode (staged once under an is-None guard). - weight_loader: stack_deepseek_v4_layer_weights plus CSA/HCA stacked name groups. - executor: compile l3_prefill_fwd + l3_decode_fwd with layer-stacked dummy args; extend kernel-module loading and contract validation. - tests: update arg orders and stacked tensor shapes for the packed kernels. - Right-size DEEPSEEK_V4_CMP_MAX_BLOCKS to the kernel's CMP_MAX_BLOCKS; fix _final_norm fp32 round-trip overflow; pre-allocate stacked weights before the L3 worker forks (shared-memory visibility). - Bump pypto-lib submodule to the EP8 l3_prefill_fwd generalization. - gitignore CLAUDE.local.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
7b706a3 to
03b029e
Compare
Bump the pypto-lib submodule to the merge of origin PR #626 (num_tokens) + #611 (LM-head fusion), and adapt the serving runner to the new packed l3_decode_fwd contract: - Decode now passes final_norm_w + lm_head_weight (reusing the tensors the runner already staged for its separate lm_head step) and a trailing num_tokens INT32 scalar (= actual_batch * decode_seq), and receives a [N_RANKS, T, VOCAB] logits buffer directly. The separate post-decode _logits_for_hidden / _final_norm steps are dropped from the decode path (prefill path is unchanged). - _DECODE_FWD_TENSOR_ORDER: x_out -> final_norm_w, lm_head_weight, logits. - Logits buffer + final-norm/lm-head static tensors are staged before the L3 worker forks (shared-memory visibility). - npu_executor _decode_dummy_args matches the new signature. - tests/test_deepseek_v4.py: decode contract updated (81 args, logits shape, trailing num_tokens scalar). Validated on-device (TP1xEP8): "Huawei is" generates coherent multi-token output end-to-end. Two env-gated, off-by-default debug toggles remain in npu_runner.py from the investigation (PYPTO_DSV4_SKIP_PREFILL_KERNEL, PYPTO_DSV4_DIVERSE_DECODE_PAD); they have no effect unless explicitly set. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Summary
dist-serving)Stacked dependency
Submodule dependency
Verification
Current e2e status
Draft: no locked 8-device DeepSeek V4 serving e2e rerun has been completed after stacking on #37. The earlier branch still needed e2e follow-up for runtime 507018/non-finite logits.