Add DeepSeek V4 serving integration by ndleslx · Pull Request #40 · hw-native-sys/pypto-serving

ndleslx · 2026-06-23T08:46:30Z

Summary

stack the DeepSeek V4 serving integration on top of PR feat: add v1 parallel serving strategy support #37's DP/TP serving support (dist-serving)
add DeepSeek V4 W8A8 serving model loader, executor, runner, and CLI topology selection
update the pypto-lib submodule pointer to DeepSeek V4 kernel shape/signature changes
add focused DeepSeek V4 unit coverage

Stacked dependency

Based on feat: add v1 parallel serving strategy support #37 for v1 parallel serving strategy support.
Relates to [Feature] Parallel Strategies Support #36.

Submodule dependency

pypto-lib commit: ndleslx/pypto-lib@2626e4a
I could not push this commit directly to hw-native-sys/pypto-lib due to permissions, so the upstream pypto-lib commit/PR still needs to land before this can be undrafted/merged cleanly.

Verification

python -m py_compile python/cli/main.py python/core/async_engine.py python/core/server.py python/core/serving_worker.py examples/model/deepseek_v4/runner/npu_executor.py examples/model/deepseek_v4/runner/npu_runner.py examples/model/deepseek_v4/runner/weight_loader.py
ruff check --config ruff.toml python/cli/main.py python/core/async_engine.py python/core/serving_worker.py examples/model/deepseek_v4 tests/test_deepseek_v4.py
python -m pytest tests/test_deepseek_v4.py -q
python -m pytest tests/test_parallel.py tests/test_cli.py tests/test_npu_prefix_chunk.py -q

Current e2e status

Draft: no locked 8-device DeepSeek V4 serving e2e rerun has been completed after stacking on #37. The earlier branch still needed e2e follow-up for runtime 507018/non-finite logits.

coderabbitai · 2026-06-23T08:46:39Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ddf348b9-a727-4729-860f-143653d4fbef

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

gemini-code-assist

Code Review

This pull request integrates DeepSeekV4 serving into the PyPTO framework by adding a dedicated executor, weight loader, and model loader for W8A8 checkpoints. The serving engine is refactored to support data parallelism through the introduction of DPEngineCore and AsyncLLMEngineClient, along with new CLI arguments for parallel configuration. The reviewer identified several improvements, including removing hardcoded paths to improve portability, enhancing the AST parser to handle unary operations, and ensuring consistent UTF-8 encoding across all file read operations to prevent potential decoding errors.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-23T08:49:28Z

+from python.core.types import RuntimeModel
+
+
+_FALLBACK_PYPTO_LIB_ROOT = Path("/data/liuxu/pypto-lib")


The fallback path _FALLBACK_PYPTO_LIB_ROOT is hardcoded to a user-specific directory (/data/liuxu/pypto-lib). Hardcoding personal user paths in a shared codebase reduces portability and can lead to unexpected failures on other environments. Consider removing this fallback or replacing it with a more generic path or an environment variable check.

gemini-code-assist · 2026-06-23T08:49:28Z

+def _int_constant_from_file(path: Path, name: str) -> int | None:
+    """Read a simple integer module constant without importing kernel code."""
+    tree = ast.parse(path.read_text(), filename=str(path))


Calling path.read_text() without specifying an explicit encoding uses the platform's default encoding (e.g., CP1252 on Windows), which can lead to UnicodeDecodeError when parsing files containing non-ASCII characters. It is highly recommended to always specify encoding="utf-8" when reading text files.

Suggested change

def _int_constant_from_file(path: Path, name: str) -> int | None:

"""Read a simple integer module constant without importing kernel code."""

tree = ast.parse(path.read_text(), filename=str(path))

def _int_constant_from_file(path: Path, name: str) -> int | None:

"""Read a simple integer module constant without importing kernel code."""

tree = ast.parse(path.read_text(encoding="utf-8"), filename=str(path))

gemini-code-assist · 2026-06-23T08:49:28Z

+    def _eval_int(node: ast.AST) -> int | None:
+        nonlocal config_assignments
+        if isinstance(node, ast.Constant) and isinstance(node.value, int):
+            return int(node.value)
+        if isinstance(node, ast.Name):
+            if node.id in assignments:
+                return _eval_int(assignments[node.id])
+            if config_assignments is None:
+                config_path = path.parent / "config.py"
+                if config_path == path or not config_path.exists():
+                    config_assignments = {}
+                else:
+                    config_tree = ast.parse(config_path.read_text(), filename=str(config_path))
+                    config_assignments = {
+                        target.id: cfg_node.value
+                        for cfg_node in config_tree.body
+                        if isinstance(cfg_node, ast.Assign)
+                        for target in cfg_node.targets
+                        if isinstance(target, ast.Name)
+                    }
+            config_node = config_assignments.get(node.id)
+            return _eval_int(config_node) if config_node is not None else None
+        if isinstance(node, ast.BinOp):
+            left = _eval_int(node.left)
+            right = _eval_int(node.right)
+            op = _AST_INT_OPERATORS.get(type(node.op))
+            if left is None or right is None or op is None:
+                return None
+            return int(op(left, right))
+        return None


The _eval_int helper function does not handle unary operations (ast.UnaryOp). In Python's AST, negative integer constants (e.g., -1) are represented as a UnaryOp with a USub operator and a Constant operand. Without handling ast.UnaryOp, any negative integer constants in the configuration files will fail to parse and return None. Supporting ast.UnaryOp (specifically USub and UAdd) makes the constant parser much more robust.

def _eval_int(node: ast.AST) -> int | None: nonlocal config_assignments if isinstance(node, ast.Constant) and isinstance(node.value, int): return int(node.value) if isinstance(node, ast.Name): if node.id in assignments: return _eval_int(assignments[node.id]) if config_assignments is None: config_path = path.parent / "config.py" if config_path == path or not config_path.exists(): config_assignments = {} else: config_tree = ast.parse(config_path.read_text(encoding="utf-8"), filename=str(config_path)) config_assignments = { target.id: cfg_node.value for cfg_node in config_tree.body if isinstance(cfg_node, ast.Assign) for target in cfg_node.targets if isinstance(target, ast.Name) } config_node = config_assignments.get(node.id) return _eval_int(config_node) if config_node is not None else None if isinstance(node, ast.UnaryOp): operand = _eval_int(node.operand) if operand is None: return None if isinstance(node.op, ast.USub): return -operand if isinstance(node.op, ast.UAdd): return operand if isinstance(node, ast.BinOp): left = _eval_int(node.left) right = _eval_int(node.right) op = _AST_INT_OPERATORS.get(type(node.op)) if left is None or right is None or op is None: return None return int(op(left, right)) return None

gemini-code-assist · 2026-06-23T08:49:28Z

+    config_path = model_path / "tokenizer_config.json"
+    tokenizer_config = json.loads(config_path.read_text()) if config_path.exists() else {}


Specify encoding="utf-8" when calling read_text() to prevent potential UnicodeDecodeError on platforms where the default system encoding is not UTF-8.

Suggested change

config_path = model_path / "tokenizer_config.json"

tokenizer_config = json.loads(config_path.read_text()) if config_path.exists() else {}

config_path = model_path / "tokenizer_config.json"

tokenizer_config = json.loads(config_path.read_text(encoding="utf-8")) if config_path.exists() else {}

gemini-code-assist · 2026-06-23T08:49:28Z

+        try:
+            config_data = json.loads(config_path.read_text())


Specify encoding="utf-8" when calling read_text() to ensure cross-platform compatibility and avoid UnicodeDecodeError on systems with non-UTF-8 default encodings.

Suggested change

try:

config_data = json.loads(config_path.read_text())

try:

config_data = json.loads(config_path.read_text(encoding="utf-8"))

gemini-code-assist · 2026-06-23T08:49:28Z

+        if not index_path.exists():
+            raise FileNotFoundError(f"Missing model.safetensors.index.json in {model_path}")
+
+        config_data = json.loads(config_path.read_text())


Specify encoding="utf-8" when calling read_text() to ensure cross-platform compatibility and avoid UnicodeDecodeError.

Suggested change

config_data = json.loads(config_path.read_text())

config_data = json.loads(config_path.read_text(encoding="utf-8"))

gemini-code-assist · 2026-06-23T08:49:28Z

+        config = _build_deepseek_v4_model_config(request.model_id, config_data, tokenizer)
+        runtime = request.runtime_config or RuntimeConfig(max_seq_len=min(config.max_position_embeddings, 8192))
+        layer_specs = _build_layer_specs(config)
+        index_data = json.loads(index_path.read_text())


Specify encoding="utf-8" when calling read_text() to ensure cross-platform compatibility and avoid UnicodeDecodeError.

Suggested change

index_data = json.loads(index_path.read_text())

index_data = json.loads(index_path.read_text(encoding="utf-8"))

gemini-code-assist · 2026-06-23T08:49:28Z

+    try:
+        config_data = json.loads(config_path.read_text())


Specify encoding="utf-8" when calling read_text() to ensure cross-platform compatibility and avoid UnicodeDecodeError.

Suggested change

try:

config_data = json.loads(config_path.read_text())

try:

config_data = json.loads(config_path.read_text(encoding="utf-8"))

gemini-code-assist · 2026-06-23T08:49:28Z

+    """Validate model-specific serving topology constraints."""
+    if model_family != "deepseek_v4":
+        return
+    config_data = json.loads((Path(args.model).resolve() / "config.json").read_text())


Specify encoding="utf-8" when calling read_text() to ensure cross-platform compatibility and avoid UnicodeDecodeError.

Suggested change

config_data = json.loads((Path(args.model).resolve() / "config.json").read_text())

config_data = json.loads((Path(args.model).resolve() / "config.json").read_text(encoding="utf-8"))

Rewrite the DeepSeek V4 runner to drive the packed all-layer kernels in single calls, replacing the per-layer dispatch path: - run_prefill / run_decode each issue one packed kernel call (l3_prefill_fwd, l3_decode_fwd) via the L3 worker. - Add stacked-weight loading/staging (load_stacked_layer_weights, _stage_stacked_weights) and share one resident weight copy between prefill and decode (staged once under an is-None guard). - weight_loader: stack_deepseek_v4_layer_weights plus CSA/HCA stacked name groups. - executor: compile l3_prefill_fwd + l3_decode_fwd with layer-stacked dummy args; extend kernel-module loading and contract validation. - tests: update arg orders and stacked tensor shapes for the packed kernels. - Right-size DEEPSEEK_V4_CMP_MAX_BLOCKS to the kernel's CMP_MAX_BLOCKS; fix _final_norm fp32 round-trip overflow; pre-allocate stacked weights before the L3 worker forks (shared-memory visibility). - Bump pypto-lib submodule to the EP8 l3_prefill_fwd generalization. - gitignore CLAUDE.local.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Bump the pypto-lib submodule to the merge of origin PR #626 (num_tokens) + #611 (LM-head fusion), and adapt the serving runner to the new packed l3_decode_fwd contract: - Decode now passes final_norm_w + lm_head_weight (reusing the tensors the runner already staged for its separate lm_head step) and a trailing num_tokens INT32 scalar (= actual_batch * decode_seq), and receives a [N_RANKS, T, VOCAB] logits buffer directly. The separate post-decode _logits_for_hidden / _final_norm steps are dropped from the decode path (prefill path is unchanged). - _DECODE_FWD_TENSOR_ORDER: x_out -> final_norm_w, lm_head_weight, logits. - Logits buffer + final-norm/lm-head static tensors are staged before the L3 worker forks (shared-memory visibility). - npu_executor _decode_dummy_args matches the new signature. - tests/test_deepseek_v4.py: decode contract updated (81 args, logits shape, trailing num_tokens scalar). Validated on-device (TP1xEP8): "Huawei is" generates coherent multi-token output end-to-end. Two env-gated, off-by-default debug toggles remain in npu_runner.py from the investigation (PYPTO_DSV4_SKIP_PREFILL_KERNEL, PYPTO_DSV4_DIVERSE_DECODE_PAD); they have no effect unless explicitly set. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

feat: add v1 parallel serving strategy support

454d1ca

gemini-code-assist Bot reviewed Jun 23, 2026

View reviewed changes

ndleslx force-pushed the dsv4-serving-generation branch from 0a6d110 to 7b706a3 Compare June 23, 2026 09:24

Add DeepSeek V4 serving integration

5de431f

bumble0918 mentioned this pull request Jun 25, 2026

[Feature] DeepSeek-V4 single-node 16-card serving inference co-agent-serving/meta-sprint#13

Open

ndleslx force-pushed the dsv4-serving-generation branch from 7b706a3 to 03b029e Compare June 26, 2026 08:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add DeepSeek V4 serving integration#40

Add DeepSeek V4 serving integration#40
ndleslx wants to merge 4 commits into
hw-native-sys:mainfrom
ndleslx:dsv4-serving-generation

ndleslx commented Jun 23, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jun 23, 2026 •

edited

Loading

Review skipped

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 23, 2026

Uh oh!

gemini-code-assist Bot Jun 23, 2026

Uh oh!

gemini-code-assist Bot Jun 23, 2026

Uh oh!

gemini-code-assist Bot Jun 23, 2026

Uh oh!

gemini-code-assist Bot Jun 23, 2026

Uh oh!

gemini-code-assist Bot Jun 23, 2026

Uh oh!

gemini-code-assist Bot Jun 23, 2026

Uh oh!

gemini-code-assist Bot Jun 23, 2026

Uh oh!

gemini-code-assist Bot Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		from python.core.types import RuntimeModel


		_FALLBACK_PYPTO_LIB_ROOT = Path("/data/liuxu/pypto-lib")

		config_path = model_path / "tokenizer_config.json"
		tokenizer_config = json.loads(config_path.read_text()) if config_path.exists() else {}

	config_data = json.loads(config_path.read_text())
	config_data = json.loads(config_path.read_text(encoding="utf-8"))

	index_data = json.loads(index_path.read_text())
	index_data = json.loads(index_path.read_text(encoding="utf-8"))

	config_data = json.loads((Path(args.model).resolve() / "config.json").read_text())
	config_data = json.loads((Path(args.model).resolve() / "config.json").read_text(encoding="utf-8"))

Uh oh!

Conversation

ndleslx commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Stacked dependency

Submodule dependency

Verification

Current e2e status

Uh oh!

coderabbitai Bot commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ndleslx commented Jun 23, 2026 •

edited

Loading

coderabbitai Bot commented Jun 23, 2026 •

edited

Loading