Skip to content

Add DeepSeek V4 serving integration#40

Draft
ndleslx wants to merge 4 commits into
hw-native-sys:mainfrom
ndleslx:dsv4-serving-generation
Draft

Add DeepSeek V4 serving integration#40
ndleslx wants to merge 4 commits into
hw-native-sys:mainfrom
ndleslx:dsv4-serving-generation

Conversation

@ndleslx

@ndleslx ndleslx commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • stack the DeepSeek V4 serving integration on top of PR feat: add v1 parallel serving strategy support #37's DP/TP serving support (dist-serving)
  • add DeepSeek V4 W8A8 serving model loader, executor, runner, and CLI topology selection
  • update the pypto-lib submodule pointer to DeepSeek V4 kernel shape/signature changes
  • add focused DeepSeek V4 unit coverage

Stacked dependency

Submodule dependency

  • pypto-lib commit: ndleslx/pypto-lib@2626e4a
  • I could not push this commit directly to hw-native-sys/pypto-lib due to permissions, so the upstream pypto-lib commit/PR still needs to land before this can be undrafted/merged cleanly.

Verification

  • python -m py_compile python/cli/main.py python/core/async_engine.py python/core/server.py python/core/serving_worker.py examples/model/deepseek_v4/runner/npu_executor.py examples/model/deepseek_v4/runner/npu_runner.py examples/model/deepseek_v4/runner/weight_loader.py
  • ruff check --config ruff.toml python/cli/main.py python/core/async_engine.py python/core/serving_worker.py examples/model/deepseek_v4 tests/test_deepseek_v4.py
  • python -m pytest tests/test_deepseek_v4.py -q
  • python -m pytest tests/test_parallel.py tests/test_cli.py tests/test_npu_prefix_chunk.py -q

Current e2e status

Draft: no locked 8-device DeepSeek V4 serving e2e rerun has been completed after stacking on #37. The earlier branch still needed e2e follow-up for runtime 507018/non-finite logits.

@coderabbitai

coderabbitai Bot commented Jun 23, 2026

Copy link
Copy Markdown

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ddf348b9-a727-4729-860f-143653d4fbef

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request integrates DeepSeekV4 serving into the PyPTO framework by adding a dedicated executor, weight loader, and model loader for W8A8 checkpoints. The serving engine is refactored to support data parallelism through the introduction of DPEngineCore and AsyncLLMEngineClient, along with new CLI arguments for parallel configuration. The reviewer identified several improvements, including removing hardcoded paths to improve portability, enhancing the AST parser to handle unary operations, and ensuring consistent UTF-8 encoding across all file read operations to prevent potential decoding errors.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

from python.core.types import RuntimeModel


_FALLBACK_PYPTO_LIB_ROOT = Path("/data/liuxu/pypto-lib")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The fallback path _FALLBACK_PYPTO_LIB_ROOT is hardcoded to a user-specific directory (/data/liuxu/pypto-lib). Hardcoding personal user paths in a shared codebase reduces portability and can lead to unexpected failures on other environments. Consider removing this fallback or replacing it with a more generic path or an environment variable check.

Comment on lines +113 to +115
def _int_constant_from_file(path: Path, name: str) -> int | None:
"""Read a simple integer module constant without importing kernel code."""
tree = ast.parse(path.read_text(), filename=str(path))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Calling path.read_text() without specifying an explicit encoding uses the platform's default encoding (e.g., CP1252 on Windows), which can lead to UnicodeDecodeError when parsing files containing non-ASCII characters. It is highly recommended to always specify encoding="utf-8" when reading text files.

Suggested change
def _int_constant_from_file(path: Path, name: str) -> int | None:
"""Read a simple integer module constant without importing kernel code."""
tree = ast.parse(path.read_text(), filename=str(path))
def _int_constant_from_file(path: Path, name: str) -> int | None:
"""Read a simple integer module constant without importing kernel code."""
tree = ast.parse(path.read_text(encoding="utf-8"), filename=str(path))

Comment on lines +125 to +154
def _eval_int(node: ast.AST) -> int | None:
nonlocal config_assignments
if isinstance(node, ast.Constant) and isinstance(node.value, int):
return int(node.value)
if isinstance(node, ast.Name):
if node.id in assignments:
return _eval_int(assignments[node.id])
if config_assignments is None:
config_path = path.parent / "config.py"
if config_path == path or not config_path.exists():
config_assignments = {}
else:
config_tree = ast.parse(config_path.read_text(), filename=str(config_path))
config_assignments = {
target.id: cfg_node.value
for cfg_node in config_tree.body
if isinstance(cfg_node, ast.Assign)
for target in cfg_node.targets
if isinstance(target, ast.Name)
}
config_node = config_assignments.get(node.id)
return _eval_int(config_node) if config_node is not None else None
if isinstance(node, ast.BinOp):
left = _eval_int(node.left)
right = _eval_int(node.right)
op = _AST_INT_OPERATORS.get(type(node.op))
if left is None or right is None or op is None:
return None
return int(op(left, right))
return None

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The _eval_int helper function does not handle unary operations (ast.UnaryOp). In Python's AST, negative integer constants (e.g., -1) are represented as a UnaryOp with a USub operator and a Constant operand. Without handling ast.UnaryOp, any negative integer constants in the configuration files will fail to parse and return None. Supporting ast.UnaryOp (specifically USub and UAdd) makes the constant parser much more robust.

    def _eval_int(node: ast.AST) -> int | None:
        nonlocal config_assignments
        if isinstance(node, ast.Constant) and isinstance(node.value, int):
            return int(node.value)
        if isinstance(node, ast.Name):
            if node.id in assignments:
                return _eval_int(assignments[node.id])
            if config_assignments is None:
                config_path = path.parent / "config.py"
                if config_path == path or not config_path.exists():
                    config_assignments = {}
                else:
                    config_tree = ast.parse(config_path.read_text(encoding="utf-8"), filename=str(config_path))
                    config_assignments = {
                        target.id: cfg_node.value
                        for cfg_node in config_tree.body
                        if isinstance(cfg_node, ast.Assign)
                        for target in cfg_node.targets
                        if isinstance(target, ast.Name)
                    }
            config_node = config_assignments.get(node.id)
            return _eval_int(config_node) if config_node is not None else None
        if isinstance(node, ast.UnaryOp):
            operand = _eval_int(node.operand)
            if operand is None:
                return None
            if isinstance(node.op, ast.USub):
                return -operand
            if isinstance(node.op, ast.UAdd):
                return operand
        if isinstance(node, ast.BinOp):
            left = _eval_int(node.left)
            right = _eval_int(node.right)
            op = _AST_INT_OPERATORS.get(type(node.op))
            if left is None or right is None or op is None:
                return None
            return int(op(left, right))
        return None

Comment thread python/core/tokenizer.py
Comment on lines +121 to +122
config_path = model_path / "tokenizer_config.json"
tokenizer_config = json.loads(config_path.read_text()) if config_path.exists() else {}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Specify encoding="utf-8" when calling read_text() to prevent potential UnicodeDecodeError on platforms where the default system encoding is not UTF-8.

Suggested change
config_path = model_path / "tokenizer_config.json"
tokenizer_config = json.loads(config_path.read_text()) if config_path.exists() else {}
config_path = model_path / "tokenizer_config.json"
tokenizer_config = json.loads(config_path.read_text(encoding="utf-8")) if config_path.exists() else {}

Comment on lines +294 to +295
try:
config_data = json.loads(config_path.read_text())

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Specify encoding="utf-8" when calling read_text() to ensure cross-platform compatibility and avoid UnicodeDecodeError on systems with non-UTF-8 default encodings.

Suggested change
try:
config_data = json.loads(config_path.read_text())
try:
config_data = json.loads(config_path.read_text(encoding="utf-8"))

if not index_path.exists():
raise FileNotFoundError(f"Missing model.safetensors.index.json in {model_path}")

config_data = json.loads(config_path.read_text())

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Specify encoding="utf-8" when calling read_text() to ensure cross-platform compatibility and avoid UnicodeDecodeError.

Suggested change
config_data = json.loads(config_path.read_text())
config_data = json.loads(config_path.read_text(encoding="utf-8"))

config = _build_deepseek_v4_model_config(request.model_id, config_data, tokenizer)
runtime = request.runtime_config or RuntimeConfig(max_seq_len=min(config.max_position_embeddings, 8192))
layer_specs = _build_layer_specs(config)
index_data = json.loads(index_path.read_text())

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Specify encoding="utf-8" when calling read_text() to ensure cross-platform compatibility and avoid UnicodeDecodeError.

Suggested change
index_data = json.loads(index_path.read_text())
index_data = json.loads(index_path.read_text(encoding="utf-8"))

Comment thread python/cli/main.py
Comment on lines +201 to +202
try:
config_data = json.loads(config_path.read_text())

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Specify encoding="utf-8" when calling read_text() to ensure cross-platform compatibility and avoid UnicodeDecodeError.

Suggested change
try:
config_data = json.loads(config_path.read_text())
try:
config_data = json.loads(config_path.read_text(encoding="utf-8"))

Comment thread python/cli/main.py
"""Validate model-specific serving topology constraints."""
if model_family != "deepseek_v4":
return
config_data = json.loads((Path(args.model).resolve() / "config.json").read_text())

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Specify encoding="utf-8" when calling read_text() to ensure cross-platform compatibility and avoid UnicodeDecodeError.

Suggested change
config_data = json.loads((Path(args.model).resolve() / "config.json").read_text())
config_data = json.loads((Path(args.model).resolve() / "config.json").read_text(encoding="utf-8"))

@ndleslx ndleslx force-pushed the dsv4-serving-generation branch from 0a6d110 to 7b706a3 Compare June 23, 2026 09:24
Rewrite the DeepSeek V4 runner to drive the packed all-layer kernels in
single calls, replacing the per-layer dispatch path:

- run_prefill / run_decode each issue one packed kernel call
  (l3_prefill_fwd, l3_decode_fwd) via the L3 worker.
- Add stacked-weight loading/staging (load_stacked_layer_weights,
  _stage_stacked_weights) and share one resident weight copy between
  prefill and decode (staged once under an is-None guard).
- weight_loader: stack_deepseek_v4_layer_weights plus CSA/HCA stacked
  name groups.
- executor: compile l3_prefill_fwd + l3_decode_fwd with layer-stacked
  dummy args; extend kernel-module loading and contract validation.
- tests: update arg orders and stacked tensor shapes for the packed
  kernels.
- Right-size DEEPSEEK_V4_CMP_MAX_BLOCKS to the kernel's CMP_MAX_BLOCKS;
  fix _final_norm fp32 round-trip overflow; pre-allocate stacked
  weights before the L3 worker forks (shared-memory visibility).
- Bump pypto-lib submodule to the EP8 l3_prefill_fwd generalization.
- gitignore CLAUDE.local.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@ndleslx ndleslx force-pushed the dsv4-serving-generation branch from 7b706a3 to 03b029e Compare June 26, 2026 08:19
Bump the pypto-lib submodule to the merge of origin PR #626 (num_tokens)
+ #611 (LM-head fusion), and adapt the serving runner to the new packed
l3_decode_fwd contract:

- Decode now passes final_norm_w + lm_head_weight (reusing the tensors
  the runner already staged for its separate lm_head step) and a trailing
  num_tokens INT32 scalar (= actual_batch * decode_seq), and receives a
  [N_RANKS, T, VOCAB] logits buffer directly. The separate post-decode
  _logits_for_hidden / _final_norm steps are dropped from the decode path
  (prefill path is unchanged).
- _DECODE_FWD_TENSOR_ORDER: x_out -> final_norm_w, lm_head_weight, logits.
- Logits buffer + final-norm/lm-head static tensors are staged before the
  L3 worker forks (shared-memory visibility).
- npu_executor _decode_dummy_args matches the new signature.
- tests/test_deepseek_v4.py: decode contract updated (81 args, logits
  shape, trailing num_tokens scalar).

Validated on-device (TP1xEP8): "Huawei is" generates coherent multi-token
output end-to-end. Two env-gated, off-by-default debug toggles remain in
npu_runner.py from the investigation (PYPTO_DSV4_SKIP_PREFILL_KERNEL,
PYPTO_DSV4_DIVERSE_DECODE_PAD); they have no effect unless explicitly set.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant