Skip to content

Qwen3-14B serving: device-side embedding and sampling#47

Open
sunkaixuan2018 wants to merge 1 commit into
hw-native-sys:mainfrom
sunkaixuan2018:emb-and-samping
Open

Qwen3-14B serving: device-side embedding and sampling#47
sunkaixuan2018 wants to merge 1 commit into
hw-native-sys:mainfrom
sunkaixuan2018:emb-and-samping

Conversation

@sunkaixuan2018

@sunkaixuan2018 sunkaixuan2018 commented Jun 29, 2026

Copy link
Copy Markdown

Summary

This PR adds the pypto-serving side of moving Qwen3 greedy sampling and token embedding from host to device for non-streaming generation.

This branch was rebuilt cleanly from upstream/main and no longer includes the earlier luohuan stream/base commit. It is paired with the pypto-lib PR below. The two PRs should be reviewed together: pypto-lib defines the device kernels and padded-vocab behavior, and this serving PR pads weights, calls the new kernels, consumes device-sampled ids, and verifies the serving path.

Paired pypto-lib PR: hw-native-sys/pypto-lib#639

Performance Impact

Measured on myserver with full 40-layer, 128-token NPU generation:

  • Latest main baseline: total 8.583s, prefill 0.410s, decode 6.211s over 127 decode steps, 48.9 ms/token.
  • Device emb/sample branch: total 6.450s, prefill 0.373s, decode 6.033s over 127 decode steps, 47.5 ms/token.

This run shows about 25% lower end-to-end generation time. Pure device kernel work is close overall and the new path does add small greedy_sample / token_embed work; the main win is that serving no longer does greedy sampling, embedding lookup, and related host/device synchronization once per generated token on the host side.

What Changed

  • Wire Qwen3 prefill/decode to device-side greedy sampling and embedding.
  • Pad lm_head rows by duplicating token 0 so padded-vocab logits are deterministic.
  • Keep padded embed_weight rows as zero padding.
  • Remove the runtime valid_vocab_size tensor path and rely on the lib-side REAL_VOCAB clamp.
  • Add numeric regression coverage for padded-vocab argmax semantics, real-token tie-break, and REAL_VOCAB clamp.
  • Simplify decode hidden-state flow: device-embedding decode now passes a zero placeholder instead of caching pending_next_hidden on host.
  • Document PR scope as non-streaming generation.

Validation

Local:

  • python -m py_compile examples/model/qwen3_14b/runner/npu_executor.py examples/model/qwen3_14b/runner/npu_runner.py examples/model/qwen3_14b/runner/qwen3_l3_dispatch.py python/core/engine.py python/core/executor.py python/core/serving_worker.py python/core/types.py tests/test_batching.py tests/test_device_sampling_submission.py
  • python -m pytest tests/test_device_sampling_submission.py -q -> 3 passed, 1 skipped locally because torch is unavailable

myserver:

  • pytest tests/test_device_sampling_submission.py -q -> 4 passed
  • pytest tests/test_batching.py -q -> 16 passed
  • pytest tests/test_batching.py tests/test_device_sampling_submission.py -q -> 20 passed
  • greedy sample a2a3sim -> PASS
  • NPU generation -> exit 0, token ids [594, 220, 20, 38, 5440], text 's 5G technology
  • Full 40-layer, 128-token comparison against latest main baseline -> baseline 8.583s total / 48.9 ms-token decode, device branch 6.450s total / 47.5 ms-token decode
  • L2 named swimlane/perf -> PASS, includes device greedy_sample and token_embed kernels

@coderabbitai

coderabbitai Bot commented Jun 29, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

Adds device-side greedy sampling and token embedding to the Qwen3-14B NPU executor. New capability flags (supports_device_sampling, supports_device_embedding) propagate through ModelExecutor, batch/result types, LLMEngine, and WorkerProcess. The NPU executor compiles greedy_sample and token_embed kernels, allocates shared sampling buffers, and the runner returns sampled_token_ids/next_hidden_states alongside logits.

Changes

Device-side sampling and embedding integration

Layer / File(s) Summary
Batch/result types and executor capability interface
python/core/types.py, python/core/executor.py
Adds allow_device_greedy_sampling flag to PrefillBatch/DecodeBatch, optional sampled_token_ids/next_hidden_states to PrefillResult/DecodeResult, and supports_device_sampling/supports_device_embedding properties (defaulting to False) to ModelExecutor.
L3 dispatch host wrapper signatures and new kernel placeholders
examples/model/qwen3_14b/runner/qwen3_l3_dispatch.py
Adds greedy_sample_fwd/token_embed_fwd module-level placeholders, reorders qwen3_prefill_host args, extends qwen3_decode_host to return (logits, sampled_ids, next_hidden), and adds qwen3_greedy_sample_host/qwen3_token_embed_host wrappers.
NPU executor: compilation, validation, and buffer allocation
examples/model/qwen3_14b/runner/npu_executor.py
Adds supports_device_sampling property, loads and registers greedy_sample/token_embed kernel modules, validates REAL_VOCAB/batch/vocab/hidden constants, computes sampled_ids_width, compiles new host wrappers, pads embed_tokens into padded_embed_weight, and allocates prefill/decode sampling buffers in _CompiledKernels.
NPU runner: data structures, dispatch args, and result extraction
examples/model/qwen3_14b/runner/npu_runner.py
Extends _CompiledKernels, _DecodeInputs, _DecodeKernelInputs, and _StaticKernelArgs with new sampling buffers/fields; wires them into static tensor sharing and kernel dispatch; adds _integrated_sample_result; updates run_prefill/run_decode to return sampled ids and next hidden states; pads token_ids into fixed-batch decode buffer.
Engine and serving worker integration
python/core/engine.py, python/core/serving_worker.py
Engine computes allow_device_greedy_sampling from temperature and executor flags, adds _sample_batch_rows and _decode_embeddings_from_cache_or_lookup helpers. Worker gates device sampling per-request, allocates zero decode embeddings when supports_device_embedding, and selects tokens via _sample_result_row.
Tests, test doubles, submodule, and scope note
tests/test_batching.py, tests/test_device_sampling_submission.py, pypto-lib, docs/pr-scope-note.md
Updates _compiled_kernels fixture, adds engine/worker integration tests and source-inspection tests for kernel inlining, introduces _DeviceSamplingExecutor/_FailingSampler/_FixedSampler test doubles, adds numeric argmax validation, bumps pypto-lib submodule, and adds PR scope note.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • hw-native-sys/pypto-serving#19: Modifies Qwen314BPyptoExecutor._compile_prefill_fwd_callable in the same file and method, adjusting the int32 argument binding in the prefill kernel compilation path this PR extends.
  • hw-native-sys/pypto-serving#29: Rewrote the L3 runner dispatch in qwen3_l3_dispatch.py/npu_runner.py that this PR directly builds on to add greedy_sample_fwd/token_embed_fwd wrappers and the new qwen3_decode_host tuple return.

Poem

🐇 Hop hop, the tokens now flow,
From device kernels, sampled below!
No host argmax needed this day,
The NPU picks the next token's way.
Embed and sample, all in one go—
A speedy little rabbit says: "Bravo!" 🎉

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 44.44% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: Qwen3-14B serving moves embedding and sampling onto the device.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The description clearly matches the changeset, covering device-side sampling, token embedding, padding, and the non-streaming scope note.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces device-side embedding and greedy sampling support for non-streaming generation in the Qwen3-14B model, alongside a host-side CPU prefill path and chunked prefill support to mitigate out-of-memory errors on long contexts. It also adds stress testing scripts, performance documentation, and corresponding unit tests. The review feedback highlights a potential runtime error in Qwen314BPyptoExecutor where the environment variable PYPTO_QWEN3_PREFILL_ON_CPU is not checked during compilation, which would prevent the allocation of the required staging buffer. Additionally, a simplification is suggested in python/core/engine.py to eliminate a redundant conditional check and improve code readability.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread examples/model/qwen3_14b/runner/npu_executor.py Outdated
Comment thread python/core/engine.py Outdated
sunkaixuan2018 added a commit to sunkaixuan2018/pypto-serving that referenced this pull request Jun 29, 2026
- Align CPU prefill env handling between executor and runner\n- Simplify chunked prefill sampled-token handling\n- Fix pre-commit and submodule-dependent tests
sunkaixuan2018 added a commit to sunkaixuan2018/pypto-serving that referenced this pull request Jun 29, 2026
- Remove BOM from npu runner source header\n- Add missing test header\n- Skip submodule source assertions when pypto-lib is not checked out
@sunkaixuan2018 sunkaixuan2018 changed the title [codex] wire qwen3 device sampling and embedding in serving Qwen3-14B serving: device-side embedding and sampling Jun 29, 2026
@sunkaixuan2018 sunkaixuan2018 marked this pull request as ready for review June 29, 2026 08:43
@sunkaixuan2018 sunkaixuan2018 force-pushed the emb-and-samping branch 9 times, most recently from 03f24d0 to 75c63dc Compare June 30, 2026 06:03

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (3)
python/core/serving_worker.py (1)

338-345: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick win

Preserve sampled-token row boundaries before indexing.

Flattening a 2-D sampled-id tensor can select a padded column instead of the requested batch row. Match the engine-side extraction by indexing the first column per row.

Proposed fix
 sampled = getattr(result, "sampled_token_ids", None)
 if allow_device_sampled and sampled is not None:
-    flat = sampled.view(-1)
-    if flat.numel() <= row_idx:
+    sampled_rows = sampled.reshape(sampled.shape[0], -1)
+    if sampled_rows.shape[0] <= row_idx:
         raise ValueError(
-            f"sampled_token_ids has {flat.numel()} rows, expected row {row_idx}"
+            f"sampled_token_ids has {sampled_rows.shape[0]} rows, expected row {row_idx}"
         )
-    return int(flat[row_idx].item())
+    return int(sampled_rows[row_idx, 0].item())
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@python/core/serving_worker.py` around lines 338 - 345, In serving_worker.py,
the sampled-token path in the code that reads result.sampled_token_ids is
flattening the tensor with view(-1), which can pick the wrong element across row
boundaries. Update this logic so the extraction matches the engine-side behavior
by selecting the first token from the requested row (or otherwise indexing by
row first) before converting to int, and keep the existing bounds check in the
same sampled-token handling branch.
python/core/engine.py (2)

190-194: 🚀 Performance & Scalability | 🔵 Trivial | ⚡ Quick win

Decouple device sampling from device embedding.

_sample_batch_rows only needs executor-provided token IDs, and decode can still use host embedding lookup when supports_device_embedding is false. Requiring both flags disables sampling-only executors in LLMEngine, while the worker path treats these capabilities independently.

Proposed adjustment
 allow_device_greedy_sampling = (
     generate_config.temperature <= 0.0
     and self._executor.supports_device_sampling
-    and self._executor.supports_device_embedding
 )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@python/core/engine.py` around lines 190 - 194, The greedy sampling gate in
_sample_batch_rows is coupling device sampling to device embedding
unnecessarily. Update the allow_device_greedy_sampling condition in LLMEngine to
depend only on temperature and self._executor.supports_device_sampling, and keep
decode-side embedding lookup separate so executors that only provide token IDs
can still use sampling. Verify the worker capability handling remains
independent by checking the related executor flags used around
_sample_batch_rows and decode.

444-450: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick win

Preserve sampled-token row boundaries before indexing.

Flattening a (batch, padded_width) sampled-id tensor would read padding columns as later rows. Normalize to rows and take column 0.

Proposed fix
 if sampled_token_ids is not None:
-    flat_ids = sampled_token_ids.view(-1)
-    if flat_ids.numel() < row_count:
+    sampled_rows = sampled_token_ids.reshape(sampled_token_ids.shape[0], -1)
+    if sampled_rows.shape[0] < row_count:
         raise ValueError(
-            f"sampled_token_ids has {flat_ids.numel()} rows, expected at least {row_count}"
+            f"sampled_token_ids has {sampled_rows.shape[0]} rows, expected at least {row_count}"
         )
-    return [int(flat_ids[idx].item()) for idx in range(row_count)]
+    return [int(sampled_rows[idx, 0].item()) for idx in range(row_count)]
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@python/core/engine.py` around lines 444 - 450, In the sampled_token_ids
handling logic, flattening with view(-1) is collapsing the row structure and can
treat padding columns as subsequent rows. Update the row extraction in the
sampled-token path to preserve the original 2D boundaries in the code around
sampled_token_ids, flat_ids, and the return list, then index by row and take
column 0 for each row instead of reading from the flattened tensor. Keep the
existing row-count validation, but ensure the implementation works from
normalized rows rather than a flattened buffer.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@examples/model/qwen3_14b/runner/npu_runner.py`:
- Around line 528-536: The decode path in npu_runner.py is silently truncating
multi-token rows because `width` is derived with `min(...)` inside the
`decode_token_ids_buffer` copy logic. Update the decode handling in the
`compiled.decode_token_ids_buffer` block to validate
`inputs.token_ids`/`active_token_ids` row width before copying, and raise a
fail-fast error when any decode row contains more than one token. Keep the
normal single-token copy and the `actual_batch` padding behavior intact, but do
not slice away extra tokens.

In `@tests/test_batching.py`:
- Around line 588-615: The batching tests are importing qwen3_l3_dispatch via
importlib, which unnecessarily pulls in runtime dependencies like pypto.language
and module-level JIT setup. Update the test cases in test_batching.py to read
the qwen3_l3_dispatch.py source directly from disk with Path, then slice the
qwen3_prefill_host and qwen3_decode_host sections from that text for assertions.
Keep the existing symbol-based checks on qwen3_prefill_host, qwen3_decode_host,
and qwen3_greedy_sample_host, but remove the importlib-based module import path.

In `@tests/test_device_sampling_submission.py`:
- Around line 16-31: Guard this test module against missing optional pypto-lib
contents by checking for the QWEN tree before any _source() calls; if the model
checkout is absent, skip the module/tests instead of reading files
unconditionally. Use the existing ROOT/QWEN setup in
tests/test_device_sampling_submission.py and mirror the skip-if-missing pattern
already used in tests/test_batching.py so helpers like _source and
test_device_sampling_is_limited_by_runtime_vocab_size only run when the files
exist.

---

Nitpick comments:
In `@python/core/engine.py`:
- Around line 190-194: The greedy sampling gate in _sample_batch_rows is
coupling device sampling to device embedding unnecessarily. Update the
allow_device_greedy_sampling condition in LLMEngine to depend only on
temperature and self._executor.supports_device_sampling, and keep decode-side
embedding lookup separate so executors that only provide token IDs can still use
sampling. Verify the worker capability handling remains independent by checking
the related executor flags used around _sample_batch_rows and decode.
- Around line 444-450: In the sampled_token_ids handling logic, flattening with
view(-1) is collapsing the row structure and can treat padding columns as
subsequent rows. Update the row extraction in the sampled-token path to preserve
the original 2D boundaries in the code around sampled_token_ids, flat_ids, and
the return list, then index by row and take column 0 for each row instead of
reading from the flattened tensor. Keep the existing row-count validation, but
ensure the implementation works from normalized rows rather than a flattened
buffer.

In `@python/core/serving_worker.py`:
- Around line 338-345: In serving_worker.py, the sampled-token path in the code
that reads result.sampled_token_ids is flattening the tensor with view(-1),
which can pick the wrong element across row boundaries. Update this logic so the
extraction matches the engine-side behavior by selecting the first token from
the requested row (or otherwise indexing by row first) before converting to int,
and keep the existing bounds check in the same sampled-token handling branch.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a90bce51-9837-4107-a951-ebdd022d772c

📥 Commits

Reviewing files that changed from the base of the PR and between d37496a and 03f24d0.

📒 Files selected for processing (11)
  • docs/pr-scope-note.md
  • examples/model/qwen3_14b/runner/npu_executor.py
  • examples/model/qwen3_14b/runner/npu_runner.py
  • examples/model/qwen3_14b/runner/qwen3_l3_dispatch.py
  • pypto-lib
  • python/core/engine.py
  • python/core/executor.py
  • python/core/serving_worker.py
  • python/core/types.py
  • tests/test_batching.py
  • tests/test_device_sampling_submission.py

Comment on lines +528 to +536
token_ids = compiled.decode_token_ids_buffer
token_ids.zero_()
active_token_ids = inputs.token_ids.reshape(actual_batch, -1)
width = min(active_token_ids.shape[1], token_ids.shape[1])
token_ids[:actual_batch, :width].copy_(active_token_ids[:, :width])
if actual_batch < kernel_batch:
token_ids[actual_batch:, :width].copy_(
active_token_ids[0:1, :width].expand(kernel_batch - actual_batch, width)
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Reject multi-token decode rows instead of truncating them.

Line 531 uses min(...), so a malformed DecodeBatch.token_ids row with more than one token is silently truncated before device embedding. Decode is a single-token step; fail fast on shape drift.

Proposed fix
         token_ids = compiled.decode_token_ids_buffer
         token_ids.zero_()
         active_token_ids = inputs.token_ids.reshape(actual_batch, -1)
-        width = min(active_token_ids.shape[1], token_ids.shape[1])
+        if active_token_ids.shape[1] != 1:
+            raise ValueError(
+                "decode token_ids must contain exactly one token per row, "
+                f"got shape {tuple(inputs.token_ids.shape)}"
+            )
+        width = 1
         token_ids[:actual_batch, :width].copy_(active_token_ids[:, :width])
         if actual_batch < kernel_batch:
             token_ids[actual_batch:, :width].copy_(
                 active_token_ids[0:1, :width].expand(kernel_batch - actual_batch, width)
             )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
token_ids = compiled.decode_token_ids_buffer
token_ids.zero_()
active_token_ids = inputs.token_ids.reshape(actual_batch, -1)
width = min(active_token_ids.shape[1], token_ids.shape[1])
token_ids[:actual_batch, :width].copy_(active_token_ids[:, :width])
if actual_batch < kernel_batch:
token_ids[actual_batch:, :width].copy_(
active_token_ids[0:1, :width].expand(kernel_batch - actual_batch, width)
)
token_ids = compiled.decode_token_ids_buffer
token_ids.zero_()
active_token_ids = inputs.token_ids.reshape(actual_batch, -1)
if active_token_ids.shape[1] != 1:
raise ValueError(
"decode token_ids must contain exactly one token per row, "
f"got shape {tuple(inputs.token_ids.shape)}"
)
width = 1
token_ids[:actual_batch, :width].copy_(active_token_ids[:, :width])
if actual_batch < kernel_batch:
token_ids[actual_batch:, :width].copy_(
active_token_ids[0:1, :width].expand(kernel_batch - actual_batch, width)
)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/model/qwen3_14b/runner/npu_runner.py` around lines 528 - 536, The
decode path in npu_runner.py is silently truncating multi-token rows because
`width` is derived with `min(...)` inside the `decode_token_ids_buffer` copy
logic. Update the decode handling in the `compiled.decode_token_ids_buffer`
block to validate `inputs.token_ids`/`active_token_ids` row width before
copying, and raise a fail-fast error when any decode row contains more than one
token. Keep the normal single-token copy and the `actual_batch` padding behavior
intact, but do not slice away extra tokens.

Comment thread tests/test_batching.py
Comment on lines +588 to +615
qwen3_l3_dispatch = importlib.import_module(
"examples.model.qwen3_14b.runner.qwen3_l3_dispatch"
)
module_source = Path(qwen3_l3_dispatch.__file__).read_text(encoding="utf-8")
start = module_source.index("def qwen3_decode_host")
end = module_source.index("def qwen3_greedy_sample_host")
source = module_source[start:end]

assert source.count("decode_fwd(") == 1
assert "token_embed_fwd(" not in source
assert "greedy_sample_fwd(" not in source

kernel_dir = Path(qwen3_l3_dispatch.__file__).parents[4] / "pypto-lib" / "models" / "qwen3" / "14b"
if not kernel_dir.is_dir():
pytest.skip("pypto-lib submodule is not checked out")
decode_source = (kernel_dir / "decode_layer.py").read_text(encoding="utf-8")
assert 'name_hint="token_embed"' in decode_source
assert 'name_hint="greedy_sample"' in decode_source


def test_prefill_host_keeps_sampling_as_standalone_kernel():
qwen3_l3_dispatch = importlib.import_module(
"examples.model.qwen3_14b.runner.qwen3_l3_dispatch"
)
module_source = Path(qwen3_l3_dispatch.__file__).read_text(encoding="utf-8")
start = module_source.index("def qwen3_prefill_host")
end = module_source.index("def qwen3_decode_host")
source = module_source[start:end]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Inspect the relevant test and the target module layout.
git ls-files tests/test_batching.py examples/model/qwen3_14b/runner/qwen3_l3_dispatch.py
echo '--- outline tests/test_batching.py ---'
ast-grep outline tests/test_batching.py --view expanded || true
echo '--- outline examples/model/qwen3_14b/runner/qwen3_l3_dispatch.py ---'
ast-grep outline examples/model/qwen3_14b/runner/qwen3_l3_dispatch.py --view expanded || true
echo '--- relevant lines in tests/test_batching.py ---'
sed -n '560,640p' tests/test_batching.py
echo '--- top of qwen3_l3_dispatch.py ---'
sed -n '1,220p' examples/model/qwen3_14b/runner/qwen3_l3_dispatch.py

Repository: hw-native-sys/pypto-serving

Length of output: 10738


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Determine whether the test file already imports importlib and whether there are
# similar source-only inspections elsewhere.
rg -n "importlib\.import_module|Path\(__file__\).*read_text|__file__" tests/test_batching.py tests -g '*.py'

Repository: hw-native-sys/pypto-serving

Length of output: 1762


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Check whether pypto.language is part of the repo or only an external dependency,
# and whether qwen3_l3_dispatch is otherwise imported in tests.
git ls-files | rg '(^|/)pypto(/|$)|(^|/)language\.py$|(^|/)language/__init__\.py$|(^|/)qwen3_l3_dispatch\.py$'
echo '--- pypto.language references ---'
rg -n "import pypto\.language|from pypto\.language" -g '*.py'

Repository: hw-native-sys/pypto-serving

Length of output: 252


Read examples/model/qwen3_14b/runner/qwen3_l3_dispatch.py from disk here. These checks only inspect source text, so importing the module adds an unnecessary dependency on pypto.language and module-level @pl.jit.host setup. Loading the file directly keeps the test independent of the runtime import path.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_batching.py` around lines 588 - 615, The batching tests are
importing qwen3_l3_dispatch via importlib, which unnecessarily pulls in runtime
dependencies like pypto.language and module-level JIT setup. Update the test
cases in test_batching.py to read the qwen3_l3_dispatch.py source directly from
disk with Path, then slice the qwen3_prefill_host and qwen3_decode_host sections
from that text for assertions. Keep the existing symbol-based checks on
qwen3_prefill_host, qwen3_decode_host, and qwen3_greedy_sample_host, but remove
the importlib-based module import path.

Comment on lines +16 to +31
ROOT = Path(__file__).resolve().parents[1]
QWEN = ROOT / "pypto-lib" / "models" / "qwen3" / "14b"
REAL_VOCAB = 151936
PADDED_VOCAB = 152064


def _source(path: Path) -> str:
return path.read_text(encoding="utf-8")


def test_device_sampling_is_limited_by_runtime_vocab_size() -> None:
dispatch = _source(ROOT / "examples" / "model" / "qwen3_14b" / "runner" / "qwen3_l3_dispatch.py")
executor = _source(ROOT / "examples" / "model" / "qwen3_14b" / "runner" / "npu_executor.py")
runner = _source(ROOT / "examples" / "model" / "qwen3_14b" / "runner" / "npu_runner.py")
config = _source(QWEN / "config.py")
greedy = _source(QWEN / "greedy_sample.py")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Skip this module when pypto-lib is not checked out.

Every test below reads pypto-lib/models/qwen3/14b/* unconditionally, so a checkout without submodules will fail with FileNotFoundError instead of skipping. tests/test_batching.py already guards the same optional tree that way.

Suggested change
 ROOT = Path(__file__).resolve().parents[1]
 QWEN = ROOT / "pypto-lib" / "models" / "qwen3" / "14b"
 REAL_VOCAB = 151936
 PADDED_VOCAB = 152064
+
+pytestmark = pytest.mark.skipif(
+    not QWEN.is_dir(),
+    reason="pypto-lib submodule is not checked out",
+)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
ROOT = Path(__file__).resolve().parents[1]
QWEN = ROOT / "pypto-lib" / "models" / "qwen3" / "14b"
REAL_VOCAB = 151936
PADDED_VOCAB = 152064
def _source(path: Path) -> str:
return path.read_text(encoding="utf-8")
def test_device_sampling_is_limited_by_runtime_vocab_size() -> None:
dispatch = _source(ROOT / "examples" / "model" / "qwen3_14b" / "runner" / "qwen3_l3_dispatch.py")
executor = _source(ROOT / "examples" / "model" / "qwen3_14b" / "runner" / "npu_executor.py")
runner = _source(ROOT / "examples" / "model" / "qwen3_14b" / "runner" / "npu_runner.py")
config = _source(QWEN / "config.py")
greedy = _source(QWEN / "greedy_sample.py")
ROOT = Path(__file__).resolve().parents[1]
QWEN = ROOT / "pypto-lib" / "models" / "qwen3" / "14b"
REAL_VOCAB = 151936
PADDED_VOCAB = 152064
pytestmark = pytest.mark.skipif(
not QWEN.is_dir(),
reason="pypto-lib submodule is not checked out",
)
def _source(path: Path) -> str:
return path.read_text(encoding="utf-8")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_device_sampling_submission.py` around lines 16 - 31, Guard this
test module against missing optional pypto-lib contents by checking for the QWEN
tree before any _source() calls; if the model checkout is absent, skip the
module/tests instead of reading files unconditionally. Use the existing
ROOT/QWEN setup in tests/test_device_sampling_submission.py and mirror the
skip-if-missing pattern already used in tests/test_batching.py so helpers like
_source and test_device_sampling_is_limited_by_runtime_vocab_size only run when
the files exist.

zhangqi-chen pushed a commit to hw-native-sys/pypto-lib that referenced this pull request Jul 1, 2026
## Summary

This PR adds the Qwen3 pypto-lib side of moving greedy sampling and
token embedding onto device.

It is paired with the pypto-serving PR below. The two PRs should be
reviewed together: pypto-lib provides the device kernels and
padded-vocab semantics, while pypto-serving wires those kernels into
generation and validates the serving behavior.

Paired pypto-serving PR:
hw-native-sys/pypto-serving#47

## Performance Impact

Measured with the paired serving PR on myserver using full 40-layer,
128-token NPU generation:

- Latest main baseline: total 8.583s, prefill 0.410s, decode 6.211s over
127 decode steps, 48.9 ms/token.
- Device emb/sample branch: total 6.450s, prefill 0.373s, decode 6.033s
over 127 decode steps, 47.5 ms/token.

This is about 25% lower end-to-end generation time in that run, with
decode TPOT slightly lower as well. The main reason is not a large
kernel-level speedup: the device path adds small `greedy_sample` /
`token_embed` kernels, but removes the per-token host sampling, host
embedding lookup, and host/device synchronization work from the
generation loop.

## What Changed

- Add `REAL_VOCAB` for Qwen3-14B so kernels can distinguish the real
vocabulary from the padded device vocabulary.
- Add standalone greedy sampling and token embedding kernels.
- Integrate previous-token embedding and next-token greedy sampling into
the fused decode kernel.
- Keep prefill free of inline sample/embed logic; serving runs
standalone device greedy sampling after prefill when greedy generation
is enabled.
- Make greedy tie-break match host `torch.argmax` by reverse-scanning
equal best logits so the smallest token id wins.
- Clamp any sampled padded-vocab id back to token 0 as a defensive
guard.

## Validation

Validated together with the paired serving branch on myserver:

- `pytest tests/test_batching.py
tests/test_device_sampling_submission.py -q` -> 20 passed
- `python -m pytest tests/golden/test_qwen3_greedy_source.py -q` -> 3
passed
- greedy sample `a2a3sim` -> PASS
- NPU generation -> exit 0, token ids `[594, 220, 20, 38, 5440]`, text
`'s 5G technology`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant