Qwen3-14B serving: device-side embedding and sampling by sunkaixuan2018 · Pull Request #47 · hw-native-sys/pypto-serving

sunkaixuan2018 · 2026-06-29T07:09:02Z

Summary

This PR adds the pypto-serving side of moving Qwen3 greedy sampling and token embedding from host to device for non-streaming generation.

This branch was rebuilt cleanly from upstream/main and no longer includes the earlier luohuan stream/base commit. It is paired with the pypto-lib PR below. The two PRs should be reviewed together: pypto-lib defines the device kernels and padded-vocab behavior, and this serving PR pads weights, calls the new kernels, consumes device-sampled ids, and verifies the serving path.

Paired pypto-lib PR: hw-native-sys/pypto-lib#639

Performance Impact

Measured on myserver with full 40-layer, 128-token NPU generation:

Latest main baseline: total 8.583s, prefill 0.410s, decode 6.211s over 127 decode steps, 48.9 ms/token.
Device emb/sample branch: total 6.450s, prefill 0.373s, decode 6.033s over 127 decode steps, 47.5 ms/token.

This run shows about 25% lower end-to-end generation time. Pure device kernel work is close overall and the new path does add small greedy_sample / token_embed work; the main win is that serving no longer does greedy sampling, embedding lookup, and related host/device synchronization once per generated token on the host side.

What Changed

Wire Qwen3 prefill/decode to device-side greedy sampling and embedding.
Pad lm_head rows by duplicating token 0 so padded-vocab logits are deterministic.
Keep padded embed_weight rows as zero padding.
Remove the runtime valid_vocab_size tensor path and rely on the lib-side REAL_VOCAB clamp.
Add numeric regression coverage for padded-vocab argmax semantics, real-token tie-break, and REAL_VOCAB clamp.
Simplify decode hidden-state flow: device-embedding decode now passes a zero placeholder instead of caching pending_next_hidden on host.
Document PR scope as non-streaming generation.

Validation

Local:

python -m py_compile examples/model/qwen3_14b/runner/npu_executor.py examples/model/qwen3_14b/runner/npu_runner.py examples/model/qwen3_14b/runner/qwen3_l3_dispatch.py python/core/engine.py python/core/executor.py python/core/serving_worker.py python/core/types.py tests/test_batching.py tests/test_device_sampling_submission.py
python -m pytest tests/test_device_sampling_submission.py -q -> 3 passed, 1 skipped locally because torch is unavailable

myserver:

pytest tests/test_device_sampling_submission.py -q -> 4 passed
pytest tests/test_batching.py -q -> 16 passed
pytest tests/test_batching.py tests/test_device_sampling_submission.py -q -> 20 passed
greedy sample a2a3sim -> PASS
NPU generation -> exit 0, token ids [594, 220, 20, 38, 5440], text 's 5G technology
Full 40-layer, 128-token comparison against latest main baseline -> baseline 8.583s total / 48.9 ms-token decode, device branch 6.450s total / 47.5 ms-token decode
L2 named swimlane/perf -> PASS, includes device greedy_sample and token_embed kernels

coderabbitai · 2026-06-29T07:09:10Z

📝 Walkthrough

Walkthrough

Adds device-side greedy sampling and token embedding to the Qwen3-14B NPU executor. New capability flags (supports_device_sampling, supports_device_embedding) propagate through ModelExecutor, batch/result types, LLMEngine, and WorkerProcess. The NPU executor compiles greedy_sample and token_embed kernels, allocates shared sampling buffers, and the runner returns sampled_token_ids/next_hidden_states alongside logits.

Changes

Device-side sampling and embedding integration

Layer / File(s)	Summary
Batch/result types and executor capability interface `python/core/types.py`, `python/core/executor.py`	Adds `allow_device_greedy_sampling` flag to `PrefillBatch`/`DecodeBatch`, optional `sampled_token_ids`/`next_hidden_states` to `PrefillResult`/`DecodeResult`, and `supports_device_sampling`/`supports_device_embedding` properties (defaulting to `False`) to `ModelExecutor`.
L3 dispatch host wrapper signatures and new kernel placeholders `examples/model/qwen3_14b/runner/qwen3_l3_dispatch.py`	Adds `greedy_sample_fwd`/`token_embed_fwd` module-level placeholders, reorders `qwen3_prefill_host` args, extends `qwen3_decode_host` to return `(logits, sampled_ids, next_hidden)`, and adds `qwen3_greedy_sample_host`/`qwen3_token_embed_host` wrappers.
NPU executor: compilation, validation, and buffer allocation `examples/model/qwen3_14b/runner/npu_executor.py`	Adds `supports_device_sampling` property, loads and registers `greedy_sample`/`token_embed` kernel modules, validates `REAL_VOCAB`/batch/vocab/hidden constants, computes `sampled_ids_width`, compiles new host wrappers, pads `embed_tokens` into `padded_embed_weight`, and allocates prefill/decode sampling buffers in `_CompiledKernels`.
NPU runner: data structures, dispatch args, and result extraction `examples/model/qwen3_14b/runner/npu_runner.py`	Extends `_CompiledKernels`, `_DecodeInputs`, `_DecodeKernelInputs`, and `_StaticKernelArgs` with new sampling buffers/fields; wires them into static tensor sharing and kernel dispatch; adds `_integrated_sample_result`; updates `run_prefill`/`run_decode` to return sampled ids and next hidden states; pads `token_ids` into fixed-batch decode buffer.
Engine and serving worker integration `python/core/engine.py`, `python/core/serving_worker.py`	Engine computes `allow_device_greedy_sampling` from temperature and executor flags, adds `_sample_batch_rows` and `_decode_embeddings_from_cache_or_lookup` helpers. Worker gates device sampling per-request, allocates zero decode embeddings when `supports_device_embedding`, and selects tokens via `_sample_result_row`.
Tests, test doubles, submodule, and scope note `tests/test_batching.py`, `tests/test_device_sampling_submission.py`, `pypto-lib`, `docs/pr-scope-note.md`	Updates `_compiled_kernels` fixture, adds engine/worker integration tests and source-inspection tests for kernel inlining, introduces `_DeviceSamplingExecutor`/`_FailingSampler`/`_FixedSampler` test doubles, adds numeric argmax validation, bumps `pypto-lib` submodule, and adds PR scope note.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

hw-native-sys/pypto-serving#19: Modifies Qwen314BPyptoExecutor._compile_prefill_fwd_callable in the same file and method, adjusting the int32 argument binding in the prefill kernel compilation path this PR extends.
hw-native-sys/pypto-serving#29: Rewrote the L3 runner dispatch in qwen3_l3_dispatch.py/npu_runner.py that this PR directly builds on to add greedy_sample_fwd/token_embed_fwd wrappers and the new qwen3_decode_host tuple return.

Poem

🐇 Hop hop, the tokens now flow,
From device kernels, sampled below!
No host argmax needed this day,
The NPU picks the next token's way.
Embed and sample, all in one go—
A speedy little rabbit says: "Bravo!" 🎉

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 44.44% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the main change: Qwen3-14B serving moves embedding and sampling onto the device.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description check	✅ Passed	The description clearly matches the changeset, covering device-side sampling, token embedding, padding, and the non-streaming scope note.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

gemini-code-assist

Code Review

This pull request introduces device-side embedding and greedy sampling support for non-streaming generation in the Qwen3-14B model, alongside a host-side CPU prefill path and chunked prefill support to mitigate out-of-memory errors on long contexts. It also adds stress testing scripts, performance documentation, and corresponding unit tests. The review feedback highlights a potential runtime error in Qwen314BPyptoExecutor where the environment variable PYPTO_QWEN3_PREFILL_ON_CPU is not checked during compilation, which would prevent the allocation of the required staging buffer. Additionally, a simplification is suggested in python/core/engine.py to eliminate a redundant conditional check and improve code readability.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

- Align CPU prefill env handling between executor and runner\n- Simplify chunked prefill sampled-token handling\n- Fix pre-commit and submodule-dependent tests

- Remove BOM from npu runner source header\n- Add missing test header\n- Skip submodule source assertions when pypto-lib is not checked out

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (3)

python/core/serving_worker.py (1)

338-345: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick win

Preserve sampled-token row boundaries before indexing.

Flattening a 2-D sampled-id tensor can select a padded column instead of the requested batch row. Match the engine-side extraction by indexing the first column per row.

Proposed fix

 sampled = getattr(result, "sampled_token_ids", None)
 if allow_device_sampled and sampled is not None:
-    flat = sampled.view(-1)
-    if flat.numel() <= row_idx:
+    sampled_rows = sampled.reshape(sampled.shape[0], -1)
+    if sampled_rows.shape[0] <= row_idx:
         raise ValueError(
-            f"sampled_token_ids has {flat.numel()} rows, expected row {row_idx}"
+            f"sampled_token_ids has {sampled_rows.shape[0]} rows, expected row {row_idx}"
         )
-    return int(flat[row_idx].item())
+    return int(sampled_rows[row_idx, 0].item())

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@python/core/serving_worker.py` around lines 338 - 345, In serving_worker.py,
the sampled-token path in the code that reads result.sampled_token_ids is
flattening the tensor with view(-1), which can pick the wrong element across row
boundaries. Update this logic so the extraction matches the engine-side behavior
by selecting the first token from the requested row (or otherwise indexing by
row first) before converting to int, and keep the existing bounds check in the
same sampled-token handling branch.

python/core/engine.py (2)

190-194: 🚀 Performance & Scalability | 🔵 Trivial | ⚡ Quick win

Decouple device sampling from device embedding.

_sample_batch_rows only needs executor-provided token IDs, and decode can still use host embedding lookup when supports_device_embedding is false. Requiring both flags disables sampling-only executors in LLMEngine, while the worker path treats these capabilities independently.

Proposed adjustment

 allow_device_greedy_sampling = (
     generate_config.temperature <= 0.0
     and self._executor.supports_device_sampling
-    and self._executor.supports_device_embedding
 )

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@python/core/engine.py` around lines 190 - 194, The greedy sampling gate in
_sample_batch_rows is coupling device sampling to device embedding
unnecessarily. Update the allow_device_greedy_sampling condition in LLMEngine to
depend only on temperature and self._executor.supports_device_sampling, and keep
decode-side embedding lookup separate so executors that only provide token IDs
can still use sampling. Verify the worker capability handling remains
independent by checking the related executor flags used around
_sample_batch_rows and decode.

444-450: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick win

Preserve sampled-token row boundaries before indexing.

Flattening a (batch, padded_width) sampled-id tensor would read padding columns as later rows. Normalize to rows and take column 0.

Proposed fix

 if sampled_token_ids is not None:
-    flat_ids = sampled_token_ids.view(-1)
-    if flat_ids.numel() < row_count:
+    sampled_rows = sampled_token_ids.reshape(sampled_token_ids.shape[0], -1)
+    if sampled_rows.shape[0] < row_count:
         raise ValueError(
-            f"sampled_token_ids has {flat_ids.numel()} rows, expected at least {row_count}"
+            f"sampled_token_ids has {sampled_rows.shape[0]} rows, expected at least {row_count}"
         )
-    return [int(flat_ids[idx].item()) for idx in range(row_count)]
+    return [int(sampled_rows[idx, 0].item()) for idx in range(row_count)]

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@python/core/engine.py` around lines 444 - 450, In the sampled_token_ids
handling logic, flattening with view(-1) is collapsing the row structure and can
treat padding columns as subsequent rows. Update the row extraction in the
sampled-token path to preserve the original 2D boundaries in the code around
sampled_token_ids, flat_ids, and the return list, then index by row and take
column 0 for each row instead of reading from the flattened tensor. Keep the
existing row-count validation, but ensure the implementation works from
normalized rows rather than a flattened buffer.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@examples/model/qwen3_14b/runner/npu_runner.py`:
- Around line 528-536: The decode path in npu_runner.py is silently truncating
multi-token rows because `width` is derived with `min(...)` inside the
`decode_token_ids_buffer` copy logic. Update the decode handling in the
`compiled.decode_token_ids_buffer` block to validate
`inputs.token_ids`/`active_token_ids` row width before copying, and raise a
fail-fast error when any decode row contains more than one token. Keep the
normal single-token copy and the `actual_batch` padding behavior intact, but do
not slice away extra tokens.

In `@tests/test_batching.py`:
- Around line 588-615: The batching tests are importing qwen3_l3_dispatch via
importlib, which unnecessarily pulls in runtime dependencies like pypto.language
and module-level JIT setup. Update the test cases in test_batching.py to read
the qwen3_l3_dispatch.py source directly from disk with Path, then slice the
qwen3_prefill_host and qwen3_decode_host sections from that text for assertions.
Keep the existing symbol-based checks on qwen3_prefill_host, qwen3_decode_host,
and qwen3_greedy_sample_host, but remove the importlib-based module import path.

In `@tests/test_device_sampling_submission.py`:
- Around line 16-31: Guard this test module against missing optional pypto-lib
contents by checking for the QWEN tree before any _source() calls; if the model
checkout is absent, skip the module/tests instead of reading files
unconditionally. Use the existing ROOT/QWEN setup in
tests/test_device_sampling_submission.py and mirror the skip-if-missing pattern
already used in tests/test_batching.py so helpers like _source and
test_device_sampling_is_limited_by_runtime_vocab_size only run when the files
exist.

---

Nitpick comments:
In `@python/core/engine.py`:
- Around line 190-194: The greedy sampling gate in _sample_batch_rows is
coupling device sampling to device embedding unnecessarily. Update the
allow_device_greedy_sampling condition in LLMEngine to depend only on
temperature and self._executor.supports_device_sampling, and keep decode-side
embedding lookup separate so executors that only provide token IDs can still use
sampling. Verify the worker capability handling remains independent by checking
the related executor flags used around _sample_batch_rows and decode.
- Around line 444-450: In the sampled_token_ids handling logic, flattening with
view(-1) is collapsing the row structure and can treat padding columns as
subsequent rows. Update the row extraction in the sampled-token path to preserve
the original 2D boundaries in the code around sampled_token_ids, flat_ids, and
the return list, then index by row and take column 0 for each row instead of
reading from the flattened tensor. Keep the existing row-count validation, but
ensure the implementation works from normalized rows rather than a flattened
buffer.

In `@python/core/serving_worker.py`:
- Around line 338-345: In serving_worker.py, the sampled-token path in the code
that reads result.sampled_token_ids is flattening the tensor with view(-1),
which can pick the wrong element across row boundaries. Update this logic so the
extraction matches the engine-side behavior by selecting the first token from
the requested row (or otherwise indexing by row first) before converting to int,
and keep the existing bounds check in the same sampled-token handling branch.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a90bce51-9837-4107-a951-ebdd022d772c

📥 Commits

Reviewing files that changed from the base of the PR and between d37496a and 03f24d0.

📒 Files selected for processing (11)

docs/pr-scope-note.md
examples/model/qwen3_14b/runner/npu_executor.py
examples/model/qwen3_14b/runner/npu_runner.py
examples/model/qwen3_14b/runner/qwen3_l3_dispatch.py
pypto-lib
python/core/engine.py
python/core/executor.py
python/core/serving_worker.py
python/core/types.py
tests/test_batching.py
tests/test_device_sampling_submission.py

coderabbitai · 2026-06-30T06:08:21Z

+        token_ids = compiled.decode_token_ids_buffer
+        token_ids.zero_()
+        active_token_ids = inputs.token_ids.reshape(actual_batch, -1)
+        width = min(active_token_ids.shape[1], token_ids.shape[1])
+        token_ids[:actual_batch, :width].copy_(active_token_ids[:, :width])
+        if actual_batch < kernel_batch:
+            token_ids[actual_batch:, :width].copy_(
+                active_token_ids[0:1, :width].expand(kernel_batch - actual_batch, width)
+            )


🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Reject multi-token decode rows instead of truncating them.

Line 531 uses min(...), so a malformed DecodeBatch.token_ids row with more than one token is silently truncated before device embedding. Decode is a single-token step; fail fast on shape drift.

Proposed fix

token_ids = compiled.decode_token_ids_buffer token_ids.zero_() active_token_ids = inputs.token_ids.reshape(actual_batch, -1) - width = min(active_token_ids.shape[1], token_ids.shape[1]) + if active_token_ids.shape[1] != 1: + raise ValueError( + "decode token_ids must contain exactly one token per row, " + f"got shape {tuple(inputs.token_ids.shape)}" + ) + width = 1 token_ids[:actual_batch, :width].copy_(active_token_ids[:, :width]) if actual_batch < kernel_batch: token_ids[actual_batch:, :width].copy_( active_token_ids[0:1, :width].expand(kernel_batch - actual_batch, width) )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

token_ids = compiled.decode_token_ids_buffer

token_ids.zero_()

active_token_ids = inputs.token_ids.reshape(actual_batch, -1)

width = min(active_token_ids.shape[1], token_ids.shape[1])

token_ids[:actual_batch, :width].copy_(active_token_ids[:, :width])

if actual_batch < kernel_batch:

token_ids[actual_batch:, :width].copy_(

active_token_ids[0:1, :width].expand(kernel_batch - actual_batch, width)

)

token_ids = compiled.decode_token_ids_buffer

token_ids.zero_()

active_token_ids = inputs.token_ids.reshape(actual_batch, -1)

if active_token_ids.shape[1] != 1:

raise ValueError(

"decode token_ids must contain exactly one token per row, "

f"got shape {tuple(inputs.token_ids.shape)}"

)

width = 1

token_ids[:actual_batch, :width].copy_(active_token_ids[:, :width])

if actual_batch < kernel_batch:

token_ids[actual_batch:, :width].copy_(

active_token_ids[0:1, :width].expand(kernel_batch - actual_batch, width)

)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/model/qwen3_14b/runner/npu_runner.py` around lines 528 - 536, The decode path in npu_runner.py is silently truncating multi-token rows because `width` is derived with `min(...)` inside the `decode_token_ids_buffer` copy logic. Update the decode handling in the `compiled.decode_token_ids_buffer` block to validate `inputs.token_ids`/`active_token_ids` row width before copying, and raise a fail-fast error when any decode row contains more than one token. Keep the normal single-token copy and the `actual_batch` padding behavior intact, but do not slice away extra tokens.

coderabbitai · 2026-06-30T06:08:21Z

+    qwen3_l3_dispatch = importlib.import_module(
+        "examples.model.qwen3_14b.runner.qwen3_l3_dispatch"
+    )
+    module_source = Path(qwen3_l3_dispatch.__file__).read_text(encoding="utf-8")
+    start = module_source.index("def qwen3_decode_host")
+    end = module_source.index("def qwen3_greedy_sample_host")
+    source = module_source[start:end]
+
+    assert source.count("decode_fwd(") == 1
+    assert "token_embed_fwd(" not in source
+    assert "greedy_sample_fwd(" not in source
+
+    kernel_dir = Path(qwen3_l3_dispatch.__file__).parents[4] / "pypto-lib" / "models" / "qwen3" / "14b"
+    if not kernel_dir.is_dir():
+        pytest.skip("pypto-lib submodule is not checked out")
+    decode_source = (kernel_dir / "decode_layer.py").read_text(encoding="utf-8")
+    assert 'name_hint="token_embed"' in decode_source
+    assert 'name_hint="greedy_sample"' in decode_source
+
+
+def test_prefill_host_keeps_sampling_as_standalone_kernel():
+    qwen3_l3_dispatch = importlib.import_module(
+        "examples.model.qwen3_14b.runner.qwen3_l3_dispatch"
+    )
+    module_source = Path(qwen3_l3_dispatch.__file__).read_text(encoding="utf-8")
+    start = module_source.index("def qwen3_prefill_host")
+    end = module_source.index("def qwen3_decode_host")
+    source = module_source[start:end]


🩺 Stability & Availability | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail # Inspect the relevant test and the target module layout. git ls-files tests/test_batching.py examples/model/qwen3_14b/runner/qwen3_l3_dispatch.py echo '--- outline tests/test_batching.py ---' ast-grep outline tests/test_batching.py --view expanded || true echo '--- outline examples/model/qwen3_14b/runner/qwen3_l3_dispatch.py ---' ast-grep outline examples/model/qwen3_14b/runner/qwen3_l3_dispatch.py --view expanded || true echo '--- relevant lines in tests/test_batching.py ---' sed -n '560,640p' tests/test_batching.py echo '--- top of qwen3_l3_dispatch.py ---' sed -n '1,220p' examples/model/qwen3_14b/runner/qwen3_l3_dispatch.py

Repository: hw-native-sys/pypto-serving

Length of output: 10738

🏁 Script executed:

#!/bin/bash set -euo pipefail # Determine whether the test file already imports importlib and whether there are # similar source-only inspections elsewhere. rg -n "importlib\.import_module|Path$__file__$.*read_text|__file__" tests/test_batching.py tests -g '*.py'

Repository: hw-native-sys/pypto-serving

Length of output: 1762

🏁 Script executed:

#!/bin/bash set -euo pipefail # Check whether pypto.language is part of the repo or only an external dependency, # and whether qwen3_l3_dispatch is otherwise imported in tests. git ls-files | rg '(^|/)pypto(/|$)|(^|/)language\.py$|(^|/)language/__init__\.py$|(^|/)qwen3_l3_dispatch\.py$' echo '--- pypto.language references ---' rg -n "import pypto\.language|from pypto\.language" -g '*.py'

Repository: hw-native-sys/pypto-serving

Length of output: 252

Read examples/model/qwen3_14b/runner/qwen3_l3_dispatch.py from disk here. These checks only inspect source text, so importing the module adds an unnecessary dependency on pypto.language and module-level @pl.jit.host setup. Loading the file directly keeps the test independent of the runtime import path.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_batching.py` around lines 588 - 615, The batching tests are importing qwen3_l3_dispatch via importlib, which unnecessarily pulls in runtime dependencies like pypto.language and module-level JIT setup. Update the test cases in test_batching.py to read the qwen3_l3_dispatch.py source directly from disk with Path, then slice the qwen3_prefill_host and qwen3_decode_host sections from that text for assertions. Keep the existing symbol-based checks on qwen3_prefill_host, qwen3_decode_host, and qwen3_greedy_sample_host, but remove the importlib-based module import path.

coderabbitai · 2026-06-30T06:08:21Z

+ROOT = Path(__file__).resolve().parents[1]
+QWEN = ROOT / "pypto-lib" / "models" / "qwen3" / "14b"
+REAL_VOCAB = 151936
+PADDED_VOCAB = 152064
+
+
+def _source(path: Path) -> str:
+    return path.read_text(encoding="utf-8")
+
+
+def test_device_sampling_is_limited_by_runtime_vocab_size() -> None:
+    dispatch = _source(ROOT / "examples" / "model" / "qwen3_14b" / "runner" / "qwen3_l3_dispatch.py")
+    executor = _source(ROOT / "examples" / "model" / "qwen3_14b" / "runner" / "npu_executor.py")
+    runner = _source(ROOT / "examples" / "model" / "qwen3_14b" / "runner" / "npu_runner.py")
+    config = _source(QWEN / "config.py")
+    greedy = _source(QWEN / "greedy_sample.py")


🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Skip this module when pypto-lib is not checked out.

Every test below reads pypto-lib/models/qwen3/14b/* unconditionally, so a checkout without submodules will fail with FileNotFoundError instead of skipping. tests/test_batching.py already guards the same optional tree that way.

Suggested change

ROOT = Path(__file__).resolve().parents[1] QWEN = ROOT / "pypto-lib" / "models" / "qwen3" / "14b" REAL_VOCAB = 151936 PADDED_VOCAB = 152064 + +pytestmark = pytest.mark.skipif( + not QWEN.is_dir(), + reason="pypto-lib submodule is not checked out", +)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

ROOT = Path(__file__).resolve().parents[1]

QWEN = ROOT / "pypto-lib" / "models" / "qwen3" / "14b"

REAL_VOCAB = 151936

PADDED_VOCAB = 152064

def _source(path: Path) -> str:

return path.read_text(encoding="utf-8")

def test_device_sampling_is_limited_by_runtime_vocab_size() -> None:

dispatch = _source(ROOT / "examples" / "model" / "qwen3_14b" / "runner" / "qwen3_l3_dispatch.py")

executor = _source(ROOT / "examples" / "model" / "qwen3_14b" / "runner" / "npu_executor.py")

runner = _source(ROOT / "examples" / "model" / "qwen3_14b" / "runner" / "npu_runner.py")

config = _source(QWEN / "config.py")

greedy = _source(QWEN / "greedy_sample.py")

ROOT = Path(__file__).resolve().parents[1]

QWEN = ROOT / "pypto-lib" / "models" / "qwen3" / "14b"

REAL_VOCAB = 151936

PADDED_VOCAB = 152064

pytestmark = pytest.mark.skipif(

not QWEN.is_dir(),

reason="pypto-lib submodule is not checked out",

)

def _source(path: Path) -> str:

return path.read_text(encoding="utf-8")

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_device_sampling_submission.py` around lines 16 - 31, Guard this test module against missing optional pypto-lib contents by checking for the QWEN tree before any _source() calls; if the model checkout is absent, skip the module/tests instead of reading files unconditionally. Use the existing ROOT/QWEN setup in tests/test_device_sampling_submission.py and mirror the skip-if-missing pattern already used in tests/test_batching.py so helpers like _source and test_device_sampling_is_limited_by_runtime_vocab_size only run when the files exist.

## Summary This PR adds the Qwen3 pypto-lib side of moving greedy sampling and token embedding onto device. It is paired with the pypto-serving PR below. The two PRs should be reviewed together: pypto-lib provides the device kernels and padded-vocab semantics, while pypto-serving wires those kernels into generation and validates the serving behavior. Paired pypto-serving PR: hw-native-sys/pypto-serving#47 ## Performance Impact Measured with the paired serving PR on myserver using full 40-layer, 128-token NPU generation: - Latest main baseline: total 8.583s, prefill 0.410s, decode 6.211s over 127 decode steps, 48.9 ms/token. - Device emb/sample branch: total 6.450s, prefill 0.373s, decode 6.033s over 127 decode steps, 47.5 ms/token. This is about 25% lower end-to-end generation time in that run, with decode TPOT slightly lower as well. The main reason is not a large kernel-level speedup: the device path adds small `greedy_sample` / `token_embed` kernels, but removes the per-token host sampling, host embedding lookup, and host/device synchronization work from the generation loop. ## What Changed - Add `REAL_VOCAB` for Qwen3-14B so kernels can distinguish the real vocabulary from the padded device vocabulary. - Add standalone greedy sampling and token embedding kernels. - Integrate previous-token embedding and next-token greedy sampling into the fused decode kernel. - Keep prefill free of inline sample/embed logic; serving runs standalone device greedy sampling after prefill when greedy generation is enabled. - Make greedy tie-break match host `torch.argmax` by reverse-scanning equal best logits so the smallest token id wins. - Clamp any sampled padded-vocab id back to token 0 as a defensive guard. ## Validation Validated together with the paired serving branch on myserver: - `pytest tests/test_batching.py tests/test_device_sampling_submission.py -q` -> 20 passed - `python -m pytest tests/golden/test_qwen3_greedy_source.py -q` -> 3 passed - greedy sample `a2a3sim` -> PASS - NPU generation -> exit 0, token ids `[594, 220, 20, 38, 5440]`, text `'s 5G technology`

sunkaixuan2018 mentioned this pull request Jun 29, 2026

feat(qwen3): device-side greedy sampling and embedding hw-native-sys/pypto-lib#639

Merged

gemini-code-assist Bot reviewed Jun 29, 2026

View reviewed changes

Comment thread examples/model/qwen3_14b/runner/npu_executor.py Outdated

Comment thread python/core/engine.py Outdated

sunkaixuan2018 force-pushed the emb-and-samping branch from bccaf3c to 62edf7c Compare June 29, 2026 07:41

sunkaixuan2018 changed the title ~~[codex] wire qwen3 device sampling and embedding in serving~~ Qwen3-14B serving: device-side embedding and sampling Jun 29, 2026

sunkaixuan2018 marked this pull request as ready for review June 29, 2026 08:43

sunkaixuan2018 force-pushed the emb-and-samping branch 9 times, most recently from 03f24d0 to 75c63dc Compare June 30, 2026 06:03

coderabbitai Bot reviewed Jun 30, 2026

View reviewed changes

Qwen3-14B serving: device-side embedding and sampling

3c0a4e6

sunkaixuan2018 force-pushed the emb-and-samping branch from 75c63dc to 3c0a4e6 Compare June 30, 2026 06:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Qwen3-14B serving: device-side embedding and sampling#47

Qwen3-14B serving: device-side embedding and sampling#47
sunkaixuan2018 wants to merge 1 commit into
hw-native-sys:mainfrom
sunkaixuan2018:emb-and-samping

sunkaixuan2018 commented Jun 29, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jun 29, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 30, 2026

Uh oh!

coderabbitai Bot Jun 30, 2026

Uh oh!

coderabbitai Bot Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

sunkaixuan2018 commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Performance Impact

What Changed

Validation

Uh oh!

coderabbitai Bot commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

sunkaixuan2018 commented Jun 29, 2026 •

edited

Loading

coderabbitai Bot commented Jun 29, 2026 •

edited

Loading