Qwen3-14B serving: device-side embedding and sampling#47
Conversation
📝 WalkthroughWalkthroughAdds device-side greedy sampling and token embedding to the Qwen3-14B NPU executor. New capability flags ( ChangesDevice-side sampling and embedding integration
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces device-side embedding and greedy sampling support for non-streaming generation in the Qwen3-14B model, alongside a host-side CPU prefill path and chunked prefill support to mitigate out-of-memory errors on long contexts. It also adds stress testing scripts, performance documentation, and corresponding unit tests. The review feedback highlights a potential runtime error in Qwen314BPyptoExecutor where the environment variable PYPTO_QWEN3_PREFILL_ON_CPU is not checked during compilation, which would prevent the allocation of the required staging buffer. Additionally, a simplification is suggested in python/core/engine.py to eliminate a redundant conditional check and improve code readability.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
- Align CPU prefill env handling between executor and runner\n- Simplify chunked prefill sampled-token handling\n- Fix pre-commit and submodule-dependent tests
bccaf3c to
62edf7c
Compare
- Remove BOM from npu runner source header\n- Add missing test header\n- Skip submodule source assertions when pypto-lib is not checked out
03f24d0 to
75c63dc
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (3)
python/core/serving_worker.py (1)
338-345: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick winPreserve sampled-token row boundaries before indexing.
Flattening a 2-D sampled-id tensor can select a padded column instead of the requested batch row. Match the engine-side extraction by indexing the first column per row.
Proposed fix
sampled = getattr(result, "sampled_token_ids", None) if allow_device_sampled and sampled is not None: - flat = sampled.view(-1) - if flat.numel() <= row_idx: + sampled_rows = sampled.reshape(sampled.shape[0], -1) + if sampled_rows.shape[0] <= row_idx: raise ValueError( - f"sampled_token_ids has {flat.numel()} rows, expected row {row_idx}" + f"sampled_token_ids has {sampled_rows.shape[0]} rows, expected row {row_idx}" ) - return int(flat[row_idx].item()) + return int(sampled_rows[row_idx, 0].item())🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@python/core/serving_worker.py` around lines 338 - 345, In serving_worker.py, the sampled-token path in the code that reads result.sampled_token_ids is flattening the tensor with view(-1), which can pick the wrong element across row boundaries. Update this logic so the extraction matches the engine-side behavior by selecting the first token from the requested row (or otherwise indexing by row first) before converting to int, and keep the existing bounds check in the same sampled-token handling branch.python/core/engine.py (2)
190-194: 🚀 Performance & Scalability | 🔵 Trivial | ⚡ Quick winDecouple device sampling from device embedding.
_sample_batch_rowsonly needs executor-provided token IDs, and decode can still use host embedding lookup whensupports_device_embeddingis false. Requiring both flags disables sampling-only executors inLLMEngine, while the worker path treats these capabilities independently.Proposed adjustment
allow_device_greedy_sampling = ( generate_config.temperature <= 0.0 and self._executor.supports_device_sampling - and self._executor.supports_device_embedding )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@python/core/engine.py` around lines 190 - 194, The greedy sampling gate in _sample_batch_rows is coupling device sampling to device embedding unnecessarily. Update the allow_device_greedy_sampling condition in LLMEngine to depend only on temperature and self._executor.supports_device_sampling, and keep decode-side embedding lookup separate so executors that only provide token IDs can still use sampling. Verify the worker capability handling remains independent by checking the related executor flags used around _sample_batch_rows and decode.
444-450: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick winPreserve sampled-token row boundaries before indexing.
Flattening a
(batch, padded_width)sampled-id tensor would read padding columns as later rows. Normalize to rows and take column 0.Proposed fix
if sampled_token_ids is not None: - flat_ids = sampled_token_ids.view(-1) - if flat_ids.numel() < row_count: + sampled_rows = sampled_token_ids.reshape(sampled_token_ids.shape[0], -1) + if sampled_rows.shape[0] < row_count: raise ValueError( - f"sampled_token_ids has {flat_ids.numel()} rows, expected at least {row_count}" + f"sampled_token_ids has {sampled_rows.shape[0]} rows, expected at least {row_count}" ) - return [int(flat_ids[idx].item()) for idx in range(row_count)] + return [int(sampled_rows[idx, 0].item()) for idx in range(row_count)]🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@python/core/engine.py` around lines 444 - 450, In the sampled_token_ids handling logic, flattening with view(-1) is collapsing the row structure and can treat padding columns as subsequent rows. Update the row extraction in the sampled-token path to preserve the original 2D boundaries in the code around sampled_token_ids, flat_ids, and the return list, then index by row and take column 0 for each row instead of reading from the flattened tensor. Keep the existing row-count validation, but ensure the implementation works from normalized rows rather than a flattened buffer.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@examples/model/qwen3_14b/runner/npu_runner.py`:
- Around line 528-536: The decode path in npu_runner.py is silently truncating
multi-token rows because `width` is derived with `min(...)` inside the
`decode_token_ids_buffer` copy logic. Update the decode handling in the
`compiled.decode_token_ids_buffer` block to validate
`inputs.token_ids`/`active_token_ids` row width before copying, and raise a
fail-fast error when any decode row contains more than one token. Keep the
normal single-token copy and the `actual_batch` padding behavior intact, but do
not slice away extra tokens.
In `@tests/test_batching.py`:
- Around line 588-615: The batching tests are importing qwen3_l3_dispatch via
importlib, which unnecessarily pulls in runtime dependencies like pypto.language
and module-level JIT setup. Update the test cases in test_batching.py to read
the qwen3_l3_dispatch.py source directly from disk with Path, then slice the
qwen3_prefill_host and qwen3_decode_host sections from that text for assertions.
Keep the existing symbol-based checks on qwen3_prefill_host, qwen3_decode_host,
and qwen3_greedy_sample_host, but remove the importlib-based module import path.
In `@tests/test_device_sampling_submission.py`:
- Around line 16-31: Guard this test module against missing optional pypto-lib
contents by checking for the QWEN tree before any _source() calls; if the model
checkout is absent, skip the module/tests instead of reading files
unconditionally. Use the existing ROOT/QWEN setup in
tests/test_device_sampling_submission.py and mirror the skip-if-missing pattern
already used in tests/test_batching.py so helpers like _source and
test_device_sampling_is_limited_by_runtime_vocab_size only run when the files
exist.
---
Nitpick comments:
In `@python/core/engine.py`:
- Around line 190-194: The greedy sampling gate in _sample_batch_rows is
coupling device sampling to device embedding unnecessarily. Update the
allow_device_greedy_sampling condition in LLMEngine to depend only on
temperature and self._executor.supports_device_sampling, and keep decode-side
embedding lookup separate so executors that only provide token IDs can still use
sampling. Verify the worker capability handling remains independent by checking
the related executor flags used around _sample_batch_rows and decode.
- Around line 444-450: In the sampled_token_ids handling logic, flattening with
view(-1) is collapsing the row structure and can treat padding columns as
subsequent rows. Update the row extraction in the sampled-token path to preserve
the original 2D boundaries in the code around sampled_token_ids, flat_ids, and
the return list, then index by row and take column 0 for each row instead of
reading from the flattened tensor. Keep the existing row-count validation, but
ensure the implementation works from normalized rows rather than a flattened
buffer.
In `@python/core/serving_worker.py`:
- Around line 338-345: In serving_worker.py, the sampled-token path in the code
that reads result.sampled_token_ids is flattening the tensor with view(-1),
which can pick the wrong element across row boundaries. Update this logic so the
extraction matches the engine-side behavior by selecting the first token from
the requested row (or otherwise indexing by row first) before converting to int,
and keep the existing bounds check in the same sampled-token handling branch.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: a90bce51-9837-4107-a951-ebdd022d772c
📒 Files selected for processing (11)
docs/pr-scope-note.mdexamples/model/qwen3_14b/runner/npu_executor.pyexamples/model/qwen3_14b/runner/npu_runner.pyexamples/model/qwen3_14b/runner/qwen3_l3_dispatch.pypypto-libpython/core/engine.pypython/core/executor.pypython/core/serving_worker.pypython/core/types.pytests/test_batching.pytests/test_device_sampling_submission.py
| token_ids = compiled.decode_token_ids_buffer | ||
| token_ids.zero_() | ||
| active_token_ids = inputs.token_ids.reshape(actual_batch, -1) | ||
| width = min(active_token_ids.shape[1], token_ids.shape[1]) | ||
| token_ids[:actual_batch, :width].copy_(active_token_ids[:, :width]) | ||
| if actual_batch < kernel_batch: | ||
| token_ids[actual_batch:, :width].copy_( | ||
| active_token_ids[0:1, :width].expand(kernel_batch - actual_batch, width) | ||
| ) |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win
Reject multi-token decode rows instead of truncating them.
Line 531 uses min(...), so a malformed DecodeBatch.token_ids row with more than one token is silently truncated before device embedding. Decode is a single-token step; fail fast on shape drift.
Proposed fix
token_ids = compiled.decode_token_ids_buffer
token_ids.zero_()
active_token_ids = inputs.token_ids.reshape(actual_batch, -1)
- width = min(active_token_ids.shape[1], token_ids.shape[1])
+ if active_token_ids.shape[1] != 1:
+ raise ValueError(
+ "decode token_ids must contain exactly one token per row, "
+ f"got shape {tuple(inputs.token_ids.shape)}"
+ )
+ width = 1
token_ids[:actual_batch, :width].copy_(active_token_ids[:, :width])
if actual_batch < kernel_batch:
token_ids[actual_batch:, :width].copy_(
active_token_ids[0:1, :width].expand(kernel_batch - actual_batch, width)
)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| token_ids = compiled.decode_token_ids_buffer | |
| token_ids.zero_() | |
| active_token_ids = inputs.token_ids.reshape(actual_batch, -1) | |
| width = min(active_token_ids.shape[1], token_ids.shape[1]) | |
| token_ids[:actual_batch, :width].copy_(active_token_ids[:, :width]) | |
| if actual_batch < kernel_batch: | |
| token_ids[actual_batch:, :width].copy_( | |
| active_token_ids[0:1, :width].expand(kernel_batch - actual_batch, width) | |
| ) | |
| token_ids = compiled.decode_token_ids_buffer | |
| token_ids.zero_() | |
| active_token_ids = inputs.token_ids.reshape(actual_batch, -1) | |
| if active_token_ids.shape[1] != 1: | |
| raise ValueError( | |
| "decode token_ids must contain exactly one token per row, " | |
| f"got shape {tuple(inputs.token_ids.shape)}" | |
| ) | |
| width = 1 | |
| token_ids[:actual_batch, :width].copy_(active_token_ids[:, :width]) | |
| if actual_batch < kernel_batch: | |
| token_ids[actual_batch:, :width].copy_( | |
| active_token_ids[0:1, :width].expand(kernel_batch - actual_batch, width) | |
| ) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@examples/model/qwen3_14b/runner/npu_runner.py` around lines 528 - 536, The
decode path in npu_runner.py is silently truncating multi-token rows because
`width` is derived with `min(...)` inside the `decode_token_ids_buffer` copy
logic. Update the decode handling in the `compiled.decode_token_ids_buffer`
block to validate `inputs.token_ids`/`active_token_ids` row width before
copying, and raise a fail-fast error when any decode row contains more than one
token. Keep the normal single-token copy and the `actual_batch` padding behavior
intact, but do not slice away extra tokens.
| qwen3_l3_dispatch = importlib.import_module( | ||
| "examples.model.qwen3_14b.runner.qwen3_l3_dispatch" | ||
| ) | ||
| module_source = Path(qwen3_l3_dispatch.__file__).read_text(encoding="utf-8") | ||
| start = module_source.index("def qwen3_decode_host") | ||
| end = module_source.index("def qwen3_greedy_sample_host") | ||
| source = module_source[start:end] | ||
|
|
||
| assert source.count("decode_fwd(") == 1 | ||
| assert "token_embed_fwd(" not in source | ||
| assert "greedy_sample_fwd(" not in source | ||
|
|
||
| kernel_dir = Path(qwen3_l3_dispatch.__file__).parents[4] / "pypto-lib" / "models" / "qwen3" / "14b" | ||
| if not kernel_dir.is_dir(): | ||
| pytest.skip("pypto-lib submodule is not checked out") | ||
| decode_source = (kernel_dir / "decode_layer.py").read_text(encoding="utf-8") | ||
| assert 'name_hint="token_embed"' in decode_source | ||
| assert 'name_hint="greedy_sample"' in decode_source | ||
|
|
||
|
|
||
| def test_prefill_host_keeps_sampling_as_standalone_kernel(): | ||
| qwen3_l3_dispatch = importlib.import_module( | ||
| "examples.model.qwen3_14b.runner.qwen3_l3_dispatch" | ||
| ) | ||
| module_source = Path(qwen3_l3_dispatch.__file__).read_text(encoding="utf-8") | ||
| start = module_source.index("def qwen3_prefill_host") | ||
| end = module_source.index("def qwen3_decode_host") | ||
| source = module_source[start:end] |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟡 Minor | ⚡ Quick win
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Inspect the relevant test and the target module layout.
git ls-files tests/test_batching.py examples/model/qwen3_14b/runner/qwen3_l3_dispatch.py
echo '--- outline tests/test_batching.py ---'
ast-grep outline tests/test_batching.py --view expanded || true
echo '--- outline examples/model/qwen3_14b/runner/qwen3_l3_dispatch.py ---'
ast-grep outline examples/model/qwen3_14b/runner/qwen3_l3_dispatch.py --view expanded || true
echo '--- relevant lines in tests/test_batching.py ---'
sed -n '560,640p' tests/test_batching.py
echo '--- top of qwen3_l3_dispatch.py ---'
sed -n '1,220p' examples/model/qwen3_14b/runner/qwen3_l3_dispatch.pyRepository: hw-native-sys/pypto-serving
Length of output: 10738
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Determine whether the test file already imports importlib and whether there are
# similar source-only inspections elsewhere.
rg -n "importlib\.import_module|Path\(__file__\).*read_text|__file__" tests/test_batching.py tests -g '*.py'Repository: hw-native-sys/pypto-serving
Length of output: 1762
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Check whether pypto.language is part of the repo or only an external dependency,
# and whether qwen3_l3_dispatch is otherwise imported in tests.
git ls-files | rg '(^|/)pypto(/|$)|(^|/)language\.py$|(^|/)language/__init__\.py$|(^|/)qwen3_l3_dispatch\.py$'
echo '--- pypto.language references ---'
rg -n "import pypto\.language|from pypto\.language" -g '*.py'Repository: hw-native-sys/pypto-serving
Length of output: 252
Read examples/model/qwen3_14b/runner/qwen3_l3_dispatch.py from disk here. These checks only inspect source text, so importing the module adds an unnecessary dependency on pypto.language and module-level @pl.jit.host setup. Loading the file directly keeps the test independent of the runtime import path.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/test_batching.py` around lines 588 - 615, The batching tests are
importing qwen3_l3_dispatch via importlib, which unnecessarily pulls in runtime
dependencies like pypto.language and module-level JIT setup. Update the test
cases in test_batching.py to read the qwen3_l3_dispatch.py source directly from
disk with Path, then slice the qwen3_prefill_host and qwen3_decode_host sections
from that text for assertions. Keep the existing symbol-based checks on
qwen3_prefill_host, qwen3_decode_host, and qwen3_greedy_sample_host, but remove
the importlib-based module import path.
| ROOT = Path(__file__).resolve().parents[1] | ||
| QWEN = ROOT / "pypto-lib" / "models" / "qwen3" / "14b" | ||
| REAL_VOCAB = 151936 | ||
| PADDED_VOCAB = 152064 | ||
|
|
||
|
|
||
| def _source(path: Path) -> str: | ||
| return path.read_text(encoding="utf-8") | ||
|
|
||
|
|
||
| def test_device_sampling_is_limited_by_runtime_vocab_size() -> None: | ||
| dispatch = _source(ROOT / "examples" / "model" / "qwen3_14b" / "runner" / "qwen3_l3_dispatch.py") | ||
| executor = _source(ROOT / "examples" / "model" / "qwen3_14b" / "runner" / "npu_executor.py") | ||
| runner = _source(ROOT / "examples" / "model" / "qwen3_14b" / "runner" / "npu_runner.py") | ||
| config = _source(QWEN / "config.py") | ||
| greedy = _source(QWEN / "greedy_sample.py") |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
Skip this module when pypto-lib is not checked out.
Every test below reads pypto-lib/models/qwen3/14b/* unconditionally, so a checkout without submodules will fail with FileNotFoundError instead of skipping. tests/test_batching.py already guards the same optional tree that way.
Suggested change
ROOT = Path(__file__).resolve().parents[1]
QWEN = ROOT / "pypto-lib" / "models" / "qwen3" / "14b"
REAL_VOCAB = 151936
PADDED_VOCAB = 152064
+
+pytestmark = pytest.mark.skipif(
+ not QWEN.is_dir(),
+ reason="pypto-lib submodule is not checked out",
+)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| ROOT = Path(__file__).resolve().parents[1] | |
| QWEN = ROOT / "pypto-lib" / "models" / "qwen3" / "14b" | |
| REAL_VOCAB = 151936 | |
| PADDED_VOCAB = 152064 | |
| def _source(path: Path) -> str: | |
| return path.read_text(encoding="utf-8") | |
| def test_device_sampling_is_limited_by_runtime_vocab_size() -> None: | |
| dispatch = _source(ROOT / "examples" / "model" / "qwen3_14b" / "runner" / "qwen3_l3_dispatch.py") | |
| executor = _source(ROOT / "examples" / "model" / "qwen3_14b" / "runner" / "npu_executor.py") | |
| runner = _source(ROOT / "examples" / "model" / "qwen3_14b" / "runner" / "npu_runner.py") | |
| config = _source(QWEN / "config.py") | |
| greedy = _source(QWEN / "greedy_sample.py") | |
| ROOT = Path(__file__).resolve().parents[1] | |
| QWEN = ROOT / "pypto-lib" / "models" / "qwen3" / "14b" | |
| REAL_VOCAB = 151936 | |
| PADDED_VOCAB = 152064 | |
| pytestmark = pytest.mark.skipif( | |
| not QWEN.is_dir(), | |
| reason="pypto-lib submodule is not checked out", | |
| ) | |
| def _source(path: Path) -> str: | |
| return path.read_text(encoding="utf-8") |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/test_device_sampling_submission.py` around lines 16 - 31, Guard this
test module against missing optional pypto-lib contents by checking for the QWEN
tree before any _source() calls; if the model checkout is absent, skip the
module/tests instead of reading files unconditionally. Use the existing
ROOT/QWEN setup in tests/test_device_sampling_submission.py and mirror the
skip-if-missing pattern already used in tests/test_batching.py so helpers like
_source and test_device_sampling_is_limited_by_runtime_vocab_size only run when
the files exist.
75c63dc to
3c0a4e6
Compare
## Summary This PR adds the Qwen3 pypto-lib side of moving greedy sampling and token embedding onto device. It is paired with the pypto-serving PR below. The two PRs should be reviewed together: pypto-lib provides the device kernels and padded-vocab semantics, while pypto-serving wires those kernels into generation and validates the serving behavior. Paired pypto-serving PR: hw-native-sys/pypto-serving#47 ## Performance Impact Measured with the paired serving PR on myserver using full 40-layer, 128-token NPU generation: - Latest main baseline: total 8.583s, prefill 0.410s, decode 6.211s over 127 decode steps, 48.9 ms/token. - Device emb/sample branch: total 6.450s, prefill 0.373s, decode 6.033s over 127 decode steps, 47.5 ms/token. This is about 25% lower end-to-end generation time in that run, with decode TPOT slightly lower as well. The main reason is not a large kernel-level speedup: the device path adds small `greedy_sample` / `token_embed` kernels, but removes the per-token host sampling, host embedding lookup, and host/device synchronization work from the generation loop. ## What Changed - Add `REAL_VOCAB` for Qwen3-14B so kernels can distinguish the real vocabulary from the padded device vocabulary. - Add standalone greedy sampling and token embedding kernels. - Integrate previous-token embedding and next-token greedy sampling into the fused decode kernel. - Keep prefill free of inline sample/embed logic; serving runs standalone device greedy sampling after prefill when greedy generation is enabled. - Make greedy tie-break match host `torch.argmax` by reverse-scanning equal best logits so the smallest token id wins. - Clamp any sampled padded-vocab id back to token 0 as a defensive guard. ## Validation Validated together with the paired serving branch on myserver: - `pytest tests/test_batching.py tests/test_device_sampling_submission.py -q` -> 20 passed - `python -m pytest tests/golden/test_qwen3_greedy_source.py -q` -> 3 passed - greedy sample `a2a3sim` -> PASS - NPU generation -> exit 0, token ids `[594, 220, 20, 38, 5440]`, text `'s 5G technology`
Summary
This PR adds the pypto-serving side of moving Qwen3 greedy sampling and token embedding from host to device for non-streaming generation.
This branch was rebuilt cleanly from
upstream/mainand no longer includes the earlier luohuan stream/base commit. It is paired with the pypto-lib PR below. The two PRs should be reviewed together: pypto-lib defines the device kernels and padded-vocab behavior, and this serving PR pads weights, calls the new kernels, consumes device-sampled ids, and verifies the serving path.Paired pypto-lib PR: hw-native-sys/pypto-lib#639
Performance Impact
Measured on myserver with full 40-layer, 128-token NPU generation:
This run shows about 25% lower end-to-end generation time. Pure device kernel work is close overall and the new path does add small
greedy_sample/token_embedwork; the main win is that serving no longer does greedy sampling, embedding lookup, and related host/device synchronization once per generated token on the host side.What Changed
lm_headrows by duplicating token 0 so padded-vocab logits are deterministic.embed_weightrows as zero padding.valid_vocab_sizetensor path and rely on the lib-sideREAL_VOCABclamp.REAL_VOCABclamp.pending_next_hiddenon host.Validation
Local:
python -m py_compile examples/model/qwen3_14b/runner/npu_executor.py examples/model/qwen3_14b/runner/npu_runner.py examples/model/qwen3_14b/runner/qwen3_l3_dispatch.py python/core/engine.py python/core/executor.py python/core/serving_worker.py python/core/types.py tests/test_batching.py tests/test_device_sampling_submission.pypython -m pytest tests/test_device_sampling_submission.py -q-> 3 passed, 1 skipped locally because torch is unavailablemyserver:
pytest tests/test_device_sampling_submission.py -q-> 4 passedpytest tests/test_batching.py -q-> 16 passedpytest tests/test_batching.py tests/test_device_sampling_submission.py -q-> 20 passeda2a3sim-> PASS[594, 220, 20, 38, 5440], text's 5G technologygreedy_sampleandtoken_embedkernels