Qwen3-14B serving: dynamic KV sizing, vLLM-style admission, chunked p…#45
Qwen3-14B serving: dynamic KV sizing, vLLM-style admission, chunked p…#45sunghajung6688 wants to merge 6 commits into
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThe PR wires dynamic KV-cache sizing from NPU free memory through a shared ChangesServing Runtime and NPU Cache Flow
pypto-lib Revision Bump
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
082c79f to
deb90c4
Compare
There was a problem hiding this comment.
Code Review
This pull request introduces dynamic KV cache allocation, a profile warmup phase, and batch sampling optimizations for Qwen3-14B on NPU. It enables lazy initialization of host-side block metadata based on actual device-side KV cache capacity, adds robust prompt length validation and capping in the scheduler, and updates default data types to bfloat16. Feedback on these changes highlights three key areas for improvement: refining the KV cache allocation retry loop to prevent skipping the minimum page floor, correcting the validation check in npu_executor.py to use the computed min_pages (or kernel_batch) instead of a weak check against 1, and updating the weight parameter estimation formula in npu_runner.py to accurately reflect Grouped Query Attention (GQA) instead of Multi-Head Attention (MHA).
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| except (RuntimeError, MemoryError) as e: | ||
| prev = num_pages | ||
| num_pages //= 2 | ||
| print( | ||
| f"[init_kv_cache] alloc failed ({e}); retrying {prev} -> {num_pages}", | ||
| flush=True, | ||
| ) |
There was a problem hiding this comment.
In the retry loop, halving num_pages can cause it to drop below floor immediately, skipping the floor value entirely. For example, if floor is 16 and num_pages is 30, halving it results in 15, which is less than floor, causing the loop to terminate without ever trying to allocate exactly floor (16) pages. If num_pages drops below floor but was previously strictly greater than floor, it should be clamped to floor for one final attempt.
except (RuntimeError, MemoryError) as e:
prev = num_pages
num_pages //= 2
if num_pages < floor and prev > floor:
num_pages = floor
print(
f"[init_kv_cache] alloc failed ({e}); retrying {prev} -> {num_pages}",
flush=True,
)| min_pages = kernel_batch * (model.runtime.max_seq_len + model.runtime.page_size - 1) // model.runtime.page_size | ||
| if model.runtime.total_kv_pages < 1: | ||
| raise ValueError( | ||
| "PyPTO Qwen3-14B kernels require total_kv_pages to match the runtime batch capacity: " | ||
| f"{model.runtime.total_kv_pages} provided, {expected_pages} required." | ||
| f"total_kv_pages must be at least 1, got {model.runtime.total_kv_pages}" | ||
| ) |
There was a problem hiding this comment.
The variable min_pages is computed but never used in the validation check. Furthermore, checking total_kv_pages < 1 is too weak for a batch of size kernel_batch, as each batch slot requires at least 1 page to avoid runtime errors. It should be validated against kernel_batch instead.
if model.runtime.total_kv_pages < kernel_batch:
raise ValueError(
f"total_kv_pages must be at least kernel_batch ({kernel_batch}), got {model.runtime.total_kv_pages}"
)| # Weights — same estimate as _compute_kv_cache_pages. | ||
| hidden = config.hidden_size | ||
| wt_params = ( | ||
| config.num_hidden_layers * ( | ||
| hidden * hidden * 4 | ||
| + hidden * config.intermediate_size * 3 | ||
| + hidden * 4 | ||
| ) | ||
| + config.vocab_size * hidden | ||
| ) | ||
| weight_bytes = int(wt_params * dtype_bytes) |
There was a problem hiding this comment.
The weight parameter estimation formula assumes Multi-Head Attention (MHA) where the Q, K, V, and O projections are all of size hidden * hidden (hence hidden * hidden * 4). However, Qwen3-14B uses Grouped Query Attention (GQA), where the K and V projections are significantly smaller (hidden * kv_hidden). This leads to a large overestimate of the model weight size in the memory breakdown log. We should use the actual GQA projection sizes for a more accurate estimate.
| # Weights — same estimate as _compute_kv_cache_pages. | |
| hidden = config.hidden_size | |
| wt_params = ( | |
| config.num_hidden_layers * ( | |
| hidden * hidden * 4 | |
| + hidden * config.intermediate_size * 3 | |
| + hidden * 4 | |
| ) | |
| + config.vocab_size * hidden | |
| ) | |
| weight_bytes = int(wt_params * dtype_bytes) | |
| # Weights — same estimate as _compute_kv_cache_pages. | |
| hidden = config.hidden_size | |
| kv_hidden = config.num_key_value_heads * config.head_dim | |
| wt_params = ( | |
| config.num_hidden_layers * ( | |
| hidden * hidden * 2 | |
| + hidden * kv_hidden * 2 | |
| + hidden * config.intermediate_size * 3 | |
| + hidden * 4 | |
| ) | |
| + config.vocab_size * hidden | |
| ) | |
| weight_bytes = int(wt_params * dtype_bytes) |
deb90c4 to
e9936d0
Compare
There was a problem hiding this comment.
Actionable comments posted: 10
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@examples/model/qwen3_14b/npu_generate.py`:
- Around line 438-440: Replace the ambiguous Unicode multiplication sign in the
nearby comment with an ASCII-safe equivalent so Ruff no longer flags RUF003.
Update the comment in the area around the decode kernel capacity note in
npu_generate.py, keeping the wording the same but using plain text characters
only.
In `@examples/model/qwen3_14b/runner/npu_runner.py`:
- Around line 382-387: The warmup decode path is allowing `seq_len` to exceed
`runtime.max_seq_len` when `max_num_batched_tokens` is large, so adjust the
warmup token calculation in `npu_runner.py` to cap decode length at
`max_seq_len` before dispatching the warmup request. Update the logic around
`batch`, `max_seq`, `mnb`, `step_tokens`, `per_req`, and `total_tokens` so the
decode step sent by the warmup code never becomes `max_seq_len + 1`, and apply
the same bound wherever the warmup decode request is built.
- Line 267: The new diagnostics path has Ruff issues in NpuRunner-related code:
replace any ambiguous Unicode multiplication symbols with plain ASCII x in the
affected text, and update the logic that reads the cache so it does not build a
full list just to access the first item. Make the fix in the NpuRunner
methods/docstrings around the diagnostics path and any helper that currently
materializes cache entries before using only the first value.
In `@python/cli/main.py`:
- Around line 52-57: The `--kv-cache-memory-fraction` argument in `main` has
misleading help text: it shows the wrong default and describes the reservation
basis incorrectly. Update the `parser.add_argument` help string to reflect the
actual default value of 0.90 and state that the fraction is applied to measured
free HBM after warmup, not total HBM. Keep the change localized to the CLI
argument definition so the documentation matches runtime behavior.
In `@python/core/kv_cache.py`:
- Around line 345-356: The lazy host-pool setup in the pool allocation block
leaves pool.key_pages assigned before pool.value_pages is created, so a failure
can permanently half-initialize the pool. Update the allocation flow in the
host-pool initialization path (the code that sets pool.key_pages and
pool.value_pages) so both tensors are created atomically: allocate into local
temporaries first, then assign both fields only after both allocations succeed,
and leave the pool unchanged on error.
In `@python/core/pypto_executor.py`:
- Around line 61-69: `register_model` in `pypto_executor.py` leaves partially
initialized state if `runner.init_kv_cache(...)` throws, so add rollback around
the runner setup path. Wrap the `self._create_runner(...)` and
`runner.init_kv_cache(...)` sequence in `register_model` with cleanup that
removes any `self._compiled[model_id]` entry, avoids storing
`self._runners[model_id]`, and closes/destroys the created runner on failure.
Make sure the successful path still assigns `self._runners[model_id]` only after
KV-cache init completes.
In `@python/core/scheduler.py`:
- Around line 147-155: The scheduler’s max-new-token capping in scheduler.py can
reduce `request.max_new_tokens` to 0 when `prompt_len` already equals
`max_seq_len`, which still allows one extra sampled token later in the
prefill/finish path. Update the enqueue-time validation in the scheduler logic
around `remaining`, `request.max_new_tokens`, and `request.request_id` so
zero-token generation capacity is rejected or finished before the request is
queued, rather than merely capped to 0.
In `@python/core/server.py`:
- Around line 276-279: The chat-template argument handling currently lets
chat_template_kwargs override the fixed settings used by apply_chat_template,
which can change tokenize/add_generation_prompt in a way that breaks
engine.add_request(). Update the logic around kwargs in the tokenizer
apply_chat_template call so the required defaults stay enforced and only
non-fixed request-supplied options are merged in; make sure tokenize remains
false and add_generation_prompt remains true regardless of chat_template_kwargs.
In `@python/core/serving_worker.py`:
- Around line 98-105: The readiness path in serving_worker.py is returning an
invalid KV page count from the register_model fallback, which causes
AsyncLLMEngine.start() to reject startup after the worker has already signaled
ready. Update the register_model handling in the model-load flow so the code
never reports readiness with 0 pages; either ensure a valid positive page count
is produced by register_model or fail before setting ready in
_worker_entry()/the model load method. Use the existing register_model and
_worker_entry logic to locate and adjust the startup contract.
In `@python/core/types.py`:
- Around line 67-70: Add construction-time validation for the new runtime knobs
in the config type so invalid values fail fast before reaching NPU sizing/warmup
paths. Update the class in types.py that defines kv_cache_memory_fraction and
max_num_batched_tokens to enforce a valid range for kv_cache_memory_fraction (>
0 and <= 1) and a positive value for max_num_batched_tokens. Make the check
centralized at the config level so every caller gets the same bounds
enforcement, not only the CLI entry points.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: c7191972-3946-4576-9627-a5e47606a4d9
📒 Files selected for processing (18)
examples/model/qwen3_14b/cpu_generate.pyexamples/model/qwen3_14b/npu_generate.pyexamples/model/qwen3_14b/runner/npu_executor.pyexamples/model/qwen3_14b/runner/npu_runner.pypypto-libpython/cli/main.pypython/core/async_engine.pypython/core/kv_cache.pypython/core/model_runner.pypython/core/pypto_executor.pypython/core/sampler.pypython/core/scheduler.pypython/core/server.pypython/core/serving_worker.pypython/core/tokenizer.pypython/core/types.pytests/test_batching.pytests/test_npu_prefix_chunk.py
| # Conservative default — the decode kernel is compiled | ||
| # with this baked-in shape and cannot be resized later. | ||
| # 200 pages × 128 tokens = 25 600 tokens total capacity. |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win
Replace the Unicode multiplication sign in the comment.
Line 440 uses ×, which Ruff flags as ambiguous (RUF003) and can keep lint red for a comment-only issue.
🧰 Tools
🪛 Ruff (0.15.18)
[warning] 440-440: Comment contains ambiguous × (MULTIPLICATION SIGN). Did you mean x (LATIN SMALL LETTER X)?
(RUF003)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@examples/model/qwen3_14b/npu_generate.py` around lines 438 - 440, Replace the
ambiguous Unicode multiplication sign in the nearby comment with an ASCII-safe
equivalent so Ruff no longer flags RUF003. Update the comment in the area around
the decode kernel capacity note in npu_generate.py, keeping the wording the same
but using plain text characters only.
Source: Linters/SAST tools
| Called AFTER the profile warm-up, so weights, the simpler ring-heap | ||
| arena, compiled buffers and any persistent scratch are already | ||
| allocated — i.e. already reflected in the measured ``free``. The KV | ||
| budget is therefore just ``free × fraction``; no separate |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win
Address the Ruff warnings in the new diagnostics path.
Replace ambiguous × characters with ASCII x, and avoid materializing the full cache list just to read the first value.
Proposed lint cleanup
- budget is therefore just ``free × fraction``; no separate
+ budget is therefore just ``free x fraction``; no separate
@@
- model config), KV cache (exact = num_pages × bytes_per_page), simpler
- ring-heap arena (from the ``PTO2_RING_HEAP`` env × 4), and the
+ model config), KV cache (exact = num_pages x bytes_per_page), simpler
+ ring-heap arena (from the ``PTO2_RING_HEAP`` env x 4), and the
@@
- f"≈ {max_len_reqs} × full-len({max_seq_len}) reqs; "
- f"worst-case need {runtime.max_batch_size}×{max_seq_len}="
+ f"≈ {max_len_reqs} x full-len({max_seq_len}) reqs; "
+ f"worst-case need {runtime.max_batch_size}x{max_seq_len}="
@@
- print(f" ├─ simpler arena (env × 4): {arena_bytes / 1e9:7.2f} GB", flush=True)
+ print(f" ├─ simpler arena (env x 4): {arena_bytes / 1e9:7.2f} GB", flush=True)
@@
- kv_cache = list(self._kv_caches.values())[0]
+ kv_cache = next(iter(self._kv_caches.values()))Also applies to: 298-299, 349-350, 355-355, 395-395
🧰 Tools
🪛 Ruff (0.15.18)
[warning] 267-267: Docstring contains ambiguous × (MULTIPLICATION SIGN). Did you mean x (LATIN SMALL LETTER X)?
(RUF002)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@examples/model/qwen3_14b/runner/npu_runner.py` at line 267, The new
diagnostics path has Ruff issues in NpuRunner-related code: replace any
ambiguous Unicode multiplication symbols with plain ASCII x in the affected
text, and update the logic that reads the cache so it does not build a full list
just to access the first item. Make the fix in the NpuRunner methods/docstrings
around the diagnostics path and any helper that currently materializes cache
entries before using only the first value.
Source: Linters/SAST tools
| batch = runtime.max_batch_size | ||
| max_seq = runtime.max_seq_len | ||
| mnb = getattr(runtime, "max_num_batched_tokens", 4096) | ||
| step_tokens = min(mnb, batch * max_seq) | ||
| per_req = max(step_tokens // batch, 1) | ||
| total_tokens = per_req * batch |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
Keep warmup decode seq_len within max_seq_len.
When max_num_batched_tokens >= max_batch_size * max_seq_len, per_req becomes max_seq_len, so Line 440 dispatches decode with max_seq_len + 1. That can fail startup or overrun max-sequence kernel/RoPE assumptions.
Proposed fix
batch = runtime.max_batch_size
max_seq = runtime.max_seq_len
mnb = getattr(runtime, "max_num_batched_tokens", 4096)
- step_tokens = min(mnb, batch * max_seq)
+ max_prefill_per_req = max(max_seq - 1, 1)
+ step_tokens = min(mnb, batch * max_prefill_per_req)
per_req = max(step_tokens // batch, 1)
total_tokens = per_req * batch
+ decode_seq_len = min(per_req + 1, max_seq)
@@
for b in range(batch):
- compiled.decode_seq_lens_buffer[b] = per_req + 1
+ compiled.decode_seq_lens_buffer[b] = decode_seq_len
@@
- print(f"[warmup] decode dispatch … (batch={batch}, seq_len={per_req + 1})", flush=True)
+ print(f"[warmup] decode dispatch … (batch={batch}, seq_len={decode_seq_len})", flush=True)Also applies to: 439-440, 451-455
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@examples/model/qwen3_14b/runner/npu_runner.py` around lines 382 - 387, The
warmup decode path is allowing `seq_len` to exceed `runtime.max_seq_len` when
`max_num_batched_tokens` is large, so adjust the warmup token calculation in
`npu_runner.py` to cap decode length at `max_seq_len` before dispatching the
warmup request. Update the logic around `batch`, `max_seq`, `mnb`,
`step_tokens`, `per_req`, and `total_tokens` so the decode step sent by the
warmup code never becomes `max_seq_len + 1`, and apply the same bound wherever
the warmup decode request is built.
| parser.add_argument( | ||
| "--kv-cache-memory-fraction", | ||
| type=float, | ||
| default=0.90, | ||
| help="Fraction of total NPU HBM to reserve for KV cache (default: 0.10).", | ||
| ) |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win
Fix the --kv-cache-memory-fraction help text.
The parser default is 0.90, but the help still says 0.10, and the runtime actually applies the fraction to measured free HBM after warmup rather than total HBM. As written, the flag documents the wrong default and the wrong sizing basis.
Suggested edit
parser.add_argument(
"--kv-cache-memory-fraction",
type=float,
default=0.90,
- help="Fraction of total NPU HBM to reserve for KV cache (default: 0.10).",
+ help="Fraction of free NPU HBM after warmup to reserve for KV cache (default: 0.90).",
)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| parser.add_argument( | |
| "--kv-cache-memory-fraction", | |
| type=float, | |
| default=0.90, | |
| help="Fraction of total NPU HBM to reserve for KV cache (default: 0.10).", | |
| ) | |
| parser.add_argument( | |
| "--kv-cache-memory-fraction", | |
| type=float, | |
| default=0.90, | |
| help="Fraction of free NPU HBM after warmup to reserve for KV cache (default: 0.90).", | |
| ) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@python/cli/main.py` around lines 52 - 57, The `--kv-cache-memory-fraction`
argument in `main` has misleading help text: it shows the wrong default and
describes the reservation basis incorrectly. Update the `parser.add_argument`
help string to reflect the actual default value of 0.90 and state that the
fraction is applied to measured free HBM after warmup, not total HBM. Keep the
change localized to the CLI argument definition so the documentation matches
runtime behavior.
| pool = self._pool(model_id) | ||
| if pool.key_pages is None: | ||
| pool.key_pages = torch.zeros( | ||
| pool.num_layers, | ||
| pool.num_pages, | ||
| pool.num_kv_heads, | ||
| pool.page_size, | ||
| pool.head_dim, | ||
| dtype=pool.kv_dtype, | ||
| device="cpu", | ||
| ) | ||
| pool.value_pages = torch.zeros_like(pool.key_pages) |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
Make lazy host-pool allocation atomic.
If value_pages allocation fails here, key_pages stays set and the pool is permanently half-initialized. The next write_tokens/read_context call skips allocation and crashes on pool.value_pages[...].
Proposed fix
pool = self._pool(model_id)
- if pool.key_pages is None:
- pool.key_pages = torch.zeros(
+ if pool.key_pages is None or pool.value_pages is None:
+ key_pages = torch.zeros(
pool.num_layers,
pool.num_pages,
pool.num_kv_heads,
pool.page_size,
pool.head_dim,
dtype=pool.kv_dtype,
device="cpu",
)
- pool.value_pages = torch.zeros_like(pool.key_pages)
+ value_pages = torch.zeros_like(key_pages)
+ pool.key_pages = key_pages
+ pool.value_pages = value_pages
return pool📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| pool = self._pool(model_id) | |
| if pool.key_pages is None: | |
| pool.key_pages = torch.zeros( | |
| pool.num_layers, | |
| pool.num_pages, | |
| pool.num_kv_heads, | |
| pool.page_size, | |
| pool.head_dim, | |
| dtype=pool.kv_dtype, | |
| device="cpu", | |
| ) | |
| pool.value_pages = torch.zeros_like(pool.key_pages) | |
| pool = self._pool(model_id) | |
| if pool.key_pages is None or pool.value_pages is None: | |
| key_pages = torch.zeros( | |
| pool.num_layers, | |
| pool.num_pages, | |
| pool.num_kv_heads, | |
| pool.page_size, | |
| pool.head_dim, | |
| dtype=pool.kv_dtype, | |
| device="cpu", | |
| ) | |
| value_pages = torch.zeros_like(key_pages) | |
| pool.key_pages = key_pages | |
| pool.value_pages = value_pages |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@python/core/kv_cache.py` around lines 345 - 356, The lazy host-pool setup in
the pool allocation block leaves pool.key_pages assigned before pool.value_pages
is created, so a failure can permanently half-initialize the pool. Update the
allocation flow in the host-pool initialization path (the code that sets
pool.key_pages and pool.value_pages) so both tensors are created atomically:
allocate into local temporaries first, then assign both fields only after both
allocations succeed, and leave the pool unchanged on error.
| compiled = self._compile_model(record.runtime_model) | ||
| self._compiled[model_id] = compiled | ||
| print("[register_model] kernel compiled, creating runner …", flush=True) | ||
| runner = self._create_runner(model_id, compiled) | ||
| runner.init_kv_cache(model_id, record.config, record.runtime) | ||
| print("[register_model] runner created, init kv cache …", flush=True) | ||
| num_pages = runner.init_kv_cache(model_id, record.config, record.runtime) | ||
| # init_kv_cache runs profile warmup internally (phase 1) | ||
| self._runners[model_id] = runner | ||
| return num_pages |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
Rollback runner state if KV-cache init fails.
runner.init_kv_cache() now does the expensive warmup/allocation path. If it raises, this method leaves self._compiled[model_id] populated and never closes the runner, so a retry can inherit leaked worker/device state.
Proposed fix
with profile_span("PyptoExecutor.register_model", cat="executor", args={"model_id": model_id}):
compiled = self._compile_model(record.runtime_model)
- self._compiled[model_id] = compiled
print("[register_model] kernel compiled, creating runner …", flush=True)
runner = self._create_runner(model_id, compiled)
- print("[register_model] runner created, init kv cache …", flush=True)
- num_pages = runner.init_kv_cache(model_id, record.config, record.runtime)
- # init_kv_cache runs profile warmup internally (phase 1)
- self._runners[model_id] = runner
+ try:
+ self._compiled[model_id] = compiled
+ print("[register_model] runner created, init kv cache …", flush=True)
+ num_pages = runner.init_kv_cache(model_id, record.config, record.runtime)
+ # init_kv_cache runs profile warmup internally (phase 1)
+ self._runners[model_id] = runner
+ except Exception:
+ self._compiled.pop(model_id, None)
+ close = getattr(runner, "close", None)
+ if callable(close):
+ close()
+ raise
return num_pages📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| compiled = self._compile_model(record.runtime_model) | |
| self._compiled[model_id] = compiled | |
| print("[register_model] kernel compiled, creating runner …", flush=True) | |
| runner = self._create_runner(model_id, compiled) | |
| runner.init_kv_cache(model_id, record.config, record.runtime) | |
| print("[register_model] runner created, init kv cache …", flush=True) | |
| num_pages = runner.init_kv_cache(model_id, record.config, record.runtime) | |
| # init_kv_cache runs profile warmup internally (phase 1) | |
| self._runners[model_id] = runner | |
| return num_pages | |
| compiled = self._compile_model(record.runtime_model) | |
| print("[register_model] kernel compiled, creating runner …", flush=True) | |
| runner = self._create_runner(model_id, compiled) | |
| try: | |
| self._compiled[model_id] = compiled | |
| print("[register_model] runner created, init kv cache …", flush=True) | |
| num_pages = runner.init_kv_cache(model_id, record.config, record.runtime) | |
| # init_kv_cache runs profile warmup internally (phase 1) | |
| self._runners[model_id] = runner | |
| except Exception: | |
| self._compiled.pop(model_id, None) | |
| close = getattr(runner, "close", None) | |
| if callable(close): | |
| close() | |
| raise | |
| return num_pages |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@python/core/pypto_executor.py` around lines 61 - 69, `register_model` in
`pypto_executor.py` leaves partially initialized state if
`runner.init_kv_cache(...)` throws, so add rollback around the runner setup
path. Wrap the `self._create_runner(...)` and `runner.init_kv_cache(...)`
sequence in `register_model` with cleanup that removes any
`self._compiled[model_id]` entry, avoids storing `self._runners[model_id]`, and
closes/destroys the created runner on failure. Make sure the successful path
still assigns `self._runners[model_id]` only after KV-cache init completes.
| remaining = max_seq_len - prompt_len | ||
| if request.max_new_tokens > remaining: | ||
| logger.warning( | ||
| "Request %s: capping max_new_tokens %d -> %d to fit max_seq_len %d " | ||
| "(prompt_len=%d).", | ||
| request.request_id, request.max_new_tokens, remaining, | ||
| max_seq_len, prompt_len, | ||
| ) | ||
| request.max_new_tokens = remaining |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
Reject zero-token generation capacity before enqueueing.
When prompt_len == max_seq_len, this caps max_new_tokens to 0, but the prefill completion path still samples one token before _check_finish() can finish the request. That lets responses exceed max_seq_len by one token.
Suggested fix
# keeps every request within the KV-cache capacity budgeted per request
# and avoids overflow-driven preemption.
+ if request.max_new_tokens <= 0:
+ raise ValueError(
+ f"Request {request.request_id} max_new_tokens must be positive; "
+ f"got {request.max_new_tokens}."
+ )
remaining = max_seq_len - prompt_len
+ if remaining <= 0:
+ raise ValueError(
+ f"Request {request.request_id} prompt length {prompt_len} "
+ f"leaves no room for generation within max_seq_len {max_seq_len}."
+ )
if request.max_new_tokens > remaining:
logger.warning(📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| remaining = max_seq_len - prompt_len | |
| if request.max_new_tokens > remaining: | |
| logger.warning( | |
| "Request %s: capping max_new_tokens %d -> %d to fit max_seq_len %d " | |
| "(prompt_len=%d).", | |
| request.request_id, request.max_new_tokens, remaining, | |
| max_seq_len, prompt_len, | |
| ) | |
| request.max_new_tokens = remaining | |
| # keeps every request within the KV-cache capacity budgeted per request | |
| # and avoids overflow-driven preemption. | |
| if request.max_new_tokens <= 0: | |
| raise ValueError( | |
| f"Request {request.request_id} max_new_tokens must be positive; " | |
| f"got {request.max_new_tokens}." | |
| ) | |
| remaining = max_seq_len - prompt_len | |
| if remaining <= 0: | |
| raise ValueError( | |
| f"Request {request.request_id} prompt length {prompt_len} " | |
| f"leaves no room for generation within max_seq_len {max_seq_len}." | |
| ) | |
| if request.max_new_tokens > remaining: | |
| logger.warning( | |
| "Request %s: capping max_new_tokens %d -> %d to fit max_seq_len %d " | |
| "(prompt_len=%d).", | |
| request.request_id, request.max_new_tokens, remaining, | |
| max_seq_len, prompt_len, | |
| ) | |
| request.max_new_tokens = remaining |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@python/core/scheduler.py` around lines 147 - 155, The scheduler’s
max-new-token capping in scheduler.py can reduce `request.max_new_tokens` to 0
when `prompt_len` already equals `max_seq_len`, which still allows one extra
sampled token later in the prefill/finish path. Update the enqueue-time
validation in the scheduler logic around `remaining`, `request.max_new_tokens`,
and `request.request_id` so zero-token generation capacity is rejected or
finished before the request is queued, rather than merely capped to 0.
| kwargs: dict = {"tokenize": False, "add_generation_prompt": True} | ||
| if chat_template_kwargs: | ||
| kwargs.update(chat_template_kwargs) | ||
| return self.engine.tokenizer.tokenizer.apply_chat_template(hf_messages, **kwargs) |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
Protect fixed chat-template arguments from request overrides.
chat_template_kwargs can currently override tokenize=False or add_generation_prompt=True; tokenize=True makes this method return token IDs instead of the str that engine.add_request() expects.
Suggested fix
hf_messages = [{"role": m.role, "content": m.content} for m in messages]
kwargs: dict = {"tokenize": False, "add_generation_prompt": True}
if chat_template_kwargs:
+ reserved = {"tokenize", "add_generation_prompt", "return_tensors", "chat_template"}
+ blocked = reserved.intersection(chat_template_kwargs)
+ if blocked:
+ raise ValueError(
+ "chat_template_kwargs may not override reserved keys: "
+ + ", ".join(sorted(blocked))
+ )
kwargs.update(chat_template_kwargs)
return self.engine.tokenizer.tokenizer.apply_chat_template(hf_messages, **kwargs)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| kwargs: dict = {"tokenize": False, "add_generation_prompt": True} | |
| if chat_template_kwargs: | |
| kwargs.update(chat_template_kwargs) | |
| return self.engine.tokenizer.tokenizer.apply_chat_template(hf_messages, **kwargs) | |
| kwargs: dict = {"tokenize": False, "add_generation_prompt": True} | |
| if chat_template_kwargs: | |
| reserved = {"tokenize", "add_generation_prompt", "return_tensors", "chat_template"} | |
| blocked = reserved.intersection(chat_template_kwargs) | |
| if blocked: | |
| raise ValueError( | |
| "chat_template_kwargs may not override reserved keys: " | |
| ", ".join(sorted(blocked)) | |
| ) | |
| kwargs.update(chat_template_kwargs) | |
| return self.engine.tokenizer.tokenizer.apply_chat_template(hf_messages, **kwargs) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@python/core/server.py` around lines 276 - 279, The chat-template argument
handling currently lets chat_template_kwargs override the fixed settings used by
apply_chat_template, which can change tokenize/add_generation_prompt in a way
that breaks engine.add_request(). Update the logic around kwargs in the
tokenizer apply_chat_template call so the required defaults stay enforced and
only non-fixed request-supplied options are merged in; make sure tokenize
remains false and add_generation_prompt remains true regardless of
chat_template_kwargs.
| register_model = getattr(self.executor, "register_model", None) | ||
| if callable(register_model): | ||
| register_model(self.config.model_id, self.model_record) | ||
| num_pages = register_model(self.config.model_id, self.model_record) | ||
| else: | ||
| num_pages = 0 | ||
|
|
||
| logger.info("Worker model loaded and ready") | ||
| return num_pages |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
Do not report readiness with an invalid KV page count.
The else path returns 0, but AsyncLLMEngine.start() rejects <= 0; meanwhile _worker_entry() has already set ready_event and proceeds into busy_loop().
Suggested fix
register_model = getattr(self.executor, "register_model", None)
- if callable(register_model):
- num_pages = register_model(self.config.model_id, self.model_record)
- else:
- num_pages = 0
+ if not callable(register_model):
+ raise RuntimeError("Executor must expose register_model() and return KV cache pages")
+ num_pages = register_model(self.config.model_id, self.model_record)
+ if num_pages <= 0:
+ raise RuntimeError(f"Executor reported invalid KV cache page count: {num_pages}")
logger.info("Worker model loaded and ready")
return num_pages📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| register_model = getattr(self.executor, "register_model", None) | |
| if callable(register_model): | |
| register_model(self.config.model_id, self.model_record) | |
| num_pages = register_model(self.config.model_id, self.model_record) | |
| else: | |
| num_pages = 0 | |
| logger.info("Worker model loaded and ready") | |
| return num_pages | |
| register_model = getattr(self.executor, "register_model", None) | |
| if not callable(register_model): | |
| raise RuntimeError("Executor must expose register_model() and return KV cache pages") | |
| num_pages = register_model(self.config.model_id, self.model_record) | |
| if num_pages <= 0: | |
| raise RuntimeError(f"Executor reported invalid KV cache page count: {num_pages}") | |
| logger.info("Worker model loaded and ready") | |
| return num_pages |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@python/core/serving_worker.py` around lines 98 - 105, The readiness path in
serving_worker.py is returning an invalid KV page count from the register_model
fallback, which causes AsyncLLMEngine.start() to reject startup after the worker
has already signaled ready. Update the register_model handling in the model-load
flow so the code never reports readiness with 0 pages; either ensure a valid
positive page count is produced by register_model or fail before setting ready
in _worker_entry()/the model load method. Use the existing register_model and
_worker_entry logic to locate and adjust the startup contract.
| # Fraction of remaining free HBM (after weights + arena) for KV cache. | ||
| kv_cache_memory_fraction: float = 0.90 | ||
| # Max tokens processed per scheduling step (chunked-prefill granularity). | ||
| max_num_batched_tokens: int = 4096 |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win
Validate the new runtime knobs at construction.
kv_cache_memory_fraction and max_num_batched_tokens are consumed directly by the NPU sizing/warmup paths, so values like <= 0, > 1, or 0 currently fall through into a 1-page KV cache or nonsensical warmup sizing instead of failing fast. Please add a config-level validator here so every caller gets the same bounds check, not just the CLIs.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@python/core/types.py` around lines 67 - 70, Add construction-time validation
for the new runtime knobs in the config type so invalid values fail fast before
reaching NPU sizing/warmup paths. Update the class in types.py that defines
kv_cache_memory_fraction and max_num_batched_tokens to enforce a valid range for
kv_cache_memory_fraction (> 0 and <= 1) and a positive value for
max_num_batched_tokens. Make the check centralized at the config level so every
caller gets the same bounds enforcement, not only the CLI entry points.
70c1fee to
98cefb0
Compare
| print("[register_model] kernel compiled, creating runner …", flush=True) | ||
| runner = self._create_runner(model_id, compiled) | ||
| runner.init_kv_cache(model_id, record.config, record.runtime) | ||
| print("[register_model] runner created, init kv cache …", flush=True) |
| # Defence in depth: stop once the full sequence (prompt + generated) | ||
| # reaches max_seq_len, regardless of max_new_tokens. | ||
| if request.num_tokens >= self.config.max_seq_len: | ||
| return RequestStatus.FINISHED_LENGTH |
| def sample_batch( | ||
| self, logits: torch.Tensor, params_list: Sequence[SamplingParams] | ||
| ) -> list[int]: | ||
| """Sample one token per row from a ``[N, vocab]`` logits block. |
There was a problem hiding this comment.
sampler等npu版本实现再接入吧,目前cpu版本改成batch性能提升也不明显
| * runtime.page_size * config.head_dim * dtype_bytes | ||
| ) | ||
| fraction = getattr(runtime, "kv_cache_memory_fraction", 0.90) | ||
| kv_budget = int(free_bytes * fraction) |
There was a problem hiding this comment.
如果对标vllm的’gpu_memory_utilization',这里应该是总的显存大小* fraction - free_bytes
There was a problem hiding this comment.
改为提供--npu-memory-utilization参数,与vllm对齐
e5da157 to
13d1d75
Compare
…fault, remove --max-new-tokens from serving
…ory breakdown, decode bind_dynamic
… cap, prefix cache full-hit fix
…ler, skip_special_tokens
| for i, sr in enumerate(scheduled) | ||
| if sr.num_computed_tokens + sr.num_new_tokens >= sr.request.num_prompt_tokens | ||
| ] | ||
| if finishing: |
Qwen3-14B serving: vLLM-style KV sizing, admission control, chunked prefill, API hardening