Qwen3-14B serving: dynamic KV sizing, vLLM-style admission, chunked p… by sunghajung6688 · Pull Request #45 · hw-native-sys/pypto-serving

sunghajung6688 · 2026-06-26T09:01:51Z

Qwen3-14B serving: vLLM-style KV sizing, admission control, chunked prefill, API hardening

KV cache (vLLM-style):
- Size by total × npu_memory_utilization − peak_non_kv (measured after warmup);
  halve-retry on rtMalloc OOM, floor-clamped to max_batch
- Profile warmup before KV alloc (slot=-1 1-page scratch)
- Per-component memory breakdown via logger (weights / KV / arena / residual)

Scheduler (vLLM-style):
- Reject prompts exceeding max_seq_len (HTTP 400, not silent truncate)
- Cap generation so prompt + output <= max_seq_len
- Chunked prefill support (long_prefill_token_threshold)
- Prefix cache full-hit fix: num_new=0 requests transition to decode
  instead of infinite-looping in waiting queue

Serving worker:
- torch-based _batch_prefill / _batch_decode (embeddings, positions, block_table)
- KV page count synced to main process via mp.Value

API server:
- Official Qwen3 chat template + chat_template_kwargs passthrough (enable_thinking)
- ValueError -> HTTP 400 handler

CLI:
- max_model_len default 1024; removed --max-new-tokens (serving, no effect)
- Renamed --kv-cache-memory-fraction -> --npu-memory-utilization
- Added --max-num-batched-tokens, --long-prefill-token-threshold, --max-num-seqs

Tokenizer:
- skip_special_tokens=True (strip EOS / im_end from output)

Logging:
- Request received/finished lifecycle logs (vLLM-style)
- All runner logs converted from print() to logger.info()
- Quiet noisy library loggers (simpler_setup / pypto -> WARNING)

Tests:
- Updated test_cli.py (removed --max-new-tokens)
- Updated test_batching.py (_FakeWorker.run for warmup dispatch)

coderabbitai · 2026-06-26T09:02:05Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 29e09ac8-55b2-46fb-8d35-88f62ff5cb26

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

The PR wires dynamic KV-cache sizing from NPU free memory through a shared multiprocessing.Value to the engine's block pool, adds batched token sampling, updates RuntimeConfig defaults to bfloat16 with new kv_cache_memory_fraction and max_num_batched_tokens fields, adds an NPU warmup flow, makes host KV pages lazily allocated, updates the scheduler and server, and bumps the pypto-lib submodule.

Changes

Serving Runtime and NPU Cache Flow

Layer / File(s)	Summary
RuntimeConfig fields and CLI defaults `python/core/types.py`, `python/cli/main.py`, `examples/model/qwen3_14b/cpu_generate.py`, `examples/model/qwen3_14b/npu_generate.py`, `tests/test_batching.py`, `tests/test_npu_prefix_chunk.py`	`RuntimeConfig` gains `kv_cache_memory_fraction` and `max_num_batched_tokens` fields with bfloat16 dtype defaults; the serving and example CLIs wire the new fields, remove `--max-new-tokens`, and configure logging; tests update to bfloat16.
Chat templating, scheduler validation, and server error handling `python/core/server.py`, `python/core/tokenizer.py`, `python/core/scheduler.py`	`ChatCompletionRequest` adds `chat_template_kwargs`; `_apply_chat_template` delegates to the tokenizer API; the server maps `ValueError` to HTTP 400; the tokenizer strips special tokens; the scheduler rejects overlong prompts, caps `max_new_tokens`, handles fully-prefix-cached requests, and enforces `max_seq_len` at finish.
KV page count propagation: runner → worker → engine `python/core/model_runner.py`, `python/core/pypto_executor.py`, `python/core/serving_worker.py`, `python/core/async_engine.py`	`init_kv_cache` returns the allocated page count; `register_model` and `init_device_and_model` propagate it through a shared `multiprocessing.Value`; `AsyncLLMEngine.start()` reads the value and calls `kv_cache_manager._init_blocks`; `ModelRunner.warmup` becomes a no-op default.
Lazy host KV page allocation `python/core/kv_cache.py`	`_CachePool` key/value tensors become optional; `register_model` skips eager allocation; `_ensure_host_pool` allocates on first CPU-side access; `write_tokens`, `read_context`, and `materialize_single_layer_cache` route through the new helper.
Batched sampling and request logging `python/core/sampler.py`, `python/core/serving_worker.py`, `python/core/async_engine.py`	`Sampler.sample_batch` performs a single device→host transfer for all logits rows; `_batch_prefill` and `_batch_decode` use batched sampling; `add_request` logs arrival and completion with latency and tokens/sec.
NPU warmup, dynamic KV sizing, and memory observability `examples/model/qwen3_14b/runner/npu_runner.py`, `examples/model/qwen3_14b/runner/npu_executor.py`	`init_kv_cache` is replaced with a two-phase flow: scratch-cache warmup dispatch then dynamic page sizing from free NPU memory with OOM-retry halving; `_alloc_kv_cache_with_retry`, `_compute_kv_cache_pages`, `_print_memory_breakdown`, `warmup`, and `_warmup_dispatch` are added; the executor validator relaxes to a minimum-pages check.

pypto-lib Revision Bump

Layer / File(s)	Summary
Submodule reference update `pypto-lib`	The `pypto-lib` subproject pointer was advanced to a new commit hash.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

hw-native-sys/pypto-serving#21: Modifies python/cli/main.py and AsyncLLMEngine initialization, the same wiring path extended in this PR for KV-cache memory fraction and batched token limits.
hw-native-sys/pypto-serving#24: Refactors init_kv_cache and KV-cache materialization in model_runner.py, kv_cache.py, and npu_runner.py, directly overlapping with this PR's lazy host pool and page-count propagation changes.

Poem

🐇 Hoppity-hop through the memory lane,
Pages computed, then counted again!
Warmup runs first with a scratch-page disguise,
Then bf16 floats fill the cache to the skies.
The scheduler now says "no overlong chat!"
A rabbit approves — and that's final, full stop, flat. 🎉

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 52.31% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title matches the main themes of the PR: dynamic KV sizing, admission changes, and chunked prefill support.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description check	✅ Passed	The description matches the serving, scheduler, API, CLI, tokenizer, logging, and test updates in the changeset.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

gemini-code-assist

Code Review

This pull request introduces dynamic KV cache allocation, a profile warmup phase, and batch sampling optimizations for Qwen3-14B on NPU. It enables lazy initialization of host-side block metadata based on actual device-side KV cache capacity, adds robust prompt length validation and capping in the scheduler, and updates default data types to bfloat16. Feedback on these changes highlights three key areas for improvement: refining the KV cache allocation retry loop to prevent skipping the minimum page floor, correcting the validation check in npu_executor.py to use the computed min_pages (or kernel_batch) instead of a weak check against 1, and updating the weight parameter estimation formula in npu_runner.py to accurately reflect Grouped Query Attention (GQA) instead of Multi-Head Attention (MHA).

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-26T09:04:37Z

+            except (RuntimeError, MemoryError) as e:
+                prev = num_pages
+                num_pages //= 2
+                print(
+                    f"[init_kv_cache] alloc failed ({e}); retrying {prev} -> {num_pages}",
+                    flush=True,
+                )


In the retry loop, halving num_pages can cause it to drop below floor immediately, skipping the floor value entirely. For example, if floor is 16 and num_pages is 30, halving it results in 15, which is less than floor, causing the loop to terminate without ever trying to allocate exactly floor (16) pages. If num_pages drops below floor but was previously strictly greater than floor, it should be clamped to floor for one final attempt.

except (RuntimeError, MemoryError) as e: prev = num_pages num_pages //= 2 if num_pages < floor and prev > floor: num_pages = floor print( f"[init_kv_cache] alloc failed ({e}); retrying {prev} -> {num_pages}", flush=True, )

gemini-code-assist · 2026-06-26T09:04:37Z

+        min_pages = kernel_batch * (model.runtime.max_seq_len + model.runtime.page_size - 1) // model.runtime.page_size
+        if model.runtime.total_kv_pages < 1:
            raise ValueError(
-                "PyPTO Qwen3-14B kernels require total_kv_pages to match the runtime batch capacity: "
-                f"{model.runtime.total_kv_pages} provided, {expected_pages} required."
+                f"total_kv_pages must be at least 1, got {model.runtime.total_kv_pages}"
            )


The variable min_pages is computed but never used in the validation check. Furthermore, checking total_kv_pages < 1 is too weak for a batch of size kernel_batch, as each batch slot requires at least 1 page to avoid runtime errors. It should be validated against kernel_batch instead.

if model.runtime.total_kv_pages < kernel_batch: raise ValueError( f"total_kv_pages must be at least kernel_batch ({kernel_batch}), got {model.runtime.total_kv_pages}" )

gemini-code-assist · 2026-06-26T09:04:37Z

+        # Weights — same estimate as _compute_kv_cache_pages.
+        hidden = config.hidden_size
+        wt_params = (
+            config.num_hidden_layers * (
+                hidden * hidden * 4
+                + hidden * config.intermediate_size * 3
+                + hidden * 4
+            )
+            + config.vocab_size * hidden
+        )
+        weight_bytes = int(wt_params * dtype_bytes)


The weight parameter estimation formula assumes Multi-Head Attention (MHA) where the Q, K, V, and O projections are all of size hidden * hidden (hence hidden * hidden * 4). However, Qwen3-14B uses Grouped Query Attention (GQA), where the K and V projections are significantly smaller (hidden * kv_hidden). This leads to a large overestimate of the model weight size in the memory breakdown log. We should use the actual GQA projection sizes for a more accurate estimate.

Suggested change

# Weights — same estimate as _compute_kv_cache_pages.

hidden = config.hidden_size

wt_params = (

config.num_hidden_layers * (

hidden * hidden * 4

+ hidden * config.intermediate_size * 3

+ hidden * 4

)

+ config.vocab_size * hidden

)

weight_bytes = int(wt_params * dtype_bytes)

# Weights — same estimate as _compute_kv_cache_pages.

hidden = config.hidden_size

kv_hidden = config.num_key_value_heads * config.head_dim

wt_params = (

config.num_hidden_layers * (

hidden * hidden * 2

+ hidden * kv_hidden * 2

+ hidden * config.intermediate_size * 3

+ hidden * 4

)

+ config.vocab_size * hidden

)

weight_bytes = int(wt_params * dtype_bytes)

coderabbitai

Actionable comments posted: 10

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@examples/model/qwen3_14b/npu_generate.py`:
- Around line 438-440: Replace the ambiguous Unicode multiplication sign in the
nearby comment with an ASCII-safe equivalent so Ruff no longer flags RUF003.
Update the comment in the area around the decode kernel capacity note in
npu_generate.py, keeping the wording the same but using plain text characters
only.

In `@examples/model/qwen3_14b/runner/npu_runner.py`:
- Around line 382-387: The warmup decode path is allowing `seq_len` to exceed
`runtime.max_seq_len` when `max_num_batched_tokens` is large, so adjust the
warmup token calculation in `npu_runner.py` to cap decode length at
`max_seq_len` before dispatching the warmup request. Update the logic around
`batch`, `max_seq`, `mnb`, `step_tokens`, `per_req`, and `total_tokens` so the
decode step sent by the warmup code never becomes `max_seq_len + 1`, and apply
the same bound wherever the warmup decode request is built.
- Line 267: The new diagnostics path has Ruff issues in NpuRunner-related code:
replace any ambiguous Unicode multiplication symbols with plain ASCII x in the
affected text, and update the logic that reads the cache so it does not build a
full list just to access the first item. Make the fix in the NpuRunner
methods/docstrings around the diagnostics path and any helper that currently
materializes cache entries before using only the first value.

In `@python/cli/main.py`:
- Around line 52-57: The `--kv-cache-memory-fraction` argument in `main` has
misleading help text: it shows the wrong default and describes the reservation
basis incorrectly. Update the `parser.add_argument` help string to reflect the
actual default value of 0.90 and state that the fraction is applied to measured
free HBM after warmup, not total HBM. Keep the change localized to the CLI
argument definition so the documentation matches runtime behavior.

In `@python/core/kv_cache.py`:
- Around line 345-356: The lazy host-pool setup in the pool allocation block
leaves pool.key_pages assigned before pool.value_pages is created, so a failure
can permanently half-initialize the pool. Update the allocation flow in the
host-pool initialization path (the code that sets pool.key_pages and
pool.value_pages) so both tensors are created atomically: allocate into local
temporaries first, then assign both fields only after both allocations succeed,
and leave the pool unchanged on error.

In `@python/core/pypto_executor.py`:
- Around line 61-69: `register_model` in `pypto_executor.py` leaves partially
initialized state if `runner.init_kv_cache(...)` throws, so add rollback around
the runner setup path. Wrap the `self._create_runner(...)` and
`runner.init_kv_cache(...)` sequence in `register_model` with cleanup that
removes any `self._compiled[model_id]` entry, avoids storing
`self._runners[model_id]`, and closes/destroys the created runner on failure.
Make sure the successful path still assigns `self._runners[model_id]` only after
KV-cache init completes.

In `@python/core/scheduler.py`:
- Around line 147-155: The scheduler’s max-new-token capping in scheduler.py can
reduce `request.max_new_tokens` to 0 when `prompt_len` already equals
`max_seq_len`, which still allows one extra sampled token later in the
prefill/finish path. Update the enqueue-time validation in the scheduler logic
around `remaining`, `request.max_new_tokens`, and `request.request_id` so
zero-token generation capacity is rejected or finished before the request is
queued, rather than merely capped to 0.

In `@python/core/server.py`:
- Around line 276-279: The chat-template argument handling currently lets
chat_template_kwargs override the fixed settings used by apply_chat_template,
which can change tokenize/add_generation_prompt in a way that breaks
engine.add_request(). Update the logic around kwargs in the tokenizer
apply_chat_template call so the required defaults stay enforced and only
non-fixed request-supplied options are merged in; make sure tokenize remains
false and add_generation_prompt remains true regardless of chat_template_kwargs.

In `@python/core/serving_worker.py`:
- Around line 98-105: The readiness path in serving_worker.py is returning an
invalid KV page count from the register_model fallback, which causes
AsyncLLMEngine.start() to reject startup after the worker has already signaled
ready. Update the register_model handling in the model-load flow so the code
never reports readiness with 0 pages; either ensure a valid positive page count
is produced by register_model or fail before setting ready in
_worker_entry()/the model load method. Use the existing register_model and
_worker_entry logic to locate and adjust the startup contract.

In `@python/core/types.py`:
- Around line 67-70: Add construction-time validation for the new runtime knobs
in the config type so invalid values fail fast before reaching NPU sizing/warmup
paths. Update the class in types.py that defines kv_cache_memory_fraction and
max_num_batched_tokens to enforce a valid range for kv_cache_memory_fraction (>
0 and <= 1) and a positive value for max_num_batched_tokens. Make the check
centralized at the config level so every caller gets the same bounds
enforcement, not only the CLI entry points.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c7191972-3946-4576-9627-a5e47606a4d9

📥 Commits

Reviewing files that changed from the base of the PR and between d37496a and e9936d0.

📒 Files selected for processing (18)

examples/model/qwen3_14b/cpu_generate.py
examples/model/qwen3_14b/npu_generate.py
examples/model/qwen3_14b/runner/npu_executor.py
examples/model/qwen3_14b/runner/npu_runner.py
pypto-lib
python/cli/main.py
python/core/async_engine.py
python/core/kv_cache.py
python/core/model_runner.py
python/core/pypto_executor.py
python/core/sampler.py
python/core/scheduler.py
python/core/server.py
python/core/serving_worker.py
python/core/tokenizer.py
python/core/types.py
tests/test_batching.py
tests/test_npu_prefix_chunk.py

coderabbitai · 2026-06-29T02:40:35Z

+                # Conservative default — the decode kernel is compiled
+                # with this baked-in shape and cannot be resized later.
+                # 200 pages × 128 tokens = 25 600 tokens total capacity.


📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Replace the Unicode multiplication sign in the comment.

Line 440 uses ×, which Ruff flags as ambiguous (RUF003) and can keep lint red for a comment-only issue.

🧰 Tools

🪛 Ruff (0.15.18)

[warning] 440-440: Comment contains ambiguous × (MULTIPLICATION SIGN). Did you mean x (LATIN SMALL LETTER X)?

(RUF003)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/model/qwen3_14b/npu_generate.py` around lines 438 - 440, Replace the ambiguous Unicode multiplication sign in the nearby comment with an ASCII-safe equivalent so Ruff no longer flags RUF003. Update the comment in the area around the decode kernel capacity note in npu_generate.py, keeping the wording the same but using plain text characters only.

Source: Linters/SAST tools

coderabbitai · 2026-06-29T02:40:35Z

+        Called AFTER the profile warm-up, so weights, the simpler ring-heap
+        arena, compiled buffers and any persistent scratch are already
+        allocated — i.e. already reflected in the measured ``free``. The KV
+        budget is therefore just ``free × fraction``; no separate


📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Address the Ruff warnings in the new diagnostics path.

Replace ambiguous × characters with ASCII x, and avoid materializing the full cache list just to read the first value.

Proposed lint cleanup

- budget is therefore just ``free × fraction``; no separate + budget is therefore just ``free x fraction``; no separate @@ - model config), KV cache (exact = num_pages × bytes_per_page), simpler - ring-heap arena (from the ``PTO2_RING_HEAP`` env × 4), and the + model config), KV cache (exact = num_pages x bytes_per_page), simpler + ring-heap arena (from the ``PTO2_RING_HEAP`` env x 4), and the @@ - f"≈ {max_len_reqs} × full-len({max_seq_len}) reqs; " - f"worst-case need {runtime.max_batch_size}×{max_seq_len}=" + f"≈ {max_len_reqs} x full-len({max_seq_len}) reqs; " + f"worst-case need {runtime.max_batch_size}x{max_seq_len}=" @@ - print(f" ├─ simpler arena (env × 4): {arena_bytes / 1e9:7.2f} GB", flush=True) + print(f" ├─ simpler arena (env x 4): {arena_bytes / 1e9:7.2f} GB", flush=True) @@ - kv_cache = list(self._kv_caches.values())[0] + kv_cache = next(iter(self._kv_caches.values()))

Also applies to: 298-299, 349-350, 355-355, 395-395

🧰 Tools

🪛 Ruff (0.15.18)

[warning] 267-267: Docstring contains ambiguous × (MULTIPLICATION SIGN). Did you mean x (LATIN SMALL LETTER X)?

(RUF002)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/model/qwen3_14b/runner/npu_runner.py` at line 267, The new diagnostics path has Ruff issues in NpuRunner-related code: replace any ambiguous Unicode multiplication symbols with plain ASCII x in the affected text, and update the logic that reads the cache so it does not build a full list just to access the first item. Make the fix in the NpuRunner methods/docstrings around the diagnostics path and any helper that currently materializes cache entries before using only the first value.

Source: Linters/SAST tools

coderabbitai · 2026-06-29T02:40:35Z

+        batch = runtime.max_batch_size
+        max_seq = runtime.max_seq_len
+        mnb = getattr(runtime, "max_num_batched_tokens", 4096)
+        step_tokens = min(mnb, batch * max_seq)
+        per_req = max(step_tokens // batch, 1)
+        total_tokens = per_req * batch


🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Keep warmup decode seq_len within max_seq_len.

When max_num_batched_tokens >= max_batch_size * max_seq_len, per_req becomes max_seq_len, so Line 440 dispatches decode with max_seq_len + 1. That can fail startup or overrun max-sequence kernel/RoPE assumptions.

Proposed fix

batch = runtime.max_batch_size max_seq = runtime.max_seq_len mnb = getattr(runtime, "max_num_batched_tokens", 4096) - step_tokens = min(mnb, batch * max_seq) + max_prefill_per_req = max(max_seq - 1, 1) + step_tokens = min(mnb, batch * max_prefill_per_req) per_req = max(step_tokens // batch, 1) total_tokens = per_req * batch + decode_seq_len = min(per_req + 1, max_seq) @@ for b in range(batch): - compiled.decode_seq_lens_buffer[b] = per_req + 1 + compiled.decode_seq_lens_buffer[b] = decode_seq_len @@ - print(f"[warmup] decode dispatch … (batch={batch}, seq_len={per_req + 1})", flush=True) + print(f"[warmup] decode dispatch … (batch={batch}, seq_len={decode_seq_len})", flush=True)

Also applies to: 439-440, 451-455

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/model/qwen3_14b/runner/npu_runner.py` around lines 382 - 387, The warmup decode path is allowing `seq_len` to exceed `runtime.max_seq_len` when `max_num_batched_tokens` is large, so adjust the warmup token calculation in `npu_runner.py` to cap decode length at `max_seq_len` before dispatching the warmup request. Update the logic around `batch`, `max_seq`, `mnb`, `step_tokens`, `per_req`, and `total_tokens` so the decode step sent by the warmup code never becomes `max_seq_len + 1`, and apply the same bound wherever the warmup decode request is built.

coderabbitai · 2026-06-29T02:40:35Z

+    parser.add_argument(
+        "--kv-cache-memory-fraction",
+        type=float,
+        default=0.90,
+        help="Fraction of total NPU HBM to reserve for KV cache (default: 0.10).",
+    )


📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Fix the --kv-cache-memory-fraction help text.

The parser default is 0.90, but the help still says 0.10, and the runtime actually applies the fraction to measured free HBM after warmup rather than total HBM. As written, the flag documents the wrong default and the wrong sizing basis.

Suggested edit

parser.add_argument( "--kv-cache-memory-fraction", type=float, default=0.90, - help="Fraction of total NPU HBM to reserve for KV cache (default: 0.10).", + help="Fraction of free NPU HBM after warmup to reserve for KV cache (default: 0.90).", )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

parser.add_argument(

"--kv-cache-memory-fraction",

type=float,

default=0.90,

help="Fraction of total NPU HBM to reserve for KV cache (default: 0.10).",

)

parser.add_argument(

"--kv-cache-memory-fraction",

type=float,

default=0.90,

help="Fraction of free NPU HBM after warmup to reserve for KV cache (default: 0.90).",

)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@python/cli/main.py` around lines 52 - 57, The `--kv-cache-memory-fraction` argument in `main` has misleading help text: it shows the wrong default and describes the reservation basis incorrectly. Update the `parser.add_argument` help string to reflect the actual default value of 0.90 and state that the fraction is applied to measured free HBM after warmup, not total HBM. Keep the change localized to the CLI argument definition so the documentation matches runtime behavior.

coderabbitai · 2026-06-29T02:40:35Z

+        pool = self._pool(model_id)
+        if pool.key_pages is None:
+            pool.key_pages = torch.zeros(
+                pool.num_layers,
+                pool.num_pages,
+                pool.num_kv_heads,
+                pool.page_size,
+                pool.head_dim,
+                dtype=pool.kv_dtype,
+                device="cpu",
+            )
+            pool.value_pages = torch.zeros_like(pool.key_pages)


🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Make lazy host-pool allocation atomic.

If value_pages allocation fails here, key_pages stays set and the pool is permanently half-initialized. The next write_tokens/read_context call skips allocation and crashes on pool.value_pages[...].

Proposed fix

pool = self._pool(model_id) - if pool.key_pages is None: - pool.key_pages = torch.zeros( + if pool.key_pages is None or pool.value_pages is None: + key_pages = torch.zeros( pool.num_layers, pool.num_pages, pool.num_kv_heads, pool.page_size, pool.head_dim, dtype=pool.kv_dtype, device="cpu", ) - pool.value_pages = torch.zeros_like(pool.key_pages) + value_pages = torch.zeros_like(key_pages) + pool.key_pages = key_pages + pool.value_pages = value_pages return pool

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

pool = self._pool(model_id)

if pool.key_pages is None:

pool.key_pages = torch.zeros(

pool.num_layers,

pool.num_pages,

pool.num_kv_heads,

pool.page_size,

pool.head_dim,

dtype=pool.kv_dtype,

device="cpu",

)

pool.value_pages = torch.zeros_like(pool.key_pages)

pool = self._pool(model_id)

if pool.key_pages is None or pool.value_pages is None:

key_pages = torch.zeros(

pool.num_layers,

pool.num_pages,

pool.num_kv_heads,

pool.page_size,

pool.head_dim,

dtype=pool.kv_dtype,

device="cpu",

)

value_pages = torch.zeros_like(key_pages)

pool.key_pages = key_pages

pool.value_pages = value_pages

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@python/core/kv_cache.py` around lines 345 - 356, The lazy host-pool setup in the pool allocation block leaves pool.key_pages assigned before pool.value_pages is created, so a failure can permanently half-initialize the pool. Update the allocation flow in the host-pool initialization path (the code that sets pool.key_pages and pool.value_pages) so both tensors are created atomically: allocate into local temporaries first, then assign both fields only after both allocations succeed, and leave the pool unchanged on error.

coderabbitai · 2026-06-29T02:40:35Z

            compiled = self._compile_model(record.runtime_model)
            self._compiled[model_id] = compiled
+            print("[register_model] kernel compiled, creating runner …", flush=True)
            runner = self._create_runner(model_id, compiled)
-            runner.init_kv_cache(model_id, record.config, record.runtime)
+            print("[register_model] runner created, init kv cache …", flush=True)
+            num_pages = runner.init_kv_cache(model_id, record.config, record.runtime)
+            # init_kv_cache runs profile warmup internally (phase 1)
            self._runners[model_id] = runner
+        return num_pages


🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Rollback runner state if KV-cache init fails.

runner.init_kv_cache() now does the expensive warmup/allocation path. If it raises, this method leaves self._compiled[model_id] populated and never closes the runner, so a retry can inherit leaked worker/device state.

Proposed fix

with profile_span("PyptoExecutor.register_model", cat="executor", args={"model_id": model_id}): compiled = self._compile_model(record.runtime_model) - self._compiled[model_id] = compiled print("[register_model] kernel compiled, creating runner …", flush=True) runner = self._create_runner(model_id, compiled) - print("[register_model] runner created, init kv cache …", flush=True) - num_pages = runner.init_kv_cache(model_id, record.config, record.runtime) - # init_kv_cache runs profile warmup internally (phase 1) - self._runners[model_id] = runner + try: + self._compiled[model_id] = compiled + print("[register_model] runner created, init kv cache …", flush=True) + num_pages = runner.init_kv_cache(model_id, record.config, record.runtime) + # init_kv_cache runs profile warmup internally (phase 1) + self._runners[model_id] = runner + except Exception: + self._compiled.pop(model_id, None) + close = getattr(runner, "close", None) + if callable(close): + close() + raise return num_pages

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

compiled = self._compile_model(record.runtime_model)

self._compiled[model_id] = compiled

print("[register_model] kernel compiled, creating runner …", flush=True)

runner = self._create_runner(model_id, compiled)

runner.init_kv_cache(model_id, record.config, record.runtime)

print("[register_model] runner created, init kv cache …", flush=True)

num_pages = runner.init_kv_cache(model_id, record.config, record.runtime)

# init_kv_cache runs profile warmup internally (phase 1)

self._runners[model_id] = runner

return num_pages

compiled = self._compile_model(record.runtime_model)

print("[register_model] kernel compiled, creating runner …", flush=True)

runner = self._create_runner(model_id, compiled)

try:

self._compiled[model_id] = compiled

print("[register_model] runner created, init kv cache …", flush=True)

num_pages = runner.init_kv_cache(model_id, record.config, record.runtime)

# init_kv_cache runs profile warmup internally (phase 1)

self._runners[model_id] = runner

except Exception:

self._compiled.pop(model_id, None)

close = getattr(runner, "close", None)

if callable(close):

close()

raise

return num_pages

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@python/core/pypto_executor.py` around lines 61 - 69, `register_model` in `pypto_executor.py` leaves partially initialized state if `runner.init_kv_cache(...)` throws, so add rollback around the runner setup path. Wrap the `self._create_runner(...)` and `runner.init_kv_cache(...)` sequence in `register_model` with cleanup that removes any `self._compiled[model_id]` entry, avoids storing `self._runners[model_id]`, and closes/destroys the created runner on failure. Make sure the successful path still assigns `self._runners[model_id]` only after KV-cache init completes.

coderabbitai · 2026-06-29T02:40:35Z

+        remaining = max_seq_len - prompt_len
+        if request.max_new_tokens > remaining:
+            logger.warning(
+                "Request %s: capping max_new_tokens %d -> %d to fit max_seq_len %d "
+                "(prompt_len=%d).",
+                request.request_id, request.max_new_tokens, remaining,
+                max_seq_len, prompt_len,
+            )
+            request.max_new_tokens = remaining


🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Reject zero-token generation capacity before enqueueing.

When prompt_len == max_seq_len, this caps max_new_tokens to 0, but the prefill completion path still samples one token before _check_finish() can finish the request. That lets responses exceed max_seq_len by one token.

Suggested fix

# keeps every request within the KV-cache capacity budgeted per request # and avoids overflow-driven preemption. + if request.max_new_tokens <= 0: + raise ValueError( + f"Request {request.request_id} max_new_tokens must be positive; " + f"got {request.max_new_tokens}." + ) remaining = max_seq_len - prompt_len + if remaining <= 0: + raise ValueError( + f"Request {request.request_id} prompt length {prompt_len} " + f"leaves no room for generation within max_seq_len {max_seq_len}." + ) if request.max_new_tokens > remaining: logger.warning(

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

remaining = max_seq_len - prompt_len

if request.max_new_tokens > remaining:

logger.warning(

"Request %s: capping max_new_tokens %d -> %d to fit max_seq_len %d "

"(prompt_len=%d).",

request.request_id, request.max_new_tokens, remaining,

max_seq_len, prompt_len,

)

request.max_new_tokens = remaining

# keeps every request within the KV-cache capacity budgeted per request

# and avoids overflow-driven preemption.

if request.max_new_tokens <= 0:

raise ValueError(

f"Request {request.request_id} max_new_tokens must be positive; "

f"got {request.max_new_tokens}."

)

remaining = max_seq_len - prompt_len

if remaining <= 0:

raise ValueError(

f"Request {request.request_id} prompt length {prompt_len} "

f"leaves no room for generation within max_seq_len {max_seq_len}."

)

if request.max_new_tokens > remaining:

logger.warning(

"Request %s: capping max_new_tokens %d -> %d to fit max_seq_len %d "

"(prompt_len=%d).",

request.request_id, request.max_new_tokens, remaining,

max_seq_len, prompt_len,

)

request.max_new_tokens = remaining

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@python/core/scheduler.py` around lines 147 - 155, The scheduler’s max-new-token capping in scheduler.py can reduce `request.max_new_tokens` to 0 when `prompt_len` already equals `max_seq_len`, which still allows one extra sampled token later in the prefill/finish path. Update the enqueue-time validation in the scheduler logic around `remaining`, `request.max_new_tokens`, and `request.request_id` so zero-token generation capacity is rejected or finished before the request is queued, rather than merely capped to 0.

coderabbitai · 2026-06-29T02:40:36Z

+        kwargs: dict = {"tokenize": False, "add_generation_prompt": True}
+        if chat_template_kwargs:
+            kwargs.update(chat_template_kwargs)
+        return self.engine.tokenizer.tokenizer.apply_chat_template(hf_messages, **kwargs)


🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Protect fixed chat-template arguments from request overrides.

chat_template_kwargs can currently override tokenize=False or add_generation_prompt=True; tokenize=True makes this method return token IDs instead of the str that engine.add_request() expects.

Suggested fix

hf_messages = [{"role": m.role, "content": m.content} for m in messages] kwargs: dict = {"tokenize": False, "add_generation_prompt": True} if chat_template_kwargs: + reserved = {"tokenize", "add_generation_prompt", "return_tensors", "chat_template"} + blocked = reserved.intersection(chat_template_kwargs) + if blocked: + raise ValueError( + "chat_template_kwargs may not override reserved keys: " + + ", ".join(sorted(blocked)) + ) kwargs.update(chat_template_kwargs) return self.engine.tokenizer.tokenizer.apply_chat_template(hf_messages, **kwargs)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

kwargs: dict = {"tokenize": False, "add_generation_prompt": True}

if chat_template_kwargs:

kwargs.update(chat_template_kwargs)

return self.engine.tokenizer.tokenizer.apply_chat_template(hf_messages, **kwargs)

kwargs: dict = {"tokenize": False, "add_generation_prompt": True}

if chat_template_kwargs:

reserved = {"tokenize", "add_generation_prompt", "return_tensors", "chat_template"}

blocked = reserved.intersection(chat_template_kwargs)

if blocked:

raise ValueError(

"chat_template_kwargs may not override reserved keys: "

", ".join(sorted(blocked))

)

kwargs.update(chat_template_kwargs)

return self.engine.tokenizer.tokenizer.apply_chat_template(hf_messages, **kwargs)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@python/core/server.py` around lines 276 - 279, The chat-template argument handling currently lets chat_template_kwargs override the fixed settings used by apply_chat_template, which can change tokenize/add_generation_prompt in a way that breaks engine.add_request(). Update the logic around kwargs in the tokenizer apply_chat_template call so the required defaults stay enforced and only non-fixed request-supplied options are merged in; make sure tokenize remains false and add_generation_prompt remains true regardless of chat_template_kwargs.

coderabbitai · 2026-06-29T02:40:36Z

            register_model = getattr(self.executor, "register_model", None)
            if callable(register_model):
-                register_model(self.config.model_id, self.model_record)
+                num_pages = register_model(self.config.model_id, self.model_record)
+            else:
+                num_pages = 0

            logger.info("Worker model loaded and ready")
+            return num_pages


🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Do not report readiness with an invalid KV page count.

The else path returns 0, but AsyncLLMEngine.start() rejects <= 0; meanwhile _worker_entry() has already set ready_event and proceeds into busy_loop().

Suggested fix

register_model = getattr(self.executor, "register_model", None) - if callable(register_model): - num_pages = register_model(self.config.model_id, self.model_record) - else: - num_pages = 0 + if not callable(register_model): + raise RuntimeError("Executor must expose register_model() and return KV cache pages") + num_pages = register_model(self.config.model_id, self.model_record) + if num_pages <= 0: + raise RuntimeError(f"Executor reported invalid KV cache page count: {num_pages}") logger.info("Worker model loaded and ready") return num_pages

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

register_model = getattr(self.executor, "register_model", None)

if callable(register_model):

register_model(self.config.model_id, self.model_record)

num_pages = register_model(self.config.model_id, self.model_record)

else:

num_pages = 0

logger.info("Worker model loaded and ready")

return num_pages

register_model = getattr(self.executor, "register_model", None)

if not callable(register_model):

raise RuntimeError("Executor must expose register_model() and return KV cache pages")

num_pages = register_model(self.config.model_id, self.model_record)

if num_pages <= 0:

raise RuntimeError(f"Executor reported invalid KV cache page count: {num_pages}")

logger.info("Worker model loaded and ready")

return num_pages

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@python/core/serving_worker.py` around lines 98 - 105, The readiness path in serving_worker.py is returning an invalid KV page count from the register_model fallback, which causes AsyncLLMEngine.start() to reject startup after the worker has already signaled ready. Update the register_model handling in the model-load flow so the code never reports readiness with 0 pages; either ensure a valid positive page count is produced by register_model or fail before setting ready in _worker_entry()/the model load method. Use the existing register_model and _worker_entry logic to locate and adjust the startup contract.

coderabbitai · 2026-06-29T02:40:36Z

+    # Fraction of remaining free HBM (after weights + arena) for KV cache.
+    kv_cache_memory_fraction: float = 0.90
+    # Max tokens processed per scheduling step (chunked-prefill granularity).
+    max_num_batched_tokens: int = 4096


🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Validate the new runtime knobs at construction.

kv_cache_memory_fraction and max_num_batched_tokens are consumed directly by the NPU sizing/warmup paths, so values like <= 0, > 1, or 0 currently fall through into a 1-page KV cache or nonsensical warmup sizing instead of failing fast. Please add a config-level validator here so every caller gets the same bounds check, not just the CLIs.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@python/core/types.py` around lines 67 - 70, Add construction-time validation for the new runtime knobs in the config type so invalid values fail fast before reaching NPU sizing/warmup paths. Update the class in types.py that defines kv_cache_memory_fraction and max_num_batched_tokens to enforce a valid range for kv_cache_memory_fraction (> 0 and <= 1) and a positive value for max_num_batched_tokens. Make the check centralized at the config level so every caller gets the same bounds enforcement, not only the CLI entry points.

superxf · 2026-06-29T12:26:21Z

+            print("[register_model] kernel compiled, creating runner …", flush=True)
            runner = self._create_runner(model_id, compiled)
-            runner.init_kv_cache(model_id, record.config, record.runtime)
+            print("[register_model] runner created, init kv cache …", flush=True)


打印的信息统一用log吧，避免用print

superxf · 2026-06-29T12:42:12Z

+        # Defence in depth: stop once the full sequence (prompt + generated)
+        # reaches max_seq_len, regardless of max_new_tokens.
+        if request.num_tokens >= self.config.max_seq_len:
+            return RequestStatus.FINISHED_LENGTH


这个是不是要给用户一个提示

已增加日志

superxf · 2026-06-29T12:46:33Z

+    def sample_batch(
+        self, logits: torch.Tensor, params_list: Sequence[SamplingParams]
+    ) -> list[int]:
+        """Sample one token per row from a ``[N, vocab]`` logits block.


sampler等npu版本实现再接入吧，目前cpu版本改成batch性能提升也不明显

superxf · 2026-06-29T12:58:20Z

+            * runtime.page_size * config.head_dim * dtype_bytes
+        )
+        fraction = getattr(runtime, "kv_cache_memory_fraction", 0.90)
+        kv_budget = int(free_bytes * fraction)


如果对标vllm的’gpu_memory_utilization'，这里应该是总的显存大小* fraction - free_bytes

改为提供--npu-memory-utilization参数,与vllm对齐

…fault, remove --max-new-tokens from serving

…ory breakdown, decode bind_dynamic

… cap, prefix cache full-hit fix

…onfig

…ler, skip_special_tokens

bumble0918 · 2026-07-01T09:20:25Z

+                for i, sr in enumerate(scheduled)
+                if sr.num_computed_tokens + sr.num_new_tokens >= sr.request.num_prompt_tokens
+            ]
+            if finishing:


early return 减少嵌套深度

sunghajung6688 force-pushed the kv_malloc branch from 082c79f to deb90c4 Compare June 26, 2026 09:03

gemini-code-assist Bot reviewed Jun 26, 2026

View reviewed changes

sunghajung6688 force-pushed the kv_malloc branch from deb90c4 to e9936d0 Compare June 29, 2026 02:30

coderabbitai Bot reviewed Jun 29, 2026

View reviewed changes

sunghajung6688 force-pushed the kv_malloc branch 4 times, most recently from 70c1fee to 98cefb0 Compare June 30, 2026 01:40

superxf reviewed Jun 30, 2026

View reviewed changes

sunghajung6688 force-pushed the kv_malloc branch 6 times, most recently from e5da157 to 13d1d75 Compare June 30, 2026 09:05

sunghajung6688 added 6 commits June 30, 2026 20:50

Config + CLI: npu_memory_utilization, max_num_batched_tokens, bf16 de…

b8d4cfc

…fault, remove --max-new-tokens from serving

KV cache: vLLM-style dynamic sizing, profile warmup, halve-retry, mem…

bbfd7d6

…ory breakdown, decode bind_dynamic

Scheduler: vLLM-style admission (reject over max_seq_len), generation…

e6a699d

… cap, prefix cache full-hit fix

Serving worker: torch-based prefill/decode, num_pages sync, logging c…

dd60e9a

…onfig

Server/engine: request lifecycle logs, chat_template_kwargs, 400 hand…

66791a2

…ler, skip_special_tokens

Tests: update for removed --max-new-tokens, _FakeWorker.run, warmup mock

d38b822

sunghajung6688 force-pushed the kv_malloc branch from 13d1d75 to d38b822 Compare July 1, 2026 02:03

bumble0918 reviewed Jul 1, 2026

View reviewed changes

Uh oh!

Conversation

sunghajung6688 commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

sunghajung6688 commented Jun 26, 2026 •

edited

Loading

coderabbitai Bot commented Jun 26, 2026 •

edited

Loading