feat(serve): stream prefill progress (llama.cpp-compatible prompt_progress) by dusterbloom · Pull Request #184 · panbanda/higgs

dusterbloom · 2026-06-10T11:23:01Z

What

Streams prefill progress to the client during long prompt processing, using a llama.cpp-compatible prompt_progress field on streaming chunks. Before this, a long prefill looked like a silent hang until the first decoded token; now the client sees incremental progress as the prompt is consumed.

A progress sink in higgs-models reports tokens-processed / total during (chunked) prefill.
The engine forwards those updates; the OpenAI-compatible SSE layer emits them as prompt_progress on the stream.

Testing

cargo build -p higgs — clean.
Independent of any other in-flight PR; applies and builds directly on main.

This was extracted as a standalone change from a local integration branch; it touches only the prefill-progress path.

Summary by CodeRabbit

New Features
- Added return_progress parameter to streaming chat completion requests, enabling real-time progress updates during prompt processing via prompt_progress chunks.
- Progress chunks include processed token count, cached tokens, total tokens, and processing time, providing visibility into prefill execution phases.

…ogress Long prefills (a 14k-token context takes ~60s at ~250 tok/s) were a blind wait for clients: nothing streams until the first token. Emit progress events from the chunked-prefill loops so clients can render a true percentage, including how much the prefix cache covered. - higgs-models: new progress module — a thread-scoped sink the chunked prefill loops report (processed, total) into after each ~1024-token chunk. Thread-local because threading a callback through every forward_chunked signature (AnyModel zoo + per-model overrides + their test callers) would churn a dozen call sites for one optional observer; engines run generation on a dedicated blocking thread, so the scoping is exact. Hooked in both loops (generic KV + Qwen3Next hybrid). - higgs-engine: StreamingOutput gains prefill_progress: Option<PrefillProgress {processed, cached, total}>. SimpleEngine installs the sink around run_prefill (RAII guard, dropped before decode), maps suffix-relative chunk completions to absolute prompt position via the prefix-cache hit length, and emits an initial event so clients learn total + cache split before the first chunk lands. try_send throughout — a slow consumer can never stall prefill. - higgs (server): requests opt in with "return_progress": true (llama.cpp-compatible). Progress outputs become {"choices":[],"prompt_progress":{"total","cache","processed", "time_ms"}} SSE chunks and never reach the delta/tool trackers. Verified live against Qwen3.6-35B-A3B-4bit: a 2.5k-token prompt streams 0 -> 1024 -> 2048 of 2514 with timings; cache-hit requests start at the cached fraction. cargo test: higgs 458, higgs-engine 277, all green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-06-10T11:23:08Z

Warning

Review limit reached

@dusterbloom, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 3 minutes and 52 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more credits in the billing tab to continue.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 6b98e425-3e94-4435-b393-9cfc173ab9e9

📥 Commits

Reviewing files that changed from the base of the PR and between 6f01c64 and 5bf1add.

📒 Files selected for processing (8)

crates/higgs-engine/src/batch_engine.rs
crates/higgs-engine/src/simple.rs
crates/higgs-models/src/progress.rs
crates/higgs-models/src/qwen3_next.rs
crates/higgs/src/routes/anthropic.rs
crates/higgs/src/routes/chat.rs
crates/higgs/src/sse.rs
crates/higgs/src/state.rs

📝 Walkthrough

Walkthrough

Adds opt-in prefill progress reporting to streaming requests. Defines PrefillProgress token-count types, implements a thread-local sink mechanism at the model level, integrates reporting into chunked prefill loops, updates engine output contracts, installs sinks in SimpleEngine streaming paths, and exposes progress chunks to HTTP clients via a return_progress request flag.

Changes

Prefill Progress Reporting for Streaming Requests

Layer / File(s)	Summary
Progress type definitions and StreamingOutput contract `crates/higgs-engine/src/engine.rs`, `crates/higgs/src/types/openai.rs`	`PrefillProgress` struct with `processed`, `cached`, `total` token counts. `StreamingOutput` extended with optional `prefill_progress` field. `ChatCompletionRequest` gains `return_progress` boolean flag to control progress chunk emission.
Thread-local prefill progress sink mechanism `crates/higgs-models/src/lib.rs`, `crates/higgs-models/src/progress.rs`	Public `install_prefill_progress_sink` function returns RAII `PrefillSinkGuard`. Crate-visible `report_prefill_progress` invokes the installed sink or no-ops. Unit tests verify scoped behavior and silent no-op after guard drop.
Model integration: calling report_prefill_progress at chunk boundaries `crates/higgs-models/src/lib.rs`, `crates/higgs-models/src/qwen3_next.rs`	`AnyModel::forward_chunked` and `qwen3_next::forward_chunked` call `report_prefill_progress(offset, total)` after each chunk advances the offset, emitting progress during prefill.
Batch engine StreamingOutput updates `crates/higgs-engine/src/batch_engine.rs`, `crates/higgs-engine/src/engine.rs`	All `StreamingOutput` emissions throughout batch decode paths (early returns, error cases, prefill completions, decode steps) set `prefill_progress: None`. Unit tests updated to include the field.
SimpleEngine streaming implementation with progress sink `crates/higgs-engine/src/simple.rs`	`generate_streaming_inner` installs a prefill progress sink before `run_prefill`, forwards chunk-complete updates to the streaming channel with elapsed-time tracking, and explicitly drops the guard after prefill. All existing streaming outputs include `prefill_progress: None`.
HTTP request/response layer for progress chunks `crates/higgs/src/sse.rs`, `crates/higgs/src/routes/chat.rs`	`ChatChunkWriter::write_prompt_progress` serializes progress chunks to SSE format. `chat_completions_stream` reads `return_progress` flag and conditionally sends `prompt_progress` events when progress is present and enabled.
API documentation `README.md`	Documents the streaming `return_progress` option and `prompt_progress` chunk fields (`total`, `cache`, `processed`, `time_ms`) during chunked prefill.

Sequence Diagram

sequenceDiagram
  participant Client
  participant ChatHandler
  participant SimpleEngine
  participant AnyModel
  participant ProgressSink
  Client->>ChatHandler: POST return_progress=true
  ChatHandler->>SimpleEngine: generate_streaming
  SimpleEngine->>ProgressSink: install_prefill_progress_sink
  ProgressSink->>SimpleEngine: PrefillSinkGuard
  SimpleEngine->>AnyModel: run_prefill
  loop Each chunk
    AnyModel->>ProgressSink: report_prefill_progress(offset,total)
    ProgressSink->>SimpleEngine: forward progress to channel
    SimpleEngine->>ChatHandler: StreamingOutput with prefill_progress
  end
  SimpleEngine->>ProgressSink: drop guard
  ChatHandler->>Client: SSE prompt_progress chunks

🎯 3 (Moderate) | ⏱️ ~25 minutes

risk: medium

🐰 Progress tracked, chunk by chunk we go,
Thread-local sinks now watch the prefill flow.
Models report, engines relay with care,
Streaming clients peek—prompt progress laid bare! ✨

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'feat(serve): stream prefill progress (llama.cpp-compatible prompt_progress)' directly and specifically describes the main change: streaming prefill progress as a llama.cpp-compatible feature.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Warning

Review ran into problems

🔥 Problems

Git: Failed to clone repository. Please run the @coderabbitai full review command to re-trigger a full review. If the issue persists, set path_filters to include or exclude specific files.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

The prefill-progress sink closure tripped cargo fmt --check in CI. No behavior change — pure rustfmt reflow. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (3)

crates/higgs-models/src/progress.rs (2)
42-48: 💤 Low value

Document the double-borrow hazard in the sink contract.

If a user-provided sink recursively calls report_prefill_progress or tries to install a new sink, the borrow_mut() at line 44 will panic due to RefCell's double-borrow check. While unlikely in practice, adding a brief note in the doc comment for install_prefill_progress_sink (e.g., "The sink must not recursively report progress or reinstall sinks") would prevent confusion.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/higgs-models/src/progress.rs` around lines 42 - 48, The current sink
contract can panic due to RefCell's double-borrow when a user-provided sink
recursively calls report_prefill_progress or attempts to install a new sink
while the sink is already being invoked; update the documentation for
install_prefill_progress_sink to explicitly warn that sinks must not recursively
call report_prefill_progress or reinstall/alter PREFILL_SINK during invocation
(mention PREFILL_SINK, report_prefill_progress, install_prefill_progress_sink
and the RefCell borrow_mut double-borrow hazard) so callers know to avoid
reentrancy that would trigger a panic.
29-37: ⚡ Quick win

Clarify "suffix-relative tokens" in the doc comment.

The phrase "suffix-relative tokens" in line 31 is ambiguous. Based on the call site in lib.rs line 362, processed is the cumulative offset (tokens processed so far), not a chunk-relative count. Consider rephrasing to "cumulative tokens processed" or "absolute prompt position" for clarity.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/higgs-models/src/progress.rs` around lines 29 - 37, The doc comment
for install_prefill_progress_sink is ambiguous: replace "suffix-relative tokens"
with clearer wording indicating that the sink receives the cumulative offset
(tokens processed so far) and the total absolute prompt length (e.g., "the sink
receives (processed, total) where processed is the cumulative number of tokens
processed so far and total is the absolute prompt length"), and keep references
to PREFILL_SINK and PrefillSinkGuard unchanged.
crates/higgs/src/sse.rs (1)
100-118: ⚡ Quick win

Add regression coverage for manual prompt_progress serialization (Line 100).

This path hand-builds JSON but currently has no test equivalent to the existing serde parity tests, so contract drift can slip in unnoticed. Please add a focused test asserting exact emitted JSON shape for prompt_progress.
Proposed test addition
+    #[test]
+    fn chat_prompt_progress_chunk_shape() {
+        let mut w = ChatChunkWriter::new("chatcmpl-p", 123, "qwen3");
+        let got = w.write_prompt_progress(100, 20, 60, 42).to_owned();
+        let expected = r#"{"id":"chatcmpl-p","object":"chat.completion.chunk","created":123,"model":"qwen3","choices":[],"prompt_progress":{"total":100,"cache":20,"processed":60,"time_ms":42}}"#;
+        assert_eq!(got, expected);
+    }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/higgs/src/sse.rs` around lines 100 - 118, Add a unit test that covers
the manual JSON construction in write_prompt_progress: instantiate the same
writer type used in sse.rs (so the test can access pub(crate) API or be placed
in the same module), call write_prompt_progress with fixed sample values (e.g.
total=10, cache=2, processed=5, time_ms=123) and assert the returned &str
exactly equals the expected JSON string built from self.head +
r#"","choices":[],"prompt_progress":{"total":10,"cache":2,"processed":5,"time_ms":123}}}"#;
this will lock the emitted shape and prevent regressive drift. Ensure the test
lives in the same crate/module (so it can access write_prompt_progress) and uses
a deterministic head value or sets head before calling.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/higgs-engine/src/simple.rs`:
- Around line 2216-2244: The code unconditionally installs a prefill-progress
sink and emits a progress-only StreamingOutput (via make_progress_output,
install_prefill_progress_sink, and the initial sender.try_send) for every
SimpleEngine streaming call; instead, add and thread a boolean opt-in flag from
the request down into the SimpleEngine streaming path and guard both the sink
installation (higgs_models::progress::install_prefill_progress_sink) and the
initial try_send(make_progress_output(0)) behind that flag so prefill_progress
is only emitted when the caller explicitly requests it (leave BatchEngine
behavior unchanged).

In `@crates/higgs-models/src/qwen3_next.rs`:
- Line 3803: The progress reporting currently calls
crate::progress::report_prefill_progress(offset, T) only for intermediate chunks
and omits a final update after the last chunk, so clients never see 100%; update
the code in the same function surrounding the prefill loop (the place that calls
crate::progress::report_prefill_progress(offset, T)) to invoke a final
crate::progress::report_prefill_progress(total_offset, T) (or
report_prefill_progress(T, T)) after processing the last chunk so the terminal
progress reaches 100%—ensure you use the same offset/T variables used in the
loop to compute completion.

In `@README.md`:
- Around line 195-197: Add a concrete example request/config showing how to
enable the new streaming flag (use the "return_progress" field set to true in a
sample JSON request or client call) and append a short reference table
describing the `prompt_progress` chunk fields (`total`, `cache`, `processed`,
`time_ms`) with brief meanings and types; update the README section that
mentions "return_progress" and `prompt_progress` so readers can copy a runnable
example and understand each field.

---

Nitpick comments:
In `@crates/higgs-models/src/progress.rs`:
- Around line 42-48: The current sink contract can panic due to RefCell's
double-borrow when a user-provided sink recursively calls
report_prefill_progress or attempts to install a new sink while the sink is
already being invoked; update the documentation for
install_prefill_progress_sink to explicitly warn that sinks must not recursively
call report_prefill_progress or reinstall/alter PREFILL_SINK during invocation
(mention PREFILL_SINK, report_prefill_progress, install_prefill_progress_sink
and the RefCell borrow_mut double-borrow hazard) so callers know to avoid
reentrancy that would trigger a panic.
- Around line 29-37: The doc comment for install_prefill_progress_sink is
ambiguous: replace "suffix-relative tokens" with clearer wording indicating that
the sink receives the cumulative offset (tokens processed so far) and the total
absolute prompt length (e.g., "the sink receives (processed, total) where
processed is the cumulative number of tokens processed so far and total is the
absolute prompt length"), and keep references to PREFILL_SINK and
PrefillSinkGuard unchanged.

In `@crates/higgs/src/sse.rs`:
- Around line 100-118: Add a unit test that covers the manual JSON construction
in write_prompt_progress: instantiate the same writer type used in sse.rs (so
the test can access pub(crate) API or be placed in the same module), call
write_prompt_progress with fixed sample values (e.g. total=10, cache=2,
processed=5, time_ms=123) and assert the returned &str exactly equals the
expected JSON string built from self.head +
r#"","choices":[],"prompt_progress":{"total":10,"cache":2,"processed":5,"time_ms":123}}}"#;
this will lock the emitted shape and prevent regressive drift. Ensure the test
lives in the same crate/module (so it can access write_prompt_progress) and uses
a deterministic head value or sets head before calling.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 8feb9f4d-0f20-4f53-90dd-9f6361c750d0

📥 Commits

Reviewing files that changed from the base of the PR and between f6e3c2f and 6f01c64.

📒 Files selected for processing (10)

README.md
crates/higgs-engine/src/batch_engine.rs
crates/higgs-engine/src/engine.rs
crates/higgs-engine/src/simple.rs
crates/higgs-models/src/lib.rs
crates/higgs-models/src/progress.rs
crates/higgs-models/src/qwen3_next.rs
crates/higgs/src/routes/chat.rs
crates/higgs/src/sse.rs
crates/higgs/src/types/openai.rs

coderabbitai · 2026-06-11T13:12:18Z

+  - Streaming requests may set `"return_progress": true` to receive
+    llama.cpp-compatible `prompt_progress` chunks (`{total, cache, processed,
+    time_ms}`) during chunked prefill.


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

README update is incomplete for a new user-facing stream option (Line 195).

The new flag is documented, but this section still needs a concrete config/request example and a small field reference table for prompt_progress to satisfy the docs requirement for user-facing surface changes.

As per coding guidelines, “README.md: When changing user-facing behavior (config fields, CLI flags, API surface), update README.md with config examples and reference tables”.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@README.md` around lines 195 - 197, Add a concrete example request/config showing how to enable the new streaming flag (use the "return_progress" field set to true in a sample JSON request or client call) and append a short reference table describing the `prompt_progress` chunk fields (`total`, `cache`, `processed`, `time_ms`) with brief meanings and types; update the README section that mentions "return_progress" and `prompt_progress` so readers can copy a runnable example and understand each field.

Source: Coding guidelines

cargo fmt unmasked clippy as_conversions/cast_* (deny) in the prefill progress code — the Lint job had been bailing at the fmt step. Replace the two `as u32` casts with u32::try_from().unwrap_or(...) matching the codebase convention (simple.rs:2578). Token counts always fit u32; the fallbacks are unreachable saturation guards. No behavior change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Addresses CodeRabbit panbanda#184: SimpleEngine installed the prefill-progress sink and pushed progress-only StreamingOutputs into the channel for *every* streaming call, changing the engine-level streaming contract for all direct SimpleEngine consumers and diverging from BatchEngine (which emits none). The HTTP layer already gated client-visible chunks on return_progress, but the engine did the work regardless. Thread the existing return_progress request flag through generate_streaming_with_thinking into generate_streaming_inner, and wrap the sink install + initial event in return_progress.then(...). When off (the default; all non-chat routes pass false): no sink, no channel clone, no events — the original progress-free contract is preserved. BatchEngine accepts and ignores the flag to keep the shared streaming interface. Verified: cargo fmt, clippy --all-targets --all-features -Dwarnings, higgs-engine (244) + higgs (458) tests all pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Remaining CodeRabbit panbanda#184 review items: - qwen3_next chunked prefill reported progress only up to the final chunk boundary; emit a terminal report_prefill_progress(T, T) after the last chunk so clients reach 100% instead of stalling just short of total. - progress.rs docs: clarify the sink's processed value is cumulative within the forwarded suffix (not a per-chunk delta, not an absolute prompt offset), and document the RefCell reentrancy hazard (sinks must not re-enter the progress machinery or reinstall the sink). - sse.rs: add a write_prompt_progress unit test asserting the hand-built chunk parses as JSON and matches the exact llama.cpp-compatible wire shape. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

style(serve): cargo fmt to clear Lint check

6f01c64

The prefill-progress sink closure tripped cargo fmt --check in CI. No behavior change — pure rustfmt reflow. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

coderabbitai Bot requested changes Jun 11, 2026

View reviewed changes

dusterbloom and others added 3 commits June 11, 2026 15:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(serve): stream prefill progress (llama.cpp-compatible prompt_progress)#184

feat(serve): stream prefill progress (llama.cpp-compatible prompt_progress)#184
dusterbloom wants to merge 5 commits into
panbanda:mainfrom
dusterbloom:dusterbloom/serve-prefill-progress

dusterbloom commented Jun 10, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 10, 2026 •

edited

Loading

Review limit reached

Walkthrough

Changes

Sequence Diagram

Review ran into problems

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dusterbloom commented Jun 10, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Testing

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Walkthrough

Changes

Sequence Diagram

Review ran into problems

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dusterbloom commented Jun 10, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 10, 2026 •

edited

Loading