Skip to content

feat(serve): stream prefill progress (llama.cpp-compatible prompt_progress)#184

Open
dusterbloom wants to merge 5 commits into
panbanda:mainfrom
dusterbloom:dusterbloom/serve-prefill-progress
Open

feat(serve): stream prefill progress (llama.cpp-compatible prompt_progress)#184
dusterbloom wants to merge 5 commits into
panbanda:mainfrom
dusterbloom:dusterbloom/serve-prefill-progress

Conversation

@dusterbloom

@dusterbloom dusterbloom commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

What

Streams prefill progress to the client during long prompt processing, using a llama.cpp-compatible prompt_progress field on streaming chunks. Before this, a long prefill looked like a silent hang until the first decoded token; now the client sees incremental progress as the prompt is consumed.

  • A progress sink in higgs-models reports tokens-processed / total during (chunked) prefill.
  • The engine forwards those updates; the OpenAI-compatible SSE layer emits them as prompt_progress on the stream.

Testing

  • cargo build -p higgs — clean.
  • Independent of any other in-flight PR; applies and builds directly on main.

This was extracted as a standalone change from a local integration branch; it touches only the prefill-progress path.

Summary by CodeRabbit

  • New Features
    • Added return_progress parameter to streaming chat completion requests, enabling real-time progress updates during prompt processing via prompt_progress chunks.
    • Progress chunks include processed token count, cached tokens, total tokens, and processing time, providing visibility into prefill execution phases.

…ogress

Long prefills (a 14k-token context takes ~60s at ~250 tok/s) were a
blind wait for clients: nothing streams until the first token. Emit
progress events from the chunked-prefill loops so clients can render a
true percentage, including how much the prefix cache covered.

- higgs-models: new progress module — a thread-scoped sink the chunked
  prefill loops report (processed, total) into after each ~1024-token
  chunk. Thread-local because threading a callback through every
  forward_chunked signature (AnyModel zoo + per-model overrides + their
  test callers) would churn a dozen call sites for one optional
  observer; engines run generation on a dedicated blocking thread, so
  the scoping is exact. Hooked in both loops (generic KV + Qwen3Next
  hybrid).
- higgs-engine: StreamingOutput gains prefill_progress:
  Option<PrefillProgress {processed, cached, total}>. SimpleEngine
  installs the sink around run_prefill (RAII guard, dropped before
  decode), maps suffix-relative chunk completions to absolute prompt
  position via the prefix-cache hit length, and emits an initial event
  so clients learn total + cache split before the first chunk lands.
  try_send throughout — a slow consumer can never stall prefill.
- higgs (server): requests opt in with "return_progress": true
  (llama.cpp-compatible). Progress outputs become
  {"choices":[],"prompt_progress":{"total","cache","processed",
  "time_ms"}} SSE chunks and never reach the delta/tool trackers.

Verified live against Qwen3.6-35B-A3B-4bit: a 2.5k-token prompt streams
0 -> 1024 -> 2048 of 2514 with timings; cache-hit requests start at the
cached fraction. cargo test: higgs 458, higgs-engine 277, all green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Warning

Review limit reached

@dusterbloom, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 3 minutes and 52 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more credits in the billing tab to continue.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 6b98e425-3e94-4435-b393-9cfc173ab9e9

📥 Commits

Reviewing files that changed from the base of the PR and between 6f01c64 and 5bf1add.

📒 Files selected for processing (8)
  • crates/higgs-engine/src/batch_engine.rs
  • crates/higgs-engine/src/simple.rs
  • crates/higgs-models/src/progress.rs
  • crates/higgs-models/src/qwen3_next.rs
  • crates/higgs/src/routes/anthropic.rs
  • crates/higgs/src/routes/chat.rs
  • crates/higgs/src/sse.rs
  • crates/higgs/src/state.rs
📝 Walkthrough

Walkthrough

Adds opt-in prefill progress reporting to streaming requests. Defines PrefillProgress token-count types, implements a thread-local sink mechanism at the model level, integrates reporting into chunked prefill loops, updates engine output contracts, installs sinks in SimpleEngine streaming paths, and exposes progress chunks to HTTP clients via a return_progress request flag.

Changes

Prefill Progress Reporting for Streaming Requests

Layer / File(s) Summary
Progress type definitions and StreamingOutput contract
crates/higgs-engine/src/engine.rs, crates/higgs/src/types/openai.rs
PrefillProgress struct with processed, cached, total token counts. StreamingOutput extended with optional prefill_progress field. ChatCompletionRequest gains return_progress boolean flag to control progress chunk emission.
Thread-local prefill progress sink mechanism
crates/higgs-models/src/lib.rs, crates/higgs-models/src/progress.rs
Public install_prefill_progress_sink function returns RAII PrefillSinkGuard. Crate-visible report_prefill_progress invokes the installed sink or no-ops. Unit tests verify scoped behavior and silent no-op after guard drop.
Model integration: calling report_prefill_progress at chunk boundaries
crates/higgs-models/src/lib.rs, crates/higgs-models/src/qwen3_next.rs
AnyModel::forward_chunked and qwen3_next::forward_chunked call report_prefill_progress(offset, total) after each chunk advances the offset, emitting progress during prefill.
Batch engine StreamingOutput updates
crates/higgs-engine/src/batch_engine.rs, crates/higgs-engine/src/engine.rs
All StreamingOutput emissions throughout batch decode paths (early returns, error cases, prefill completions, decode steps) set prefill_progress: None. Unit tests updated to include the field.
SimpleEngine streaming implementation with progress sink
crates/higgs-engine/src/simple.rs
generate_streaming_inner installs a prefill progress sink before run_prefill, forwards chunk-complete updates to the streaming channel with elapsed-time tracking, and explicitly drops the guard after prefill. All existing streaming outputs include prefill_progress: None.
HTTP request/response layer for progress chunks
crates/higgs/src/sse.rs, crates/higgs/src/routes/chat.rs
ChatChunkWriter::write_prompt_progress serializes progress chunks to SSE format. chat_completions_stream reads return_progress flag and conditionally sends prompt_progress events when progress is present and enabled.
API documentation
README.md
Documents the streaming return_progress option and prompt_progress chunk fields (total, cache, processed, time_ms) during chunked prefill.

Sequence Diagram

sequenceDiagram
  participant Client
  participant ChatHandler
  participant SimpleEngine
  participant AnyModel
  participant ProgressSink
  Client->>ChatHandler: POST return_progress=true
  ChatHandler->>SimpleEngine: generate_streaming
  SimpleEngine->>ProgressSink: install_prefill_progress_sink
  ProgressSink->>SimpleEngine: PrefillSinkGuard
  SimpleEngine->>AnyModel: run_prefill
  loop Each chunk
    AnyModel->>ProgressSink: report_prefill_progress(offset,total)
    ProgressSink->>SimpleEngine: forward progress to channel
    SimpleEngine->>ChatHandler: StreamingOutput with prefill_progress
  end
  SimpleEngine->>ProgressSink: drop guard
  ChatHandler->>Client: SSE prompt_progress chunks
Loading

🎯 3 (Moderate) | ⏱️ ~25 minutes

risk: medium

🐰 Progress tracked, chunk by chunk we go,
Thread-local sinks now watch the prefill flow.
Models report, engines relay with care,
Streaming clients peek—prompt progress laid bare! ✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat(serve): stream prefill progress (llama.cpp-compatible prompt_progress)' directly and specifically describes the main change: streaming prefill progress as a llama.cpp-compatible feature.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

Review ran into problems

🔥 Problems

Git: Failed to clone repository. Please run the @coderabbitai full review command to re-trigger a full review. If the issue persists, set path_filters to include or exclude specific files.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

The prefill-progress sink closure tripped cargo fmt --check in CI.
No behavior change — pure rustfmt reflow.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (3)
crates/higgs-models/src/progress.rs (2)

42-48: 💤 Low value

Document the double-borrow hazard in the sink contract.

If a user-provided sink recursively calls report_prefill_progress or tries to install a new sink, the borrow_mut() at line 44 will panic due to RefCell's double-borrow check. While unlikely in practice, adding a brief note in the doc comment for install_prefill_progress_sink (e.g., "The sink must not recursively report progress or reinstall sinks") would prevent confusion.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/higgs-models/src/progress.rs` around lines 42 - 48, The current sink
contract can panic due to RefCell's double-borrow when a user-provided sink
recursively calls report_prefill_progress or attempts to install a new sink
while the sink is already being invoked; update the documentation for
install_prefill_progress_sink to explicitly warn that sinks must not recursively
call report_prefill_progress or reinstall/alter PREFILL_SINK during invocation
(mention PREFILL_SINK, report_prefill_progress, install_prefill_progress_sink
and the RefCell borrow_mut double-borrow hazard) so callers know to avoid
reentrancy that would trigger a panic.

29-37: ⚡ Quick win

Clarify "suffix-relative tokens" in the doc comment.

The phrase "suffix-relative tokens" in line 31 is ambiguous. Based on the call site in lib.rs line 362, processed is the cumulative offset (tokens processed so far), not a chunk-relative count. Consider rephrasing to "cumulative tokens processed" or "absolute prompt position" for clarity.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/higgs-models/src/progress.rs` around lines 29 - 37, The doc comment
for install_prefill_progress_sink is ambiguous: replace "suffix-relative tokens"
with clearer wording indicating that the sink receives the cumulative offset
(tokens processed so far) and the total absolute prompt length (e.g., "the sink
receives (processed, total) where processed is the cumulative number of tokens
processed so far and total is the absolute prompt length"), and keep references
to PREFILL_SINK and PrefillSinkGuard unchanged.
crates/higgs/src/sse.rs (1)

100-118: ⚡ Quick win

Add regression coverage for manual prompt_progress serialization (Line 100).

This path hand-builds JSON but currently has no test equivalent to the existing serde parity tests, so contract drift can slip in unnoticed. Please add a focused test asserting exact emitted JSON shape for prompt_progress.

Proposed test addition
+    #[test]
+    fn chat_prompt_progress_chunk_shape() {
+        let mut w = ChatChunkWriter::new("chatcmpl-p", 123, "qwen3");
+        let got = w.write_prompt_progress(100, 20, 60, 42).to_owned();
+        let expected = r#"{"id":"chatcmpl-p","object":"chat.completion.chunk","created":123,"model":"qwen3","choices":[],"prompt_progress":{"total":100,"cache":20,"processed":60,"time_ms":42}}"#;
+        assert_eq!(got, expected);
+    }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/higgs/src/sse.rs` around lines 100 - 118, Add a unit test that covers
the manual JSON construction in write_prompt_progress: instantiate the same
writer type used in sse.rs (so the test can access pub(crate) API or be placed
in the same module), call write_prompt_progress with fixed sample values (e.g.
total=10, cache=2, processed=5, time_ms=123) and assert the returned &str
exactly equals the expected JSON string built from self.head +
r#"","choices":[],"prompt_progress":{"total":10,"cache":2,"processed":5,"time_ms":123}}}"#;
this will lock the emitted shape and prevent regressive drift. Ensure the test
lives in the same crate/module (so it can access write_prompt_progress) and uses
a deterministic head value or sets head before calling.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/higgs-engine/src/simple.rs`:
- Around line 2216-2244: The code unconditionally installs a prefill-progress
sink and emits a progress-only StreamingOutput (via make_progress_output,
install_prefill_progress_sink, and the initial sender.try_send) for every
SimpleEngine streaming call; instead, add and thread a boolean opt-in flag from
the request down into the SimpleEngine streaming path and guard both the sink
installation (higgs_models::progress::install_prefill_progress_sink) and the
initial try_send(make_progress_output(0)) behind that flag so prefill_progress
is only emitted when the caller explicitly requests it (leave BatchEngine
behavior unchanged).

In `@crates/higgs-models/src/qwen3_next.rs`:
- Line 3803: The progress reporting currently calls
crate::progress::report_prefill_progress(offset, T) only for intermediate chunks
and omits a final update after the last chunk, so clients never see 100%; update
the code in the same function surrounding the prefill loop (the place that calls
crate::progress::report_prefill_progress(offset, T)) to invoke a final
crate::progress::report_prefill_progress(total_offset, T) (or
report_prefill_progress(T, T)) after processing the last chunk so the terminal
progress reaches 100%—ensure you use the same offset/T variables used in the
loop to compute completion.

In `@README.md`:
- Around line 195-197: Add a concrete example request/config showing how to
enable the new streaming flag (use the "return_progress" field set to true in a
sample JSON request or client call) and append a short reference table
describing the `prompt_progress` chunk fields (`total`, `cache`, `processed`,
`time_ms`) with brief meanings and types; update the README section that
mentions "return_progress" and `prompt_progress` so readers can copy a runnable
example and understand each field.

---

Nitpick comments:
In `@crates/higgs-models/src/progress.rs`:
- Around line 42-48: The current sink contract can panic due to RefCell's
double-borrow when a user-provided sink recursively calls
report_prefill_progress or attempts to install a new sink while the sink is
already being invoked; update the documentation for
install_prefill_progress_sink to explicitly warn that sinks must not recursively
call report_prefill_progress or reinstall/alter PREFILL_SINK during invocation
(mention PREFILL_SINK, report_prefill_progress, install_prefill_progress_sink
and the RefCell borrow_mut double-borrow hazard) so callers know to avoid
reentrancy that would trigger a panic.
- Around line 29-37: The doc comment for install_prefill_progress_sink is
ambiguous: replace "suffix-relative tokens" with clearer wording indicating that
the sink receives the cumulative offset (tokens processed so far) and the total
absolute prompt length (e.g., "the sink receives (processed, total) where
processed is the cumulative number of tokens processed so far and total is the
absolute prompt length"), and keep references to PREFILL_SINK and
PrefillSinkGuard unchanged.

In `@crates/higgs/src/sse.rs`:
- Around line 100-118: Add a unit test that covers the manual JSON construction
in write_prompt_progress: instantiate the same writer type used in sse.rs (so
the test can access pub(crate) API or be placed in the same module), call
write_prompt_progress with fixed sample values (e.g. total=10, cache=2,
processed=5, time_ms=123) and assert the returned &str exactly equals the
expected JSON string built from self.head +
r#"","choices":[],"prompt_progress":{"total":10,"cache":2,"processed":5,"time_ms":123}}}"#;
this will lock the emitted shape and prevent regressive drift. Ensure the test
lives in the same crate/module (so it can access write_prompt_progress) and uses
a deterministic head value or sets head before calling.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 8feb9f4d-0f20-4f53-90dd-9f6361c750d0

📥 Commits

Reviewing files that changed from the base of the PR and between f6e3c2f and 6f01c64.

📒 Files selected for processing (10)
  • README.md
  • crates/higgs-engine/src/batch_engine.rs
  • crates/higgs-engine/src/engine.rs
  • crates/higgs-engine/src/simple.rs
  • crates/higgs-models/src/lib.rs
  • crates/higgs-models/src/progress.rs
  • crates/higgs-models/src/qwen3_next.rs
  • crates/higgs/src/routes/chat.rs
  • crates/higgs/src/sse.rs
  • crates/higgs/src/types/openai.rs

Comment thread crates/higgs-engine/src/simple.rs Outdated
Comment thread crates/higgs-models/src/qwen3_next.rs
Comment thread README.md
Comment on lines +195 to +197
- Streaming requests may set `"return_progress": true` to receive
llama.cpp-compatible `prompt_progress` chunks (`{total, cache, processed,
time_ms}`) during chunked prefill.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

README update is incomplete for a new user-facing stream option (Line 195).

The new flag is documented, but this section still needs a concrete config/request example and a small field reference table for prompt_progress to satisfy the docs requirement for user-facing surface changes.

As per coding guidelines, “README.md: When changing user-facing behavior (config fields, CLI flags, API surface), update README.md with config examples and reference tables”.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@README.md` around lines 195 - 197, Add a concrete example request/config
showing how to enable the new streaming flag (use the "return_progress" field
set to true in a sample JSON request or client call) and append a short
reference table describing the `prompt_progress` chunk fields (`total`, `cache`,
`processed`, `time_ms`) with brief meanings and types; update the README section
that mentions "return_progress" and `prompt_progress` so readers can copy a
runnable example and understand each field.

Source: Coding guidelines

dusterbloom and others added 3 commits June 11, 2026 15:18
cargo fmt unmasked clippy as_conversions/cast_* (deny) in the prefill
progress code — the Lint job had been bailing at the fmt step. Replace the
two `as u32` casts with u32::try_from().unwrap_or(...) matching the
codebase convention (simple.rs:2578). Token counts always fit u32; the
fallbacks are unreachable saturation guards. No behavior change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Addresses CodeRabbit panbanda#184: SimpleEngine installed the prefill-progress sink
and pushed progress-only StreamingOutputs into the channel for *every*
streaming call, changing the engine-level streaming contract for all direct
SimpleEngine consumers and diverging from BatchEngine (which emits none). The
HTTP layer already gated client-visible chunks on return_progress, but the
engine did the work regardless.

Thread the existing return_progress request flag through
generate_streaming_with_thinking into generate_streaming_inner, and wrap the
sink install + initial event in return_progress.then(...). When off (the
default; all non-chat routes pass false): no sink, no channel clone, no events
— the original progress-free contract is preserved. BatchEngine accepts and
ignores the flag to keep the shared streaming interface.

Verified: cargo fmt, clippy --all-targets --all-features -Dwarnings,
higgs-engine (244) + higgs (458) tests all pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Remaining CodeRabbit panbanda#184 review items:
- qwen3_next chunked prefill reported progress only up to the final chunk
  boundary; emit a terminal report_prefill_progress(T, T) after the last chunk
  so clients reach 100% instead of stalling just short of total.
- progress.rs docs: clarify the sink's processed value is cumulative within the
  forwarded suffix (not a per-chunk delta, not an absolute prompt offset), and
  document the RefCell reentrancy hazard (sinks must not re-enter the progress
  machinery or reinstall the sink).
- sse.rs: add a write_prompt_progress unit test asserting the hand-built chunk
  parses as JSON and matches the exact llama.cpp-compatible wire shape.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant