feat(serve): stream prefill progress (llama.cpp-compatible prompt_progress)#184
feat(serve): stream prefill progress (llama.cpp-compatible prompt_progress)#184dusterbloom wants to merge 5 commits into
Conversation
…ogress
Long prefills (a 14k-token context takes ~60s at ~250 tok/s) were a
blind wait for clients: nothing streams until the first token. Emit
progress events from the chunked-prefill loops so clients can render a
true percentage, including how much the prefix cache covered.
- higgs-models: new progress module — a thread-scoped sink the chunked
prefill loops report (processed, total) into after each ~1024-token
chunk. Thread-local because threading a callback through every
forward_chunked signature (AnyModel zoo + per-model overrides + their
test callers) would churn a dozen call sites for one optional
observer; engines run generation on a dedicated blocking thread, so
the scoping is exact. Hooked in both loops (generic KV + Qwen3Next
hybrid).
- higgs-engine: StreamingOutput gains prefill_progress:
Option<PrefillProgress {processed, cached, total}>. SimpleEngine
installs the sink around run_prefill (RAII guard, dropped before
decode), maps suffix-relative chunk completions to absolute prompt
position via the prefix-cache hit length, and emits an initial event
so clients learn total + cache split before the first chunk lands.
try_send throughout — a slow consumer can never stall prefill.
- higgs (server): requests opt in with "return_progress": true
(llama.cpp-compatible). Progress outputs become
{"choices":[],"prompt_progress":{"total","cache","processed",
"time_ms"}} SSE chunks and never reach the delta/tool trackers.
Verified live against Qwen3.6-35B-A3B-4bit: a 2.5k-token prompt streams
0 -> 1024 -> 2048 of 2514 with timings; cache-hit requests start at the
cached fraction. cargo test: higgs 458, higgs-engine 277, all green.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Warning Review limit reached
More reviews will be available in 3 minutes and 52 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more credits in the billing tab to continue. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (8)
📝 WalkthroughWalkthroughAdds opt-in prefill progress reporting to streaming requests. Defines ChangesPrefill Progress Reporting for Streaming Requests
Sequence DiagramsequenceDiagram
participant Client
participant ChatHandler
participant SimpleEngine
participant AnyModel
participant ProgressSink
Client->>ChatHandler: POST return_progress=true
ChatHandler->>SimpleEngine: generate_streaming
SimpleEngine->>ProgressSink: install_prefill_progress_sink
ProgressSink->>SimpleEngine: PrefillSinkGuard
SimpleEngine->>AnyModel: run_prefill
loop Each chunk
AnyModel->>ProgressSink: report_prefill_progress(offset,total)
ProgressSink->>SimpleEngine: forward progress to channel
SimpleEngine->>ChatHandler: StreamingOutput with prefill_progress
end
SimpleEngine->>ProgressSink: drop guard
ChatHandler->>Client: SSE prompt_progress chunks
🎯 3 (Moderate) | ⏱️ ~25 minutes
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning Review ran into problems🔥 ProblemsGit: Failed to clone repository. Please run the Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
The prefill-progress sink closure tripped cargo fmt --check in CI. No behavior change — pure rustfmt reflow. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (3)
crates/higgs-models/src/progress.rs (2)
42-48: 💤 Low valueDocument the double-borrow hazard in the sink contract.
If a user-provided sink recursively calls
report_prefill_progressor tries to install a new sink, theborrow_mut()at line 44 will panic due to RefCell's double-borrow check. While unlikely in practice, adding a brief note in the doc comment forinstall_prefill_progress_sink(e.g., "The sink must not recursively report progress or reinstall sinks") would prevent confusion.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/higgs-models/src/progress.rs` around lines 42 - 48, The current sink contract can panic due to RefCell's double-borrow when a user-provided sink recursively calls report_prefill_progress or attempts to install a new sink while the sink is already being invoked; update the documentation for install_prefill_progress_sink to explicitly warn that sinks must not recursively call report_prefill_progress or reinstall/alter PREFILL_SINK during invocation (mention PREFILL_SINK, report_prefill_progress, install_prefill_progress_sink and the RefCell borrow_mut double-borrow hazard) so callers know to avoid reentrancy that would trigger a panic.
29-37: ⚡ Quick winClarify "suffix-relative tokens" in the doc comment.
The phrase "suffix-relative tokens" in line 31 is ambiguous. Based on the call site in
lib.rsline 362,processedis the cumulative offset (tokens processed so far), not a chunk-relative count. Consider rephrasing to "cumulative tokens processed" or "absolute prompt position" for clarity.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/higgs-models/src/progress.rs` around lines 29 - 37, The doc comment for install_prefill_progress_sink is ambiguous: replace "suffix-relative tokens" with clearer wording indicating that the sink receives the cumulative offset (tokens processed so far) and the total absolute prompt length (e.g., "the sink receives (processed, total) where processed is the cumulative number of tokens processed so far and total is the absolute prompt length"), and keep references to PREFILL_SINK and PrefillSinkGuard unchanged.crates/higgs/src/sse.rs (1)
100-118: ⚡ Quick winAdd regression coverage for manual
prompt_progressserialization (Line 100).This path hand-builds JSON but currently has no test equivalent to the existing serde parity tests, so contract drift can slip in unnoticed. Please add a focused test asserting exact emitted JSON shape for
prompt_progress.Proposed test addition
+ #[test] + fn chat_prompt_progress_chunk_shape() { + let mut w = ChatChunkWriter::new("chatcmpl-p", 123, "qwen3"); + let got = w.write_prompt_progress(100, 20, 60, 42).to_owned(); + let expected = r#"{"id":"chatcmpl-p","object":"chat.completion.chunk","created":123,"model":"qwen3","choices":[],"prompt_progress":{"total":100,"cache":20,"processed":60,"time_ms":42}}"#; + assert_eq!(got, expected); + }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/higgs/src/sse.rs` around lines 100 - 118, Add a unit test that covers the manual JSON construction in write_prompt_progress: instantiate the same writer type used in sse.rs (so the test can access pub(crate) API or be placed in the same module), call write_prompt_progress with fixed sample values (e.g. total=10, cache=2, processed=5, time_ms=123) and assert the returned &str exactly equals the expected JSON string built from self.head + r#"","choices":[],"prompt_progress":{"total":10,"cache":2,"processed":5,"time_ms":123}}}"#; this will lock the emitted shape and prevent regressive drift. Ensure the test lives in the same crate/module (so it can access write_prompt_progress) and uses a deterministic head value or sets head before calling.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/higgs-engine/src/simple.rs`:
- Around line 2216-2244: The code unconditionally installs a prefill-progress
sink and emits a progress-only StreamingOutput (via make_progress_output,
install_prefill_progress_sink, and the initial sender.try_send) for every
SimpleEngine streaming call; instead, add and thread a boolean opt-in flag from
the request down into the SimpleEngine streaming path and guard both the sink
installation (higgs_models::progress::install_prefill_progress_sink) and the
initial try_send(make_progress_output(0)) behind that flag so prefill_progress
is only emitted when the caller explicitly requests it (leave BatchEngine
behavior unchanged).
In `@crates/higgs-models/src/qwen3_next.rs`:
- Line 3803: The progress reporting currently calls
crate::progress::report_prefill_progress(offset, T) only for intermediate chunks
and omits a final update after the last chunk, so clients never see 100%; update
the code in the same function surrounding the prefill loop (the place that calls
crate::progress::report_prefill_progress(offset, T)) to invoke a final
crate::progress::report_prefill_progress(total_offset, T) (or
report_prefill_progress(T, T)) after processing the last chunk so the terminal
progress reaches 100%—ensure you use the same offset/T variables used in the
loop to compute completion.
In `@README.md`:
- Around line 195-197: Add a concrete example request/config showing how to
enable the new streaming flag (use the "return_progress" field set to true in a
sample JSON request or client call) and append a short reference table
describing the `prompt_progress` chunk fields (`total`, `cache`, `processed`,
`time_ms`) with brief meanings and types; update the README section that
mentions "return_progress" and `prompt_progress` so readers can copy a runnable
example and understand each field.
---
Nitpick comments:
In `@crates/higgs-models/src/progress.rs`:
- Around line 42-48: The current sink contract can panic due to RefCell's
double-borrow when a user-provided sink recursively calls
report_prefill_progress or attempts to install a new sink while the sink is
already being invoked; update the documentation for
install_prefill_progress_sink to explicitly warn that sinks must not recursively
call report_prefill_progress or reinstall/alter PREFILL_SINK during invocation
(mention PREFILL_SINK, report_prefill_progress, install_prefill_progress_sink
and the RefCell borrow_mut double-borrow hazard) so callers know to avoid
reentrancy that would trigger a panic.
- Around line 29-37: The doc comment for install_prefill_progress_sink is
ambiguous: replace "suffix-relative tokens" with clearer wording indicating that
the sink receives the cumulative offset (tokens processed so far) and the total
absolute prompt length (e.g., "the sink receives (processed, total) where
processed is the cumulative number of tokens processed so far and total is the
absolute prompt length"), and keep references to PREFILL_SINK and
PrefillSinkGuard unchanged.
In `@crates/higgs/src/sse.rs`:
- Around line 100-118: Add a unit test that covers the manual JSON construction
in write_prompt_progress: instantiate the same writer type used in sse.rs (so
the test can access pub(crate) API or be placed in the same module), call
write_prompt_progress with fixed sample values (e.g. total=10, cache=2,
processed=5, time_ms=123) and assert the returned &str exactly equals the
expected JSON string built from self.head +
r#"","choices":[],"prompt_progress":{"total":10,"cache":2,"processed":5,"time_ms":123}}}"#;
this will lock the emitted shape and prevent regressive drift. Ensure the test
lives in the same crate/module (so it can access write_prompt_progress) and uses
a deterministic head value or sets head before calling.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 8feb9f4d-0f20-4f53-90dd-9f6361c750d0
📒 Files selected for processing (10)
README.mdcrates/higgs-engine/src/batch_engine.rscrates/higgs-engine/src/engine.rscrates/higgs-engine/src/simple.rscrates/higgs-models/src/lib.rscrates/higgs-models/src/progress.rscrates/higgs-models/src/qwen3_next.rscrates/higgs/src/routes/chat.rscrates/higgs/src/sse.rscrates/higgs/src/types/openai.rs
| - Streaming requests may set `"return_progress": true` to receive | ||
| llama.cpp-compatible `prompt_progress` chunks (`{total, cache, processed, | ||
| time_ms}`) during chunked prefill. |
There was a problem hiding this comment.
README update is incomplete for a new user-facing stream option (Line 195).
The new flag is documented, but this section still needs a concrete config/request example and a small field reference table for prompt_progress to satisfy the docs requirement for user-facing surface changes.
As per coding guidelines, “README.md: When changing user-facing behavior (config fields, CLI flags, API surface), update README.md with config examples and reference tables”.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@README.md` around lines 195 - 197, Add a concrete example request/config
showing how to enable the new streaming flag (use the "return_progress" field
set to true in a sample JSON request or client call) and append a short
reference table describing the `prompt_progress` chunk fields (`total`, `cache`,
`processed`, `time_ms`) with brief meanings and types; update the README section
that mentions "return_progress" and `prompt_progress` so readers can copy a
runnable example and understand each field.
Source: Coding guidelines
cargo fmt unmasked clippy as_conversions/cast_* (deny) in the prefill progress code — the Lint job had been bailing at the fmt step. Replace the two `as u32` casts with u32::try_from().unwrap_or(...) matching the codebase convention (simple.rs:2578). Token counts always fit u32; the fallbacks are unreachable saturation guards. No behavior change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Addresses CodeRabbit panbanda#184: SimpleEngine installed the prefill-progress sink and pushed progress-only StreamingOutputs into the channel for *every* streaming call, changing the engine-level streaming contract for all direct SimpleEngine consumers and diverging from BatchEngine (which emits none). The HTTP layer already gated client-visible chunks on return_progress, but the engine did the work regardless. Thread the existing return_progress request flag through generate_streaming_with_thinking into generate_streaming_inner, and wrap the sink install + initial event in return_progress.then(...). When off (the default; all non-chat routes pass false): no sink, no channel clone, no events — the original progress-free contract is preserved. BatchEngine accepts and ignores the flag to keep the shared streaming interface. Verified: cargo fmt, clippy --all-targets --all-features -Dwarnings, higgs-engine (244) + higgs (458) tests all pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Remaining CodeRabbit panbanda#184 review items: - qwen3_next chunked prefill reported progress only up to the final chunk boundary; emit a terminal report_prefill_progress(T, T) after the last chunk so clients reach 100% instead of stalling just short of total. - progress.rs docs: clarify the sink's processed value is cumulative within the forwarded suffix (not a per-chunk delta, not an absolute prompt offset), and document the RefCell reentrancy hazard (sinks must not re-enter the progress machinery or reinstall the sink). - sse.rs: add a write_prompt_progress unit test asserting the hand-built chunk parses as JSON and matches the exact llama.cpp-compatible wire shape. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
What
Streams prefill progress to the client during long prompt processing, using a llama.cpp-compatible
prompt_progressfield on streaming chunks. Before this, a long prefill looked like a silent hang until the first decoded token; now the client sees incremental progress as the prompt is consumed.progresssink inhiggs-modelsreports tokens-processed / total during (chunked) prefill.prompt_progresson the stream.Testing
cargo build -p higgs— clean.main.This was extracted as a standalone change from a local integration branch; it touches only the prefill-progress path.
Summary by CodeRabbit
return_progressparameter to streaming chat completion requests, enabling real-time progress updates during prompt processing viaprompt_progresschunks.