fix: add per-call LLM timeout and activity-based branch tracking#556
fix: add per-call LLM timeout and activity-based branch tracking#556EZotoff wants to merge 2 commits intospacedriveapp:mainfrom
Conversation
Prevents hung API connections from blocking branches, channels, and other LLM processes indefinitely. Two changes: 1. Wrap prompt_once and prompt_once_streaming in tokio::time::timeout (300s default). On timeout, returns PromptError::CompletionError which the existing retry/error paths handle naturally. 2. Add last_activity_at to BranchTracker (matching WorkerTracker) so the cortex supervisor can detect stalled branches by activity age rather than just wall-clock time since spawn. Also adds debug-level health tick logging with active/overdue counts for observability.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
WalkthroughAdds branch activity tracking to cortex (new Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related issues
Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/agent/cortex.rs (1)
510-515:⚠️ Potential issue | 🟠 MajorInconsistent use of
started_atfor branch kill ordering.The timeout detection at line 1017 now uses
tracker.last_activity_at, but this function still returnsstarted_atfor branches. This means branch timeout detection uses one timestamp while kill ordering uses another.🐛 Proposed fix
fn kill_target_last_activity(target: &KillTarget) -> Instant { match target { KillTarget::Worker(tracker) => tracker.last_activity_at, - KillTarget::Branch(tracker) => tracker.started_at, + KillTarget::Branch(tracker) => tracker.last_activity_at, } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/agent/cortex.rs` around lines 510 - 515, The kill ordering function kill_target_last_activity is inconsistent with the timeout detection logic (which uses tracker.last_activity_at); update the Branch arm to return tracker.last_activity_at instead of tracker.started_at so both timeout detection and kill ordering use the same timestamp (ensure the BranchTracker struct exposes last_activity_at and that its type is Instant to match the function signature).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/hooks/spacebot.rs`:
- Around line 516-533: The current timeout only wraps
agent.stream_completion(...) so a stalled stream or blocked reads still hang;
update the code to apply tokio::time::timeout using Self::LLM_CALL_TIMEOUT_SECS
around both the stream establishment (the await on request.stream().await) and
each read from the stream (the await on stream.next().await), mapping timeout
errors to PromptError::CompletionError just like the initial request timeout;
ensure you cancel/close the underlying request on timeout and reuse the same
CompletionError creation pattern (rig::completion::CompletionError via Box<dyn
std::error::Error + Send + Sync>) so all timeouts are handled consistently for
agent.stream_completion, request.stream().await, and stream.next().await.
---
Outside diff comments:
In `@src/agent/cortex.rs`:
- Around line 510-515: The kill ordering function kill_target_last_activity is
inconsistent with the timeout detection logic (which uses
tracker.last_activity_at); update the Branch arm to return
tracker.last_activity_at instead of tracker.started_at so both timeout detection
and kill ordering use the same timestamp (ensure the BranchTracker struct
exposes last_activity_at and that its type is Instant to match the function
signature).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 36829fc0-a446-4695-a0f6-6ba181dbb769
📒 Files selected for processing (2)
src/agent/cortex.rssrc/hooks/spacebot.rs
| let request = tokio::time::timeout( | ||
| std::time::Duration::from_secs(Self::LLM_CALL_TIMEOUT_SECS), | ||
| agent.stream_completion( | ||
| current_prompt.clone(), | ||
| chat_history[..chat_history.len() - 1].to_vec(), | ||
| ), | ||
| ) | ||
| .await | ||
| .map_err(|_| PromptError::CompletionError( | ||
| rig::completion::CompletionError::from( | ||
| Box::new(std::io::Error::new( | ||
| std::io::ErrorKind::TimedOut, | ||
| format!("LLM stream_completion request timed out after {}s", Self::LLM_CALL_TIMEOUT_SECS) | ||
| )) as Box<dyn std::error::Error + Send + Sync> | ||
| ) | ||
| .await | ||
| .map_err(PromptError::CompletionError)?; | ||
| ))? | ||
| .map_err(PromptError::CompletionError)?; | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
rg -n -C3 'prompt_once_streaming|tokio::time::timeout|request\.stream\(\)\.await|stream\.next\(\)\.await' src/hooks/spacebot.rsRepository: spacedriveapp/spacebot
Length of output: 1232
🏁 Script executed:
sed -n '516,560p' src/hooks/spacebot.rsRepository: spacedriveapp/spacebot
Length of output: 2103
Streaming timeout only covers request creation; stream establishment and reads can still hang indefinitely.
Lines 516–535 wrap agent.stream_completion(...) with a timeout, but lines 536–541 (request.stream().await) and line 543 (stream.next().await) have no timeout protection. A stalled stream after request creation or during reads blocks the entire channel processing loop indefinitely, defeating hung-connection mitigation.
Proposed fix (apply timeout to stream establishment + each stream read)
- let mut stream = request
- .stream()
- .await
- .map_err(PromptError::CompletionError)?;
+ let mut stream = tokio::time::timeout(
+ std::time::Duration::from_secs(Self::LLM_CALL_TIMEOUT_SECS),
+ request.stream(),
+ )
+ .await
+ .map_err(|_| PromptError::CompletionError(
+ rig::completion::CompletionError::from(
+ Box::new(std::io::Error::new(
+ std::io::ErrorKind::TimedOut,
+ format!(
+ "LLM stream start timed out after {}s",
+ Self::LLM_CALL_TIMEOUT_SECS
+ ),
+ )) as Box<dyn std::error::Error + Send + Sync>
+ )
+ ))?
+ .map_err(PromptError::CompletionError)?;
@@
- while let Some(content) = stream.next().await {
+ while let Some(content) = tokio::time::timeout(
+ std::time::Duration::from_secs(Self::LLM_CALL_TIMEOUT_SECS),
+ stream.next(),
+ )
+ .await
+ .map_err(|_| PromptError::CompletionError(
+ rig::completion::CompletionError::from(
+ Box::new(std::io::Error::new(
+ std::io::ErrorKind::TimedOut,
+ format!(
+ "LLM stream read timed out after {}s",
+ Self::LLM_CALL_TIMEOUT_SECS
+ ),
+ )) as Box<dyn std::error::Error + Send + Sync>
+ )
+ ))? {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/hooks/spacebot.rs` around lines 516 - 533, The current timeout only wraps
agent.stream_completion(...) so a stalled stream or blocked reads still hang;
update the code to apply tokio::time::timeout using Self::LLM_CALL_TIMEOUT_SECS
around both the stream establishment (the await on request.stream().await) and
each read from the stream (the await on stream.next().await), mapping timeout
errors to PromptError::CompletionError just like the initial request timeout;
ensure you cancel/close the underlying request on timeout and reuse the same
CompletionError creation pattern (rig::completion::CompletionError via Box<dyn
std::error::Error + Send + Sync>) so all timeouts are handled consistently for
agent.stream_completion, request.stream().await, and stream.next().await.
There was a problem hiding this comment.
Pull request overview
This PR aims to prevent agent hangs by adding a per-call timeout around LLM operations and by changing cortex supervision from wall-clock branch timeouts to activity-based tracking.
Changes:
- Add a 5-minute
tokio::time::timeoutwrapper around non-streaming LLM calls and the initial streaming request creation. - Add
last_activity_attoBranchTrackerand switch branch timeout evaluation to use activity time rather than start time. - Add health-tick debug logging and include active branch/worker counts when timeout cancellation is skipped due to lagged control.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
src/hooks/spacebot.rs |
Wraps LLM prompt paths with a timeout to avoid indefinitely hung calls. |
src/agent/cortex.rs |
Introduces branch activity timestamp field and updates supervisor timeout checks + logging. |
Comments suppressed due to low confidence (1)
src/hooks/spacebot.rs:544
- The streaming timeout only wraps the
agent.stream_completion(...)request creation. If the HTTP connection stalls after the request is created (e.g.,request.stream().awaithangs, orstream.next().awaitnever yields/ends), this loop can still block a channel indefinitely. To fully prevent hung streaming calls, apply a timeout to stream creation and/or enforce an idle/per-turn timeout while awaitingnext()items.
let request = tokio::time::timeout(
std::time::Duration::from_secs(Self::LLM_CALL_TIMEOUT_SECS),
agent.stream_completion(
current_prompt.clone(),
chat_history[..chat_history.len() - 1].to_vec(),
),
)
.await
.map_err(|_| PromptError::CompletionError(
rig::completion::CompletionError::from(
Box::new(std::io::Error::new(
std::io::ErrorKind::TimedOut,
format!("LLM stream_completion request timed out after {}s", Self::LLM_CALL_TIMEOUT_SECS)
)) as Box<dyn std::error::Error + Send + Sync>
)
))?
.map_err(PromptError::CompletionError)?;
let mut stream = request
.stream()
.await
.map_err(PromptError::CompletionError)?;
let mut tool_calls = vec![];
let mut tool_results = vec![];
let mut is_text_response = false;
while let Some(content) = stream.next().await {
match content.map_err(PromptError::CompletionError)? {
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| /// Timeout for a single LLM completion call (non-streaming). | ||
| /// | ||
| /// Prevents a hung API connection from blocking a branch, compactor, or | ||
| /// ingestion process indefinitely. Set to 5 minutes — generous for complex | ||
| /// completions but catches genuine connection stalls. | ||
| const LLM_CALL_TIMEOUT_SECS: u64 = 300; |
| state | ||
| .branch_trackers | ||
| .values() | ||
| .filter(|tracker| now.duration_since(tracker.started_at) >= branch_timeout) | ||
| .filter(|tracker| now.duration_since(tracker.last_activity_at) >= branch_timeout) | ||
| .cloned() | ||
| .collect() |
| #[derive(Debug, Clone)] | ||
| struct BranchTracker { | ||
| branch_id: BranchId, | ||
| channel_id: ChannelId, | ||
| started_at: Instant, | ||
| last_activity_at: Instant, | ||
| } |
…dates kill_target_last_activity was still using started_at for branches while timeout detection had been switched to last_activity_at. Also note that last_activity_at is only set at branch spawn — updating it mid-execution requires ProcessEvent plumbing from the hook.
Summary
Prevents hung LLM API connections from blocking agent processes indefinitely. We hit this in production — a GLM-5.1 API call never returned (no error, no response), causing a branch to hang forever with the channel waiting on it.
Two targeted fixes:
1. Per-call LLM timeout (
src/hooks/spacebot.rs)LLM_CALL_TIMEOUT_SECS = 300(5 minutes)agent.prompt()inprompt_oncewithtokio::time::timeoutagent.stream_completion()inprompt_once_streamingwithtokio::time::timeoutPromptError::CompletionError→ handled by existing retry/error paths2. Activity-based branch tracking (
src/agent/cortex.rs)last_activity_at: InstanttoBranchTracker(matching existingWorkerTracker)started_at(wall-clock) tolast_activity_at(activity-based)tracing::debug!on each health tick with active/overdue countsactive_branches/active_workersin thelagged_controlskip logWhy not just config?
The cortex supervisor has a
branch_timeout_secsconfig, but it only runs on a health tick interval (30-120s), uses wall-clock time, and the tick can be skipped entirely whenlagged_controlis set. The per-call timeout is the defense in depth — it fails the specific stuck call rather than waiting for the supervisor to notice.Testing
origin/main(cargo build --release)cortex.rstests updated for newlast_activity_atfieldChanges
src/hooks/spacebot.rssrc/agent/cortex.rs