Skip to content

fix: add per-call LLM timeout and activity-based branch tracking#556

Open
EZotoff wants to merge 2 commits intospacedriveapp:mainfrom
EZotoff:fix/llm-call-timeout-and-branch-tracking
Open

fix: add per-call LLM timeout and activity-based branch tracking#556
EZotoff wants to merge 2 commits intospacedriveapp:mainfrom
EZotoff:fix/llm-call-timeout-and-branch-tracking

Conversation

@EZotoff
Copy link
Copy Markdown

@EZotoff EZotoff commented Apr 10, 2026

Summary

Prevents hung LLM API connections from blocking agent processes indefinitely. We hit this in production — a GLM-5.1 API call never returned (no error, no response), causing a branch to hang forever with the channel waiting on it.

Two targeted fixes:

1. Per-call LLM timeout (src/hooks/spacebot.rs)

  • Adds LLM_CALL_TIMEOUT_SECS = 300 (5 minutes)
  • Wraps agent.prompt() in prompt_once with tokio::time::timeout
  • Wraps agent.stream_completion() in prompt_once_streaming with tokio::time::timeout
  • On timeout, returns PromptError::CompletionError → handled by existing retry/error paths
  • Covers branches, compactors, ingestion (non-streaming) and channels (streaming)

2. Activity-based branch tracking (src/agent/cortex.rs)

  • Adds last_activity_at: Instant to BranchTracker (matching existing WorkerTracker)
  • Changes the supervisor timeout check from started_at (wall-clock) to last_activity_at (activity-based)
  • Adds tracing::debug! on each health tick with active/overdue counts
  • Includes active_branches/active_workers in the lagged_control skip log

Why not just config?

The cortex supervisor has a branch_timeout_secs config, but it only runs on a health tick interval (30-120s), uses wall-clock time, and the tick can be skipped entirely when lagged_control is set. The per-call timeout is the defense in depth — it fails the specific stuck call rather than waiting for the supervisor to notice.

Testing

  • Built and deployed to our production instance (uni-pc, RTX 4080)
  • Verified both files compile clean against origin/main (cargo build --release)
  • Existing cortex.rs tests updated for new last_activity_at field

Changes

File Change
src/hooks/spacebot.rs +46/-9 — timeout wrapper on both LLM call paths
src/agent/cortex.rs +28/-3 — BranchTracker activity tracking + health tick logging

Prevents hung API connections from blocking branches, channels, and
other LLM processes indefinitely. Two changes:

1. Wrap prompt_once and prompt_once_streaming in tokio::time::timeout
   (300s default). On timeout, returns PromptError::CompletionError
   which the existing retry/error paths handle naturally.

2. Add last_activity_at to BranchTracker (matching WorkerTracker) so
   the cortex supervisor can detect stalled branches by activity age
   rather than just wall-clock time since spawn. Also adds debug-level
   health tick logging with active/overdue counts for observability.
Copilot AI review requested due to automatic review settings April 10, 2026 18:07
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 10, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: cdd1f5f4-4976-4b9c-ac4a-88f4f1a79b3e

📥 Commits

Reviewing files that changed from the base of the PR and between c75f93c and cb4cbfc.

📒 Files selected for processing (1)
  • src/agent/cortex.rs
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/agent/cortex.rs

Walkthrough

Adds branch activity tracking to cortex (new last_activity_at, uses it for inactivity/kill timing, exposes active counts in health ticks) and enforces a 300s timeout around LLM calls in spacebot, converting timeouts into PromptError::CompletionError.

Changes

Cohort / File(s) Summary
Branch activity & health tick
src/agent/cortex.rs
Added last_activity_at: Instant to BranchTracker; switched branch inactivity/kill logic to use last_activity_at; updated kill_target_last_activity and run_health_tick to include active_branches/active_workers and emit a health-tick log; updated tests/fixtures to set the new field.
LLM call timeout enforcement
src/hooks/spacebot.rs
Added LLM_CALL_TIMEOUT_SECS = 300; wrapped non-streaming agent.prompt(...) and streaming agent.stream_completion(...) calls with tokio::time::timeout(...); convert timeouts into PromptError::CompletionError (with contextual messages).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related issues

Possibly related PRs

Suggested reviewers

  • jamiepine
🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 62.50% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately reflects both main changes: per-call LLM timeout and activity-based branch tracking, matching the core objectives of the pull request.
Description check ✅ Passed The description is well-detailed and directly related to the changeset, explaining both fixes, their rationale, production context, and testing approach.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/agent/cortex.rs (1)

510-515: ⚠️ Potential issue | 🟠 Major

Inconsistent use of started_at for branch kill ordering.

The timeout detection at line 1017 now uses tracker.last_activity_at, but this function still returns started_at for branches. This means branch timeout detection uses one timestamp while kill ordering uses another.

🐛 Proposed fix
 fn kill_target_last_activity(target: &KillTarget) -> Instant {
     match target {
         KillTarget::Worker(tracker) => tracker.last_activity_at,
-        KillTarget::Branch(tracker) => tracker.started_at,
+        KillTarget::Branch(tracker) => tracker.last_activity_at,
     }
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/agent/cortex.rs` around lines 510 - 515, The kill ordering function
kill_target_last_activity is inconsistent with the timeout detection logic
(which uses tracker.last_activity_at); update the Branch arm to return
tracker.last_activity_at instead of tracker.started_at so both timeout detection
and kill ordering use the same timestamp (ensure the BranchTracker struct
exposes last_activity_at and that its type is Instant to match the function
signature).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/hooks/spacebot.rs`:
- Around line 516-533: The current timeout only wraps
agent.stream_completion(...) so a stalled stream or blocked reads still hang;
update the code to apply tokio::time::timeout using Self::LLM_CALL_TIMEOUT_SECS
around both the stream establishment (the await on request.stream().await) and
each read from the stream (the await on stream.next().await), mapping timeout
errors to PromptError::CompletionError just like the initial request timeout;
ensure you cancel/close the underlying request on timeout and reuse the same
CompletionError creation pattern (rig::completion::CompletionError via Box<dyn
std::error::Error + Send + Sync>) so all timeouts are handled consistently for
agent.stream_completion, request.stream().await, and stream.next().await.

---

Outside diff comments:
In `@src/agent/cortex.rs`:
- Around line 510-515: The kill ordering function kill_target_last_activity is
inconsistent with the timeout detection logic (which uses
tracker.last_activity_at); update the Branch arm to return
tracker.last_activity_at instead of tracker.started_at so both timeout detection
and kill ordering use the same timestamp (ensure the BranchTracker struct
exposes last_activity_at and that its type is Instant to match the function
signature).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 36829fc0-a446-4695-a0f6-6ba181dbb769

📥 Commits

Reviewing files that changed from the base of the PR and between ac01c5b and c75f93c.

📒 Files selected for processing (2)
  • src/agent/cortex.rs
  • src/hooks/spacebot.rs

Comment on lines +516 to 533
let request = tokio::time::timeout(
std::time::Duration::from_secs(Self::LLM_CALL_TIMEOUT_SECS),
agent.stream_completion(
current_prompt.clone(),
chat_history[..chat_history.len() - 1].to_vec(),
),
)
.await
.map_err(|_| PromptError::CompletionError(
rig::completion::CompletionError::from(
Box::new(std::io::Error::new(
std::io::ErrorKind::TimedOut,
format!("LLM stream_completion request timed out after {}s", Self::LLM_CALL_TIMEOUT_SECS)
)) as Box<dyn std::error::Error + Send + Sync>
)
.await
.map_err(PromptError::CompletionError)?;
))?
.map_err(PromptError::CompletionError)?;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
rg -n -C3 'prompt_once_streaming|tokio::time::timeout|request\.stream\(\)\.await|stream\.next\(\)\.await' src/hooks/spacebot.rs

Repository: spacedriveapp/spacebot

Length of output: 1232


🏁 Script executed:

sed -n '516,560p' src/hooks/spacebot.rs

Repository: spacedriveapp/spacebot

Length of output: 2103


Streaming timeout only covers request creation; stream establishment and reads can still hang indefinitely.

Lines 516–535 wrap agent.stream_completion(...) with a timeout, but lines 536–541 (request.stream().await) and line 543 (stream.next().await) have no timeout protection. A stalled stream after request creation or during reads blocks the entire channel processing loop indefinitely, defeating hung-connection mitigation.

Proposed fix (apply timeout to stream establishment + each stream read)
-            let mut stream = request
-                .stream()
-                .await
-                .map_err(PromptError::CompletionError)?;
+            let mut stream = tokio::time::timeout(
+                std::time::Duration::from_secs(Self::LLM_CALL_TIMEOUT_SECS),
+                request.stream(),
+            )
+            .await
+            .map_err(|_| PromptError::CompletionError(
+                rig::completion::CompletionError::from(
+                    Box::new(std::io::Error::new(
+                        std::io::ErrorKind::TimedOut,
+                        format!(
+                            "LLM stream start timed out after {}s",
+                            Self::LLM_CALL_TIMEOUT_SECS
+                        ),
+                    )) as Box<dyn std::error::Error + Send + Sync>
+                )
+            ))?
+            .map_err(PromptError::CompletionError)?;
@@
-            while let Some(content) = stream.next().await {
+            while let Some(content) = tokio::time::timeout(
+                std::time::Duration::from_secs(Self::LLM_CALL_TIMEOUT_SECS),
+                stream.next(),
+            )
+            .await
+            .map_err(|_| PromptError::CompletionError(
+                rig::completion::CompletionError::from(
+                    Box::new(std::io::Error::new(
+                        std::io::ErrorKind::TimedOut,
+                        format!(
+                            "LLM stream read timed out after {}s",
+                            Self::LLM_CALL_TIMEOUT_SECS
+                        ),
+                    )) as Box<dyn std::error::Error + Send + Sync>
+                )
+            ))? {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/hooks/spacebot.rs` around lines 516 - 533, The current timeout only wraps
agent.stream_completion(...) so a stalled stream or blocked reads still hang;
update the code to apply tokio::time::timeout using Self::LLM_CALL_TIMEOUT_SECS
around both the stream establishment (the await on request.stream().await) and
each read from the stream (the await on stream.next().await), mapping timeout
errors to PromptError::CompletionError just like the initial request timeout;
ensure you cancel/close the underlying request on timeout and reuse the same
CompletionError creation pattern (rig::completion::CompletionError via Box<dyn
std::error::Error + Send + Sync>) so all timeouts are handled consistently for
agent.stream_completion, request.stream().await, and stream.next().await.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to prevent agent hangs by adding a per-call timeout around LLM operations and by changing cortex supervision from wall-clock branch timeouts to activity-based tracking.

Changes:

  • Add a 5-minute tokio::time::timeout wrapper around non-streaming LLM calls and the initial streaming request creation.
  • Add last_activity_at to BranchTracker and switch branch timeout evaluation to use activity time rather than start time.
  • Add health-tick debug logging and include active branch/worker counts when timeout cancellation is skipped due to lagged control.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
src/hooks/spacebot.rs Wraps LLM prompt paths with a timeout to avoid indefinitely hung calls.
src/agent/cortex.rs Introduces branch activity timestamp field and updates supervisor timeout checks + logging.
Comments suppressed due to low confidence (1)

src/hooks/spacebot.rs:544

  • The streaming timeout only wraps the agent.stream_completion(...) request creation. If the HTTP connection stalls after the request is created (e.g., request.stream().await hangs, or stream.next().await never yields/ends), this loop can still block a channel indefinitely. To fully prevent hung streaming calls, apply a timeout to stream creation and/or enforce an idle/per-turn timeout while awaiting next() items.
            let request = tokio::time::timeout(
                std::time::Duration::from_secs(Self::LLM_CALL_TIMEOUT_SECS),
                agent.stream_completion(
                    current_prompt.clone(),
                    chat_history[..chat_history.len() - 1].to_vec(),
                ),
            )
            .await
            .map_err(|_| PromptError::CompletionError(
                rig::completion::CompletionError::from(
                    Box::new(std::io::Error::new(
                        std::io::ErrorKind::TimedOut,
                        format!("LLM stream_completion request timed out after {}s", Self::LLM_CALL_TIMEOUT_SECS)
                    )) as Box<dyn std::error::Error + Send + Sync>
                )
            ))?
            .map_err(PromptError::CompletionError)?;

            let mut stream = request
                .stream()
                .await
                .map_err(PromptError::CompletionError)?;

            let mut tool_calls = vec![];
            let mut tool_results = vec![];
            let mut is_text_response = false;

            while let Some(content) = stream.next().await {
                match content.map_err(PromptError::CompletionError)? {

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +426 to +431
/// Timeout for a single LLM completion call (non-streaming).
///
/// Prevents a hung API connection from blocking a branch, compactor, or
/// ingestion process indefinitely. Set to 5 minutes — generous for complex
/// completions but catches genuine connection stalls.
const LLM_CALL_TIMEOUT_SECS: u64 = 300;
Comment on lines 1014 to 1019
state
.branch_trackers
.values()
.filter(|tracker| now.duration_since(tracker.started_at) >= branch_timeout)
.filter(|tracker| now.duration_since(tracker.last_activity_at) >= branch_timeout)
.cloned()
.collect()
Comment on lines 352 to 358
#[derive(Debug, Clone)]
struct BranchTracker {
branch_id: BranchId,
channel_id: ChannelId,
started_at: Instant,
last_activity_at: Instant,
}
…dates

kill_target_last_activity was still using started_at for branches while
timeout detection had been switched to last_activity_at. Also note that
last_activity_at is only set at branch spawn — updating it mid-execution
requires ProcessEvent plumbing from the hook.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants