fix: add per-call LLM timeout and activity-based branch tracking by EZotoff · Pull Request #556 · spacedriveapp/spacebot

EZotoff · 2026-04-10T18:07:15Z

Summary

Prevents hung LLM API connections from blocking agent processes indefinitely. We hit this in production — a GLM-5.1 API call never returned (no error, no response), causing a branch to hang forever with the channel waiting on it.

Two targeted fixes:

1. Per-call LLM timeout (src/hooks/spacebot.rs)

Adds LLM_CALL_TIMEOUT_SECS = 300 (5 minutes)
Wraps agent.prompt() in prompt_once with tokio::time::timeout
Wraps agent.stream_completion() in prompt_once_streaming with tokio::time::timeout
On timeout, returns PromptError::CompletionError → handled by existing retry/error paths
Covers branches, compactors, ingestion (non-streaming) and channels (streaming)

2. Activity-based branch tracking (src/agent/cortex.rs)

Adds last_activity_at: Instant to BranchTracker (matching existing WorkerTracker)
Changes the supervisor timeout check from started_at (wall-clock) to last_activity_at (activity-based)
Adds tracing::debug! on each health tick with active/overdue counts
Includes active_branches/active_workers in the lagged_control skip log

Why not just config?

The cortex supervisor has a branch_timeout_secs config, but it only runs on a health tick interval (30-120s), uses wall-clock time, and the tick can be skipped entirely when lagged_control is set. The per-call timeout is the defense in depth — it fails the specific stuck call rather than waiting for the supervisor to notice.

Testing

Built and deployed to our production instance (uni-pc, RTX 4080)
Verified both files compile clean against origin/main (cargo build --release)
Existing cortex.rs tests updated for new last_activity_at field

Changes

File	Change
`src/hooks/spacebot.rs`	+46/-9 — timeout wrapper on both LLM call paths
`src/agent/cortex.rs`	+28/-3 — BranchTracker activity tracking + health tick logging

Prevents hung API connections from blocking branches, channels, and other LLM processes indefinitely. Two changes: 1. Wrap prompt_once and prompt_once_streaming in tokio::time::timeout (300s default). On timeout, returns PromptError::CompletionError which the existing retry/error paths handle naturally. 2. Add last_activity_at to BranchTracker (matching WorkerTracker) so the cortex supervisor can detect stalled branches by activity age rather than just wall-clock time since spawn. Also adds debug-level health tick logging with active/overdue counts for observability.

coderabbitai · 2026-04-10T18:07:33Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: cdd1f5f4-4976-4b9c-ac4a-88f4f1a79b3e

📥 Commits

Reviewing files that changed from the base of the PR and between c75f93c and cb4cbfc.

📒 Files selected for processing (1)

src/agent/cortex.rs

🚧 Files skipped from review as they are similar to previous changes (1)

src/agent/cortex.rs

Walkthrough

Adds branch activity tracking to cortex (new last_activity_at, uses it for inactivity/kill timing, exposes active counts in health ticks) and enforces a 300s timeout around LLM calls in spacebot, converting timeouts into PromptError::CompletionError.

Changes

Cohort / File(s)	Summary
Branch activity & health tick `src/agent/cortex.rs`	Added `last_activity_at: Instant` to `BranchTracker`; switched branch inactivity/kill logic to use `last_activity_at`; updated `kill_target_last_activity` and `run_health_tick` to include `active_branches`/`active_workers` and emit a health-tick log; updated tests/fixtures to set the new field.
LLM call timeout enforcement `src/hooks/spacebot.rs`	Added `LLM_CALL_TIMEOUT_SECS = 300`; wrapped non-streaming `agent.prompt(...)` and streaming `agent.stream_completion(...)` calls with `tokio::time::timeout(...)`; convert timeouts into `PromptError::CompletionError` (with contextual messages).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related issues

feat: update BranchTracker.last_activity_at during branch execution #557 — Mentions updating BranchTracker.last_activity_at during branch execution; this PR introduces and wires that field into timeout logic, so it may address the issue's objective.

Possibly related PRs

fix(cortex): use idle time instead of lifetime for worker timeout #332 — Makes the same code-level change pattern in src/agent/cortex.rs switching timeout checks from started_at to last_activity_at.

Suggested reviewers

jamiepine

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 62.50% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately reflects both main changes: per-call LLM timeout and activity-based branch tracking, matching the core objectives of the pull request.
Description check	✅ Passed	The description is well-detailed and directly related to the changeset, explaining both fixes, their rationale, production context, and testing approach.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/agent/cortex.rs (1)

510-515: ⚠️ Potential issue | 🟠 Major

Inconsistent use of started_at for branch kill ordering.

The timeout detection at line 1017 now uses tracker.last_activity_at, but this function still returns started_at for branches. This means branch timeout detection uses one timestamp while kill ordering uses another.

🐛 Proposed fix

 fn kill_target_last_activity(target: &KillTarget) -> Instant {
     match target {
         KillTarget::Worker(tracker) => tracker.last_activity_at,
-        KillTarget::Branch(tracker) => tracker.started_at,
+        KillTarget::Branch(tracker) => tracker.last_activity_at,
     }
 }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/agent/cortex.rs` around lines 510 - 515, The kill ordering function
kill_target_last_activity is inconsistent with the timeout detection logic
(which uses tracker.last_activity_at); update the Branch arm to return
tracker.last_activity_at instead of tracker.started_at so both timeout detection
and kill ordering use the same timestamp (ensure the BranchTracker struct
exposes last_activity_at and that its type is Instant to match the function
signature).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/hooks/spacebot.rs`:
- Around line 516-533: The current timeout only wraps
agent.stream_completion(...) so a stalled stream or blocked reads still hang;
update the code to apply tokio::time::timeout using Self::LLM_CALL_TIMEOUT_SECS
around both the stream establishment (the await on request.stream().await) and
each read from the stream (the await on stream.next().await), mapping timeout
errors to PromptError::CompletionError just like the initial request timeout;
ensure you cancel/close the underlying request on timeout and reuse the same
CompletionError creation pattern (rig::completion::CompletionError via Box<dyn
std::error::Error + Send + Sync>) so all timeouts are handled consistently for
agent.stream_completion, request.stream().await, and stream.next().await.

---

Outside diff comments:
In `@src/agent/cortex.rs`:
- Around line 510-515: The kill ordering function kill_target_last_activity is
inconsistent with the timeout detection logic (which uses
tracker.last_activity_at); update the Branch arm to return
tracker.last_activity_at instead of tracker.started_at so both timeout detection
and kill ordering use the same timestamp (ensure the BranchTracker struct
exposes last_activity_at and that its type is Instant to match the function
signature).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 36829fc0-a446-4695-a0f6-6ba181dbb769

📥 Commits

Reviewing files that changed from the base of the PR and between ac01c5b and c75f93c.

📒 Files selected for processing (2)

src/agent/cortex.rs
src/hooks/spacebot.rs

coderabbitai · 2026-04-10T18:10:52Z

src/hooks/spacebot.rs

+            let request = tokio::time::timeout(
+                std::time::Duration::from_secs(Self::LLM_CALL_TIMEOUT_SECS),
+                agent.stream_completion(
                    current_prompt.clone(),
                    chat_history[..chat_history.len() - 1].to_vec(),
+                ),
+            )
+            .await
+            .map_err(|_| PromptError::CompletionError(
+                rig::completion::CompletionError::from(
+                    Box::new(std::io::Error::new(
+                        std::io::ErrorKind::TimedOut,
+                        format!("LLM stream_completion request timed out after {}s", Self::LLM_CALL_TIMEOUT_SECS)
+                    )) as Box<dyn std::error::Error + Send + Sync>
                )
-                .await
-                .map_err(PromptError::CompletionError)?;
+            ))?
+            .map_err(PromptError::CompletionError)?;



⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash rg -n -C3 'prompt_once_streaming|tokio::time::timeout|request\.stream\(\)\.await|stream\.next\(\)\.await' src/hooks/spacebot.rs

Repository: spacedriveapp/spacebot

Length of output: 1232

🏁 Script executed:

sed -n '516,560p' src/hooks/spacebot.rs

Repository: spacedriveapp/spacebot

Length of output: 2103

Streaming timeout only covers request creation; stream establishment and reads can still hang indefinitely.

Lines 516–535 wrap agent.stream_completion(...) with a timeout, but lines 536–541 (request.stream().await) and line 543 (stream.next().await) have no timeout protection. A stalled stream after request creation or during reads blocks the entire channel processing loop indefinitely, defeating hung-connection mitigation.

Proposed fix (apply timeout to stream establishment + each stream read)

- let mut stream = request - .stream() - .await - .map_err(PromptError::CompletionError)?; + let mut stream = tokio::time::timeout( + std::time::Duration::from_secs(Self::LLM_CALL_TIMEOUT_SECS), + request.stream(), + ) + .await + .map_err(|_| PromptError::CompletionError( + rig::completion::CompletionError::from( + Box::new(std::io::Error::new( + std::io::ErrorKind::TimedOut, + format!( + "LLM stream start timed out after {}s", + Self::LLM_CALL_TIMEOUT_SECS + ), + )) as Box<dyn std::error::Error + Send + Sync> + ) + ))? + .map_err(PromptError::CompletionError)?; @@ - while let Some(content) = stream.next().await { + while let Some(content) = tokio::time::timeout( + std::time::Duration::from_secs(Self::LLM_CALL_TIMEOUT_SECS), + stream.next(), + ) + .await + .map_err(|_| PromptError::CompletionError( + rig::completion::CompletionError::from( + Box::new(std::io::Error::new( + std::io::ErrorKind::TimedOut, + format!( + "LLM stream read timed out after {}s", + Self::LLM_CALL_TIMEOUT_SECS + ), + )) as Box<dyn std::error::Error + Send + Sync> + ) + ))? {

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/hooks/spacebot.rs` around lines 516 - 533, The current timeout only wraps agent.stream_completion(...) so a stalled stream or blocked reads still hang; update the code to apply tokio::time::timeout using Self::LLM_CALL_TIMEOUT_SECS around both the stream establishment (the await on request.stream().await) and each read from the stream (the await on stream.next().await), mapping timeout errors to PromptError::CompletionError just like the initial request timeout; ensure you cancel/close the underlying request on timeout and reuse the same CompletionError creation pattern (rig::completion::CompletionError via Box<dyn std::error::Error + Send + Sync>) so all timeouts are handled consistently for agent.stream_completion, request.stream().await, and stream.next().await.

Copilot

Pull request overview

This PR aims to prevent agent hangs by adding a per-call timeout around LLM operations and by changing cortex supervision from wall-clock branch timeouts to activity-based tracking.

Changes:

Add a 5-minute tokio::time::timeout wrapper around non-streaming LLM calls and the initial streaming request creation.
Add last_activity_at to BranchTracker and switch branch timeout evaluation to use activity time rather than start time.
Add health-tick debug logging and include active branch/worker counts when timeout cancellation is skipped due to lagged control.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
`src/hooks/spacebot.rs`	Wraps LLM prompt paths with a timeout to avoid indefinitely hung calls.
`src/agent/cortex.rs`	Introduces branch activity timestamp field and updates supervisor timeout checks + logging.

Comments suppressed due to low confidence (1)

src/hooks/spacebot.rs:544

The streaming timeout only wraps the agent.stream_completion(...) request creation. If the HTTP connection stalls after the request is created (e.g., request.stream().await hangs, or stream.next().await never yields/ends), this loop can still block a channel indefinitely. To fully prevent hung streaming calls, apply a timeout to stream creation and/or enforce an idle/per-turn timeout while awaiting next() items.

            let request = tokio::time::timeout(
                std::time::Duration::from_secs(Self::LLM_CALL_TIMEOUT_SECS),
                agent.stream_completion(
                    current_prompt.clone(),
                    chat_history[..chat_history.len() - 1].to_vec(),
                ),
            )
            .await
            .map_err(|_| PromptError::CompletionError(
                rig::completion::CompletionError::from(
                    Box::new(std::io::Error::new(
                        std::io::ErrorKind::TimedOut,
                        format!("LLM stream_completion request timed out after {}s", Self::LLM_CALL_TIMEOUT_SECS)
                    )) as Box<dyn std::error::Error + Send + Sync>
                )
            ))?
            .map_err(PromptError::CompletionError)?;

            let mut stream = request
                .stream()
                .await
                .map_err(PromptError::CompletionError)?;

            let mut tool_calls = vec![];
            let mut tool_results = vec![];
            let mut is_text_response = false;

            while let Some(content) = stream.next().await {
                match content.map_err(PromptError::CompletionError)? {

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/hooks/spacebot.rs

+    /// Timeout for a single LLM completion call (non-streaming).
+    ///
+    /// Prevents a hung API connection from blocking a branch, compactor, or
+    /// ingestion process indefinitely. Set to 5 minutes — generous for complex
+    /// completions but catches genuine connection stalls.
+    const LLM_CALL_TIMEOUT_SECS: u64 = 300;


src/agent/cortex.rs

                state
                    .branch_trackers
                    .values()
-                    .filter(|tracker| now.duration_since(tracker.started_at) >= branch_timeout)
+                    .filter(|tracker| now.duration_since(tracker.last_activity_at) >= branch_timeout)
                    .cloned()
                    .collect()


src/agent/cortex.rs

 #[derive(Debug, Clone)]
 struct BranchTracker {
    branch_id: BranchId,
    channel_id: ChannelId,
    started_at: Instant,
+    last_activity_at: Instant,
 }


…dates kill_target_last_activity was still using started_at for branches while timeout detection had been switched to last_activity_at. Also note that last_activity_at is only set at branch spawn — updating it mid-execution requires ProcessEvent plumbing from the hook.

Copilot AI review requested due to automatic review settings April 10, 2026 18:07

Copilot started reviewing on behalf of EZotoff April 10, 2026 18:07 View session

coderabbitai bot reviewed Apr 10, 2026

View reviewed changes

Copilot AI reviewed Apr 10, 2026

View reviewed changes

EZotoff mentioned this pull request Apr 10, 2026

feat: update BranchTracker.last_activity_at during branch execution #557

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: add per-call LLM timeout and activity-based branch tracking#556

fix: add per-call LLM timeout and activity-based branch tracking#556
EZotoff wants to merge 2 commits intospacedriveapp:mainfrom
EZotoff:fix/llm-call-timeout-and-branch-tracking

EZotoff commented Apr 10, 2026

Uh oh!

coderabbitai bot commented Apr 10, 2026 •

edited

Loading

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Apr 10, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

EZotoff commented Apr 10, 2026

Summary

Why not just config?

Testing

Changes

Uh oh!

coderabbitai bot commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai bot commented Apr 10, 2026 •

edited

Loading