Skip to content

CRITICAL fix: use async save_session to avoid blocking tokio runtime#534

Open
skymoore wants to merge 2 commits intoRightNow-AI:mainfrom
skymoore:main
Open

CRITICAL fix: use async save_session to avoid blocking tokio runtime#534
skymoore wants to merge 2 commits intoRightNow-AI:mainfrom
skymoore:main

Conversation

@skymoore
Copy link

Summary

Prevents 1 cpu openfang deployments from hanging on save session and not responding to any requests

Changes

save_session() was synchronous, holding a Mutex on the tokio worker thread during SQLite writes. On pods with 1 CPU core (1 tokio worker thread), this starved the entire runtime — including health check endpoints — causing K8s to mark the pod not-ready and return 504 on all subsequent requests.

Add save_session_async() that wraps the SQLite write in spawn_blocking, matching the pattern already used by other memory operations (recall, remember, etc.). Update all 12 call sites in the agent loop.

Testing

  • cargo clippy --workspace --all-targets -- -D warnings passes
  • cargo test --workspace passes
  • Live integration tested (if applicable)

Security

  • No new unsafe code
  • No secrets or API keys in diff
  • User input validated at boundaries

save_session() was synchronous, holding a Mutex<Connection> on the
tokio worker thread during SQLite writes. On pods with 1 CPU core
(1 tokio worker thread), this starved the entire runtime — including
health check endpoints — causing K8s to mark the pod not-ready and
return 504 on all subsequent requests.

Add save_session_async() that wraps the SQLite write in
spawn_blocking, matching the pattern already used by other memory
operations (recall, remember, etc.). Update all 12 call sites in
the agent loop.
The health endpoint called structured_get() synchronously on the tokio
async runtime, acquiring the shared std::sync::Mutex<Connection> on a
worker thread. When the agent loop held this mutex during session saves,
the health check blocked the tokio thread, starving the SSE stream and
causing Kubernetes probe timeouts.

- Health and health_detail now run the DB check via spawn_blocking
- SSE message/stream endpoint now includes keep_alive to flush periodic
  heartbeats even during contention
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant