Skip to content

pi.ruv.io: cognitive cycle blocks HTTP handler causing 504 upstream timeouts #305

@ruvnet

Description

@ruvnet

Summary

pi.ruv.io (Cloud Run service ruvbrain) was returning 504 upstream request timeouts on all endpoints. Every HTTP request timed out at exactly 300s (Cloud Run max timeout) while background cognitive cycles continued running normally.

Root Cause

The mcp_brain_server cognitive cycle (runs every ~3 min) processes heavy workloads on the same async runtime as HTTP handlers:

  • 200 auto-votes per cycle
  • LoRA auto-submissions from SONA patterns
  • Inference processing (11-21 inferences per cycle)
  • Strange loop calculations

This appears to starve the HTTP handler, preventing it from serving any requests. The process stays alive (cognitive logs continue) but all HTTP responses hang until the 300s Cloud Run timeout.

Evidence

  • All requests returned 504 with latency: 299.97s (Cloud Run timeout ceiling)
  • App logs showed healthy cognitive cycles (Cognitive cycle #22-24) running every ~3 min
  • SIGTERM graceful shutdown observed at 2026-03-27T13:58:50Z, followed by restart
  • After restart, cognitive cycles resumed but HTTP remained unresponsive
  • Two instance IDs visible in logs — both instances affected
  • curl confirmed TCP connect succeeds (0.003s) but app never sends response

Resolution (Temporary)

Redeployed fresh revision (ruvbrain-00159-cjl) — service immediately responsive (77ms).

Permanent Fix Needed

  1. Move cognitive cycle to a separate tokio task with budget/yield — ensure HTTP handlers are never starved by background processing
  2. Add a /healthz liveness probe that Cloud Run can use to detect and restart unresponsive instances
  3. Consider tokio::task::spawn_blocking for heavy compute (auto-votes, LoRA) to avoid blocking the async executor
  4. Add request timeout middleware (e.g., 30s) so individual requests fail fast instead of hanging to 300s

Environment

  • Service: ruvbrain / us-central1
  • Image: gcr.io/ruv-dev/ruvbrain:latest
  • Revision: ruvbrain-00158-2lb (affected), ruvbrain-00159-cjl (redeployed)
  • Config: 2 CPU, 2Gi memory, minScale=1, maxScale=10, session affinity enabled

Labels

  • priority: high
  • component: mcp-brain-server
  • type: bug

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions