Skip to content

[Bug] Synchronous blocking in _wait_for_sandbox_ready crashes single-worker uvicorn server #620

@Yuyz0112

Description

@Yuyz0112

Describe the bug

When creating a sandbox via POST /v1/sandboxes, the server synchronously blocks the asyncio event loop in _wait_for_sandbox_ready, causing liveness probe failures and pod restarts.

Root Cause

  1. time.sleep() instead of await asyncio.sleep() in kubernetes_service.py line 185 — when the workload is not yet visible in the K8s API, the code hits time.sleep(poll_interval_seconds) which blocks the entire event loop. Line 212 in the same method correctly uses await asyncio.sleep().

  2. Synchronous K8s client callsget_workload(), get_status(), create_workload() all use the synchronous kubernetes Python client. Each API call blocks the event loop for the duration of the network round-trip.

  3. Single uvicorn workercli.py calls uvicorn.run() without a workers parameter, defaulting to 1 process with 1 event loop.

Combined, a single POST /v1/sandboxes request can block the event loop for up to 60 seconds (sandbox_create_timeout_seconds). During this time, all other requests — including /health liveness probes — are unserviceable. Kubernetes kills the pod after enough missed probes.

To Reproduce

  1. Deploy OpenSandbox server on Kubernetes with default Helm values (single replica, default liveness probe)
  2. Create a sandbox with an image that hasn't been pulled yet on the target node
  3. Observe server logs: sandbox stays Pending for the full 60s timeout
  4. Observe pod restarts due to liveness probe failure

Suggested Fix

  • Immediate: Replace time.sleep() on line 185 with await asyncio.sleep().
  • Proper: Wrap synchronous K8s client calls in asyncio.loop.run_in_executor(), or switch to an async K8s client.

Environment

  • OpenSandbox Server: v0.1.4 (Helm chart 0.1.0)
  • Kubernetes: v1.28.15
  • Runtime: containerd 1.6.36

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions