-
Notifications
You must be signed in to change notification settings - Fork 762
[Bug] Synchronous blocking in _wait_for_sandbox_ready crashes single-worker uvicorn server #620
Description
Describe the bug
When creating a sandbox via POST /v1/sandboxes, the server synchronously blocks the asyncio event loop in _wait_for_sandbox_ready, causing liveness probe failures and pod restarts.
Root Cause
-
time.sleep()instead ofawait asyncio.sleep()inkubernetes_service.pyline 185 — when the workload is not yet visible in the K8s API, the code hitstime.sleep(poll_interval_seconds)which blocks the entire event loop. Line 212 in the same method correctly usesawait asyncio.sleep(). -
Synchronous K8s client calls —
get_workload(),get_status(),create_workload()all use the synchronouskubernetesPython client. Each API call blocks the event loop for the duration of the network round-trip. -
Single uvicorn worker —
cli.pycallsuvicorn.run()without aworkersparameter, defaulting to 1 process with 1 event loop.
Combined, a single POST /v1/sandboxes request can block the event loop for up to 60 seconds (sandbox_create_timeout_seconds). During this time, all other requests — including /health liveness probes — are unserviceable. Kubernetes kills the pod after enough missed probes.
To Reproduce
- Deploy OpenSandbox server on Kubernetes with default Helm values (single replica, default liveness probe)
- Create a sandbox with an image that hasn't been pulled yet on the target node
- Observe server logs: sandbox stays
Pendingfor the full 60s timeout - Observe pod restarts due to liveness probe failure
Suggested Fix
- Immediate: Replace
time.sleep()on line 185 withawait asyncio.sleep(). - Proper: Wrap synchronous K8s client calls in
asyncio.loop.run_in_executor(), or switch to an async K8s client.
Environment
- OpenSandbox Server: v0.1.4 (Helm chart 0.1.0)
- Kubernetes: v1.28.15
- Runtime: containerd 1.6.36