Fix evaluation workers hanging due to missing exec_run() timeouts#93
Open
sarvanithin wants to merge 1 commit intowithmartian:mainfrom
Open
Fix evaluation workers hanging due to missing exec_run() timeouts#93sarvanithin wants to merge 1 commit intowithmartian:mainfrom
sarvanithin wants to merge 1 commit intowithmartian:mainfrom
Conversation
Several exec_run() call sites were missing timeout_s, causing evaluations to hang indefinitely when the container runtime does not properly propagate asyncio cancellation. Reported symptom: 18 agent steps observed but wall-clock time far exceeding 18 * 2 min = 36 min expected. Root causes and fixes: 1. mini_swe_agent.py - Agent setup runs 4 exec_run() calls (uname -a/r/v/m) with no timeout. If Daytona is unresponsive during container setup, the code_agent_task hangs before making any LLM request, blocking _get_time_step() indefinitely. Fix: add setup_timeout_s=30. 2. code_env.py (_compute_reward) - Test evaluation and reward file reads use exec_run() with no timeout. If a test script contains an infinite loop or the container hangs during scoring, the entire episode hangs after the agent completes. Fix: add _REWARD_EXEC_TIMEOUT_S = 300 (5 min) to all exec calls in _compute_reward() and _parse_reward_file(). 3. code_env.py (_get_time_step) - asyncio.wait() had no timeout argument. Even with per-operation timeouts, if the underlying Daytona SDK catches and delays CancelledError (e.g. during server-side cleanup), the effective timeout can far exceed the configured value. Fix: add timeout=_GET_TIME_STEP_TIMEOUT_S (10 min) and raise TimeoutError with a descriptive message if neither the code agent nor the LLM queue responds in time. 4. examples/03_parallel_eval_with_api.py - No per-task wall-clock limit. A hung task holds a semaphore slot indefinitely, reducing effective parallelism. Fix: wrap each evaluate_task() with asyncio.wait_for(timeout=args.task_timeout_s) (default 30 min). Timed-out tasks propagate TimeoutError through gather's return_exceptions=True and are counted as errors. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
User description
Problem
Evaluation workers hang indefinitely when running example 3 (
03_parallel_eval_with_api.py). Symptom: 18 agent steps observed but wall-clock time far exceeding the expected 18 × 2 min = 36 min. Reported by Josh.Root Cause
Four
exec_run()call sites were missingtimeout_s, so a single unresponsive container exec could block the entire evaluation indefinitely — even though per-operation timeouts appear to exist elsewhere.1.
mini_swe_agent.py— agent setup has no timeoutThe 4
unamecommands run with notimeout_s. If Daytona is slow during container startup, thecode_agent_taskhangs before making any LLM request, blocking_get_time_step()forever viaasyncio.wait().2.
code_env.py(_compute_reward) — test execution has no timeoutbash {test_path}and reward filecatcalls use no timeout. If a test script contains an infinite loop or the container hangs during scoring, the episode hangs indefinitely after the agent finishes all its steps.3.
code_env.py(_get_time_step) —asyncio.wait()has no timeoutNo backstop at the step level. Even when per-operation timeouts are set, if the underlying Daytona SDK delays propagating
CancelledError(e.g. during server-side cancel cleanup), the effective timeout can far exceed the configured value.4.
examples/03_parallel_eval_with_api.py— no per-task wall-clock limitA hung task holds a semaphore slot indefinitely, reducing effective parallelism from
num_parallel_workersdown as workers pile up.Fix
mini_swe_agent.pytimeout_s=30to the 4 setupexec_run()callscode_env.py_REWARD_EXEC_TIMEOUT_S = 300(5 min) to all exec calls in_compute_reward()and_parse_reward_file()code_env.pytimeout=_GET_TIME_STEP_TIMEOUT_S(10 min) toasyncio.wait()in_get_time_step()with a clearTimeoutErrorexamples/03_parallel_eval_with_api.pyevaluate_task()withasyncio.wait_for(timeout=args.task_timeout_s)(default 30 min); timed-out tasks surface as errors viagather(return_exceptions=True)Testing
All 163 existing tests pass. End-to-end testing with Daytona in progress.
Generated description
Below is a concise technical summary of the changes proposed in this PR:
Implements comprehensive timeouts across container execution calls and task management to prevent evaluation workers from hanging indefinitely. These changes ensure that unresponsive container operations or delayed cancellations do not block system resources or parallel execution slots.
timeout_sintoexec_runcalls withinMiniSWECodeAgentandCodeEnvironmentto prevent blocking during agent setup, test execution, and reward parsing.Modified files (2)
Latest Contributors(2)
_get_time_stepusingasyncio.waitand introduces a configurabletask_timeout_sin the parallel evaluation script to safeguard against hung tasks.Modified files (2)
Latest Contributors(2)