Skip to content

Fix evaluation workers hanging due to missing exec_run() timeouts#93

Open
sarvanithin wants to merge 1 commit intowithmartian:mainfrom
sarvanithin:fix/eval-worker-timeouts
Open

Fix evaluation workers hanging due to missing exec_run() timeouts#93
sarvanithin wants to merge 1 commit intowithmartian:mainfrom
sarvanithin:fix/eval-worker-timeouts

Conversation

@sarvanithin
Copy link
Contributor

@sarvanithin sarvanithin commented Feb 27, 2026

User description

Problem

Evaluation workers hang indefinitely when running example 3 (03_parallel_eval_with_api.py). Symptom: 18 agent steps observed but wall-clock time far exceeding the expected 18 × 2 min = 36 min. Reported by Josh.

Root Cause

Four exec_run() call sites were missing timeout_s, so a single unresponsive container exec could block the entire evaluation indefinitely — even though per-operation timeouts appear to exist elsewhere.

1. mini_swe_agent.py — agent setup has no timeout

The 4 uname commands run with no timeout_s. If Daytona is slow during container startup, the code_agent_task hangs before making any LLM request, blocking _get_time_step() forever via asyncio.wait().

2. code_env.py (_compute_reward) — test execution has no timeout

bash {test_path} and reward file cat calls use no timeout. If a test script contains an infinite loop or the container hangs during scoring, the episode hangs indefinitely after the agent finishes all its steps.

3. code_env.py (_get_time_step) — asyncio.wait() has no timeout

No backstop at the step level. Even when per-operation timeouts are set, if the underlying Daytona SDK delays propagating CancelledError (e.g. during server-side cancel cleanup), the effective timeout can far exceed the configured value.

4. examples/03_parallel_eval_with_api.py — no per-task wall-clock limit

A hung task holds a semaphore slot indefinitely, reducing effective parallelism from num_parallel_workers down as workers pile up.

Fix

File Change
mini_swe_agent.py Add timeout_s=30 to the 4 setup exec_run() calls
code_env.py Add _REWARD_EXEC_TIMEOUT_S = 300 (5 min) to all exec calls in _compute_reward() and _parse_reward_file()
code_env.py Add timeout=_GET_TIME_STEP_TIMEOUT_S (10 min) to asyncio.wait() in _get_time_step() with a clear TimeoutError
examples/03_parallel_eval_with_api.py Wrap each evaluate_task() with asyncio.wait_for(timeout=args.task_timeout_s) (default 30 min); timed-out tasks surface as errors via gather(return_exceptions=True)

Testing

All 163 existing tests pass. End-to-end testing with Daytona in progress.


Generated description

Below is a concise technical summary of the changes proposed in this PR:
Implements comprehensive timeouts across container execution calls and task management to prevent evaluation workers from hanging indefinitely. These changes ensure that unresponsive container operations or delayed cancellations do not block system resources or parallel execution slots.

TopicDetails
Execution Timeouts Integrates timeout_s into exec_run calls within MiniSWECodeAgent and CodeEnvironment to prevent blocking during agent setup, test execution, and reward parsing.
Modified files (2)
  • src/ares/code_agents/mini_swe_agent.py
  • src/ares/environments/code_env.py
Latest Contributors(2)
UserCommitDate
joshua.greaves@gmail.comAdd-an-LLMResponse-mod...January 29, 2026
ryan@withmartian.comAllowed-all-harbor-dat...January 28, 2026
Task Safeguards Adds a 10-minute backstop to _get_time_step using asyncio.wait and introduces a configurable task_timeout_s in the parallel evaluation script to safeguard against hung tasks.
Modified files (2)
  • examples/03_parallel_eval_with_api.py
  • src/ares/environments/code_env.py
Latest Contributors(2)
UserCommitDate
ryan@withmartian.comFail-fast-cleanups-76January 29, 2026
joshua.greaves@gmail.comFail-fast-if-required-...January 29, 2026
This pull request is reviewed by Baz. Review like a pro on (Baz).

Several exec_run() call sites were missing timeout_s, causing evaluations
to hang indefinitely when the container runtime does not properly propagate
asyncio cancellation. Reported symptom: 18 agent steps observed but wall-clock
time far exceeding 18 * 2 min = 36 min expected.

Root causes and fixes:

1. mini_swe_agent.py - Agent setup runs 4 exec_run() calls (uname -a/r/v/m)
   with no timeout. If Daytona is unresponsive during container setup, the
   code_agent_task hangs before making any LLM request, blocking _get_time_step()
   indefinitely. Fix: add setup_timeout_s=30.

2. code_env.py (_compute_reward) - Test evaluation and reward file reads use
   exec_run() with no timeout. If a test script contains an infinite loop or
   the container hangs during scoring, the entire episode hangs after the agent
   completes. Fix: add _REWARD_EXEC_TIMEOUT_S = 300 (5 min) to all exec calls
   in _compute_reward() and _parse_reward_file().

3. code_env.py (_get_time_step) - asyncio.wait() had no timeout argument.
   Even with per-operation timeouts, if the underlying Daytona SDK catches and
   delays CancelledError (e.g. during server-side cleanup), the effective timeout
   can far exceed the configured value. Fix: add timeout=_GET_TIME_STEP_TIMEOUT_S
   (10 min) and raise TimeoutError with a descriptive message if neither the
   code agent nor the LLM queue responds in time.

4. examples/03_parallel_eval_with_api.py - No per-task wall-clock limit. A hung
   task holds a semaphore slot indefinitely, reducing effective parallelism. Fix:
   wrap each evaluate_task() with asyncio.wait_for(timeout=args.task_timeout_s)
   (default 30 min). Timed-out tasks propagate TimeoutError through gather's
   return_exceptions=True and are counted as errors.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant