[gui] feat: add GUI agent loop (sandbox-as-environment VLM training)#49
[gui] feat: add GUI agent loop (sandbox-as-environment VLM training)#49aoshen02 wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces the GUI Agent components for VLM GUI agent training, including the agent loop, sandbox tool integration, and PyAutoGUI action conversion utilities. The review feedback highlights several critical issues: potential hanging in the sandbox client due to misplaced timeouts on response parsing instead of connection establishment, inefficient VM recreations caused by raising unrecoverable errors on format validation failures, potential runtime crashes in action conversion when returning empty lists for capslock or user call actions, and a strict assertion in the sliding window utility that could crash the training process if an image block ends exactly at the end of the prompt.
| async with session.post(f"{self.sandbox_url}/api/v1/sandbox/create", json=request_body) as response: | ||
| if response.status != 200: | ||
| text = await response.text() | ||
| logger.error(f"[SandboxClient] Create failed: status={response.status}, body={text[:500]}") | ||
| response.raise_for_status() | ||
|
|
||
| return await asyncio.wait_for(response.json(), timeout=self.create_timeout) |
There was a problem hiding this comment.
The asyncio.wait_for is only wrapping response.json(), which is executed after the connection has already been established and headers received. If the sandbox server hangs during the connection or before sending headers, session.post will block indefinitely because there is no timeout passed to it. This can cause the entire training rollout worker to hang. Pass the timeout directly to session.post using aiohttp.ClientTimeout.
| async with session.post(f"{self.sandbox_url}/api/v1/sandbox/create", json=request_body) as response: | |
| if response.status != 200: | |
| text = await response.text() | |
| logger.error(f"[SandboxClient] Create failed: status={response.status}, body={text[:500]}") | |
| response.raise_for_status() | |
| return await asyncio.wait_for(response.json(), timeout=self.create_timeout) | |
| import aiohttp | |
| timeout = aiohttp.ClientTimeout(total=self.create_timeout) | |
| async with session.post(f"{self.sandbox_url}/api/v1/sandbox/create", json=request_body, timeout=timeout) as response: | |
| if response.status != 200: | |
| text = await response.text() | |
| logger.error(f"[SandboxClient] Create failed: status={response.status}, body={text[:500]}") | |
| response.raise_for_status() | |
| return await response.json() |
| session = await self._get_session() | ||
| async with session.post( | ||
| f"{self.sandbox_url}/api/v1/sandbox/execute", json={"task_id": task_id, "action": action} | ||
| ) as response: | ||
| response.raise_for_status() | ||
| return await asyncio.wait_for(response.json(), timeout=self.execute_timeout) | ||
| except asyncio.TimeoutError as e: |
There was a problem hiding this comment.
Similar to the create method, asyncio.wait_for only wraps response.json(). If the sandbox execution hangs, session.post will block indefinitely. Pass the timeout directly to session.post instead.
| session = await self._get_session() | |
| async with session.post( | |
| f"{self.sandbox_url}/api/v1/sandbox/execute", json={"task_id": task_id, "action": action} | |
| ) as response: | |
| response.raise_for_status() | |
| return await asyncio.wait_for(response.json(), timeout=self.execute_timeout) | |
| except asyncio.TimeoutError as e: | |
| import aiohttp | |
| timeout = aiohttp.ClientTimeout(total=self.execute_timeout) | |
| async with session.post( | |
| f"{self.sandbox_url}/api/v1/sandbox/execute", | |
| json={"task_id": task_id, "action": action}, | |
| timeout=timeout | |
| ) as response: | |
| response.raise_for_status() | |
| return await response.json() |
| if gen_attempt == self.MAX_GENERATION_RETRIES - 1: | ||
| raise TaskUnrecoverableError( | ||
| task_id=agent_data.request_id, | ||
| error=f"Validation failed after {self.MAX_GENERATION_RETRIES} attempts: {error_msg}", | ||
| step_count=agent_data.step_count, | ||
| ) from e | ||
| continue |
There was a problem hiding this comment.
Raising TaskUnrecoverableError when model generation format validation fails is highly inefficient. TaskUnrecoverableError triggers a full sandbox VM teardown and recreation (which can take up to 5 minutes). Format validation is a model generation issue, not a system/VM corruption issue. It should be treated as a non-retryable trajectory failure (similar to TruncatedError) to avoid wasting massive VM/GPU resources on retries.
| if gen_attempt == self.MAX_GENERATION_RETRIES - 1: | |
| raise TaskUnrecoverableError( | |
| task_id=agent_data.request_id, | |
| error=f"Validation failed after {self.MAX_GENERATION_RETRIES} attempts: {error_msg}", | |
| step_count=agent_data.step_count, | |
| ) from e | |
| continue | |
| if gen_attempt == self.MAX_GENERATION_RETRIES - 1: | |
| agent_data.truncated = True | |
| raise TruncatedError( | |
| agent_data=agent_data, | |
| message=f"Validation failed after {self.MAX_GENERATION_RETRIES} attempts: {error_msg}", | |
| ) from e | |
| continue |
| if len(valid_keys) == 1 and valid_keys[0] == "capslock": | ||
| if self.caps_manager.should_use_system_capslock(): | ||
| # System mode: use OS-level caps lock | ||
| hotkey_interval = self.pyautogui_config.hotkey_interval | ||
| return [f"pyautogui.hotkey('capslock', interval={hotkey_interval})"] | ||
| else: | ||
| # Session mode: toggle internal state (no actual key press needed in conversion) | ||
| self.caps_manager.toggle() | ||
| return [] # No pyautogui command needed for session mode | ||
| else: |
There was a problem hiding this comment.
Returning an empty list [] for capslock toggle in session mode will cause __call__ to raise a RuntimeError ("All action conversions failed") if it is the only action in the sequence, because converted remains empty. To prevent this, return a no-op command like ["WAIT(0.0)"] which is safely parsed as a 0-second sleep.
| if len(valid_keys) == 1 and valid_keys[0] == "capslock": | |
| if self.caps_manager.should_use_system_capslock(): | |
| # System mode: use OS-level caps lock | |
| hotkey_interval = self.pyautogui_config.hotkey_interval | |
| return [f"pyautogui.hotkey('capslock', interval={hotkey_interval})"] | |
| else: | |
| # Session mode: toggle internal state (no actual key press needed in conversion) | |
| self.caps_manager.toggle() | |
| return [] # No pyautogui command needed for session mode | |
| else: | |
| if len(valid_keys) == 1 and valid_keys[0] == "capslock": | |
| if self.caps_manager.should_use_system_capslock(): | |
| # System mode: use OS-level caps lock | |
| hotkey_interval = self.pyautogui_config.hotkey_interval | |
| return [f"pyautogui.hotkey('capslock', interval={hotkey_interval})"] | |
| else: | |
| # Session mode: toggle internal state (no actual key press needed in conversion) | |
| self.caps_manager.toggle() | |
| return ["WAIT(0.0)"] | |
| else: |
| if action_type == ActionType.CALL_USER.value: | ||
| # User intervention requested - not an error, just no-op | ||
| self.logger.info("User intervention requested") | ||
| return [] |
There was a problem hiding this comment.
Returning [] for CALL_USER will cause __call__ to raise a RuntimeError if it is the only action in the sequence. Return ["WAIT(0.0)"] instead to safely represent a no-op.
| if action_type == ActionType.CALL_USER.value: | |
| # User intervention requested - not an error, just no-op | |
| self.logger.info("User intervention requested") | |
| return [] | |
| if action_type == ActionType.CALL_USER.value: | |
| # User intervention requested - not an error, just no-op | |
| self.logger.info("User intervention requested") | |
| return ["WAIT(0.0)"] |
| last_end = end | ||
|
|
||
| # Add remaining tokens after last block | ||
| assert last_end < len(prompt_ids) |
There was a problem hiding this comment.
If the last image block ends exactly at the end of prompt_ids, last_end will be equal to len(prompt_ids). In this case, assert last_end < len(prompt_ids) will fail and crash the training process. Change the assertion to assert last_end <= len(prompt_ids).
| assert last_end < len(prompt_ids) | |
| assert last_end <= len(prompt_ids) |
Adds a new agent loop for VLM GUI agent RL training that treats a desktop
sandbox as the environment: the model emits raw CoT + pyautogui-style
actions, the sandbox executes them, and returns the next screenshot. The
loop runs until DONE/FAIL/max_steps with retry on recoverable sandbox
errors, optional trajectory splitting, and a FlexAttention heterogeneous-
context mode that keeps all text tokens while sliding-windowing images.
Layout (one file per agent loop variant, sibling to existing UniAgentLoop;
the sandbox tool sits with the other tools):
- uni_agent/gui_agent_loop.py — `@register("gui_agent")` GUIAgentLoop,
on top of verl.experimental.agent_loop.AgentLoopBase. Parallels the
existing uni_agent/agent_loop.py / UniAgentLoop.
- uni_agent/gui_utils.py — apply_sliding_window_to_images,
PyautoguiActionConvertor (OAGI action -> pyautogui command),
CapsLockManager, key validation tables.
- uni_agent/tools/os_sandbox_tool.py — OSSandboxTool (verl BaseTool) +
DummySandboxTool + SandboxClient (aiohttp) + TaskUnrecoverableError /
SystemUnavailableError.
Imports follow uni-agent convention: AgentLoopBase / AgentLoopMetrics /
AgentLoopOutput / register / simple_timer / rollout_trace_op / TokenOutput
still come from the verl submodule; only the GUI-specific helpers and the
sandbox tool live under uni_agent.
Lints: ruff + ruff-format + mypy + compileall pre-commit hooks pass on
the new files. AI-assisted (Claude Code).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3601b8b to
5cd5a02
Compare
Summary
Adds a new agent loop for VLM GUI agent RL training that treats a desktop sandbox as the environment: the model emits raw CoT + pyautogui-style actions, the sandbox executes them, and returns the next screenshot. The loop runs until DONE / FAIL / max_steps with retry on recoverable sandbox errors, optional trajectory splitting, and a FlexAttention heterogeneous-context mode that keeps all text tokens while sliding-windowing images.
Everything new lives under
uni_agent/gui/:gui_agent_loop.py—@register("gui_agent")GUIAgentLoop, subclass ofverl.experimental.agent_loop.AgentLoopBase.os_sandbox_tool.py—OSSandboxTool(verl.tools.BaseToolinterface) +DummySandboxTool+SandboxClient(aiohttp) +TaskUnrecoverableError/SystemUnavailableError.gui_utils.py—apply_sliding_window_to_images,PyautoguiActionConvertor(OAGI action → pyautogui command),CapsLockManager, key validation tables.Following the existing uni-agent convention, the verl base classes (
AgentLoopBase/AgentLoopMetrics/AgentLoopOutput/register/simple_timer/rollout_trace_op/TokenOutput) are still imported from theverlsubmodule; only the GUI-specific helpers and the sandbox tool live underuni_agent.gui.Test plan
ruff check uni_agent/gui/→ cleanruff format --check uni_agent/gui/→ cleanpre-commit run --files uni_agent/gui/*(ruff + ruff-format + mypy + compileall) → all passoagi,aiohttp, and a reachable sandbox URL)Notes
oagi.utils.output_parser.parse_raw_outputandoagi.types.ActionType; these are expected to be installed in the runtime environment alongsideverl(same as the upstream working version).verlsubmodule to exposeverl.experimental.agent_loop.{agent_loop, utils},verl.tools.{base_tool, schemas},verl.utils.{profiler, rollout_trace},verl.workers.rollout.replica.TokenOutput. All present at the currently-pinned submodule commit.🤖 Generated with Claude Code