Skip to content

[gui] feat: add GUI agent loop (sandbox-as-environment VLM training)#49

Open
aoshen02 wants to merge 1 commit into
verl-project:mainfrom
aoshen02:aoshen/gui-agent-loop
Open

[gui] feat: add GUI agent loop (sandbox-as-environment VLM training)#49
aoshen02 wants to merge 1 commit into
verl-project:mainfrom
aoshen02:aoshen/gui-agent-loop

Conversation

@aoshen02

Copy link
Copy Markdown
Collaborator

Summary

Adds a new agent loop for VLM GUI agent RL training that treats a desktop sandbox as the environment: the model emits raw CoT + pyautogui-style actions, the sandbox executes them, and returns the next screenshot. The loop runs until DONE / FAIL / max_steps with retry on recoverable sandbox errors, optional trajectory splitting, and a FlexAttention heterogeneous-context mode that keeps all text tokens while sliding-windowing images.

Everything new lives under uni_agent/gui/:

  • gui_agent_loop.py@register("gui_agent") GUIAgentLoop, subclass of verl.experimental.agent_loop.AgentLoopBase.
  • os_sandbox_tool.pyOSSandboxTool (verl.tools.BaseTool interface) + DummySandboxTool + SandboxClient (aiohttp) + TaskUnrecoverableError / SystemUnavailableError.
  • gui_utils.pyapply_sliding_window_to_images, PyautoguiActionConvertor (OAGI action → pyautogui command), CapsLockManager, key validation tables.

Following the existing uni-agent convention, the verl base classes (AgentLoopBase / AgentLoopMetrics / AgentLoopOutput / register / simple_timer / rollout_trace_op / TokenOutput) are still imported from the verl submodule; only the GUI-specific helpers and the sandbox tool live under uni_agent.gui.

Test plan

  • ruff check uni_agent/gui/ → clean
  • ruff format --check uni_agent/gui/ → clean
  • pre-commit run --files uni_agent/gui/* (ruff + ruff-format + mypy + compileall) → all pass
  • Smoke-train against a small GUI sandbox dataset (run on a follow-up branch — needs oagi, aiohttp, and a reachable sandbox URL)

Notes

  • Imports oagi.utils.output_parser.parse_raw_output and oagi.types.ActionType; these are expected to be installed in the runtime environment alongside verl (same as the upstream working version).
  • Requires verl submodule to expose verl.experimental.agent_loop.{agent_loop, utils}, verl.tools.{base_tool, schemas}, verl.utils.{profiler, rollout_trace}, verl.workers.rollout.replica.TokenOutput. All present at the currently-pinned submodule commit.
  • AI-assisted (Claude Code, Opus 4.7).

🤖 Generated with Claude Code

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the GUI Agent components for VLM GUI agent training, including the agent loop, sandbox tool integration, and PyAutoGUI action conversion utilities. The review feedback highlights several critical issues: potential hanging in the sandbox client due to misplaced timeouts on response parsing instead of connection establishment, inefficient VM recreations caused by raising unrecoverable errors on format validation failures, potential runtime crashes in action conversion when returning empty lists for capslock or user call actions, and a strict assertion in the sliding window utility that could crash the training process if an image block ends exactly at the end of the prompt.

Comment on lines +309 to +315
async with session.post(f"{self.sandbox_url}/api/v1/sandbox/create", json=request_body) as response:
if response.status != 200:
text = await response.text()
logger.error(f"[SandboxClient] Create failed: status={response.status}, body={text[:500]}")
response.raise_for_status()

return await asyncio.wait_for(response.json(), timeout=self.create_timeout)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-critical critical

The asyncio.wait_for is only wrapping response.json(), which is executed after the connection has already been established and headers received. If the sandbox server hangs during the connection or before sending headers, session.post will block indefinitely because there is no timeout passed to it. This can cause the entire training rollout worker to hang. Pass the timeout directly to session.post using aiohttp.ClientTimeout.

Suggested change
async with session.post(f"{self.sandbox_url}/api/v1/sandbox/create", json=request_body) as response:
if response.status != 200:
text = await response.text()
logger.error(f"[SandboxClient] Create failed: status={response.status}, body={text[:500]}")
response.raise_for_status()
return await asyncio.wait_for(response.json(), timeout=self.create_timeout)
import aiohttp
timeout = aiohttp.ClientTimeout(total=self.create_timeout)
async with session.post(f"{self.sandbox_url}/api/v1/sandbox/create", json=request_body, timeout=timeout) as response:
if response.status != 200:
text = await response.text()
logger.error(f"[SandboxClient] Create failed: status={response.status}, body={text[:500]}")
response.raise_for_status()
return await response.json()

Comment on lines +347 to +353
session = await self._get_session()
async with session.post(
f"{self.sandbox_url}/api/v1/sandbox/execute", json={"task_id": task_id, "action": action}
) as response:
response.raise_for_status()
return await asyncio.wait_for(response.json(), timeout=self.execute_timeout)
except asyncio.TimeoutError as e:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-critical critical

Similar to the create method, asyncio.wait_for only wraps response.json(). If the sandbox execution hangs, session.post will block indefinitely. Pass the timeout directly to session.post instead.

Suggested change
session = await self._get_session()
async with session.post(
f"{self.sandbox_url}/api/v1/sandbox/execute", json={"task_id": task_id, "action": action}
) as response:
response.raise_for_status()
return await asyncio.wait_for(response.json(), timeout=self.execute_timeout)
except asyncio.TimeoutError as e:
import aiohttp
timeout = aiohttp.ClientTimeout(total=self.execute_timeout)
async with session.post(
f"{self.sandbox_url}/api/v1/sandbox/execute",
json={"task_id": task_id, "action": action},
timeout=timeout
) as response:
response.raise_for_status()
return await response.json()

Comment on lines +535 to +541
if gen_attempt == self.MAX_GENERATION_RETRIES - 1:
raise TaskUnrecoverableError(
task_id=agent_data.request_id,
error=f"Validation failed after {self.MAX_GENERATION_RETRIES} attempts: {error_msg}",
step_count=agent_data.step_count,
) from e
continue

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Raising TaskUnrecoverableError when model generation format validation fails is highly inefficient. TaskUnrecoverableError triggers a full sandbox VM teardown and recreation (which can take up to 5 minutes). Format validation is a model generation issue, not a system/VM corruption issue. It should be treated as a non-retryable trajectory failure (similar to TruncatedError) to avoid wasting massive VM/GPU resources on retries.

Suggested change
if gen_attempt == self.MAX_GENERATION_RETRIES - 1:
raise TaskUnrecoverableError(
task_id=agent_data.request_id,
error=f"Validation failed after {self.MAX_GENERATION_RETRIES} attempts: {error_msg}",
step_count=agent_data.step_count,
) from e
continue
if gen_attempt == self.MAX_GENERATION_RETRIES - 1:
agent_data.truncated = True
raise TruncatedError(
agent_data=agent_data,
message=f"Validation failed after {self.MAX_GENERATION_RETRIES} attempts: {error_msg}",
) from e
continue

Comment thread uni_agent/gui_utils.py
Comment on lines +791 to +800
if len(valid_keys) == 1 and valid_keys[0] == "capslock":
if self.caps_manager.should_use_system_capslock():
# System mode: use OS-level caps lock
hotkey_interval = self.pyautogui_config.hotkey_interval
return [f"pyautogui.hotkey('capslock', interval={hotkey_interval})"]
else:
# Session mode: toggle internal state (no actual key press needed in conversion)
self.caps_manager.toggle()
return [] # No pyautogui command needed for session mode
else:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Returning an empty list [] for capslock toggle in session mode will cause __call__ to raise a RuntimeError ("All action conversions failed") if it is the only action in the sequence, because converted remains empty. To prevent this, return a no-op command like ["WAIT(0.0)"] which is safely parsed as a 0-second sleep.

Suggested change
if len(valid_keys) == 1 and valid_keys[0] == "capslock":
if self.caps_manager.should_use_system_capslock():
# System mode: use OS-level caps lock
hotkey_interval = self.pyautogui_config.hotkey_interval
return [f"pyautogui.hotkey('capslock', interval={hotkey_interval})"]
else:
# Session mode: toggle internal state (no actual key press needed in conversion)
self.caps_manager.toggle()
return [] # No pyautogui command needed for session mode
else:
if len(valid_keys) == 1 and valid_keys[0] == "capslock":
if self.caps_manager.should_use_system_capslock():
# System mode: use OS-level caps lock
hotkey_interval = self.pyautogui_config.hotkey_interval
return [f"pyautogui.hotkey('capslock', interval={hotkey_interval})"]
else:
# Session mode: toggle internal state (no actual key press needed in conversion)
self.caps_manager.toggle()
return ["WAIT(0.0)"]
else:

Comment thread uni_agent/gui_utils.py
Comment on lines +860 to +863
if action_type == ActionType.CALL_USER.value:
# User intervention requested - not an error, just no-op
self.logger.info("User intervention requested")
return []

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Returning [] for CALL_USER will cause __call__ to raise a RuntimeError if it is the only action in the sequence. Return ["WAIT(0.0)"] instead to safely represent a no-op.

Suggested change
if action_type == ActionType.CALL_USER.value:
# User intervention requested - not an error, just no-op
self.logger.info("User intervention requested")
return []
if action_type == ActionType.CALL_USER.value:
# User intervention requested - not an error, just no-op
self.logger.info("User intervention requested")
return ["WAIT(0.0)"]

Comment thread uni_agent/gui_utils.py
last_end = end

# Add remaining tokens after last block
assert last_end < len(prompt_ids)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If the last image block ends exactly at the end of prompt_ids, last_end will be equal to len(prompt_ids). In this case, assert last_end < len(prompt_ids) will fail and crash the training process. Change the assertion to assert last_end <= len(prompt_ids).

Suggested change
assert last_end < len(prompt_ids)
assert last_end <= len(prompt_ids)

Adds a new agent loop for VLM GUI agent RL training that treats a desktop
sandbox as the environment: the model emits raw CoT + pyautogui-style
actions, the sandbox executes them, and returns the next screenshot. The
loop runs until DONE/FAIL/max_steps with retry on recoverable sandbox
errors, optional trajectory splitting, and a FlexAttention heterogeneous-
context mode that keeps all text tokens while sliding-windowing images.

Layout (one file per agent loop variant, sibling to existing UniAgentLoop;
the sandbox tool sits with the other tools):

- uni_agent/gui_agent_loop.py — `@register("gui_agent")` GUIAgentLoop,
  on top of verl.experimental.agent_loop.AgentLoopBase. Parallels the
  existing uni_agent/agent_loop.py / UniAgentLoop.
- uni_agent/gui_utils.py — apply_sliding_window_to_images,
  PyautoguiActionConvertor (OAGI action -> pyautogui command),
  CapsLockManager, key validation tables.
- uni_agent/tools/os_sandbox_tool.py — OSSandboxTool (verl BaseTool) +
  DummySandboxTool + SandboxClient (aiohttp) + TaskUnrecoverableError /
  SystemUnavailableError.

Imports follow uni-agent convention: AgentLoopBase / AgentLoopMetrics /
AgentLoopOutput / register / simple_timer / rollout_trace_op / TokenOutput
still come from the verl submodule; only the GUI-specific helpers and the
sandbox tool live under uni_agent.

Lints: ruff + ruff-format + mypy + compileall pre-commit hooks pass on
the new files. AI-assisted (Claude Code).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@aoshen02 aoshen02 force-pushed the aoshen/gui-agent-loop branch from 3601b8b to 5cd5a02 Compare May 28, 2026 15:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant