Skip to content

AndroidStateProvider should apply screenshot resize + coordinate_scale when vision is enabled #350

@joao-neto-feroot

Description

@joao-neto-feroot

Problem

With AndroidStateProvider (vision_only=False), the LLM receives the raw native screenshot, such as 1080x2400, while coordinate_scale_x/y stays at 1.0. The vision model downscales the image internally, outputs scaled coordinates, and sends them to click_at without correction.

Example: the LLM estimates an X dismiss button at (651, 230), but the real device position is about (1020, 170). The ratio 651/1020 = 0.638 matches the model's apparent downscale factor. A retry at (976, 347) dismissed the button, and 976 is about 651 * 1.5, the inverse scale.

ScreenshotOnlyStateProvider already handles this. It resizes the screenshot with fit_dimensions_to_max_side, reports the resized dimensions in formatted_text, and sets coordinate_scale_x = input_width / screen_width so UIState.convert_point maps LLM coordinates back to device pixels.

AndroidStateProvider skips that logic. The a11y click path works when elements exist in the tree because coordinates come from element bounds. Some Compose overlays, tooltips, and popups are missing from the a11y tree, so the agent falls back to screenshot-based click_at. That path still uses convert_point with scale 1.0, so LLM coordinates land in the wrong device-pixel location.

Reproduction

Use an Android device whose native resolution exceeds the model's internal processing resolution, which is true for most devices. Run with vision_only=False and either fast_agent.vision=True or manager.vision=True.

Steps:

  • Start from a fresh Reddit installation.
  • Open Reddit and navigate to the profile screen.
  • Try to dismiss the tooltip.

The agent uses click for a11y-indexed elements, which works because those coordinates come from bounds. It falls back to click_at for Compose overlays, tooltips, or popups that are missing from the a11y tree. Those coordinate clicks are systematically off by the downscale factor.

Suggested fix

When vision is enabled on any sub-agent, apply the same fit_dimensions_to_max_side and coordinate_scale logic that ScreenshotOnlyStateProvider already uses:

# In AndroidStateProvider.get_state(), when vision is active:
screen_width, screen_height = fit_dimensions_to_max_side(native_width, native_height)
# ...
coordinate_scale_x = native_width / screen_width
coordinate_scale_y = native_height / screen_height

The click path resolves coordinates from element bounds, so it bypasses convert_point and is not affected. Only click_at and click_area go through convert_point, where the scale correction is needed.

The screenshot sent to the LLM should match the resized dimensions so the model's coordinate output aligns with the scale factor. The grid overlay from resize_image_to_max_side_with_grid could also be applied here. It is currently gated on requires_coordinate_tools, but resize plus scale is the critical fix.

Workarounds attempted (from the scanner side)

  • Added a coordinate grid overlay at the driver level. The grid labels appeared on the screenshot, but the LLM read the labels, mapped the button to the wrong grid cell, and still output the same wrong coordinates.
  • Added explicit screen dimensions to the prompt (device screen is 1080x2400 pixels). The LLM acknowledged the dimensions and noted the scaling factor, but still output coordinates in its internal resolution.
  • Reverted both changes because neither helps without coordinate scale correction.
Image

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

Status
📨 Inbox

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions