Skip to content

feat: add computer use tools and agent for desktop GUI automation#523

Open
joshkotrous wants to merge 5 commits into
canaryfrom
cursor/computer-use-tools-agent-5102
Open

feat: add computer use tools and agent for desktop GUI automation#523
joshkotrous wants to merge 5 commits into
canaryfrom
cursor/computer-use-tools-agent-5102

Conversation

@joshkotrous
Copy link
Copy Markdown
Contributor

What does this PR do?

Adds a complete computer use toolset and a specialized Computer Use agent for desktop GUI automation across all three major platforms.

Problem: Penetration testing scenarios involving thick-client apps, VNC/RDP sessions, native OS dialogs, and other graphical interfaces cannot be automated through browser tools or CLI alone.

Solution: Platform-aware desktop automation tools that provide mouse, keyboard, screenshot, and scroll capabilities:

Tools (9 new tools)

Tool Description
computer_screenshot Capture screen as base64 PNG + save to evidence dir
computer_mouse_click Click at (x,y) — left/right/middle button
computer_mouse_double_click Double-click at position
computer_mouse_move Move cursor without clicking (hover)
computer_mouse_drag Click-and-drag between two points
computer_type_text Type text via keyboard
computer_key_press Press keys/combos (ctrl+c, alt+Tab, Return, etc.)
computer_scroll Scroll mouse wheel up/down
computer_screen_info Get screen size, mouse position, active window title

Platform backends

Platform Dependencies Method
Linux xdotool + scrot/ImageMagick X11 automation
macOS cliclick + screencapture Native Quartz
Windows PowerShell (built-in) .NET System.Windows.Forms + user32.dll P/Invoke

Computer Use Agent

  • ComputerUseAgent — specialized subagent following an observe-plan-act loop (screenshot → identify elements → interact → verify)
  • delegate_to_computer_use_agent — orchestration tool so parent agents can delegate GUI tasks
  • Registered in ALL_TOOL_NAMES, PLAN_MODE_TOOL_NAMES, and the toolset UI definitions

Architecture

  • src/core/agents/offSecAgent/tools/computerUse/ — platform backend + 9 tool files
  • src/core/agents/specialized/computerUseAgent/ — agent, prompts, types
  • src/core/agents/offSecAgent/tools/delegateComputerUse.ts — orchestration tool
  • All tools follow existing patterns: tool() from ai + Zod schemas + ToolContext
  • Subagent follows the same pattern as spawnCodingAgent / delegateAuth (dynamic import, event bus forwarding, subagent lifecycle)

How did you verify your code works?

  • bun run tsc — TypeScript type checking passes
  • bun run lint — ESLint passes
  • bun run format:check — Prettier passes
  • bun run test — All 30 test files pass (583 tests), 7 skipped (integration tests needing live services)
  • New platform detection and backend construction tests added (platform.test.ts) — verifies all three backends are constructable and implement the full DesktopBackend interface
Open in Web Open in Cursor 

cursoragent and others added 2 commits April 1, 2026 18:11
Adds a complete computer use toolset for desktop automation via:
- Linux: xdotool + scrot/ImageMagick
- macOS: cliclick + screencapture

Tools:
- computer_screenshot: capture screen as base64 PNG
- computer_mouse_click: click at (x,y) coordinates
- computer_mouse_double_click: double-click at position
- computer_mouse_move: move cursor without clicking
- computer_mouse_drag: drag from one point to another
- computer_type_text: type text via keyboard
- computer_key_press: press keys/combos (ctrl+c, alt+Tab, etc.)
- computer_scroll: scroll mouse wheel
- computer_screen_info: get screen size, mouse pos, active window

Agent:
- ComputerUseAgent specialized subagent follows observe-plan-act loop
- delegate_to_computer_use_agent orchestration tool for parent agents
- Registered in ALL_TOOL_NAMES and PLAN_MODE_TOOL_NAMES
- Added to toolset UI definitions

Co-authored-by: Josh Kotrous <joshkotrous@users.noreply.github.com>
Adds WindowsBackend using PowerShell + .NET System.Windows.Forms
and user32.dll P/Invoke for desktop automation on Windows:

- Screenshot via System.Drawing.Graphics.CopyFromScreen
- Mouse control via user32.dll mouse_event / SetCursorPos
- Keyboard input via System.Windows.Forms.SendKeys
- Window title via user32.dll GetWindowText
- Screen size via System.Windows.Forms.Screen.PrimaryScreen

No external dependencies required — uses built-in PowerShell/.NET.
Includes mapKeysToSendKeys() to translate xdotool-style key names
(ctrl+c, Return, alt+Tab) to .NET SendKeys format.

Co-authored-by: Josh Kotrous <joshkotrous@users.noreply.github.com>
@joshkotrous joshkotrous marked this pull request as ready for review April 1, 2026 19:59
@github-actions github-actions Bot requested a review from Yuvanesh-ux April 1, 2026 19:59
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 5 potential issues.

Autofix Details

Bugbot Autofix prepared a fix for the issue found in the latest run.

  • ✅ Fixed: keyPress passes unquoted string directly into shell command
    • Replaced exec() (which uses shell interpolation) with execFileSync() using argument arrays in both LinuxBackend.keyPress and DarwinBackend.keyPress, bypassing the shell entirely and preventing injection.
Preview (276aa82ab0)
diff --git a/src/core/agents/offSecAgent/tools/computerUse/index.ts b/src/core/agents/offSecAgent/tools/computerUse/index.ts
new file mode 100644
--- /dev/null
+++ b/src/core/agents/offSecAgent/tools/computerUse/index.ts
@@ -1,0 +1,67 @@
+/**
+ * Computer Use tools — desktop automation via xdotool (Linux), cliclick (macOS),
+ * and PowerShell/.NET (Windows).
+ *
+ * Provides low-level desktop interaction primitives: screenshot, mouse clicks,
+ * keyboard input, scrolling, and drag operations. These tools enable agents
+ * to interact with graphical applications for penetration testing scenarios
+ * that require GUI interaction (e.g. thick-client apps, VNC sessions, RDP).
+ *
+ * The tools are platform-aware:
+ * - Linux: xdotool + scrot/ImageMagick
+ * - macOS: cliclick + screencapture
+ * - Windows: PowerShell + .NET System.Windows.Forms / user32.dll
+ */
+
+import type { ToolContext } from "../types";
+import { computerScreenshot } from "./screenshot";
+import { computerMouseClick } from "./mouseClick";
+import { computerMouseDoubleClick } from "./mouseDoubleClick";
+import { computerMouseMove } from "./mouseMove";
+import { computerMouseDrag } from "./mouseDrag";
+import { computerTypeText } from "./typeText";
+import { computerKeyPress } from "./keyPress";
+import { computerScroll } from "./scroll";
+import { computerScreenInfo } from "./screenInfo";
+
+export const COMPUTER_USE_TOOL_NAMES = [
+  "computer_screenshot",
+  "computer_mouse_click",
+  "computer_mouse_double_click",
+  "computer_mouse_move",
+  "computer_mouse_drag",
+  "computer_type_text",
+  "computer_key_press",
+  "computer_scroll",
+  "computer_screen_info",
+] as const;
+
+export type ComputerUseToolName = (typeof COMPUTER_USE_TOOL_NAMES)[number];
+
+export function createComputerUseToolset(ctx: ToolContext) {
+  return {
+    computer_screenshot: computerScreenshot(ctx),
+    computer_mouse_click: computerMouseClick(ctx),
+    computer_mouse_double_click: computerMouseDoubleClick(ctx),
+    computer_mouse_move: computerMouseMove(ctx),
+    computer_mouse_drag: computerMouseDrag(ctx),
+    computer_type_text: computerTypeText(ctx),
+    computer_key_press: computerKeyPress(ctx),
+    computer_scroll: computerScroll(ctx),
+    computer_screen_info: computerScreenInfo(ctx),
+  } as const;
+}
+
+export {
+  computerScreenshot,
+  computerMouseClick,
+  computerMouseDoubleClick,
+  computerMouseMove,
+  computerMouseDrag,
+  computerTypeText,
+  computerKeyPress,
+  computerScroll,
+  computerScreenInfo,
+};
+
+export { type DesktopBackend, type Platform, detectPlatform } from "./platform";

diff --git a/src/core/agents/offSecAgent/tools/computerUse/keyPress.ts b/src/core/agents/offSecAgent/tools/computerUse/keyPress.ts
new file mode 100644
--- /dev/null
+++ b/src/core/agents/offSecAgent/tools/computerUse/keyPress.ts
@@ -1,0 +1,56 @@
+import { tool } from "ai";
+import { z } from "zod";
+import type { ToolContext } from "../types";
+import { getDesktopBackend } from "./platform";
+
+export function computerKeyPress(_ctx: ToolContext) {
+  return tool({
+    description: `Press a key or key combination.
+
+Sends a key press event. Supports single keys and modifier combinations.
+
+Key names follow xdotool conventions on Linux, cliclick on macOS,
+and SendKeys on Windows:
+
+Single keys: Return, Escape, Tab, BackSpace, Delete, space, Up, Down, Left, Right,
+  Home, End, Page_Up, Page_Down, F1-F12
+
+Modifier combos (use + separator): ctrl+c, ctrl+v, ctrl+a, alt+Tab, alt+F4,
+  ctrl+shift+t, super+l (Windows/Super key)
+
+Common examples:
+- "Return"        — press Enter
+- "Escape"        — press Escape
+- "ctrl+c"        — copy
+- "ctrl+v"        — paste
+- "ctrl+a"        — select all
+- "alt+Tab"       — switch windows
+- "ctrl+shift+t"  — reopen closed tab
+- "super+l"       — lock screen`,
+    inputSchema: z.object({
+      keys: z
+        .string()
+        .describe(
+          'Key or key combination to press (e.g. "Return", "ctrl+c", "alt+Tab")',
+        ),
+      toolCallDescription: z
+        .string()
+        .describe("Why you are pressing this key combination"),
+    }),
+    execute: async ({
+      keys,
+    }): Promise<{ success: boolean; message: string }> => {
+      try {
+        const backend = getDesktopBackend();
+        backend.keyPress(keys);
+        return {
+          success: true,
+          message: `Pressed key(s): ${keys}`,
+        };
+      } catch (error) {
+        const msg = error instanceof Error ? error.message : String(error);
+        return { success: false, message: `Key press failed: ${msg}` };
+      }
+    },
+  });
+}

diff --git a/src/core/agents/offSecAgent/tools/computerUse/mouseClick.ts b/src/core/agents/offSecAgent/tools/computerUse/mouseClick.ts
new file mode 100644
--- /dev/null
+++ b/src/core/agents/offSecAgent/tools/computerUse/mouseClick.ts
@@ -1,0 +1,45 @@
+import { tool } from "ai";
+import { z } from "zod";
+import type { ToolContext } from "../types";
+import { getDesktopBackend } from "./platform";
+
+export function computerMouseClick(_ctx: ToolContext) {
+  return tool({
+    description: `Click the mouse at a specific position on screen.
+
+Performs a mouse click at the given (x, y) coordinates. Supports left, right,
+and middle button clicks. If coordinates are omitted, clicks at the current
+mouse position.
+
+Use after taking a screenshot to identify where to click.`,
+    inputSchema: z.object({
+      x: z.number().describe("X coordinate to click at"),
+      y: z.number().describe("Y coordinate to click at"),
+      button: z
+        .enum(["left", "right", "middle"])
+        .optional()
+        .default("left")
+        .describe("Mouse button to click"),
+      toolCallDescription: z
+        .string()
+        .describe("Why you are clicking at this position"),
+    }),
+    execute: async ({
+      x,
+      y,
+      button,
+    }): Promise<{ success: boolean; message: string }> => {
+      try {
+        const backend = getDesktopBackend();
+        backend.mouseClick(button, x, y);
+        return {
+          success: true,
+          message: `Clicked ${button ?? "left"} button at (${x}, ${y})`,
+        };
+      } catch (error) {
+        const msg = error instanceof Error ? error.message : String(error);
+        return { success: false, message: `Mouse click failed: ${msg}` };
+      }
+    },
+  });
+}

diff --git a/src/core/agents/offSecAgent/tools/computerUse/mouseDoubleClick.ts b/src/core/agents/offSecAgent/tools/computerUse/mouseDoubleClick.ts
new file mode 100644
--- /dev/null
+++ b/src/core/agents/offSecAgent/tools/computerUse/mouseDoubleClick.ts
@@ -1,0 +1,41 @@
+import { tool } from "ai";
+import { z } from "zod";
+import type { ToolContext } from "../types";
+import { getDesktopBackend } from "./platform";
+
+export function computerMouseDoubleClick(_ctx: ToolContext) {
+  return tool({
+    description: `Double-click the mouse at a specific position on screen.
+
+Performs a double-click at the given (x, y) coordinates.
+If coordinates are omitted, double-clicks at the current mouse position.`,
+    inputSchema: z.object({
+      x: z.number().optional().describe("X coordinate to double-click at"),
+      y: z.number().optional().describe("Y coordinate to double-click at"),
+      toolCallDescription: z
+        .string()
+        .describe("Why you are double-clicking at this position"),
+    }),
+    execute: async ({
+      x,
+      y,
+    }): Promise<{ success: boolean; message: string }> => {
+      try {
+        const backend = getDesktopBackend();
+        backend.mouseDoubleClick(x, y);
+        const pos =
+          x != null && y != null ? `(${x}, ${y})` : "current position";
+        return {
+          success: true,
+          message: `Double-clicked at ${pos}`,
+        };
+      } catch (error) {
+        const msg = error instanceof Error ? error.message : String(error);
+        return {
+          success: false,
+          message: `Double-click failed: ${msg}`,
+        };
+      }
+    },
+  });
+}

diff --git a/src/core/agents/offSecAgent/tools/computerUse/mouseDrag.ts b/src/core/agents/offSecAgent/tools/computerUse/mouseDrag.ts
new file mode 100644
--- /dev/null
+++ b/src/core/agents/offSecAgent/tools/computerUse/mouseDrag.ts
@@ -1,0 +1,40 @@
+import { tool } from "ai";
+import { z } from "zod";
+import type { ToolContext } from "../types";
+import { getDesktopBackend } from "./platform";
+
+export function computerMouseDrag(_ctx: ToolContext) {
+  return tool({
+    description: `Drag the mouse from one position to another.
+
+Performs a click-and-drag from (startX, startY) to (endX, endY).
+Useful for selecting text, moving elements, or interacting with sliders.`,
+    inputSchema: z.object({
+      startX: z.number().describe("Starting X coordinate"),
+      startY: z.number().describe("Starting Y coordinate"),
+      endX: z.number().describe("Ending X coordinate"),
+      endY: z.number().describe("Ending Y coordinate"),
+      toolCallDescription: z
+        .string()
+        .describe("Why you are performing this drag operation"),
+    }),
+    execute: async ({
+      startX,
+      startY,
+      endX,
+      endY,
+    }): Promise<{ success: boolean; message: string }> => {
+      try {
+        const backend = getDesktopBackend();
+        backend.mouseDrag(startX, startY, endX, endY);
+        return {
+          success: true,
+          message: `Dragged from (${startX}, ${startY}) to (${endX}, ${endY})`,
+        };
+      } catch (error) {
+        const msg = error instanceof Error ? error.message : String(error);
+        return { success: false, message: `Mouse drag failed: ${msg}` };
+      }
+    },
+  });
+}

diff --git a/src/core/agents/offSecAgent/tools/computerUse/mouseMove.ts b/src/core/agents/offSecAgent/tools/computerUse/mouseMove.ts
new file mode 100644
--- /dev/null
+++ b/src/core/agents/offSecAgent/tools/computerUse/mouseMove.ts
@@ -1,0 +1,36 @@
+import { tool } from "ai";
+import { z } from "zod";
+import type { ToolContext } from "../types";
+import { getDesktopBackend } from "./platform";
+
+export function computerMouseMove(_ctx: ToolContext) {
+  return tool({
+    description: `Move the mouse cursor to specific coordinates on screen.
+
+Moves the mouse to the given (x, y) coordinates without clicking.
+Useful for hovering over elements to trigger tooltips or menus.`,
+    inputSchema: z.object({
+      x: z.number().describe("X coordinate to move to"),
+      y: z.number().describe("Y coordinate to move to"),
+      toolCallDescription: z
+        .string()
+        .describe("Why you are moving the mouse to this position"),
+    }),
+    execute: async ({
+      x,
+      y,
+    }): Promise<{ success: boolean; message: string }> => {
+      try {
+        const backend = getDesktopBackend();
+        backend.mouseMove(x, y);
+        return {
+          success: true,
+          message: `Mouse moved to (${x}, ${y})`,
+        };
+      } catch (error) {
+        const msg = error instanceof Error ? error.message : String(error);
+        return { success: false, message: `Mouse move failed: ${msg}` };
+      }
+    },
+  });
+}

diff --git a/src/core/agents/offSecAgent/tools/computerUse/platform.test.ts b/src/core/agents/offSecAgent/tools/computerUse/platform.test.ts
new file mode 100644
--- /dev/null
+++ b/src/core/agents/offSecAgent/tools/computerUse/platform.test.ts
@@ -1,0 +1,73 @@
+import { describe, it, expect } from "vitest";
+import {
+  detectPlatform,
+  LinuxBackend,
+  DarwinBackend,
+  WindowsBackend,
+} from "./platform";
+
+const BACKEND_METHODS = [
+  "screenshot",
+  "mouseMove",
+  "mouseClick",
+  "mouseDoubleClick",
+  "typeText",
+  "keyPress",
+  "getMousePosition",
+  "getScreenSize",
+  "mouseDrag",
+  "scroll",
+  "getActiveWindowTitle",
+] as const;
+
+describe("Computer Use platform detection", () => {
+  it("should detect the current platform", () => {
+    const platform = detectPlatform();
+    expect(["linux", "darwin", "win32", "unsupported"]).toContain(platform);
+  });
+
+  it("should return linux on Linux", () => {
+    if (process.platform !== "linux") return;
+    expect(detectPlatform()).toBe("linux");
+  });
+
+  it("should return darwin on macOS", () => {
+    if (process.platform !== "darwin") return;
+    expect(detectPlatform()).toBe("darwin");
+  });
+
+  it("should return win32 on Windows", () => {
+    if (process.platform !== "win32") return;
+    expect(detectPlatform()).toBe("win32");
+  });
+});
+
+describe("LinuxBackend", () => {
+  it("should be constructable and implement DesktopBackend", () => {
+    const backend = new LinuxBackend();
+    expect(backend).toBeDefined();
+    for (const method of BACKEND_METHODS) {
+      expect(typeof backend[method]).toBe("function");
+    }
+  });
+});
+
+describe("DarwinBackend", () => {
+  it("should be constructable and implement DesktopBackend", () => {
+    const backend = new DarwinBackend();
+    expect(backend).toBeDefined();
+    for (const method of BACKEND_METHODS) {
+      expect(typeof backend[method]).toBe("function");
+    }
+  });
+});
+
+describe("WindowsBackend", () => {
+  it("should be constructable and implement DesktopBackend", () => {
+    const backend = new WindowsBackend();
+    expect(backend).toBeDefined();
+    for (const method of BACKEND_METHODS) {
+      expect(typeof backend[method]).toBe("function");
+    }
+  });
+});

diff --git a/src/core/agents/offSecAgent/tools/computerUse/platform.ts b/src/core/agents/offSecAgent/tools/computerUse/platform.ts
new file mode 100644
--- /dev/null
+++ b/src/core/agents/offSecAgent/tools/computerUse/platform.ts
@@ -1,0 +1,533 @@
+/**
+ * Platform detection and desktop automation backend.
+ *
+ * Linux   → xdotool + scrot/import (ImageMagick)
+ * macOS   → cliclick + screencapture
+ * Windows → PowerShell + .NET System.Windows.Forms / user32.dll
+ *
+ * Each backend implements the same {@link DesktopBackend} interface so
+ * individual tools stay platform-agnostic.
+ */
+
+import {
+  execFileSync,
+  execSync,
+  type ExecSyncOptionsWithStringEncoding,
+} from "child_process";
+
+const EXEC_OPTS: ExecSyncOptionsWithStringEncoding = {
+  encoding: "utf-8",
+  timeout: 15_000,
+  stdio: ["pipe", "pipe", "pipe"],
+};
+
+export type Platform = "linux" | "darwin" | "win32" | "unsupported";
+
+export function detectPlatform(): Platform {
+  const p = process.platform;
+  if (p === "linux") return "linux";
+  if (p === "darwin") return "darwin";
+  if (p === "win32") return "win32";
+  return "unsupported";
+}
+
+export interface ScreenSize {
+  width: number;
+  height: number;
+}
+
+export interface MousePosition {
+  x: number;
+  y: number;
+}
+
+// ---------------------------------------------------------------------------
+// Desktop backend interface
+// ---------------------------------------------------------------------------
+
+export interface DesktopBackend {
+  /** Take a screenshot and return the file path to the PNG. */
+  screenshot(outputPath: string): string;
+
+  /** Move the mouse to absolute coordinates. */
+  mouseMove(x: number, y: number): void;
+
+  /** Click at the current (or specified) position. */
+  mouseClick(
+    button?: "left" | "right" | "middle",
+    x?: number,
+    y?: number,
+  ): void;
+
+  /** Double-click at the current (or specified) position. */
+  mouseDoubleClick(x?: number, y?: number): void;
+
+  /** Type text via keyboard (handles special characters). */
+  typeText(text: string): void;
+
+  /** Press a key or key combination (e.g. "Return", "ctrl+c", "alt+Tab"). */
+  keyPress(keys: string): void;
+
+  /** Get current mouse position. */
+  getMousePosition(): MousePosition;
+
+  /** Get screen dimensions. */
+  getScreenSize(): ScreenSize;
+
+  /** Drag from one point to another. */
+  mouseDrag(startX: number, startY: number, endX: number, endY: number): void;
+
+  /** Scroll the mouse wheel. Positive = down, negative = up. */
+  scroll(amount: number, x?: number, y?: number): void;
+
+  /** Get the title of the currently active window. */
+  getActiveWindowTitle(): string;
+}
+
+// ---------------------------------------------------------------------------
+// Linux backend (xdotool + scrot)
+// ---------------------------------------------------------------------------
+
+function exec(cmd: string): string {
+  return execSync(cmd, EXEC_OPTS).trim();
+}
+
+export class LinuxBackend implements DesktopBackend {
+  screenshot(outputPath: string): string {
+    exec(
+      `scrot -o "${outputPath}" 2>/dev/null || import -window root "${outputPath}"`,
+    );
+    return outputPath;
+  }
+
+  mouseMove(x: number, y: number): void {
+    exec(`xdotool mousemove ${x} ${y}`);
+  }
+
+  mouseClick(
+    button: "left" | "right" | "middle" = "left",
+    x?: number,
+    y?: number,
+  ): void {
+    const btnMap = { left: 1, middle: 2, right: 3 } as const;
+    if (x != null && y != null) {
+      exec(`xdotool mousemove ${x} ${y} click ${btnMap[button]}`);
+    } else {
+      exec(`xdotool click ${btnMap[button]}`);
+    }
+  }
+
+  mouseDoubleClick(x?: number, y?: number): void {
+    if (x != null && y != null) {
+      exec(`xdotool mousemove ${x} ${y} click --repeat 2 --delay 80 1`);
+    } else {
+      exec(`xdotool click --repeat 2 --delay 80 1`);
+    }
+  }
+
+  typeText(text: string): void {
+    exec(`xdotool type --clearmodifiers -- ${JSON.stringify(text)}`);
+  }
+
+  keyPress(keys: string): void {
+    execFileSync("xdotool", ["key", "--clearmodifiers", keys], EXEC_OPTS);
+  }
+
+  getMousePosition(): MousePosition {
+    const raw = exec("xdotool getmouselocation --shell");
+    const xMatch = raw.match(/X=(\d+)/);
+    const yMatch = raw.match(/Y=(\d+)/);
+    return {
+      x: xMatch ? parseInt(xMatch[1]!, 10) : 0,
+      y: yMatch ? parseInt(yMatch[1]!, 10) : 0,
+    };
+  }
+
+  getScreenSize(): ScreenSize {
+    const raw = exec("xdotool getdisplaygeometry");
+    const [w, h] = raw.split(" ").map(Number);
+    return { width: w ?? 0, height: h ?? 0 };
+  }
+
+  mouseDrag(startX: number, startY: number, endX: number, endY: number): void {
+    exec(
+      `xdotool mousemove ${startX} ${startY} mousedown 1 mousemove ${endX} ${endY} mouseup 1`,
+    );
+  }
+
+  scroll(amount: number, x?: number, y?: number): void {
+    if (x != null && y != null) {
+      exec(`xdotool mousemove ${x} ${y}`);
+    }
+    const button = amount > 0 ? 5 : 4;
+    const clicks = Math.abs(amount);
+    exec(`xdotool click --repeat ${clicks} --delay 50 ${button}`);
+  }
+
+  getActiveWindowTitle(): string {
+    try {
+      return exec("xdotool getactivewindow getwindowname");
+    } catch {
+      return "(unknown)";
+    }
+  }
+}
+
+// ---------------------------------------------------------------------------
+// macOS backend (cliclick + screencapture)
+// ---------------------------------------------------------------------------
+
+export class DarwinBackend implements DesktopBackend {
+  screenshot(outputPath: string): string {
+    exec(`screencapture -x "${outputPath}"`);
+    return outputPath;
+  }
+
+  mouseMove(x: number, y: number): void {
+    exec(`cliclick m:${x},${y}`);
+  }
+
+  mouseClick(
+    button: "left" | "right" | "middle" = "left",
+    x?: number,
+    y?: number,
+  ): void {
+    const prefix = button === "right" ? "rc" : "c";
+    if (x != null && y != null) {
+      exec(`cliclick ${prefix}:${x},${y}`);
+    } else {
+      const pos = this.getMousePosition();
+      exec(`cliclick ${prefix}:${pos.x},${pos.y}`);
+    }
+  }
+
+  mouseDoubleClick(x?: number, y?: number): void {
+    if (x != null && y != null) {
+      exec(`cliclick dc:${x},${y}`);
+    } else {
+      const pos = this.getMousePosition();
+      exec(`cliclick dc:${pos.x},${pos.y}`);
+    }
+  }
+
+  typeText(text: string): void {
+    exec(`cliclick t:${JSON.stringify(text)}`);
+  }
+
+  keyPress(keys: string): void {
+    execFileSync("cliclick", [`kp:${keys}`], EXEC_OPTS);
+  }
+
+  getMousePosition(): MousePosition {
+    const raw = exec("cliclick p");
+    const match = raw.match(/(\d+),(\d+)/);
+    return {
+      x: match ? parseInt(match[1]!, 10) : 0,
+      y: match ? parseInt(match[2]!, 10) : 0,
+    };
+  }
+
+  getScreenSize(): ScreenSize {
+    const raw = exec(
+      `system_profiler SPDisplaysDataType | grep Resolution | head -1`,
+    );
+    const match = raw.match(/(\d+)\s*x\s*(\d+)/);
+    return {
+      width: match ? parseInt(match[1]!, 10) : 0,
+      height: match ? parseInt(match[2]!, 10) : 0,
+    };
+  }
+
+  mouseDrag(startX: number, startY: number, endX: number, endY: number): void {
+    exec(`cliclick dd:${startX},${startY} du:${endX},${endY}`);
+  }
+
+  scroll(amount: number, x?: number, y?: number): void {
+    if (x != null && y != null) {
+      exec(`cliclick m:${x},${y}`);
+    }
+    const direction = amount > 0 ? "down" : "up";
+    const clicks = Math.abs(amount);
+    for (let i = 0; i < clicks; i++) {
+      exec(`cliclick "kp:${direction === "down" ? "arrow-down" : "arrow-up"}"`);
+    }
+  }
+
+  getActiveWindowTitle(): string {
+    try {
+      return exec(
+        `osascript -e 'tell application "System Events" to get name of first application process whose frontmost is true'`,
+      );
+    } catch {
+      return "(unknown)";
+    }
+  }
+}
+
+// ---------------------------------------------------------------------------
+// Windows backend (PowerShell + .NET System.Windows.Forms / user32.dll)
+// ---------------------------------------------------------------------------
+
+/**
+ * Run a PowerShell snippet and return trimmed stdout.
+ *
+ * Uses `-NoProfile -NonInteractive -Command` so startup is fast and
+ * there is no profile pollution. The snippet can use any .NET class
+ * available in the default PowerShell/.NET runtime.
+ */
+function ps(script: string): string {
+  return execSync(
+    `powershell -NoProfile -NonInteractive -Command ${JSON.stringify(script)}`,
+    EXEC_OPTS,
+  ).trim();
+}
+
+/**
+ * Shared C# helper that is injected once per PowerShell call when we need
+ * mouse or keyboard simulation via `user32.dll` P/Invoke.
+ */
+const WIN32_INPUT_TYPE = `
+Add-Type -TypeDefinition @"
+using System;
+using System.Runtime.InteropServices;
+public class Win32Input {
+    [DllImport("user32.dll")] public static extern bool SetCursorPos(int X, int Y);
+    [DllImport("user32.dll")] public static extern bool GetCursorPos(out POINT lpPoint);
+    [DllImport("user32.dll")] public static extern void mouse_event(uint dwFlags, int dx, int dy, int dwData, int dwExtraInfo);
+    [DllImport("user32.dll")] public static extern IntPtr GetForegroundWindow();
+    [DllImport("user32.dll", CharSet=CharSet.Auto)] public static extern int GetWindowText(IntPtr hWnd, System.Text.StringBuilder lpString, int nMaxCount);
+    [StructLayout(LayoutKind.Sequential)] public struct POINT { public int X; public int Y; }
+
+    public const uint MOUSEEVENTF_LEFTDOWN   = 0x0002;
+    public const uint MOUSEEVENTF_LEFTUP     = 0x0004;
+    public const uint MOUSEEVENTF_RIGHTDOWN  = 0x0008;
+    public const uint MOUSEEVENTF_RIGHTUP    = 0x0010;
+    public const uint MOUSEEVENTF_MIDDLEDOWN = 0x0020;
+    public const uint MOUSEEVENTF_MIDDLEUP   = 0x0040;
+    public const uint MOUSEEVENTF_WHEEL      = 0x0800;
+}
+"@
+`;
+
+export class WindowsBackend implements DesktopBackend {
+  screenshot(outputPath: string): string {
+    ps(
+      `Add-Type -AssemblyName System.Windows.Forms; ` +
+        `$bmp = New-Object System.Drawing.Bitmap([System.Windows.Forms.Screen]::PrimaryScreen.Bounds.Width, [System.Windows.Forms.Screen]::PrimaryScreen.Bounds.Height); ` +
+        `$g = [System.Drawing.Graphics]::FromImage($bmp); ` +
+        `$g.CopyFromScreen(0, 0, 0, 0, $bmp.Size); ` +
+        `$bmp.Save('${outputPath.replace(/'/g, "''")}'); ` +
+        `$g.Dispose(); $bmp.Dispose()`,
+    );
+    return outputPath;
+  }
+
+  mouseMove(x: number, y: number): void {
+    ps(`${WIN32_INPUT_TYPE} [Win32Input]::SetCursorPos(${x}, ${y})`);
+  }
+
+  mouseClick(
+    button: "left" | "right" | "middle" = "left",
+    x?: number,
+    y?: number,
+  ): void {
+    const downUp =
+      button === "right"
+        ? "[Win32Input]::MOUSEEVENTF_RIGHTDOWN, [Win32Input]::MOUSEEVENTF_RIGHTUP"
+        : button === "middle"
+          ? "[Win32Input]::MOUSEEVENTF_MIDDLEDOWN, [Win32Input]::MOUSEEVENTF_MIDDLEUP"
+          : "[Win32Input]::MOUSEEVENTF_LEFTDOWN, [Win32Input]::MOUSEEVENTF_LEFTUP";
+
+    const movePrefix =
+      x != null && y != null ? `[Win32Input]::SetCursorPos(${x}, ${y}); ` : "";
+
+    ps(
+      `${WIN32_INPUT_TYPE} ${movePrefix}` +
+        `$d, $u = ${downUp}; ` +
+        `[Win32Input]::mouse_event($d, 0, 0, 0, 0); ` +
+        `Start-Sleep -Milliseconds 30; ` +
+        `[Win32Input]::mouse_event($u, 0, 0, 0, 0)`,
+    );
+  }
+
+  mouseDoubleClick(x?: number, y?: number): void {
+    const movePrefix =
+      x != null && y != null ? `[Win32Input]::SetCursorPos(${x}, ${y}); ` : "";
+
+    ps(
+      `${WIN32_INPUT_TYPE} ${movePrefix}` +
+        `[Win32Input]::mouse_event([Win32Input]::MOUSEEVENTF_LEFTDOWN, 0, 0, 0, 0); ` +
+        `[Win32Input]::mouse_event([Win32Input]::MOUSEEVENTF_LEFTUP, 0, 0, 0, 0); ` +
+        `Start-Sleep -Milliseconds 80; ` +
+        `[Win32Input]::mouse_event([Win32Input]::MOUSEEVENTF_LEFTDOWN, 0, 0, 0, 0); ` +
+        `[Win32Input]::mouse_event([Win32Input]::MOUSEEVENTF_LEFTUP, 0, 0, 0, 0)`,
+    );
+  }
+
+  typeText(text: string): void {
+    ps(
+      `Add-Type -AssemblyName System.Windows.Forms; ` +
+        `[System.Windows.Forms.SendKeys]::SendWait(${JSON.stringify(text)})`,
+    );
+  }
+
+  keyPress(keys: string): void {
+    const mapped = mapKeysToSendKeys(keys);
+    ps(
+      `Add-Type -AssemblyName System.Windows.Forms; ` +
+        `[System.Windows.Forms.SendKeys]::SendWait('${mapped}')`,
+    );
+  }
+
+  getMousePosition(): MousePosition {
+    const raw = ps(
+      `${WIN32_INPUT_TYPE} $p = New-Object Win32Input+POINT; ` +
+        `[Win32Input]::GetCursorPos([ref]$p); "$($p.X),$($p.Y)"`,
+    );
+    const [xStr, yStr] = raw.split(",");
+    return {
+      x: parseInt(xStr ?? "0", 10),
+      y: parseInt(yStr ?? "0", 10),
+    };
+  }
+
+  getScreenSize(): ScreenSize {
+    const raw = ps(
... diff truncated: showing 800 of 1675 lines

You can send follow-ups to this agent here.

for (let i = 0; i < clicks; i++) {
exec(`cliclick "kp:${direction === "down" ? "arrow-down" : "arrow-up"}"`);
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

macOS scroll simulates arrow keys instead of scrolling

Medium Severity

DarwinBackend.scroll simulates arrow key presses (kp:arrow-down / kp:arrow-up) instead of actual scroll wheel events. Arrow keys move cursors or selections in most applications — they don't scroll content. The DesktopBackend interface and the computer_scroll tool description both promise mouse wheel scrolling, but this implementation does something fundamentally different on macOS. cliclick lacks native scroll support, so an alternative like osascript with Quartz event APIs would be needed.

Fix in Cursor Fix in Web

`Add-Type -AssemblyName System.Windows.Forms; ` +
`[System.Windows.Forms.SendKeys]::SendWait(${JSON.stringify(text)})`,
);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Windows typeText doesn't escape SendKeys special characters

Medium Severity

WindowsBackend.typeText passes text to SendKeys.SendWait without escaping special characters. SendKeys treats + as Shift, ^ as Ctrl, % as Alt, ~ as Enter, and {, }, (, ) as grouping/special key delimiters. Typing ordinary text like 2+2 or 100% will produce wrong output (modifier key combos instead of literal characters). These characters need to be wrapped in braces (e.g., {+}, {%}) before being sent.

Fix in Cursor Fix in Web


typeText(text: string): void {
exec(`xdotool type --clearmodifiers -- ${JSON.stringify(text)}`);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Linux/macOS typeText corrupts text with dollar signs

Medium Severity

LinuxBackend.typeText and DarwinBackend.typeText use JSON.stringify(text) to produce a double-quoted shell string, but bash expands $variable, $(command), and backtick expressions inside double quotes. Typing text like "Price is $100" would produce "Price is 00" (since $1 is an empty positional parameter), and text containing $(...) would execute arbitrary commands. Single quotes or proper shell escaping would prevent this.

Additional Locations (1)
Fix in Cursor Fix in Web

x?: number,
y?: number,
): void {
const prefix = button === "right" ? "rc" : "c";
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

macOS middle-click silently performs left-click instead

Low Severity

DarwinBackend.mouseClick maps button === "right" to "rc" but falls through to "c" (left click) for "middle". The tool and interface accept "middle" as a valid button, but the macOS backend silently performs a left click instead, with no error or warning. This could cause the agent to repeatedly fail at middle-click-dependent GUI tasks without understanding why.

Fix in Cursor Fix in Web

Comment thread src/core/agents/offSecAgent/tools/computerUse/platform.ts
@Yuvanesh-ux
Copy link
Copy Markdown
Collaborator

@joshkotrous do you have a video of computer use by apex?

cursoragent and others added 2 commits April 1, 2026 20:16
LinuxBackend.keyPress and DarwinBackend.keyPress interpolated the keys
string directly into shell commands via exec() with no quoting or
escaping. Shell metacharacters in the keys parameter would be interpreted
by /bin/sh, enabling arbitrary command execution.

Switch both implementations to execFileSync with an argument array,
which bypasses the shell entirely and passes arguments safely.

Co-authored-by: Josh Kotrous <joshkotrous@users.noreply.github.com>
Comment thread src/core/agents/offSecAgent/tools/computerUse/platform.ts
cliclick's kp: command only supports single key names (e.g. kp:return).
Modifier combos like ctrl+c require separate kd: (key down), kp: (tap),
and ku: (key up) arguments. Added mapKeysToClichClick() to translate
combo strings into the correct cliclick argument sequence.

Co-authored-by: Josh Kotrous <joshkotrous@users.noreply.github.com>
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 5 total unresolved issues (including 4 from previous reviews).

Fix All in Cursor

Bugbot Autofix is ON. A cloud agent has been kicked off to fix the reported issue.

Reviewed by Cursor Bugbot for commit ab4ddde. Configure here.

"spawn_pentest_swarm",
"spawn_coding_agent",
"run_pentest_workflow",
"delegate_to_computer_use_agent",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plan mode includes mutating computer-use delegation tool

Medium Severity

PLAN_MODE_TOOL_NAMES intentionally limits direct computer use tools to read-only ones (computer_screenshot, computer_mouse_move, computer_scroll) but also includes delegate_to_computer_use_agent, which spawns a subagent with full access to all computer use tools plus execute_command. This bypasses the plan-mode restriction, allowing clicks, typing, and dragging through the delegated agent.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit ab4ddde. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants