feat: add computer use tools and agent for desktop GUI automation#523
feat: add computer use tools and agent for desktop GUI automation#523joshkotrous wants to merge 5 commits into
Conversation
Adds a complete computer use toolset for desktop automation via: - Linux: xdotool + scrot/ImageMagick - macOS: cliclick + screencapture Tools: - computer_screenshot: capture screen as base64 PNG - computer_mouse_click: click at (x,y) coordinates - computer_mouse_double_click: double-click at position - computer_mouse_move: move cursor without clicking - computer_mouse_drag: drag from one point to another - computer_type_text: type text via keyboard - computer_key_press: press keys/combos (ctrl+c, alt+Tab, etc.) - computer_scroll: scroll mouse wheel - computer_screen_info: get screen size, mouse pos, active window Agent: - ComputerUseAgent specialized subagent follows observe-plan-act loop - delegate_to_computer_use_agent orchestration tool for parent agents - Registered in ALL_TOOL_NAMES and PLAN_MODE_TOOL_NAMES - Added to toolset UI definitions Co-authored-by: Josh Kotrous <joshkotrous@users.noreply.github.com>
Adds WindowsBackend using PowerShell + .NET System.Windows.Forms and user32.dll P/Invoke for desktop automation on Windows: - Screenshot via System.Drawing.Graphics.CopyFromScreen - Mouse control via user32.dll mouse_event / SetCursorPos - Keyboard input via System.Windows.Forms.SendKeys - Window title via user32.dll GetWindowText - Screen size via System.Windows.Forms.Screen.PrimaryScreen No external dependencies required — uses built-in PowerShell/.NET. Includes mapKeysToSendKeys() to translate xdotool-style key names (ctrl+c, Return, alt+Tab) to .NET SendKeys format. Co-authored-by: Josh Kotrous <joshkotrous@users.noreply.github.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 5 potential issues.
Autofix Details
Bugbot Autofix prepared a fix for the issue found in the latest run.
- ✅ Fixed: keyPress passes unquoted string directly into shell command
- Replaced exec() (which uses shell interpolation) with execFileSync() using argument arrays in both LinuxBackend.keyPress and DarwinBackend.keyPress, bypassing the shell entirely and preventing injection.
Preview (276aa82ab0)
diff --git a/src/core/agents/offSecAgent/tools/computerUse/index.ts b/src/core/agents/offSecAgent/tools/computerUse/index.ts
new file mode 100644
--- /dev/null
+++ b/src/core/agents/offSecAgent/tools/computerUse/index.ts
@@ -1,0 +1,67 @@
+/**
+ * Computer Use tools — desktop automation via xdotool (Linux), cliclick (macOS),
+ * and PowerShell/.NET (Windows).
+ *
+ * Provides low-level desktop interaction primitives: screenshot, mouse clicks,
+ * keyboard input, scrolling, and drag operations. These tools enable agents
+ * to interact with graphical applications for penetration testing scenarios
+ * that require GUI interaction (e.g. thick-client apps, VNC sessions, RDP).
+ *
+ * The tools are platform-aware:
+ * - Linux: xdotool + scrot/ImageMagick
+ * - macOS: cliclick + screencapture
+ * - Windows: PowerShell + .NET System.Windows.Forms / user32.dll
+ */
+
+import type { ToolContext } from "../types";
+import { computerScreenshot } from "./screenshot";
+import { computerMouseClick } from "./mouseClick";
+import { computerMouseDoubleClick } from "./mouseDoubleClick";
+import { computerMouseMove } from "./mouseMove";
+import { computerMouseDrag } from "./mouseDrag";
+import { computerTypeText } from "./typeText";
+import { computerKeyPress } from "./keyPress";
+import { computerScroll } from "./scroll";
+import { computerScreenInfo } from "./screenInfo";
+
+export const COMPUTER_USE_TOOL_NAMES = [
+ "computer_screenshot",
+ "computer_mouse_click",
+ "computer_mouse_double_click",
+ "computer_mouse_move",
+ "computer_mouse_drag",
+ "computer_type_text",
+ "computer_key_press",
+ "computer_scroll",
+ "computer_screen_info",
+] as const;
+
+export type ComputerUseToolName = (typeof COMPUTER_USE_TOOL_NAMES)[number];
+
+export function createComputerUseToolset(ctx: ToolContext) {
+ return {
+ computer_screenshot: computerScreenshot(ctx),
+ computer_mouse_click: computerMouseClick(ctx),
+ computer_mouse_double_click: computerMouseDoubleClick(ctx),
+ computer_mouse_move: computerMouseMove(ctx),
+ computer_mouse_drag: computerMouseDrag(ctx),
+ computer_type_text: computerTypeText(ctx),
+ computer_key_press: computerKeyPress(ctx),
+ computer_scroll: computerScroll(ctx),
+ computer_screen_info: computerScreenInfo(ctx),
+ } as const;
+}
+
+export {
+ computerScreenshot,
+ computerMouseClick,
+ computerMouseDoubleClick,
+ computerMouseMove,
+ computerMouseDrag,
+ computerTypeText,
+ computerKeyPress,
+ computerScroll,
+ computerScreenInfo,
+};
+
+export { type DesktopBackend, type Platform, detectPlatform } from "./platform";
diff --git a/src/core/agents/offSecAgent/tools/computerUse/keyPress.ts b/src/core/agents/offSecAgent/tools/computerUse/keyPress.ts
new file mode 100644
--- /dev/null
+++ b/src/core/agents/offSecAgent/tools/computerUse/keyPress.ts
@@ -1,0 +1,56 @@
+import { tool } from "ai";
+import { z } from "zod";
+import type { ToolContext } from "../types";
+import { getDesktopBackend } from "./platform";
+
+export function computerKeyPress(_ctx: ToolContext) {
+ return tool({
+ description: `Press a key or key combination.
+
+Sends a key press event. Supports single keys and modifier combinations.
+
+Key names follow xdotool conventions on Linux, cliclick on macOS,
+and SendKeys on Windows:
+
+Single keys: Return, Escape, Tab, BackSpace, Delete, space, Up, Down, Left, Right,
+ Home, End, Page_Up, Page_Down, F1-F12
+
+Modifier combos (use + separator): ctrl+c, ctrl+v, ctrl+a, alt+Tab, alt+F4,
+ ctrl+shift+t, super+l (Windows/Super key)
+
+Common examples:
+- "Return" — press Enter
+- "Escape" — press Escape
+- "ctrl+c" — copy
+- "ctrl+v" — paste
+- "ctrl+a" — select all
+- "alt+Tab" — switch windows
+- "ctrl+shift+t" — reopen closed tab
+- "super+l" — lock screen`,
+ inputSchema: z.object({
+ keys: z
+ .string()
+ .describe(
+ 'Key or key combination to press (e.g. "Return", "ctrl+c", "alt+Tab")',
+ ),
+ toolCallDescription: z
+ .string()
+ .describe("Why you are pressing this key combination"),
+ }),
+ execute: async ({
+ keys,
+ }): Promise<{ success: boolean; message: string }> => {
+ try {
+ const backend = getDesktopBackend();
+ backend.keyPress(keys);
+ return {
+ success: true,
+ message: `Pressed key(s): ${keys}`,
+ };
+ } catch (error) {
+ const msg = error instanceof Error ? error.message : String(error);
+ return { success: false, message: `Key press failed: ${msg}` };
+ }
+ },
+ });
+}
diff --git a/src/core/agents/offSecAgent/tools/computerUse/mouseClick.ts b/src/core/agents/offSecAgent/tools/computerUse/mouseClick.ts
new file mode 100644
--- /dev/null
+++ b/src/core/agents/offSecAgent/tools/computerUse/mouseClick.ts
@@ -1,0 +1,45 @@
+import { tool } from "ai";
+import { z } from "zod";
+import type { ToolContext } from "../types";
+import { getDesktopBackend } from "./platform";
+
+export function computerMouseClick(_ctx: ToolContext) {
+ return tool({
+ description: `Click the mouse at a specific position on screen.
+
+Performs a mouse click at the given (x, y) coordinates. Supports left, right,
+and middle button clicks. If coordinates are omitted, clicks at the current
+mouse position.
+
+Use after taking a screenshot to identify where to click.`,
+ inputSchema: z.object({
+ x: z.number().describe("X coordinate to click at"),
+ y: z.number().describe("Y coordinate to click at"),
+ button: z
+ .enum(["left", "right", "middle"])
+ .optional()
+ .default("left")
+ .describe("Mouse button to click"),
+ toolCallDescription: z
+ .string()
+ .describe("Why you are clicking at this position"),
+ }),
+ execute: async ({
+ x,
+ y,
+ button,
+ }): Promise<{ success: boolean; message: string }> => {
+ try {
+ const backend = getDesktopBackend();
+ backend.mouseClick(button, x, y);
+ return {
+ success: true,
+ message: `Clicked ${button ?? "left"} button at (${x}, ${y})`,
+ };
+ } catch (error) {
+ const msg = error instanceof Error ? error.message : String(error);
+ return { success: false, message: `Mouse click failed: ${msg}` };
+ }
+ },
+ });
+}
diff --git a/src/core/agents/offSecAgent/tools/computerUse/mouseDoubleClick.ts b/src/core/agents/offSecAgent/tools/computerUse/mouseDoubleClick.ts
new file mode 100644
--- /dev/null
+++ b/src/core/agents/offSecAgent/tools/computerUse/mouseDoubleClick.ts
@@ -1,0 +1,41 @@
+import { tool } from "ai";
+import { z } from "zod";
+import type { ToolContext } from "../types";
+import { getDesktopBackend } from "./platform";
+
+export function computerMouseDoubleClick(_ctx: ToolContext) {
+ return tool({
+ description: `Double-click the mouse at a specific position on screen.
+
+Performs a double-click at the given (x, y) coordinates.
+If coordinates are omitted, double-clicks at the current mouse position.`,
+ inputSchema: z.object({
+ x: z.number().optional().describe("X coordinate to double-click at"),
+ y: z.number().optional().describe("Y coordinate to double-click at"),
+ toolCallDescription: z
+ .string()
+ .describe("Why you are double-clicking at this position"),
+ }),
+ execute: async ({
+ x,
+ y,
+ }): Promise<{ success: boolean; message: string }> => {
+ try {
+ const backend = getDesktopBackend();
+ backend.mouseDoubleClick(x, y);
+ const pos =
+ x != null && y != null ? `(${x}, ${y})` : "current position";
+ return {
+ success: true,
+ message: `Double-clicked at ${pos}`,
+ };
+ } catch (error) {
+ const msg = error instanceof Error ? error.message : String(error);
+ return {
+ success: false,
+ message: `Double-click failed: ${msg}`,
+ };
+ }
+ },
+ });
+}
diff --git a/src/core/agents/offSecAgent/tools/computerUse/mouseDrag.ts b/src/core/agents/offSecAgent/tools/computerUse/mouseDrag.ts
new file mode 100644
--- /dev/null
+++ b/src/core/agents/offSecAgent/tools/computerUse/mouseDrag.ts
@@ -1,0 +1,40 @@
+import { tool } from "ai";
+import { z } from "zod";
+import type { ToolContext } from "../types";
+import { getDesktopBackend } from "./platform";
+
+export function computerMouseDrag(_ctx: ToolContext) {
+ return tool({
+ description: `Drag the mouse from one position to another.
+
+Performs a click-and-drag from (startX, startY) to (endX, endY).
+Useful for selecting text, moving elements, or interacting with sliders.`,
+ inputSchema: z.object({
+ startX: z.number().describe("Starting X coordinate"),
+ startY: z.number().describe("Starting Y coordinate"),
+ endX: z.number().describe("Ending X coordinate"),
+ endY: z.number().describe("Ending Y coordinate"),
+ toolCallDescription: z
+ .string()
+ .describe("Why you are performing this drag operation"),
+ }),
+ execute: async ({
+ startX,
+ startY,
+ endX,
+ endY,
+ }): Promise<{ success: boolean; message: string }> => {
+ try {
+ const backend = getDesktopBackend();
+ backend.mouseDrag(startX, startY, endX, endY);
+ return {
+ success: true,
+ message: `Dragged from (${startX}, ${startY}) to (${endX}, ${endY})`,
+ };
+ } catch (error) {
+ const msg = error instanceof Error ? error.message : String(error);
+ return { success: false, message: `Mouse drag failed: ${msg}` };
+ }
+ },
+ });
+}
diff --git a/src/core/agents/offSecAgent/tools/computerUse/mouseMove.ts b/src/core/agents/offSecAgent/tools/computerUse/mouseMove.ts
new file mode 100644
--- /dev/null
+++ b/src/core/agents/offSecAgent/tools/computerUse/mouseMove.ts
@@ -1,0 +1,36 @@
+import { tool } from "ai";
+import { z } from "zod";
+import type { ToolContext } from "../types";
+import { getDesktopBackend } from "./platform";
+
+export function computerMouseMove(_ctx: ToolContext) {
+ return tool({
+ description: `Move the mouse cursor to specific coordinates on screen.
+
+Moves the mouse to the given (x, y) coordinates without clicking.
+Useful for hovering over elements to trigger tooltips or menus.`,
+ inputSchema: z.object({
+ x: z.number().describe("X coordinate to move to"),
+ y: z.number().describe("Y coordinate to move to"),
+ toolCallDescription: z
+ .string()
+ .describe("Why you are moving the mouse to this position"),
+ }),
+ execute: async ({
+ x,
+ y,
+ }): Promise<{ success: boolean; message: string }> => {
+ try {
+ const backend = getDesktopBackend();
+ backend.mouseMove(x, y);
+ return {
+ success: true,
+ message: `Mouse moved to (${x}, ${y})`,
+ };
+ } catch (error) {
+ const msg = error instanceof Error ? error.message : String(error);
+ return { success: false, message: `Mouse move failed: ${msg}` };
+ }
+ },
+ });
+}
diff --git a/src/core/agents/offSecAgent/tools/computerUse/platform.test.ts b/src/core/agents/offSecAgent/tools/computerUse/platform.test.ts
new file mode 100644
--- /dev/null
+++ b/src/core/agents/offSecAgent/tools/computerUse/platform.test.ts
@@ -1,0 +1,73 @@
+import { describe, it, expect } from "vitest";
+import {
+ detectPlatform,
+ LinuxBackend,
+ DarwinBackend,
+ WindowsBackend,
+} from "./platform";
+
+const BACKEND_METHODS = [
+ "screenshot",
+ "mouseMove",
+ "mouseClick",
+ "mouseDoubleClick",
+ "typeText",
+ "keyPress",
+ "getMousePosition",
+ "getScreenSize",
+ "mouseDrag",
+ "scroll",
+ "getActiveWindowTitle",
+] as const;
+
+describe("Computer Use platform detection", () => {
+ it("should detect the current platform", () => {
+ const platform = detectPlatform();
+ expect(["linux", "darwin", "win32", "unsupported"]).toContain(platform);
+ });
+
+ it("should return linux on Linux", () => {
+ if (process.platform !== "linux") return;
+ expect(detectPlatform()).toBe("linux");
+ });
+
+ it("should return darwin on macOS", () => {
+ if (process.platform !== "darwin") return;
+ expect(detectPlatform()).toBe("darwin");
+ });
+
+ it("should return win32 on Windows", () => {
+ if (process.platform !== "win32") return;
+ expect(detectPlatform()).toBe("win32");
+ });
+});
+
+describe("LinuxBackend", () => {
+ it("should be constructable and implement DesktopBackend", () => {
+ const backend = new LinuxBackend();
+ expect(backend).toBeDefined();
+ for (const method of BACKEND_METHODS) {
+ expect(typeof backend[method]).toBe("function");
+ }
+ });
+});
+
+describe("DarwinBackend", () => {
+ it("should be constructable and implement DesktopBackend", () => {
+ const backend = new DarwinBackend();
+ expect(backend).toBeDefined();
+ for (const method of BACKEND_METHODS) {
+ expect(typeof backend[method]).toBe("function");
+ }
+ });
+});
+
+describe("WindowsBackend", () => {
+ it("should be constructable and implement DesktopBackend", () => {
+ const backend = new WindowsBackend();
+ expect(backend).toBeDefined();
+ for (const method of BACKEND_METHODS) {
+ expect(typeof backend[method]).toBe("function");
+ }
+ });
+});
diff --git a/src/core/agents/offSecAgent/tools/computerUse/platform.ts b/src/core/agents/offSecAgent/tools/computerUse/platform.ts
new file mode 100644
--- /dev/null
+++ b/src/core/agents/offSecAgent/tools/computerUse/platform.ts
@@ -1,0 +1,533 @@
+/**
+ * Platform detection and desktop automation backend.
+ *
+ * Linux → xdotool + scrot/import (ImageMagick)
+ * macOS → cliclick + screencapture
+ * Windows → PowerShell + .NET System.Windows.Forms / user32.dll
+ *
+ * Each backend implements the same {@link DesktopBackend} interface so
+ * individual tools stay platform-agnostic.
+ */
+
+import {
+ execFileSync,
+ execSync,
+ type ExecSyncOptionsWithStringEncoding,
+} from "child_process";
+
+const EXEC_OPTS: ExecSyncOptionsWithStringEncoding = {
+ encoding: "utf-8",
+ timeout: 15_000,
+ stdio: ["pipe", "pipe", "pipe"],
+};
+
+export type Platform = "linux" | "darwin" | "win32" | "unsupported";
+
+export function detectPlatform(): Platform {
+ const p = process.platform;
+ if (p === "linux") return "linux";
+ if (p === "darwin") return "darwin";
+ if (p === "win32") return "win32";
+ return "unsupported";
+}
+
+export interface ScreenSize {
+ width: number;
+ height: number;
+}
+
+export interface MousePosition {
+ x: number;
+ y: number;
+}
+
+// ---------------------------------------------------------------------------
+// Desktop backend interface
+// ---------------------------------------------------------------------------
+
+export interface DesktopBackend {
+ /** Take a screenshot and return the file path to the PNG. */
+ screenshot(outputPath: string): string;
+
+ /** Move the mouse to absolute coordinates. */
+ mouseMove(x: number, y: number): void;
+
+ /** Click at the current (or specified) position. */
+ mouseClick(
+ button?: "left" | "right" | "middle",
+ x?: number,
+ y?: number,
+ ): void;
+
+ /** Double-click at the current (or specified) position. */
+ mouseDoubleClick(x?: number, y?: number): void;
+
+ /** Type text via keyboard (handles special characters). */
+ typeText(text: string): void;
+
+ /** Press a key or key combination (e.g. "Return", "ctrl+c", "alt+Tab"). */
+ keyPress(keys: string): void;
+
+ /** Get current mouse position. */
+ getMousePosition(): MousePosition;
+
+ /** Get screen dimensions. */
+ getScreenSize(): ScreenSize;
+
+ /** Drag from one point to another. */
+ mouseDrag(startX: number, startY: number, endX: number, endY: number): void;
+
+ /** Scroll the mouse wheel. Positive = down, negative = up. */
+ scroll(amount: number, x?: number, y?: number): void;
+
+ /** Get the title of the currently active window. */
+ getActiveWindowTitle(): string;
+}
+
+// ---------------------------------------------------------------------------
+// Linux backend (xdotool + scrot)
+// ---------------------------------------------------------------------------
+
+function exec(cmd: string): string {
+ return execSync(cmd, EXEC_OPTS).trim();
+}
+
+export class LinuxBackend implements DesktopBackend {
+ screenshot(outputPath: string): string {
+ exec(
+ `scrot -o "${outputPath}" 2>/dev/null || import -window root "${outputPath}"`,
+ );
+ return outputPath;
+ }
+
+ mouseMove(x: number, y: number): void {
+ exec(`xdotool mousemove ${x} ${y}`);
+ }
+
+ mouseClick(
+ button: "left" | "right" | "middle" = "left",
+ x?: number,
+ y?: number,
+ ): void {
+ const btnMap = { left: 1, middle: 2, right: 3 } as const;
+ if (x != null && y != null) {
+ exec(`xdotool mousemove ${x} ${y} click ${btnMap[button]}`);
+ } else {
+ exec(`xdotool click ${btnMap[button]}`);
+ }
+ }
+
+ mouseDoubleClick(x?: number, y?: number): void {
+ if (x != null && y != null) {
+ exec(`xdotool mousemove ${x} ${y} click --repeat 2 --delay 80 1`);
+ } else {
+ exec(`xdotool click --repeat 2 --delay 80 1`);
+ }
+ }
+
+ typeText(text: string): void {
+ exec(`xdotool type --clearmodifiers -- ${JSON.stringify(text)}`);
+ }
+
+ keyPress(keys: string): void {
+ execFileSync("xdotool", ["key", "--clearmodifiers", keys], EXEC_OPTS);
+ }
+
+ getMousePosition(): MousePosition {
+ const raw = exec("xdotool getmouselocation --shell");
+ const xMatch = raw.match(/X=(\d+)/);
+ const yMatch = raw.match(/Y=(\d+)/);
+ return {
+ x: xMatch ? parseInt(xMatch[1]!, 10) : 0,
+ y: yMatch ? parseInt(yMatch[1]!, 10) : 0,
+ };
+ }
+
+ getScreenSize(): ScreenSize {
+ const raw = exec("xdotool getdisplaygeometry");
+ const [w, h] = raw.split(" ").map(Number);
+ return { width: w ?? 0, height: h ?? 0 };
+ }
+
+ mouseDrag(startX: number, startY: number, endX: number, endY: number): void {
+ exec(
+ `xdotool mousemove ${startX} ${startY} mousedown 1 mousemove ${endX} ${endY} mouseup 1`,
+ );
+ }
+
+ scroll(amount: number, x?: number, y?: number): void {
+ if (x != null && y != null) {
+ exec(`xdotool mousemove ${x} ${y}`);
+ }
+ const button = amount > 0 ? 5 : 4;
+ const clicks = Math.abs(amount);
+ exec(`xdotool click --repeat ${clicks} --delay 50 ${button}`);
+ }
+
+ getActiveWindowTitle(): string {
+ try {
+ return exec("xdotool getactivewindow getwindowname");
+ } catch {
+ return "(unknown)";
+ }
+ }
+}
+
+// ---------------------------------------------------------------------------
+// macOS backend (cliclick + screencapture)
+// ---------------------------------------------------------------------------
+
+export class DarwinBackend implements DesktopBackend {
+ screenshot(outputPath: string): string {
+ exec(`screencapture -x "${outputPath}"`);
+ return outputPath;
+ }
+
+ mouseMove(x: number, y: number): void {
+ exec(`cliclick m:${x},${y}`);
+ }
+
+ mouseClick(
+ button: "left" | "right" | "middle" = "left",
+ x?: number,
+ y?: number,
+ ): void {
+ const prefix = button === "right" ? "rc" : "c";
+ if (x != null && y != null) {
+ exec(`cliclick ${prefix}:${x},${y}`);
+ } else {
+ const pos = this.getMousePosition();
+ exec(`cliclick ${prefix}:${pos.x},${pos.y}`);
+ }
+ }
+
+ mouseDoubleClick(x?: number, y?: number): void {
+ if (x != null && y != null) {
+ exec(`cliclick dc:${x},${y}`);
+ } else {
+ const pos = this.getMousePosition();
+ exec(`cliclick dc:${pos.x},${pos.y}`);
+ }
+ }
+
+ typeText(text: string): void {
+ exec(`cliclick t:${JSON.stringify(text)}`);
+ }
+
+ keyPress(keys: string): void {
+ execFileSync("cliclick", [`kp:${keys}`], EXEC_OPTS);
+ }
+
+ getMousePosition(): MousePosition {
+ const raw = exec("cliclick p");
+ const match = raw.match(/(\d+),(\d+)/);
+ return {
+ x: match ? parseInt(match[1]!, 10) : 0,
+ y: match ? parseInt(match[2]!, 10) : 0,
+ };
+ }
+
+ getScreenSize(): ScreenSize {
+ const raw = exec(
+ `system_profiler SPDisplaysDataType | grep Resolution | head -1`,
+ );
+ const match = raw.match(/(\d+)\s*x\s*(\d+)/);
+ return {
+ width: match ? parseInt(match[1]!, 10) : 0,
+ height: match ? parseInt(match[2]!, 10) : 0,
+ };
+ }
+
+ mouseDrag(startX: number, startY: number, endX: number, endY: number): void {
+ exec(`cliclick dd:${startX},${startY} du:${endX},${endY}`);
+ }
+
+ scroll(amount: number, x?: number, y?: number): void {
+ if (x != null && y != null) {
+ exec(`cliclick m:${x},${y}`);
+ }
+ const direction = amount > 0 ? "down" : "up";
+ const clicks = Math.abs(amount);
+ for (let i = 0; i < clicks; i++) {
+ exec(`cliclick "kp:${direction === "down" ? "arrow-down" : "arrow-up"}"`);
+ }
+ }
+
+ getActiveWindowTitle(): string {
+ try {
+ return exec(
+ `osascript -e 'tell application "System Events" to get name of first application process whose frontmost is true'`,
+ );
+ } catch {
+ return "(unknown)";
+ }
+ }
+}
+
+// ---------------------------------------------------------------------------
+// Windows backend (PowerShell + .NET System.Windows.Forms / user32.dll)
+// ---------------------------------------------------------------------------
+
+/**
+ * Run a PowerShell snippet and return trimmed stdout.
+ *
+ * Uses `-NoProfile -NonInteractive -Command` so startup is fast and
+ * there is no profile pollution. The snippet can use any .NET class
+ * available in the default PowerShell/.NET runtime.
+ */
+function ps(script: string): string {
+ return execSync(
+ `powershell -NoProfile -NonInteractive -Command ${JSON.stringify(script)}`,
+ EXEC_OPTS,
+ ).trim();
+}
+
+/**
+ * Shared C# helper that is injected once per PowerShell call when we need
+ * mouse or keyboard simulation via `user32.dll` P/Invoke.
+ */
+const WIN32_INPUT_TYPE = `
+Add-Type -TypeDefinition @"
+using System;
+using System.Runtime.InteropServices;
+public class Win32Input {
+ [DllImport("user32.dll")] public static extern bool SetCursorPos(int X, int Y);
+ [DllImport("user32.dll")] public static extern bool GetCursorPos(out POINT lpPoint);
+ [DllImport("user32.dll")] public static extern void mouse_event(uint dwFlags, int dx, int dy, int dwData, int dwExtraInfo);
+ [DllImport("user32.dll")] public static extern IntPtr GetForegroundWindow();
+ [DllImport("user32.dll", CharSet=CharSet.Auto)] public static extern int GetWindowText(IntPtr hWnd, System.Text.StringBuilder lpString, int nMaxCount);
+ [StructLayout(LayoutKind.Sequential)] public struct POINT { public int X; public int Y; }
+
+ public const uint MOUSEEVENTF_LEFTDOWN = 0x0002;
+ public const uint MOUSEEVENTF_LEFTUP = 0x0004;
+ public const uint MOUSEEVENTF_RIGHTDOWN = 0x0008;
+ public const uint MOUSEEVENTF_RIGHTUP = 0x0010;
+ public const uint MOUSEEVENTF_MIDDLEDOWN = 0x0020;
+ public const uint MOUSEEVENTF_MIDDLEUP = 0x0040;
+ public const uint MOUSEEVENTF_WHEEL = 0x0800;
+}
+"@
+`;
+
+export class WindowsBackend implements DesktopBackend {
+ screenshot(outputPath: string): string {
+ ps(
+ `Add-Type -AssemblyName System.Windows.Forms; ` +
+ `$bmp = New-Object System.Drawing.Bitmap([System.Windows.Forms.Screen]::PrimaryScreen.Bounds.Width, [System.Windows.Forms.Screen]::PrimaryScreen.Bounds.Height); ` +
+ `$g = [System.Drawing.Graphics]::FromImage($bmp); ` +
+ `$g.CopyFromScreen(0, 0, 0, 0, $bmp.Size); ` +
+ `$bmp.Save('${outputPath.replace(/'/g, "''")}'); ` +
+ `$g.Dispose(); $bmp.Dispose()`,
+ );
+ return outputPath;
+ }
+
+ mouseMove(x: number, y: number): void {
+ ps(`${WIN32_INPUT_TYPE} [Win32Input]::SetCursorPos(${x}, ${y})`);
+ }
+
+ mouseClick(
+ button: "left" | "right" | "middle" = "left",
+ x?: number,
+ y?: number,
+ ): void {
+ const downUp =
+ button === "right"
+ ? "[Win32Input]::MOUSEEVENTF_RIGHTDOWN, [Win32Input]::MOUSEEVENTF_RIGHTUP"
+ : button === "middle"
+ ? "[Win32Input]::MOUSEEVENTF_MIDDLEDOWN, [Win32Input]::MOUSEEVENTF_MIDDLEUP"
+ : "[Win32Input]::MOUSEEVENTF_LEFTDOWN, [Win32Input]::MOUSEEVENTF_LEFTUP";
+
+ const movePrefix =
+ x != null && y != null ? `[Win32Input]::SetCursorPos(${x}, ${y}); ` : "";
+
+ ps(
+ `${WIN32_INPUT_TYPE} ${movePrefix}` +
+ `$d, $u = ${downUp}; ` +
+ `[Win32Input]::mouse_event($d, 0, 0, 0, 0); ` +
+ `Start-Sleep -Milliseconds 30; ` +
+ `[Win32Input]::mouse_event($u, 0, 0, 0, 0)`,
+ );
+ }
+
+ mouseDoubleClick(x?: number, y?: number): void {
+ const movePrefix =
+ x != null && y != null ? `[Win32Input]::SetCursorPos(${x}, ${y}); ` : "";
+
+ ps(
+ `${WIN32_INPUT_TYPE} ${movePrefix}` +
+ `[Win32Input]::mouse_event([Win32Input]::MOUSEEVENTF_LEFTDOWN, 0, 0, 0, 0); ` +
+ `[Win32Input]::mouse_event([Win32Input]::MOUSEEVENTF_LEFTUP, 0, 0, 0, 0); ` +
+ `Start-Sleep -Milliseconds 80; ` +
+ `[Win32Input]::mouse_event([Win32Input]::MOUSEEVENTF_LEFTDOWN, 0, 0, 0, 0); ` +
+ `[Win32Input]::mouse_event([Win32Input]::MOUSEEVENTF_LEFTUP, 0, 0, 0, 0)`,
+ );
+ }
+
+ typeText(text: string): void {
+ ps(
+ `Add-Type -AssemblyName System.Windows.Forms; ` +
+ `[System.Windows.Forms.SendKeys]::SendWait(${JSON.stringify(text)})`,
+ );
+ }
+
+ keyPress(keys: string): void {
+ const mapped = mapKeysToSendKeys(keys);
+ ps(
+ `Add-Type -AssemblyName System.Windows.Forms; ` +
+ `[System.Windows.Forms.SendKeys]::SendWait('${mapped}')`,
+ );
+ }
+
+ getMousePosition(): MousePosition {
+ const raw = ps(
+ `${WIN32_INPUT_TYPE} $p = New-Object Win32Input+POINT; ` +
+ `[Win32Input]::GetCursorPos([ref]$p); "$($p.X),$($p.Y)"`,
+ );
+ const [xStr, yStr] = raw.split(",");
+ return {
+ x: parseInt(xStr ?? "0", 10),
+ y: parseInt(yStr ?? "0", 10),
+ };
+ }
+
+ getScreenSize(): ScreenSize {
+ const raw = ps(
... diff truncated: showing 800 of 1675 linesYou can send follow-ups to this agent here.
| for (let i = 0; i < clicks; i++) { | ||
| exec(`cliclick "kp:${direction === "down" ? "arrow-down" : "arrow-up"}"`); | ||
| } | ||
| } |
There was a problem hiding this comment.
macOS scroll simulates arrow keys instead of scrolling
Medium Severity
DarwinBackend.scroll simulates arrow key presses (kp:arrow-down / kp:arrow-up) instead of actual scroll wheel events. Arrow keys move cursors or selections in most applications — they don't scroll content. The DesktopBackend interface and the computer_scroll tool description both promise mouse wheel scrolling, but this implementation does something fundamentally different on macOS. cliclick lacks native scroll support, so an alternative like osascript with Quartz event APIs would be needed.
| `Add-Type -AssemblyName System.Windows.Forms; ` + | ||
| `[System.Windows.Forms.SendKeys]::SendWait(${JSON.stringify(text)})`, | ||
| ); | ||
| } |
There was a problem hiding this comment.
Windows typeText doesn't escape SendKeys special characters
Medium Severity
WindowsBackend.typeText passes text to SendKeys.SendWait without escaping special characters. SendKeys treats + as Shift, ^ as Ctrl, % as Alt, ~ as Enter, and {, }, (, ) as grouping/special key delimiters. Typing ordinary text like 2+2 or 100% will produce wrong output (modifier key combos instead of literal characters). These characters need to be wrapped in braces (e.g., {+}, {%}) before being sent.
|
|
||
| typeText(text: string): void { | ||
| exec(`xdotool type --clearmodifiers -- ${JSON.stringify(text)}`); | ||
| } |
There was a problem hiding this comment.
Linux/macOS typeText corrupts text with dollar signs
Medium Severity
LinuxBackend.typeText and DarwinBackend.typeText use JSON.stringify(text) to produce a double-quoted shell string, but bash expands $variable, $(command), and backtick expressions inside double quotes. Typing text like "Price is $100" would produce "Price is 00" (since $1 is an empty positional parameter), and text containing $(...) would execute arbitrary commands. Single quotes or proper shell escaping would prevent this.
Additional Locations (1)
| x?: number, | ||
| y?: number, | ||
| ): void { | ||
| const prefix = button === "right" ? "rc" : "c"; |
There was a problem hiding this comment.
macOS middle-click silently performs left-click instead
Low Severity
DarwinBackend.mouseClick maps button === "right" to "rc" but falls through to "c" (left click) for "middle". The tool and interface accept "middle" as a valid button, but the macOS backend silently performs a left click instead, with no error or warning. This could cause the agent to repeatedly fail at middle-click-dependent GUI tasks without understanding why.
|
@joshkotrous do you have a video of computer use by apex? |
LinuxBackend.keyPress and DarwinBackend.keyPress interpolated the keys string directly into shell commands via exec() with no quoting or escaping. Shell metacharacters in the keys parameter would be interpreted by /bin/sh, enabling arbitrary command execution. Switch both implementations to execFileSync with an argument array, which bypasses the shell entirely and passes arguments safely. Co-authored-by: Josh Kotrous <joshkotrous@users.noreply.github.com>
cliclick's kp: command only supports single key names (e.g. kp:return). Modifier combos like ctrl+c require separate kd: (key down), kp: (tap), and ku: (key up) arguments. Added mapKeysToClichClick() to translate combo strings into the correct cliclick argument sequence. Co-authored-by: Josh Kotrous <joshkotrous@users.noreply.github.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 5 total unresolved issues (including 4 from previous reviews).
Bugbot Autofix is ON. A cloud agent has been kicked off to fix the reported issue.
Reviewed by Cursor Bugbot for commit ab4ddde. Configure here.
| "spawn_pentest_swarm", | ||
| "spawn_coding_agent", | ||
| "run_pentest_workflow", | ||
| "delegate_to_computer_use_agent", |
There was a problem hiding this comment.
Plan mode includes mutating computer-use delegation tool
Medium Severity
PLAN_MODE_TOOL_NAMES intentionally limits direct computer use tools to read-only ones (computer_screenshot, computer_mouse_move, computer_scroll) but also includes delegate_to_computer_use_agent, which spawns a subagent with full access to all computer use tools plus execute_command. This bypasses the plan-mode restriction, allowing clicks, typing, and dragging through the delegated agent.
Reviewed by Cursor Bugbot for commit ab4ddde. Configure here.


What does this PR do?
Adds a complete computer use toolset and a specialized Computer Use agent for desktop GUI automation across all three major platforms.
Problem: Penetration testing scenarios involving thick-client apps, VNC/RDP sessions, native OS dialogs, and other graphical interfaces cannot be automated through browser tools or CLI alone.
Solution: Platform-aware desktop automation tools that provide mouse, keyboard, screenshot, and scroll capabilities:
Tools (9 new tools)
computer_screenshotcomputer_mouse_clickcomputer_mouse_double_clickcomputer_mouse_movecomputer_mouse_dragcomputer_type_textcomputer_key_presscomputer_scrollcomputer_screen_infoPlatform backends
Computer Use Agent
ComputerUseAgent— specialized subagent following an observe-plan-act loop (screenshot → identify elements → interact → verify)delegate_to_computer_use_agent— orchestration tool so parent agents can delegate GUI tasksALL_TOOL_NAMES,PLAN_MODE_TOOL_NAMES, and the toolset UI definitionsArchitecture
src/core/agents/offSecAgent/tools/computerUse/— platform backend + 9 tool filessrc/core/agents/specialized/computerUseAgent/— agent, prompts, typessrc/core/agents/offSecAgent/tools/delegateComputerUse.ts— orchestration tooltool()fromai+ Zod schemas +ToolContextspawnCodingAgent/delegateAuth(dynamic import, event bus forwarding, subagent lifecycle)How did you verify your code works?
bun run tsc— TypeScript type checking passesbun run lint— ESLint passesbun run format:check— Prettier passesbun run test— All 30 test files pass (583 tests), 7 skipped (integration tests needing live services)platform.test.ts) — verifies all three backends are constructable and implement the fullDesktopBackendinterface