A GUI automation agent that uses Google's Gemini API to intelligently detect and interact with UI elements based on natural language descriptions.
- Intelligent UI Detection: Uses Gemini AI to identify UI elements from screenshots
- Natural Language Tasks: Describe what you want to click in plain English
- Safety First: Prompts for confirmation before executing clicks
- Flexible Models: Choose between
gemini-1.5-flash(fast, maps togemini-flash-latest) orgemini-1.5-pro(complex UIs, maps togemini-pro-latest) - Coordinate Mapping: Automatically converts normalized coordinates to your screen resolution
- Install dependencies:
pip install -r requirements.txt- Set up your Gemini API key:
- Get your API key from Google AI Studio
- Create a
.envfile in the project root:
GEMINI_API_KEY=your_api_key_here
Run with a high-level goal. The agent will analyze the screen and execute steps autonomously:
python3 linguist_assist.py "Log in to the website"The agent will:
- Analyze the current screen
- Determine what action to take
- Execute the action
- Take a fresh screenshot
- Continue until the goal is achieved
Use a different model for complex UIs:
python3 linguist_assist.py "Fill out the form and submit it" --model gemini-1.5-proSet maximum steps:
python3 linguist_assist.py "Complete the checkout process" --max-steps 30from linguist_assist import LinguistAssist
# Initialize the agent
agent = LinguistAssist(model_name="gemini-1.5-flash")
# Execute a high-level goal autonomously
agent.execute_task("Log in to the website")
# Set maximum steps for complex workflows
agent.execute_task("Complete the checkout process", max_steps=30)
# Or just detect coordinates without clicking
x, y = agent.detect_element("Find the search bar")
print(f"Element is at: ({x}, {y})")
# Single action (for simple tasks)
agent.click_element("Click the login button")- Goal Analysis: Takes a high-level goal and breaks it down into steps
- Screenshot Capture: Takes a fresh screenshot of the current screen state
- AI Planning: Uses Gemini to analyze the screen and determine:
- Is the goal complete?
- What is the next action needed?
- Where should the action be performed?
- Action Execution: Performs the action (click) using
pyautogui - Iteration: Takes a new screenshot and repeats until the goal is achieved
- Adaptive: Adapts to UI changes after each action by analyzing fresh screenshots
- Automatic Execution: By default, clicks execute automatically after detection
- Optional Confirmation: Use
--confirmflag to require confirmation before clicking - Failsafe: PyAutoGUI's failsafe is enabled (move mouse to corner to abort)
- Coordinate Display: Shows detected coordinates before clicking
- Python 3.7+
- Google Generative AI API key
- macOS, Linux, or Windows
LinguistAssist can run as a background service on macOS, allowing you to submit tasks without manually running the script each time.
- Install the service:
./install_service.shThis will:
- Install LinguistAssist as a macOS Launch Agent
- Set up the service to start automatically on login
- Configure logging to
~/.linguist_assist/service.log
# Start the service
launchctl start com.jumpcloud.linguistassist
# Stop the service
launchctl stop com.jumpcloud.linguistassist
# Check service status
launchctl list | grep linguistassist
# View service logs
tail -f ~/.linguist_assist/service.logOnce the service is running, submit tasks using the submit_task.py script:
# Submit a task
python3 submit_task.py "Log in to the website"
# Submit with custom max steps
python3 submit_task.py "Complete checkout process" --max-steps 30
# Check task status
python3 submit_task.py --status <task-id>Tasks are queued in ~/.linguist_assist/queue/ and processed sequentially. Results are saved in ~/.linguist_assist/completed/.
./uninstall_service.shNote: The service runs as a user agent (not a system daemon) to ensure it has access to the GUI and screen recording permissions.
LinguistAssist includes a graphical user interface for easy task management.
python3 linguist_assist_gui.pyOr use the helper script:
./start_gui.sh- Submit Tasks: Enter goals and submit them with a click
- Task Monitor: View all tasks (queued, processing, completed, failed)
- Task Details: Click on any task to see detailed information
- Auto-refresh: Automatically updates task status every 5 seconds
- Status Filtering: Filter tasks by status (all, queued, processing, completed, failed)
- API Settings: Configure API URL and API key
The GUI communicates with the Launch Agent service via the HTTP API, so the service must be running for the GUI to work.
LinguistAssist includes an HTTP API server that allows you to submit tasks remotely from any device or network.
- Install the API server:
./setup_remote_access.shThis will:
- Install Flask dependencies
- Generate an API key (stored in
~/.linguist_assist/api_config.json) - Set up the API server as a Launch Agent
- Configure the server to run on startup
POST /api/v1/tasks- Submit a new taskGET /api/v1/tasks/<id>- Get task statusGET /api/v1/tasks- List all tasksDELETE /api/v1/tasks/<id>- Cancel a queued taskGET /api/v1/health- Health check (no auth required)
All endpoints (except /health) require an API key in the X-API-Key header:
curl -H "X-API-Key: your-api-key" http://localhost:8080/api/v1/tasks \
-d '{"goal": "Log in to the website", "max_steps": 20}'Get your API key:
cat ~/.linguist_assist/api_config.json | grep -A 1 api_keys# On remote machine
ssh -L 8080:localhost:8080 user@your-mac-ip
# Then access API at http://localhost:8080
curl -H "X-API-Key: your-key" http://localhost:8080/api/v1/tasks \
-d '{"goal": "Click login button"}'Update the API server to bind to 0.0.0.0 instead of 127.0.0.1:
# Edit ~/.linguist_assist/api_config.json and set "host": "0.0.0.0"
# Then restart: launchctl stop com.jumpcloud.linguistassist.api && launchctl start com.jumpcloud.linguistassist.apiThen access from remote machine:
curl -H "X-API-Key: your-key" http://your-mac-ip:8080/api/v1/tasks \
-d '{"goal": "Click login button"}'Note: Ensure your firewall allows incoming connections on port 8080.
# Install ngrok, then expose local API
ngrok http 8080
# Access via ngrok URL (includes HTTPS)
curl -H "X-API-Key: your-key" https://abc123.ngrok.io/api/v1/tasks \
-d '{"goal": "Click login button"}'from api_client_example import LinguistAssistAPIClient
client = LinguistAssistAPIClient("http://localhost:8080", "your-api-key")
# Submit task
result = client.submit_task("Log in to the website", max_steps=20)
task_id = result["task_id"]
# Wait for completion
status = client.wait_for_task(task_id)
print(f"Task completed: {status['status']}")# Submit a task
python3 api_client_example.py --api-key YOUR_KEY "Log in to website"
# Submit and wait for completion
python3 api_client_example.py --api-key YOUR_KEY --wait "Log in to website"
# List all tasks
python3 api_client_example.py --api-key YOUR_KEY --list
# Check task status
python3 api_client_example.py --api-key YOUR_KEY --status <task-id>
# Cancel a task
python3 api_client_example.py --api-key YOUR_KEY --cancel <task-id># Start API server
launchctl start com.jumpcloud.linguistassist.api
# Stop API server
launchctl stop com.jumpcloud.linguistassist.api
# Check status
launchctl list | grep linguistassist.api
# View logs
tail -f ~/.linguist_assist/api.log- API Key Authentication: All requests require a valid API key
- Rate Limiting: Configurable requests per minute (default: 60)
- IP Whitelisting: Optional IP-based access control
- Input Validation: Validates task goals and parameters
See REMOTE_ACCESS_PLAN.md for detailed architecture and security considerations.
- The agent uses normalized coordinates (0-1000) to work across different screen resolutions
- For simple UIs, use
gemini-1.5-flashfor faster responses - For complex UIs with many elements, use
gemini-1.5-profor better accuracy - Note: The
google-generativeaipackage shows a deprecation warning but still works. Model namesgemini-1.5-flashandgemini-1.5-proare automatically mapped togemini-flash-latestandgemini-pro-latestrespectively - On macOS, you'll need to grant Screen Recording permissions in System Settings > Privacy & Security > Screen Recording
- When running as a service, ensure the Terminal/Python process has Screen Recording permissions