ColabGame is a clemgame benchmark for evaluating LLM agents in computer-use scenarios. Agents interact with an Ubuntu virtual machine environment (via OSWorld) to complete realistic computer-based tasks such as debugging code, processing data, editing images, and synthesizing research.
The game supports both single-agent and multi-agent topologies, allowing evaluation of collaborative agent behaviors across different communication structures.
ColabGame is a benchmark designed to evaluate the capabilities of Large Language Model (LLM) agents in realistic computer-use scenarios. Unlike traditional text-based benchmarks, ColabGame tests agents' ability to interact with actual operating systems, applications, and files within a virtual machine environment.
- Realistic Computer Tasks: Agents complete real-world tasks including debugging code, processing data, editing images, and synthesizing research
- Multi-Agent Support: Evaluate collaborative behaviors through various agent topologies (Single, Star, Mesh, Blackboard)
- OSWorld Integration: Leverages OSWorld's scalable, reproducible Ubuntu virtual machine environment
- Comprehensive Evaluation: Episode-level scoring with success metrics and request statistics
- Flexible Task Categories: Five task categories across three difficulty levels, plus support for OSWorld benchmark tasks
ColabGame uses OSWorld as an external computer substrate to enable LLM agents to interact with real computer environments. Agents can:
- Execute shell commands and scripts
- Manipulate files and directories
- Use applications (browsers, editors, office tools)
- Complete realistic computer-based tasks
Currently, ColabGame supports the Ubuntu Linux environment only.
Before installing ColabGame, ensure you have:
- Python 3.12+ installed
- ~25GB disk space (VM image + dependencies)
- Git installed
- VMware Workstation Pro or VMware Fusion 13 (for macOS with Apple chips)
Important: Clone OSWorld outside of the colabgame project directory (e.g., as a sibling folder).
- Install VMware (see Installation section below)
- Run the automated setup script:
The script will automatically configure the VM path in your
bash setup_osworld.sh
.envfile. - Source your environment variables:
This loads the VM path and other configuration from
export $(cat .env | xargs)
.env. - Configure S3 credentials (if needed) in your
.envfile - Generate instances:
python3 src/instancegenerator.py
- Run your first experiment:
python3 -m clem run -g colabgame -m mock
OSWorld is a benchmark environment for evaluating multimodal agents in real computer environments. It provides scalable, reproducible virtual machines (Ubuntu and Windows) where agents can interact with actual operating systems, applications, and files. OSWorld supports realistic computer tasks including file operations, web browsing, and application use, with a comprehensive evaluation framework for agent capabilities.
For more information, see the official website, paper, and documentation.
After installing VMware (see Step 2 below), run the setup script:
bash setup_osworld.sh [--osworld-dir PATH] [--skip-vm]Arguments:
--osworld-dir PATH: Specify OSWorld clone location (default:../OSWorld)--skip-vm: Skip VM initialization (for testing)
The script checks prerequisites, clones OSWorld, installs dependencies, downloads the Ubuntu VM (~20GB), and automatically configures the VM path in your .env file.
git clone https://github.com/Nid989/OSWorld
cd OSWorld
pip install -e .For non-virtualized systems (desktop, laptop, bare metal machine), use VMware.
-
Install VMware Workstation Pro (for systems with Apple Chips, install VMware Fusion 13)
Note: Broadcom account required. License key may be requested during installation.
Installation references:
-
Verify installation:
vmrun -T ws list # Windows/Linux vmrun -T fusion list # macOS
You should see the message showing current running virtual machines.
Note: VirtualBox is supported as an alternative to VMware. However, parallelism and macOS on Apple chips may not be well-supported.
Run quickstart to download and setup the VM (~20GB):
cd OSWorld
python quickstart.pyThe VM will be saved to: ./vmware_vm_data/Ubuntu0/Ubuntu0.vmx
The VM path is automatically configured by the setup_osworld.sh script and added to your .env file as VM_PATH. The system will use this path in the following priority order:
- Environment variable (
VM_PATHfrom.envfile) - Recommended - Config file (
resources/config.yaml-system.vm_path) - Default fallback (from
src/utils/constants.py)
If you need to manually set the VM path:
-
Option 1: Add to
.envfile (Recommended)VM_PATH="/path/to/OSWorld/vmware_vm_data/Ubuntu0/Ubuntu0.vmx"Then source it:
export $(cat .env | xargs)
-
Option 2: Set in
resources/config.yamlsystem: vm_path: "/path/to/OSWorld/vmware_vm_data/Ubuntu0/Ubuntu0.vmx"
Note: If you used the automated setup script, the VM path is already configured in your
.envfile. You only need to manually configure it if you installed the VM manually or moved it to a different location.
Add the OSWorld repository path to your PYTHONPATH. You can do this in two ways:
Option 1: Export directly in your shell
export PYTHONPATH=/path/to/OSWorld:$PYTHONPATHOption 2: Add to .env file
Add the following to your .env file in the colabgame project root:
PYTHONPATH=/path/to/OSWorld
VM_PATH=/path/to/OSWorld/vmware_vm_data/Ubuntu0/Ubuntu0.vmxThen source the .env file:
export $(cat .env | xargs)Note: If you used the automated setup script (
setup_osworld.sh),VM_PATHis already added to your.envfile. You only need to addPYTHONPATHif it's not already there.
Configure the following S3 environment variables in your .env file. These are necessary for:
- Syncing screenshots from the environment to local results for transcribing
- Uploading ground truth contents for evaluating the environment state at the end of an instance run
AWS_ACCESS_KEY_ID=your_access_key_id
AWS_SECRET_ACCESS_KEY=your_secret_access_key
AWS_REGION=your_aws_region
S3_BUCKET_NAME=your_bucket_nameReplace the placeholder values with your actual AWS credentials and S3 bucket information.
- Username:
user - Password:
password
ColabGame experiments span five task categories across three difficulty levels:
| Category | Description |
|---|---|
| Debugging & Refactoring | Fix syntax errors, complete logic, update configurations |
| Tabular Data Reporting | Transfer, aggregate, and calculate data across files |
| Image Processing | Insert, resize, and caption images in documents |
| Research Synthesis | Extract web information, summarize content, integrate downloads |
| Workflow Orchestration | Gather and organize information across applications |
Additionally, ColabGame supports OSWorld benchmark tasks (chrome, gimp, libreoffice_writer, os).
Note: Not all instances are available for OSWorld tasks; we sample a subset from the original dataset.
ColabGame supports four agent topologies for evaluating different collaboration patterns:
- Single – A single agent with full environment access
- Star – Hub agent coordinates with spoke agents (centralized); spoke agents have environment access
- Mesh – All agents connected (decentralized); agents take turns with environment access
- Blackboard – Shared context model; agents contribute via a common workspace (round-robin fashion)
Topology configurations are defined in resources/topologies.
Prompt templates define how agents interact with tasks and the environment. These templates are customizable and located under resources/prompts.
The prompt system supports:
- Task-specific instructions
- Agent role definitions
- Communication protocols for multi-agent topologies
- Environment interaction guidelines
Experiments are configured in resources/config.yaml. This configuration file defines:
- Task categories and difficulty levels
- Agent topologies to evaluate
- Model configurations
- Evaluation parameters
Generate instances by running:
uv run python src/instancegenerator.pyInstances are written to in/instances.json. Each instance contains:
- Task description and requirements
- Initial environment state
- Success criteria
- Agent configuration
Run experiments using the clem CLI:
# Run with a specific model
python clem run -g colabgame -m <model_name>
# Run with mock model for testing
python clem run -g colabgame -m mockTo transcribe interactions into readable formats:
python clem transcribe -g colabgameTo score completed episodes:
python clem score -g colabgameTo generate evaluation tables:
python clem evalColabGame uses the following metrics to evaluate agent performance:
Formula: Success × 100
The primary metric for task completion. A binary score where:
- 100: Task completed successfully
- 0: Task failed or aborted
Success is determined by:
- Final Environment State: Evaluation of the desktop environment state at episode completion
- Instance-Defined Criteria: Task-specific requirements (e.g., file creation, correct output, application state)
- No Abort Condition: The game must complete without errors or timeouts
Tracks agent behavior and compliance:
- Total Requests: Number of actions attempted by each agent
- Parsed Requests: Successfully formatted and executable requests
- Violated Requests: Actions that violate environment constraints or communication rules
These statistics help identify:
- Agent efficiency (fewer requests for same outcome)
- Instruction-following capability (high parse rate)
- Constraint adherence (low violation rate)
For multi-agent topologies, additional metrics include:
- Collaboration Efficiency: Success rate relative to communication overhead
- Agent Contribution: Individual agent impact on task completion
- Communication Patterns: Message flow and coordination effectiveness
- OSWorld Official Website: https://os-world.github.io/
- OSWorld Paper: https://arxiv.org/abs/2404.07972
- OSWorld Documentation: https://timothyxxx.github.io/OSWorld/
- OSWorld Repository (Fork): https://github.com/Nid989/OSWorld