ColabGame

ColabGame is a clemgame benchmark for evaluating LLM agents in computer-use scenarios. Agents interact with an Ubuntu virtual machine environment (via OSWorld) to complete realistic computer-based tasks such as debugging code, processing data, editing images, and synthesizing research.

The game supports both single-agent and multi-agent topologies, allowing evaluation of collaborative agent behaviors across different communication structures.

Overview

What is ColabGame?

ColabGame is a benchmark designed to evaluate the capabilities of Large Language Model (LLM) agents in realistic computer-use scenarios. Unlike traditional text-based benchmarks, ColabGame tests agents' ability to interact with actual operating systems, applications, and files within a virtual machine environment.

Key Features

Realistic Computer Tasks: Agents complete real-world tasks including debugging code, processing data, editing images, and synthesizing research
Multi-Agent Support: Evaluate collaborative behaviors through various agent topologies (Single, Star, Mesh, Blackboard)
OSWorld Integration: Leverages OSWorld's scalable, reproducible Ubuntu virtual machine environment
Comprehensive Evaluation: Episode-level scoring with success metrics and request statistics
Flexible Task Categories: Five task categories across three difficulty levels, plus support for OSWorld benchmark tasks

Architecture

ColabGame uses OSWorld as an external computer substrate to enable LLM agents to interact with real computer environments. Agents can:

Execute shell commands and scripts
Manipulate files and directories
Use applications (browsers, editors, office tools)
Complete realistic computer-based tasks

Currently, ColabGame supports the Ubuntu Linux environment only.

Getting Started

Prerequisites

Before installing ColabGame, ensure you have:

Python 3.12+ installed
~25GB disk space (VM image + dependencies)
Git installed
VMware Workstation Pro or VMware Fusion 13 (for macOS with Apple chips)

Important: Clone OSWorld outside of the colabgame project directory (e.g., as a sibling folder).

Quick Start Guide

Install VMware (see Installation section below)
Run the automated setup script:
```
bash setup_osworld.sh
```
The script will automatically configure the VM path in your .env file.
Source your environment variables:
```
export $(cat .env | xargs)
```
This loads the VM path and other configuration from .env.
Configure S3 credentials (if needed) in your .env file
Generate instances:
```
python3 src/instancegenerator.py
```

Run your first experiment:

python3 -m clem run -g colabgame -m mock

Installation

OSWorld Environment Setup

OSWorld is a benchmark environment for evaluating multimodal agents in real computer environments. It provides scalable, reproducible virtual machines (Ubuntu and Windows) where agents can interact with actual operating systems, applications, and files. OSWorld supports realistic computer tasks including file operations, web browsing, and application use, with a comprehensive evaluation framework for agent capabilities.

For more information, see the official website, paper, and documentation.

Option A: Automated Setup (Recommended)

After installing VMware (see Step 2 below), run the setup script:

bash setup_osworld.sh [--osworld-dir PATH] [--skip-vm]

Arguments:

--osworld-dir PATH: Specify OSWorld clone location (default: ../OSWorld)
--skip-vm: Skip VM initialization (for testing)

The script checks prerequisites, clones OSWorld, installs dependencies, downloads the Ubuntu VM (~20GB), and automatically configures the VM path in your .env file.

Option B: Manual Setup

Step 1: Clone OSWorld

git clone https://github.com/Nid989/OSWorld
cd OSWorld
pip install -e .

Step 2: Install VMware

For non-virtualized systems (desktop, laptop, bare metal machine), use VMware.

Install VMware Workstation Pro (for systems with Apple Chips, install VMware Fusion 13)

Note: Broadcom account required. License key may be requested during installation.

Installation references:
Verify installation:
```
vmrun -T ws list    # Windows/Linux
vmrun -T fusion list  # macOS
```
You should see the message showing current running virtual machines.

Note: VirtualBox is supported as an alternative to VMware. However, parallelism and macOS on Apple chips may not be well-supported.

Step 3: Initialize VM

Run quickstart to download and setup the VM (~20GB):

cd OSWorld
python quickstart.py

The VM will be saved to: ./vmware_vm_data/Ubuntu0/Ubuntu0.vmx

Environment Configuration

VM Path Configuration

The VM path is automatically configured by the setup_osworld.sh script and added to your .env file as VM_PATH. The system will use this path in the following priority order:

Environment variable (VM_PATH from .env file) - Recommended
Config file (resources/config.yaml - system.vm_path)
Default fallback (from src/utils/constants.py)

If you need to manually set the VM path:

Option 1: Add to .env file (Recommended)

VM_PATH="/path/to/OSWorld/vmware_vm_data/Ubuntu0/Ubuntu0.vmx"

Then source it:

export $(cat .env | xargs)

Option 2: Set in resources/config.yaml

system:
  vm_path: "/path/to/OSWorld/vmware_vm_data/Ubuntu0/Ubuntu0.vmx"

Note: If you used the automated setup script, the VM path is already configured in your .env file. You only need to manually configure it if you installed the VM manually or moved it to a different location.

PYTHONPATH Setup

Add the OSWorld repository path to your PYTHONPATH. You can do this in two ways:

Option 1: Export directly in your shell

export PYTHONPATH=/path/to/OSWorld:$PYTHONPATH

Option 2: Add to .env file

Add the following to your .env file in the colabgame project root:

PYTHONPATH=/path/to/OSWorld
VM_PATH=/path/to/OSWorld/vmware_vm_data/Ubuntu0/Ubuntu0.vmx

Then source the .env file:

export $(cat .env | xargs)

Note: If you used the automated setup script (setup_osworld.sh), VM_PATH is already added to your .env file. You only need to add PYTHONPATH if it's not already there.

S3 Configuration

Configure the following S3 environment variables in your .env file. These are necessary for:

Syncing screenshots from the environment to local results for transcribing
Uploading ground truth contents for evaluating the environment state at the end of an instance run

AWS_ACCESS_KEY_ID=your_access_key_id
AWS_SECRET_ACCESS_KEY=your_secret_access_key
AWS_REGION=your_aws_region
S3_BUCKET_NAME=your_bucket_name

Replace the placeholder values with your actual AWS credentials and S3 bucket information.

VM Credentials

Username: user
Password: password

Configuration

Task Categories

ColabGame experiments span five task categories across three difficulty levels:

Category	Description
Debugging & Refactoring	Fix syntax errors, complete logic, update configurations
Tabular Data Reporting	Transfer, aggregate, and calculate data across files
Image Processing	Insert, resize, and caption images in documents
Research Synthesis	Extract web information, summarize content, integrate downloads
Workflow Orchestration	Gather and organize information across applications

Additionally, ColabGame supports OSWorld benchmark tasks (chrome, gimp, libreoffice_writer, os).

Note: Not all instances are available for OSWorld tasks; we sample a subset from the original dataset.

Agent Topologies

ColabGame supports four agent topologies for evaluating different collaboration patterns:

Single – A single agent with full environment access
Star – Hub agent coordinates with spoke agents (centralized); spoke agents have environment access
Mesh – All agents connected (decentralized); agents take turns with environment access
Blackboard – Shared context model; agents contribute via a common workspace (round-robin fashion)

Topology configurations are defined in resources/topologies.

Prompts

Prompt templates define how agents interact with tasks and the environment. These templates are customizable and located under resources/prompts.

The prompt system supports:

Task-specific instructions
Agent role definitions
Communication protocols for multi-agent topologies
Environment interaction guidelines

Instance Generation

Experiments are configured in resources/config.yaml. This configuration file defines:

Task categories and difficulty levels
Agent topologies to evaluate
Model configurations
Evaluation parameters

Generate instances by running:

uv run python src/instancegenerator.py

Instances are written to in/instances.json. Each instance contains:

Task description and requirements
Initial environment state
Success criteria
Agent configuration

Usage

Running Experiments

Run experiments using the clem CLI:

# Run with a specific model
python clem run -g colabgame -m <model_name>

# Run with mock model for testing
python clem run -g colabgame -m mock

Transcribing Results

To transcribe interactions into readable formats:

python clem transcribe -g colabgame

Scoring Episodes

To score completed episodes:

python clem score -g colabgame

Generating Evaluation Tables

To generate evaluation tables:

python clem eval

Scoring Metrics

Episode-Level Scores

ColabGame uses the following metrics to evaluate agent performance:

Bench Score

Formula: Success × 100

The primary metric for task completion. A binary score where:

100: Task completed successfully
0: Task failed or aborted

Success Criteria

Success is determined by:

Final Environment State: Evaluation of the desktop environment state at episode completion
Instance-Defined Criteria: Task-specific requirements (e.g., file creation, correct output, application state)
No Abort Condition: The game must complete without errors or timeouts

Request Statistics

Tracks agent behavior and compliance:

Total Requests: Number of actions attempted by each agent
Parsed Requests: Successfully formatted and executable requests
Violated Requests: Actions that violate environment constraints or communication rules

These statistics help identify:

Agent efficiency (fewer requests for same outcome)
Instruction-following capability (high parse rate)
Constraint adherence (low violation rate)

Multi-Agent Metrics

For multi-agent topologies, additional metrics include:

Collaboration Efficiency: Success rate relative to communication overhead
Agent Contribution: Individual agent impact on task completion
Communication Patterns: Message flow and coordination effectiveness

References

OSWorld Official Website: https://os-world.github.io/
OSWorld Paper: https://arxiv.org/abs/2404.07972
OSWorld Documentation: https://timothyxxx.github.io/OSWorld/
OSWorld Repository (Fork): https://github.com/Nid989/OSWorld

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
in		in
resources		resources
src		src
.gitignore		.gitignore
README.md		README.md
clemgame.json		clemgame.json
master.py		master.py
model_registry.json		model_registry.json
requirements.txt		requirements.txt
scorer.py		scorer.py
setup_osworld.sh		setup_osworld.sh

Folders and files

Latest commit

History

Repository files navigation

ColabGame

Overview

What is ColabGame?

Key Features

Architecture

Getting Started

Prerequisites

Quick Start Guide

Installation

OSWorld Environment Setup

Option A: Automated Setup (Recommended)

Option B: Manual Setup

Step 1: Clone OSWorld

Step 2: Install VMware

Step 3: Initialize VM

Environment Configuration

VM Path Configuration

PYTHONPATH Setup

S3 Configuration

VM Credentials

Configuration

Task Categories

Agent Topologies

Prompts

Instance Generation

Usage

Running Experiments

Transcribing Results

Scoring Episodes

Generating Evaluation Tables

Scoring Metrics

Episode-Level Scores

Bench Score

Success Criteria

Request Statistics

Multi-Agent Metrics

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages