Skip to content

Nid989/colabgame

Repository files navigation

ColabGame

ColabGame is a clemgame benchmark for evaluating LLM agents in computer-use scenarios. Agents interact with an Ubuntu virtual machine environment (via OSWorld) to complete realistic computer-based tasks such as debugging code, processing data, editing images, and synthesizing research.

The game supports both single-agent and multi-agent topologies, allowing evaluation of collaborative agent behaviors across different communication structures.

Overview

What is ColabGame?

ColabGame is a benchmark designed to evaluate the capabilities of Large Language Model (LLM) agents in realistic computer-use scenarios. Unlike traditional text-based benchmarks, ColabGame tests agents' ability to interact with actual operating systems, applications, and files within a virtual machine environment.

Key Features

  • Realistic Computer Tasks: Agents complete real-world tasks including debugging code, processing data, editing images, and synthesizing research
  • Multi-Agent Support: Evaluate collaborative behaviors through various agent topologies (Single, Star, Mesh, Blackboard)
  • OSWorld Integration: Leverages OSWorld's scalable, reproducible Ubuntu virtual machine environment
  • Comprehensive Evaluation: Episode-level scoring with success metrics and request statistics
  • Flexible Task Categories: Five task categories across three difficulty levels, plus support for OSWorld benchmark tasks

Architecture

ColabGame uses OSWorld as an external computer substrate to enable LLM agents to interact with real computer environments. Agents can:

  • Execute shell commands and scripts
  • Manipulate files and directories
  • Use applications (browsers, editors, office tools)
  • Complete realistic computer-based tasks

Currently, ColabGame supports the Ubuntu Linux environment only.

Getting Started

Prerequisites

Before installing ColabGame, ensure you have:

  • Python 3.12+ installed
  • ~25GB disk space (VM image + dependencies)
  • Git installed
  • VMware Workstation Pro or VMware Fusion 13 (for macOS with Apple chips)

Important: Clone OSWorld outside of the colabgame project directory (e.g., as a sibling folder).

Quick Start Guide

  1. Install VMware (see Installation section below)
  2. Run the automated setup script:
    bash setup_osworld.sh
    The script will automatically configure the VM path in your .env file.
  3. Source your environment variables:
    export $(cat .env | xargs)
    This loads the VM path and other configuration from .env.
  4. Configure S3 credentials (if needed) in your .env file
  5. Generate instances:
    python3 src/instancegenerator.py
  6. Run your first experiment:
    python3 -m clem run -g colabgame -m mock

Installation

OSWorld Environment Setup

OSWorld is a benchmark environment for evaluating multimodal agents in real computer environments. It provides scalable, reproducible virtual machines (Ubuntu and Windows) where agents can interact with actual operating systems, applications, and files. OSWorld supports realistic computer tasks including file operations, web browsing, and application use, with a comprehensive evaluation framework for agent capabilities.

For more information, see the official website, paper, and documentation.

Option A: Automated Setup (Recommended)

After installing VMware (see Step 2 below), run the setup script:

bash setup_osworld.sh [--osworld-dir PATH] [--skip-vm]

Arguments:

  • --osworld-dir PATH: Specify OSWorld clone location (default: ../OSWorld)
  • --skip-vm: Skip VM initialization (for testing)

The script checks prerequisites, clones OSWorld, installs dependencies, downloads the Ubuntu VM (~20GB), and automatically configures the VM path in your .env file.

Option B: Manual Setup

Step 1: Clone OSWorld
git clone https://github.com/Nid989/OSWorld
cd OSWorld
pip install -e .
Step 2: Install VMware

For non-virtualized systems (desktop, laptop, bare metal machine), use VMware.

  1. Install VMware Workstation Pro (for systems with Apple Chips, install VMware Fusion 13)

    Note: Broadcom account required. License key may be requested during installation.

    Installation references:

  2. Verify installation:

    vmrun -T ws list    # Windows/Linux
    vmrun -T fusion list  # macOS

    You should see the message showing current running virtual machines.

Note: VirtualBox is supported as an alternative to VMware. However, parallelism and macOS on Apple chips may not be well-supported.

Step 3: Initialize VM

Run quickstart to download and setup the VM (~20GB):

cd OSWorld
python quickstart.py

The VM will be saved to: ./vmware_vm_data/Ubuntu0/Ubuntu0.vmx

Environment Configuration

VM Path Configuration

The VM path is automatically configured by the setup_osworld.sh script and added to your .env file as VM_PATH. The system will use this path in the following priority order:

  1. Environment variable (VM_PATH from .env file) - Recommended
  2. Config file (resources/config.yaml - system.vm_path)
  3. Default fallback (from src/utils/constants.py)

If you need to manually set the VM path:

  1. Option 1: Add to .env file (Recommended)

    VM_PATH="/path/to/OSWorld/vmware_vm_data/Ubuntu0/Ubuntu0.vmx"

    Then source it:

    export $(cat .env | xargs)
  2. Option 2: Set in resources/config.yaml

    system:
      vm_path: "/path/to/OSWorld/vmware_vm_data/Ubuntu0/Ubuntu0.vmx"

Note: If you used the automated setup script, the VM path is already configured in your .env file. You only need to manually configure it if you installed the VM manually or moved it to a different location.

PYTHONPATH Setup

Add the OSWorld repository path to your PYTHONPATH. You can do this in two ways:

Option 1: Export directly in your shell

export PYTHONPATH=/path/to/OSWorld:$PYTHONPATH

Option 2: Add to .env file

Add the following to your .env file in the colabgame project root:

PYTHONPATH=/path/to/OSWorld
VM_PATH=/path/to/OSWorld/vmware_vm_data/Ubuntu0/Ubuntu0.vmx

Then source the .env file:

export $(cat .env | xargs)

Note: If you used the automated setup script (setup_osworld.sh), VM_PATH is already added to your .env file. You only need to add PYTHONPATH if it's not already there.

S3 Configuration

Configure the following S3 environment variables in your .env file. These are necessary for:

  • Syncing screenshots from the environment to local results for transcribing
  • Uploading ground truth contents for evaluating the environment state at the end of an instance run
AWS_ACCESS_KEY_ID=your_access_key_id
AWS_SECRET_ACCESS_KEY=your_secret_access_key
AWS_REGION=your_aws_region
S3_BUCKET_NAME=your_bucket_name

Replace the placeholder values with your actual AWS credentials and S3 bucket information.

VM Credentials

  • Username: user
  • Password: password

Configuration

Task Categories

ColabGame experiments span five task categories across three difficulty levels:

Category Description
Debugging & Refactoring Fix syntax errors, complete logic, update configurations
Tabular Data Reporting Transfer, aggregate, and calculate data across files
Image Processing Insert, resize, and caption images in documents
Research Synthesis Extract web information, summarize content, integrate downloads
Workflow Orchestration Gather and organize information across applications

Additionally, ColabGame supports OSWorld benchmark tasks (chrome, gimp, libreoffice_writer, os).

Note: Not all instances are available for OSWorld tasks; we sample a subset from the original dataset.

Agent Topologies

ColabGame supports four agent topologies for evaluating different collaboration patterns:

  • Single – A single agent with full environment access
  • Star – Hub agent coordinates with spoke agents (centralized); spoke agents have environment access
  • Mesh – All agents connected (decentralized); agents take turns with environment access
  • Blackboard – Shared context model; agents contribute via a common workspace (round-robin fashion)

Topology configurations are defined in resources/topologies.

Prompts

Prompt templates define how agents interact with tasks and the environment. These templates are customizable and located under resources/prompts.

The prompt system supports:

  • Task-specific instructions
  • Agent role definitions
  • Communication protocols for multi-agent topologies
  • Environment interaction guidelines

Instance Generation

Experiments are configured in resources/config.yaml. This configuration file defines:

  • Task categories and difficulty levels
  • Agent topologies to evaluate
  • Model configurations
  • Evaluation parameters

Generate instances by running:

uv run python src/instancegenerator.py

Instances are written to in/instances.json. Each instance contains:

  • Task description and requirements
  • Initial environment state
  • Success criteria
  • Agent configuration

Usage

Running Experiments

Run experiments using the clem CLI:

# Run with a specific model
python clem run -g colabgame -m <model_name>

# Run with mock model for testing
python clem run -g colabgame -m mock

Transcribing Results

To transcribe interactions into readable formats:

python clem transcribe -g colabgame

Scoring Episodes

To score completed episodes:

python clem score -g colabgame

Generating Evaluation Tables

To generate evaluation tables:

python clem eval

Scoring Metrics

Episode-Level Scores

ColabGame uses the following metrics to evaluate agent performance:

Bench Score

Formula: Success × 100

The primary metric for task completion. A binary score where:

  • 100: Task completed successfully
  • 0: Task failed or aborted

Success Criteria

Success is determined by:

  1. Final Environment State: Evaluation of the desktop environment state at episode completion
  2. Instance-Defined Criteria: Task-specific requirements (e.g., file creation, correct output, application state)
  3. No Abort Condition: The game must complete without errors or timeouts

Request Statistics

Tracks agent behavior and compliance:

  • Total Requests: Number of actions attempted by each agent
  • Parsed Requests: Successfully formatted and executable requests
  • Violated Requests: Actions that violate environment constraints or communication rules

These statistics help identify:

  • Agent efficiency (fewer requests for same outcome)
  • Instruction-following capability (high parse rate)
  • Constraint adherence (low violation rate)

Multi-Agent Metrics

For multi-agent topologies, additional metrics include:

  • Collaboration Efficiency: Success rate relative to communication overhead
  • Agent Contribution: Individual agent impact on task completion
  • Communication Patterns: Message flow and coordination effectiveness

References

About

This repository contains ColabGame, a multi-agent environment designed to probe the collaborative and computer-use capabilities of Large Language Models within a realistic OSWorld setting

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors