Tolokaforge

A benchmarking harness for evaluating tool-using LLM agents. Multi-turn agent/user loops, sandboxed execution, deterministic grading, and rich telemetry — across any provider via LiteLLM.

Highlights

Agent + User Loop – Multi-turn conversations where both agent and user models call tools.
Sandboxed Execution – Tool calls proxy into Dockerized services with no external network access.
MCP-Compatible Tooling – Tasks declare tools via Model Context Protocol or built-ins.
Deterministic Grading – JSONPath assertions, state hashes, transcript rules, optional LLM judges.
Rich Metrics – pass@k, cost/token estimates, latency percentiles, failure attribution.
Distributed Runner – SQLite for local runs, Postgres for multi-machine execution.
Bring-Your-Own Models – Any provider supported by LiteLLM (OpenAI, Anthropic, Google, Azure, Bedrock, Ollama, OpenRouter, and more).

Installation

pip install tolokaforge                # core
pip install "tolokaforge[browser]"     # + Playwright
pip install "tolokaforge[all]"         # everything

Dev install:

curl -LsSf https://astral.sh/uv/install.sh | sh
uv sync

See Python Package Guide for all extras and programmatic API usage.

Quick Start

# 1. Configure provider keys
cp .env.example .env

# 2. Run one of the included examples
uv run tolokaforge run --config examples/native/coding/run_config.yaml

# 3. Check results
uv run tolokaforge status --run-dir results/coding_example
uv run tolokaforge analyze --trajectory results/coding_example/trials/<task_id>/0/trajectory.yaml

That's it. Docker services for browser / mobile / RAG tasks start automatically via auto_start_services (default: true).

What is a run config?

A run config (e.g. examples/native/coding/run_config.yaml) is a single YAML file that fully specifies an evaluation. The harness reads it and runs the benchmark:

models:                       # which LLM(s) drive the agent + user simulator
  agent:    {provider: openrouter, name: anthropic/claude-sonnet-4-6, ...}
  user:     {provider: openrouter, name: anthropic/claude-sonnet-4-6, ...}

orchestrator:                 # how the run is executed
  workers: 4                  # parallel trials
  repeats: 1                  # trials per task
  max_turns: 20

evaluation:                   # what to evaluate
  tasks_glob: "**/task.yaml"  # which tasks to load (relative to task_packs root or repo)
  task_packs:                 # optional: directories that contain task.yaml files
    - "examples/native/coding/dataset"
  output_dir: "results/coding_example"

To write your own benchmark, copy a working example as a starting point:

cp examples/native/coding/run_config.yaml my_run.yaml
$EDITOR my_run.yaml         # change model, tasks_glob, output_dir
uv run tolokaforge run --config my_run.yaml

Every example under examples/ ships a run_config.yaml next to its task data. There is no global "default" config — the run config and the tasks it points at always travel together.

For distributed execution and advanced workflows see the Runner Guide.

Project Structure

tolokaforge/          # Installable Python package
├── cli/              # CLI commands (run, validate, status, analyze)
├── core/             # Orchestration, grading, metrics, queue
├── tools/            # Built-in + MCP tool system
└── env/              # Environment services (JSON DB, mock web, RAG)
examples/             # Reference task layouts with runnable run_config.yaml
├── native/           # default `native` adapter
│   ├── browser_task/
│   ├── coding/
│   ├── native_shared_domain/
│   └── tool_use/
└── terminal_bench/   # `terminal_bench` adapter (Docker compose)

Documentation

Topic	Link
Getting started	docs/GETTING_STARTED.md
Task authoring	docs/TASKS.md
Grading system	docs/GRADING.md
Tool reference	docs/TOOLS.md
Browser/mobile tools	docs/BROWSER_TOOLS.md
Runner & distributed execution	docs/RUNNER.md
Adapter architecture	docs/ADAPTER_ARCHITECTURE.md
Analytics & failure attribution	docs/ANALYTICS.md
Python package API	docs/PYTHON_PACKAGE.md
Task packs	docs/TASK_PACKS.md
Configuration reference	docs/REFERENCE.md
Security model	docs/SECURITY.md
Docker runtime	docs/BENCHMARK_BACKEND_DESIGNS.md
Benchmark types	docs/BENCHMARK_TYPES.md
Testing guide	tests/README.md

Examples

Example	Description
`examples/native/coding/`	Simplest native pattern — file-write grading
`examples/native/tool_use/`	Structured tool-call grading
`examples/native/browser_task/`	Browser tool against mock-web fixtures
`examples/native/native_shared_domain/`	`_shared/domain.yaml` + FastMCP pattern
`examples/terminal_bench/`	Docker-compose stacks with `terminal_bench` adapter

Testing

make test              # all tests
make test-unit         # fast, isolated
make test-functional   # mocked externals

See tests/README.md for integration/E2E tests and contribution guidelines.

License

Apache-2.0 — see LICENSE.

Contributing

See CONTRIBUTING.md.

Citation

Use CITATION.cff or CITATION.bib when referencing Tolokaforge in papers or reports.

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
.cursor/rules		.cursor/rules
.github/workflows		.github/workflows
.roo		.roo
.vscode		.vscode
docs		docs
examples		examples
external_adapters		external_adapters
scripts		scripts
tests		tests
tolokaforge		tolokaforge
tools		tools
.dockerignore		.dockerignore
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
.mcp.json		.mcp.json
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
.roomodes		.roomodes
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CITATION.bib		CITATION.bib
CITATION.cff		CITATION.cff
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
CONTRIBUTORS.md		CONTRIBUTORS.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tolokaforge

Highlights

Installation

Quick Start

What is a run config?

Project Structure

Documentation

Examples

Testing

License

Contributing

Citation

About

Uh oh!

Releases 8

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Tolokaforge

Highlights

Installation

Quick Start

What is a run config?

Project Structure

Documentation

Examples

Testing

License

Contributing

Citation

About

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 8

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages