Corral: Scientific Agent Benchmark

Corral logo

A comprehensive benchmarking framework for evaluating AI agents on science tasks. The system provides standardized environments, tools, and evaluation metrics to test agent performance across diverse materials science challenges.

🚀 Getting Started

Prerequisites

Python 3.10 or higher
uv (recommended) or pip for package management

Installation

Clone the repository

git clone https://github.com/lamalab-org/corral.git
cd corral

Install the framework
```
uv pip install -e .
```

Install specific environment dependencies

# create task environments
cd tasks/samplemath && uv venv && uv pip install -e .  # create an env for running sample math
# ... repeat for other tasks as needed

Quick Start

Start a task environment server

cd tasks/samplemath/samplemath
python env.py  # Starts server on http://localhost:8000

Run benchmark in another terminal

from corral import CorralRunner, CorralRouter
from corral.agents import ReActAgent
from corral.report import CorralWandbLogger

# Setup interface
interface = CorralRouter("http://localhost:8000")
# Setup the WandB logger
wandblogger = CorralWandbLogger(
    project="corral",
    group="experiment_group",
    name="run_name",
)
# Setup the agent
agent = ReActAgent(model="gpt-4o", max_iterations=10, temperature=0.1)

# Run benchmark
runner = CorralRunner(interface, agent, logger=wandblogger)
result = runner.bench()

print(f"Overall score: {result.total_score:.2f}")

📊 Running Benchmarks

Single Task Execution

from corral import CorralRunner, CorralRouter
from corral.agents import ReActAgent

interface = CorralRouter("http://localhost:8000")
agent = ReActAgent(model="gpt-4o")
runner = CorralRunner(interface, agent)

# Run specific task
result = runner.bench(task_ids=["math_1"])

Multiple Tasks

# Run specific tasks
result = runner.bench(task_ids=["math_1", "math_2", "math_3"])

# Run all available tasks
result = runner.bench()  # Uses all tasks in the environment

Multiple Trials with Different Parameters

# Run multiple trials per task
result = runner.bench(
    task_ids=["math_1", "math_2"],
    trials_per_task=3,
    k_values=[1, 2, 3],  # Evaluate with different k values for pass@k metrics
    tool_verbosity="MINIMAL",  # Options: FULL, MINIMAL, NONE
)

# Evaluate with different k values for pass@k metrics
result = runner.bench(trials_per_task=5, k_values=[1, 2, 3, 4, 5])

🏗️ Available Environments

The framework includes several pre-built environments:

Environment	Description
`samplemath`	Basic mathematical operations
`spectra_elucidation`	Spectroscopy/NMR spectra elucidation tasks
`corral_md`	LAMMPS molecular dynamics simulation setup
`catalyst`	Catalysis research and material design tasks
`afm`	Atomic force microscopy image analysis
`ml`	Machine learning model training and evaluation

🤖 Available Agents

The framework includes several built-in agent types:

ReActAgent

Uses the ReAct (Reasoning and Acting) framework for step-by-step problem solving.

from corral.agents import ReActAgent

agent = ReActAgent(
    model="gpt-4o",  # or "claude-3-5-sonnet-20241022" or any other model litellm supports
    temperature=0.1,
    max_iterations=10,
)

ToolCallingAgent

Uses native function calling from LLM providers to solve tasks by leveraging built-in tool/function calling capabilities.

from corral.agents import ToolCallingAgent

agent = ToolCallingAgent(
    model="gpt-4o",  # or "claude-3-5-sonnet-20241022" or any other model LiteLLM supports
    temperature=0.0,
    max_iterations=10,
)

LLMPlanner

Uses hierarchical planning with high-level planning and low-level execution delegation to other agents.

from corral.agents import LLMPlanner

agent = LLMPlanner(model="gpt-4o", temperature=0.1, max_iterations=5)

ReflexionAgent

Implements the Reflexion architecture (paper) which adds self-reflection and learning from mistakes.

from corral.agents import ReActAgent, ReflexionAgent, ToolCallingAgent

# Create base agent (the "Actor")
base_agent = ToolCallingAgent(model="gpt-4o", max_iterations=10, temperature=0.1)

# Wrap with Reflexion capabilities
reflexion_agent = ReflexionAgent(
    actor=base_agent,
    reflection_model="gpt-4o",  # Model for generating reflections
    reflection_temperature=0.0,  # Deterministic reflections
)

# Use like any other agent
runner = CorralRunner(interface, reflexion_agent)
result = runner.bench(task_ids=["task_1"], trials_per_task=5)

💾 Checkpoint System

The framework automatically saves checkpoints during benchmark runs.

Checkpoints are automatically searched and loaded when resuming interrupted runs.

🔧 Contributing

Adding a New Environment

Create environment directory

mkdir -p tasks/my_new_env/my_new_env
cd tasks/my_new_env

Create pyproject.toml

[project]
name = "my_new_env"
version = "0.1.0"
dependencies = [
    "corral",
    # Add your specific dependencies
]

Create tools

# tasks/my_new_env/my_new_env/tools.py
from corral.backend.tool import tool


@tool
def my_custom_tool(input_param: str) -> str:
    """Description of what the tool does.

    Args:
        input_param: Description of the parameter

    Returns:
        Description of the return value
    """
    # Your tool implementation
    return f"Processed: {input_param}"

Note that the docstring has to be formatted correctly for the tool to be registered properly. This means it has to include a description of the parameters and return values as in the example above.

Implement environment class

# tasks/my_new_env/my_new_env/env.py
from corral.backend import Environment
from corral.backend.server import create_benchmark_server


class MyEnvironment(Environment):
    def __init__(self, task_id: str, problem: str, answer: str):
        self.problem = problem
        self.correct_answer = answer
        super().__init__(task_id)

        # Add your tools
        self.add_tool(my_custom_tool)

    def get_task_prompt(self) -> str:
        return f"Solve this problem: {self.problem}"

    def score(self) -> float:
        if self.state.submitted_answer is None:
            return 0.0
        return 1.0 if self.state.submitted_answer == self.correct_answer else 0.0


# Define your tasks
environments = {
    "task_1": MyEnvironment("task_1", "Problem 1", "Answer 1"),
    "task_2": MyEnvironment("task_2", "Problem 2", "Answer 2"),
}

# Create server
if __name__ == "__main__":
    app = create_benchmark_server(environments)
    import uvicorn

    uvicorn.run(app, host="0.0.0.0", port=8000)

Adding a New Agent

Create agent file

# src/corral/agents/my_agent.py
from corral.agents import BaseAgent
from corral import CorralRunner


class MyAgent(BaseAgent):
    def __init__(self, model: str, **kwargs):
        super().__init__(model, **kwargs)
        # Add your agent-specific initialization

    def run(self, interface: CorralRouter, task_id: str) -> str:
        # Get task information
        guide = interface.get_task_guide(task_id)

        # Your agent logic here
        # Use interface.execute_tool() to call tools

        return "Your final answer"

Add to agent registry

# src/corral/agents/__init__.py
from .my_agent import MyAgent

__all__ = ["MyAgent", ...]

Test your agent

from corral.agents.my_agent import MyAgent
from corral import CorralRunner, CorralRouter

agent = MyAgent(model="gpt-4o")
interface = CorralRouter("http://localhost:8000")
runner = CorralRunner(interface, agent)

result = runner.bench()

Development Setup

Install development dependencies
```
uv pip install -e .
```

Install pre-commit hooks with commitizen commits

pre-commit install --hook-type commit-msg --hook-type pre-push

📋 Advanced Usage

Tool Creation

Standard Tools

from corral.backend.tool import tool


@tool
def calculate_molecular_weight(formula: str) -> float:
    """Calculate molecular weight from chemical formula.

    Args:
        formula: Chemical formula (e.g., 'H2O', 'CH4')

    Returns:
        Molecular weight in g/mol
    """
    # Implementation here
    pass

Modal Tools (Cloud Execution)

Modal allows you to run computationally intensive tasks in the cloud. See Modal docs for setup.

from corral.utils.modal import modal_tool, MODAL_TOOL_REGISTRY
from modal import App, Image

app = App("my-corral-tools")


@modal_tool(app=app, image=Image.debian_slim().pip_install("rdkit"), memory=1024)
def complex_calculation(data: str) -> str:
    """Run computationally intensive task in the cloud."""
    from rdkit import Chem

    mol = Chem.MolFromSmiles(data)
    return f"Molecule has {mol.GetNumAtoms()} atoms"


# Access the tool
tool_instance = MODAL_TOOL_REGISTRY["complex_calculation"]

For Corral-specific usage, see the Modal App Documentation.

MCP (Model Context Protocol) Integration

Corral tools can be easily converted to MCP format for use with MCP-compatible clients like Claude Desktop:

from corral.backend.tool import tool
from corral.router.verbosity import ToolVerbosity


@tool
def my_scientific_tool(param: str) -> str:
    """Scientific tool description.

    Args:
        param: Parameter description

    Returns:
        Result description
    """
    return f"Result: {param}"


# Convert to MCP format
mcp_definition = my_scientific_tool.to_mcp()

# With specific verbosity level
mcp_brief = my_scientific_tool.to_mcp(verbosity=ToolVerbosity.BRIEF)

Create a custom MCP server:

from mcp.server import Server
from mcp.types import Tool as MCPTool
import importlib
import inspect
from corral.backend.tool import Tool

# Load tools from a module
module = importlib.import_module("my_domain.tools")
tools = {name: obj for name, obj in inspect.getmembers(module) if isinstance(obj, Tool)}

# Create MCP server
server = Server("my-corral-tools")


@server.list_tools()
async def list_tools():
    return [MCPTool(**tool.to_mcp()) for tool in tools.values()]


@server.call_tool()
async def call_tool(name: str, arguments: dict):
    tool = tools[name]
    # Execute and return results
    ...

For more details on creating tools and MCP integration, see the Tools Documentation.

Environment Configuration

For environments requiring file I/O:

export CORRAL_FS_PROTOCOL=local
export BASE_IO_PATH=/path/to/work/directory

Evaluation Metrics

The framework provides comprehensive evaluation metrics:

result = runner.bench(trials_per_task=10, k_values=[1, 3, 5])

# Access detailed results
print(f"Total score: {result.total_score}")
print(f"Pass@1: {result.pass_at_k[1]}")
print(f"Pass@3: {result.pass_at_k[3]}")
print(f"Average trials: {result.average_trials}")

# Per-task analysis
for task_id, task_result in result.task_results.items():
    print(f"Task {task_id}: {task_result.success_rate:.2f} success rate")

🤝 Community

Issues: Report bugs and request features on GitHub Issues
Discussions: Join conversations on GitHub Discussions
Contributing: See our Contributing Guide

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

If you use Corral in your research, please consider citing:

@article{ríos-garcía2026ai,
  title   = {AI scientists produce results without reasoning scientifically},
  author  = {Martiño Ríos-García and Nawaf Alampara and Chandan Gupta and Indrajeet Mandal and Sajid Mannan and Ali Asghar Aghajani and N. M. Anoop Krishnan and Kevin Maik Jablonka},
  year    = {2026},
  journal = {arXiv preprint arXiv: 2604.18805}
}

Name		Name	Last commit message	Last commit date
Latest commit History 209 Commits
.github		.github
analysis		analysis
docs		docs
questions_to_annotate		questions_to_annotate
reasoning_reports		reasoning_reports
scripts		scripts
site		site
src/corral		src/corral
tasks		tasks
tests		tests
.codecov.yml		.codecov.yml
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.sourcery.yaml		.sourcery.yaml
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE.md		LICENSE.md
MANIFEST.in		MANIFEST.in
README.md		README.md
mkdocs.yaml		mkdocs.yaml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

Corral: Scientific Agent Benchmark

🚀 Getting Started

Prerequisites

Installation

Quick Start

📊 Running Benchmarks

Single Task Execution

Multiple Tasks

Multiple Trials with Different Parameters

🏗️ Available Environments

🤖 Available Agents

ReActAgent

ToolCallingAgent

LLMPlanner

ReflexionAgent

💾 Checkpoint System

🔧 Contributing

Adding a New Environment

Adding a New Agent

Development Setup

📋 Advanced Usage

Tool Creation

Standard Tools

Modal Tools (Cloud Execution)

MCP (Model Context Protocol) Integration

Environment Configuration

Evaluation Metrics

🤝 Community

📄 License

Citation

About

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages