Skip to content

lamalab-org/corral

Corral: Scientific Agent Benchmark

Tests PyPI PyPI - License Documentation Status Contributor Covenant

Corral logo

A comprehensive benchmarking framework for evaluating AI agents on science tasks. The system provides standardized environments, tools, and evaluation metrics to test agent performance across diverse materials science challenges.

🚀 Getting Started

Prerequisites

  • Python 3.10 or higher
  • uv (recommended) or pip for package management

Installation

  1. Clone the repository

    git clone https://github.com/lamalab-org/corral.git
    cd corral
  2. Install the framework

    uv pip install -e .
  3. Install specific environment dependencies

    # create task environments
    cd tasks/samplemath && uv venv && uv pip install -e .  # create an env for running sample math
    # ... repeat for other tasks as needed

Quick Start

  1. Start a task environment server

    cd tasks/samplemath/samplemath
    python env.py  # Starts server on http://localhost:8000
  2. Run benchmark in another terminal

    from corral import CorralRunner, CorralRouter
    from corral.agents import ReActAgent
    from corral.report import CorralWandbLogger
    
    # Setup interface
    interface = CorralRouter("http://localhost:8000")
    # Setup the WandB logger
    wandblogger = CorralWandbLogger(
        project="corral",
        group="experiment_group",
        name="run_name",
    )
    # Setup the agent
    agent = ReActAgent(model="gpt-4o", max_iterations=10, temperature=0.1)
    
    # Run benchmark
    runner = CorralRunner(interface, agent, logger=wandblogger)
    result = runner.bench()
    
    print(f"Overall score: {result.total_score:.2f}")

📊 Running Benchmarks

Single Task Execution

from corral import CorralRunner, CorralRouter
from corral.agents import ReActAgent

interface = CorralRouter("http://localhost:8000")
agent = ReActAgent(model="gpt-4o")
runner = CorralRunner(interface, agent)

# Run specific task
result = runner.bench(task_ids=["math_1"])

Multiple Tasks

# Run specific tasks
result = runner.bench(task_ids=["math_1", "math_2", "math_3"])

# Run all available tasks
result = runner.bench()  # Uses all tasks in the environment

Multiple Trials with Different Parameters

# Run multiple trials per task
result = runner.bench(
    task_ids=["math_1", "math_2"],
    trials_per_task=3,
    k_values=[1, 2, 3],  # Evaluate with different k values for pass@k metrics
    tool_verbosity="MINIMAL",  # Options: FULL, MINIMAL, NONE
)

# Evaluate with different k values for pass@k metrics
result = runner.bench(trials_per_task=5, k_values=[1, 2, 3, 4, 5])

🏗️ Available Environments

The framework includes several pre-built environments:

Environment Description
samplemath Basic mathematical operations
spectra_elucidation Spectroscopy/NMR spectra elucidation tasks
corral_md LAMMPS molecular dynamics simulation setup
catalyst Catalysis research and material design tasks
afm Atomic force microscopy image analysis
ml Machine learning model training and evaluation

🤖 Available Agents

The framework includes several built-in agent types:

ReActAgent

Uses the ReAct (Reasoning and Acting) framework for step-by-step problem solving.

from corral.agents import ReActAgent

agent = ReActAgent(
    model="gpt-4o",  # or "claude-3-5-sonnet-20241022" or any other model litellm supports
    temperature=0.1,
    max_iterations=10,
)

ToolCallingAgent

Uses native function calling from LLM providers to solve tasks by leveraging built-in tool/function calling capabilities.

from corral.agents import ToolCallingAgent

agent = ToolCallingAgent(
    model="gpt-4o",  # or "claude-3-5-sonnet-20241022" or any other model LiteLLM supports
    temperature=0.0,
    max_iterations=10,
)

LLMPlanner

Uses hierarchical planning with high-level planning and low-level execution delegation to other agents.

from corral.agents import LLMPlanner

agent = LLMPlanner(model="gpt-4o", temperature=0.1, max_iterations=5)

ReflexionAgent

Implements the Reflexion architecture (paper) which adds self-reflection and learning from mistakes.

from corral.agents import ReActAgent, ReflexionAgent, ToolCallingAgent

# Create base agent (the "Actor")
base_agent = ToolCallingAgent(model="gpt-4o", max_iterations=10, temperature=0.1)

# Wrap with Reflexion capabilities
reflexion_agent = ReflexionAgent(
    actor=base_agent,
    reflection_model="gpt-4o",  # Model for generating reflections
    reflection_temperature=0.0,  # Deterministic reflections
)

# Use like any other agent
runner = CorralRunner(interface, reflexion_agent)
result = runner.bench(task_ids=["task_1"], trials_per_task=5)

💾 Checkpoint System

The framework automatically saves checkpoints during benchmark runs.

Checkpoints are automatically searched and loaded when resuming interrupted runs.

🔧 Contributing

Adding a New Environment

  1. Create environment directory

    mkdir -p tasks/my_new_env/my_new_env
    cd tasks/my_new_env
  2. Create pyproject.toml

    [project]
    name = "my_new_env"
    version = "0.1.0"
    dependencies = [
        "corral",
        # Add your specific dependencies
    ]
  3. Create tools

    # tasks/my_new_env/my_new_env/tools.py
    from corral.backend.tool import tool
    
    
    @tool
    def my_custom_tool(input_param: str) -> str:
        """Description of what the tool does.
    
        Args:
            input_param: Description of the parameter
    
        Returns:
            Description of the return value
        """
        # Your tool implementation
        return f"Processed: {input_param}"

    Note that the docstring has to be formatted correctly for the tool to be registered properly. This means it has to include a description of the parameters and return values as in the example above.

  4. Implement environment class

    # tasks/my_new_env/my_new_env/env.py
    from corral.backend import Environment
    from corral.backend.server import create_benchmark_server
    
    
    class MyEnvironment(Environment):
        def __init__(self, task_id: str, problem: str, answer: str):
            self.problem = problem
            self.correct_answer = answer
            super().__init__(task_id)
    
            # Add your tools
            self.add_tool(my_custom_tool)
    
        def get_task_prompt(self) -> str:
            return f"Solve this problem: {self.problem}"
    
        def score(self) -> float:
            if self.state.submitted_answer is None:
                return 0.0
            return 1.0 if self.state.submitted_answer == self.correct_answer else 0.0
    
    
    # Define your tasks
    environments = {
        "task_1": MyEnvironment("task_1", "Problem 1", "Answer 1"),
        "task_2": MyEnvironment("task_2", "Problem 2", "Answer 2"),
    }
    
    # Create server
    if __name__ == "__main__":
        app = create_benchmark_server(environments)
        import uvicorn
    
        uvicorn.run(app, host="0.0.0.0", port=8000)

Adding a New Agent

  1. Create agent file

    # src/corral/agents/my_agent.py
    from corral.agents import BaseAgent
    from corral import CorralRunner
    
    
    class MyAgent(BaseAgent):
        def __init__(self, model: str, **kwargs):
            super().__init__(model, **kwargs)
            # Add your agent-specific initialization
    
        def run(self, interface: CorralRouter, task_id: str) -> str:
            # Get task information
            guide = interface.get_task_guide(task_id)
    
            # Your agent logic here
            # Use interface.execute_tool() to call tools
    
            return "Your final answer"
  2. Add to agent registry

    # src/corral/agents/__init__.py
    from .my_agent import MyAgent
    
    __all__ = ["MyAgent", ...]
  3. Test your agent

    from corral.agents.my_agent import MyAgent
    from corral import CorralRunner, CorralRouter
    
    agent = MyAgent(model="gpt-4o")
    interface = CorralRouter("http://localhost:8000")
    runner = CorralRunner(interface, agent)
    
    result = runner.bench()

Development Setup

  1. Install development dependencies

    uv pip install -e .
  2. Install pre-commit hooks with commitizen commits

    pre-commit install --hook-type commit-msg --hook-type pre-push

📋 Advanced Usage

Tool Creation

Standard Tools

from corral.backend.tool import tool


@tool
def calculate_molecular_weight(formula: str) -> float:
    """Calculate molecular weight from chemical formula.

    Args:
        formula: Chemical formula (e.g., 'H2O', 'CH4')

    Returns:
        Molecular weight in g/mol
    """
    # Implementation here
    pass

Modal Tools (Cloud Execution)

Modal allows you to run computationally intensive tasks in the cloud. See Modal docs for setup.

from corral.utils.modal import modal_tool, MODAL_TOOL_REGISTRY
from modal import App, Image

app = App("my-corral-tools")


@modal_tool(app=app, image=Image.debian_slim().pip_install("rdkit"), memory=1024)
def complex_calculation(data: str) -> str:
    """Run computationally intensive task in the cloud."""
    from rdkit import Chem

    mol = Chem.MolFromSmiles(data)
    return f"Molecule has {mol.GetNumAtoms()} atoms"


# Access the tool
tool_instance = MODAL_TOOL_REGISTRY["complex_calculation"]

For Corral-specific usage, see the Modal App Documentation.

MCP (Model Context Protocol) Integration

Corral tools can be easily converted to MCP format for use with MCP-compatible clients like Claude Desktop:

from corral.backend.tool import tool
from corral.router.verbosity import ToolVerbosity


@tool
def my_scientific_tool(param: str) -> str:
    """Scientific tool description.

    Args:
        param: Parameter description

    Returns:
        Result description
    """
    return f"Result: {param}"


# Convert to MCP format
mcp_definition = my_scientific_tool.to_mcp()

# With specific verbosity level
mcp_brief = my_scientific_tool.to_mcp(verbosity=ToolVerbosity.BRIEF)

Create a custom MCP server:

from mcp.server import Server
from mcp.types import Tool as MCPTool
import importlib
import inspect
from corral.backend.tool import Tool

# Load tools from a module
module = importlib.import_module("my_domain.tools")
tools = {name: obj for name, obj in inspect.getmembers(module) if isinstance(obj, Tool)}

# Create MCP server
server = Server("my-corral-tools")


@server.list_tools()
async def list_tools():
    return [MCPTool(**tool.to_mcp()) for tool in tools.values()]


@server.call_tool()
async def call_tool(name: str, arguments: dict):
    tool = tools[name]
    # Execute and return results
    ...

For more details on creating tools and MCP integration, see the Tools Documentation.

Environment Configuration

For environments requiring file I/O:

export CORRAL_FS_PROTOCOL=local
export BASE_IO_PATH=/path/to/work/directory

Evaluation Metrics

The framework provides comprehensive evaluation metrics:

result = runner.bench(trials_per_task=10, k_values=[1, 3, 5])

# Access detailed results
print(f"Total score: {result.total_score}")
print(f"Pass@1: {result.pass_at_k[1]}")
print(f"Pass@3: {result.pass_at_k[3]}")
print(f"Average trials: {result.average_trials}")

# Per-task analysis
for task_id, task_result in result.task_results.items():
    print(f"Task {task_id}: {task_result.success_rate:.2f} success rate")

🤝 Community

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

If you use Corral in your research, please consider citing:

@article{ríos-garcía2026ai,
  title   = {AI scientists produce results without reasoning scientifically},
  author  = {Martiño Ríos-García and Nawaf Alampara and Chandan Gupta and Indrajeet Mandal and Sajid Mannan and Ali Asghar Aghajani and N. M. Anoop Krishnan and Kevin Maik Jablonka},
  year    = {2026},
  journal = {arXiv preprint arXiv: 2604.18805}
}

About

No description, website, or topics provided.

Resources

License

Code of conduct

Stars

Watchers

Forks

Packages

 
 
 

Contributors