A comprehensive benchmarking framework for evaluating AI agents on science tasks. The system provides standardized environments, tools, and evaluation metrics to test agent performance across diverse materials science challenges.
- Python 3.10 or higher
uv(recommended) orpipfor package management
-
Clone the repository
git clone https://github.com/lamalab-org/corral.git cd corral -
Install the framework
uv pip install -e . -
Install specific environment dependencies
# create task environments cd tasks/samplemath && uv venv && uv pip install -e . # create an env for running sample math # ... repeat for other tasks as needed
-
Start a task environment server
cd tasks/samplemath/samplemath python env.py # Starts server on http://localhost:8000
-
Run benchmark in another terminal
from corral import CorralRunner, CorralRouter from corral.agents import ReActAgent from corral.report import CorralWandbLogger # Setup interface interface = CorralRouter("http://localhost:8000") # Setup the WandB logger wandblogger = CorralWandbLogger( project="corral", group="experiment_group", name="run_name", ) # Setup the agent agent = ReActAgent(model="gpt-4o", max_iterations=10, temperature=0.1) # Run benchmark runner = CorralRunner(interface, agent, logger=wandblogger) result = runner.bench() print(f"Overall score: {result.total_score:.2f}")
from corral import CorralRunner, CorralRouter
from corral.agents import ReActAgent
interface = CorralRouter("http://localhost:8000")
agent = ReActAgent(model="gpt-4o")
runner = CorralRunner(interface, agent)
# Run specific task
result = runner.bench(task_ids=["math_1"])# Run specific tasks
result = runner.bench(task_ids=["math_1", "math_2", "math_3"])
# Run all available tasks
result = runner.bench() # Uses all tasks in the environment# Run multiple trials per task
result = runner.bench(
task_ids=["math_1", "math_2"],
trials_per_task=3,
k_values=[1, 2, 3], # Evaluate with different k values for pass@k metrics
tool_verbosity="MINIMAL", # Options: FULL, MINIMAL, NONE
)
# Evaluate with different k values for pass@k metrics
result = runner.bench(trials_per_task=5, k_values=[1, 2, 3, 4, 5])The framework includes several pre-built environments:
| Environment | Description |
|---|---|
samplemath |
Basic mathematical operations |
spectra_elucidation |
Spectroscopy/NMR spectra elucidation tasks |
corral_md |
LAMMPS molecular dynamics simulation setup |
catalyst |
Catalysis research and material design tasks |
afm |
Atomic force microscopy image analysis |
ml |
Machine learning model training and evaluation |
The framework includes several built-in agent types:
Uses the ReAct (Reasoning and Acting) framework for step-by-step problem solving.
from corral.agents import ReActAgent
agent = ReActAgent(
model="gpt-4o", # or "claude-3-5-sonnet-20241022" or any other model litellm supports
temperature=0.1,
max_iterations=10,
)Uses native function calling from LLM providers to solve tasks by leveraging built-in tool/function calling capabilities.
from corral.agents import ToolCallingAgent
agent = ToolCallingAgent(
model="gpt-4o", # or "claude-3-5-sonnet-20241022" or any other model LiteLLM supports
temperature=0.0,
max_iterations=10,
)Uses hierarchical planning with high-level planning and low-level execution delegation to other agents.
from corral.agents import LLMPlanner
agent = LLMPlanner(model="gpt-4o", temperature=0.1, max_iterations=5)Implements the Reflexion architecture (paper) which adds self-reflection and learning from mistakes.
from corral.agents import ReActAgent, ReflexionAgent, ToolCallingAgent
# Create base agent (the "Actor")
base_agent = ToolCallingAgent(model="gpt-4o", max_iterations=10, temperature=0.1)
# Wrap with Reflexion capabilities
reflexion_agent = ReflexionAgent(
actor=base_agent,
reflection_model="gpt-4o", # Model for generating reflections
reflection_temperature=0.0, # Deterministic reflections
)
# Use like any other agent
runner = CorralRunner(interface, reflexion_agent)
result = runner.bench(task_ids=["task_1"], trials_per_task=5)The framework automatically saves checkpoints during benchmark runs.
Checkpoints are automatically searched and loaded when resuming interrupted runs.
-
Create environment directory
mkdir -p tasks/my_new_env/my_new_env cd tasks/my_new_env -
Create pyproject.toml
[project] name = "my_new_env" version = "0.1.0" dependencies = [ "corral", # Add your specific dependencies ]
-
Create tools
# tasks/my_new_env/my_new_env/tools.py from corral.backend.tool import tool @tool def my_custom_tool(input_param: str) -> str: """Description of what the tool does. Args: input_param: Description of the parameter Returns: Description of the return value """ # Your tool implementation return f"Processed: {input_param}"
Note that the docstring has to be formatted correctly for the tool to be registered properly. This means it has to include a description of the parameters and return values as in the example above.
-
Implement environment class
# tasks/my_new_env/my_new_env/env.py from corral.backend import Environment from corral.backend.server import create_benchmark_server class MyEnvironment(Environment): def __init__(self, task_id: str, problem: str, answer: str): self.problem = problem self.correct_answer = answer super().__init__(task_id) # Add your tools self.add_tool(my_custom_tool) def get_task_prompt(self) -> str: return f"Solve this problem: {self.problem}" def score(self) -> float: if self.state.submitted_answer is None: return 0.0 return 1.0 if self.state.submitted_answer == self.correct_answer else 0.0 # Define your tasks environments = { "task_1": MyEnvironment("task_1", "Problem 1", "Answer 1"), "task_2": MyEnvironment("task_2", "Problem 2", "Answer 2"), } # Create server if __name__ == "__main__": app = create_benchmark_server(environments) import uvicorn uvicorn.run(app, host="0.0.0.0", port=8000)
-
Create agent file
# src/corral/agents/my_agent.py from corral.agents import BaseAgent from corral import CorralRunner class MyAgent(BaseAgent): def __init__(self, model: str, **kwargs): super().__init__(model, **kwargs) # Add your agent-specific initialization def run(self, interface: CorralRouter, task_id: str) -> str: # Get task information guide = interface.get_task_guide(task_id) # Your agent logic here # Use interface.execute_tool() to call tools return "Your final answer"
-
Add to agent registry
# src/corral/agents/__init__.py from .my_agent import MyAgent __all__ = ["MyAgent", ...]
-
Test your agent
from corral.agents.my_agent import MyAgent from corral import CorralRunner, CorralRouter agent = MyAgent(model="gpt-4o") interface = CorralRouter("http://localhost:8000") runner = CorralRunner(interface, agent) result = runner.bench()
-
Install development dependencies
uv pip install -e . -
Install pre-commit hooks with commitizen commits
pre-commit install --hook-type commit-msg --hook-type pre-push
from corral.backend.tool import tool
@tool
def calculate_molecular_weight(formula: str) -> float:
"""Calculate molecular weight from chemical formula.
Args:
formula: Chemical formula (e.g., 'H2O', 'CH4')
Returns:
Molecular weight in g/mol
"""
# Implementation here
passModal Tools (Cloud Execution)
Modal allows you to run computationally intensive tasks in the cloud. See Modal docs for setup.
from corral.utils.modal import modal_tool, MODAL_TOOL_REGISTRY
from modal import App, Image
app = App("my-corral-tools")
@modal_tool(app=app, image=Image.debian_slim().pip_install("rdkit"), memory=1024)
def complex_calculation(data: str) -> str:
"""Run computationally intensive task in the cloud."""
from rdkit import Chem
mol = Chem.MolFromSmiles(data)
return f"Molecule has {mol.GetNumAtoms()} atoms"
# Access the tool
tool_instance = MODAL_TOOL_REGISTRY["complex_calculation"]For Corral-specific usage, see the Modal App Documentation.
Corral tools can be easily converted to MCP format for use with MCP-compatible clients like Claude Desktop:
from corral.backend.tool import tool
from corral.router.verbosity import ToolVerbosity
@tool
def my_scientific_tool(param: str) -> str:
"""Scientific tool description.
Args:
param: Parameter description
Returns:
Result description
"""
return f"Result: {param}"
# Convert to MCP format
mcp_definition = my_scientific_tool.to_mcp()
# With specific verbosity level
mcp_brief = my_scientific_tool.to_mcp(verbosity=ToolVerbosity.BRIEF)Create a custom MCP server:
from mcp.server import Server
from mcp.types import Tool as MCPTool
import importlib
import inspect
from corral.backend.tool import Tool
# Load tools from a module
module = importlib.import_module("my_domain.tools")
tools = {name: obj for name, obj in inspect.getmembers(module) if isinstance(obj, Tool)}
# Create MCP server
server = Server("my-corral-tools")
@server.list_tools()
async def list_tools():
return [MCPTool(**tool.to_mcp()) for tool in tools.values()]
@server.call_tool()
async def call_tool(name: str, arguments: dict):
tool = tools[name]
# Execute and return results
...For more details on creating tools and MCP integration, see the Tools Documentation.
For environments requiring file I/O:
export CORRAL_FS_PROTOCOL=local
export BASE_IO_PATH=/path/to/work/directoryThe framework provides comprehensive evaluation metrics:
result = runner.bench(trials_per_task=10, k_values=[1, 3, 5])
# Access detailed results
print(f"Total score: {result.total_score}")
print(f"Pass@1: {result.pass_at_k[1]}")
print(f"Pass@3: {result.pass_at_k[3]}")
print(f"Average trials: {result.average_trials}")
# Per-task analysis
for task_id, task_result in result.task_results.items():
print(f"Task {task_id}: {task_result.success_rate:.2f} success rate")- Issues: Report bugs and request features on GitHub Issues
- Discussions: Join conversations on GitHub Discussions
- Contributing: See our Contributing Guide
This project is licensed under the MIT License - see the LICENSE file for details.
If you use Corral in your research, please consider citing:
@article{ríos-garcía2026ai,
title = {AI scientists produce results without reasoning scientifically},
author = {Martiño Ríos-García and Nawaf Alampara and Chandan Gupta and Indrajeet Mandal and Sajid Mannan and Ali Asghar Aghajani and N. M. Anoop Krishnan and Kevin Maik Jablonka},
year = {2026},
journal = {arXiv preprint arXiv: 2604.18805}
}