ELLIOT Evaluation Platform

A multimodal evaluation framework for scheduling LLM and VLM evaluations across HPC clusters. Built as an orchestration layer over lm-eval, lighteval, and lmms-eval, with a plugin system for contributing custom benchmarks.

Features

Schedule evaluations on multiple models and tasks: oellm schedule-eval
Collect results and check for missing evaluations: oellm collect-results
Task groups for pre-defined evaluation suites with automatic dataset pre-downloading
Multi-cluster support with auto-detection (Leonardo, LUMI, JURECA)
Image evaluation via lmms-eval (VQAv2, MMBench, MMMU, ChartQA, DocVQA, TextVQA, OCRBench, MathVista)
Plugin system for contributing custom benchmarks without touching core code
Automatic container builds via GitHub Actions

Quick Start

Prerequisites:

Install uv
Set HF_HOME to your HuggingFace cache directory (e.g. export HF_HOME="/path/to/hf_home")

# Install
uv tool install -p 3.12 git+https://github.com/elliot-project/elliot-cli.git

# Run evaluations using a task group
oellm schedule-eval \
    --models "EleutherAI/pythia-160m" \
    --task_groups "open-sci-0.01"

# Image evaluation (requires venv with lmms-eval)
oellm schedule-eval \
    --models "llava-hf/llava-1.5-7b-hf" \
    --task_groups "image-vqa" \
    --venv_path ~/elliot-venv

This will automatically detect your cluster, download models and datasets, and submit a SLURM job array with cluster-specific resources.

For custom environments instead of containers, pass --venv_path (see docs/VENV.md).

Task Groups

Task groups are pre-defined evaluation suites in task-groups.yaml. Each group specifies tasks, their n-shot settings, and HuggingFace dataset mappings.

Text & Multilingual

Group	Description	Engine
`open-sci-0.01`	COPA, MMLU, HellaSwag, ARC, etc.	lm-eval
`belebele-eu-5-shot`	Belebele in 23 European languages	lm-eval
`flores-200-eu-to-eng`	EU to English translation	lighteval
`flores-200-eng-to-eu`	English to EU translation	lighteval
`global-mmlu-eu`	Global MMLU in EU languages	lm-eval
`mgsm-eu`	Multilingual GSM8K	lm-eval
`generic-multilingual`	XWinograd, XCOPA, XStoryCloze	lm-eval
`include`	INCLUDE benchmarks (44 languages)	lm-eval

Super groups: oellm-multilingual (all multilingual benchmarks combined)

Image

Group	Benchmark	Engine
`image-vqa`	All 8 benchmarks combined	lmms-eval
`image-vqav2`	VQAv2	lmms-eval
`image-mmbench`	MMBench	lmms-eval
`image-mmmu`	MMMU	lmms-eval
`image-chartqa`	ChartQA	lmms-eval
`image-docvqa`	DocVQA	lmms-eval
`image-textvqa`	TextVQA	lmms-eval
`image-ocrbench`	OCRBench	lmms-eval
`image-mathvista`	MathVista	lmms-eval

The lmms-eval adapter class (llava_hf, qwen2_5_vl, etc.) is auto-detected from the model name.

Custom Benchmarks (contrib)

Community-contributed benchmarks that run outside the standard evaluation engines. See the contrib registry for the full list.

# Run all 8 image benchmarks
oellm schedule-eval \
    --models "llava-hf/llava-1.5-7b-hf" \
    --task_groups "image-vqa" \
    --venv_path ~/elliot-venv

# Mix image and text benchmarks in one submission
oellm schedule-eval \
    --models "llava-hf/llava-1.5-7b-hf" \
    --task_groups "image-mmbench,open-sci-0.01" \
    --venv_path ~/elliot-venv

# Use multiple task groups or a super group
oellm schedule-eval --models "model-name" --task_groups "belebele-eu-5-shot,global-mmlu-eu"
oellm schedule-eval --models "model-name" --task_groups "oellm-multilingual"

Collecting Results

# Basic collection
oellm collect-results /path/to/eval-output-dir

# Check for missing evaluations and create a CSV for re-running them
oellm collect-results /path/to/eval-output-dir --check --output_csv results.csv

# Re-schedule failed jobs
oellm schedule-eval --eval_csv_path results_missing.csv

Installation

uv tool install -p 3.12 git+https://github.com/elliot-project/elliot-cli.git

Update to latest:

uv tool upgrade oellm

For cluster-specific setup, see the documentation section.

Development

git clone https://github.com/elliot-project/elliot-cli.git
cd elliot-cli
uv sync --extra dev

# Run all unit tests
uv run pytest tests/ -v

# Download-only mode for testing
uv run oellm schedule-eval --models "EleutherAI/pythia-160m" --task_groups "open-sci-0.01" --download_only

Documentation

Cluster Setup

Cluster	Guide
Leonardo (CINECA)	docs/LEONARDO.md
LUMI, JURECA	Coming soon

Environment & Infrastructure

Doc	Description
Using a Virtual Environment	Setting up a custom venv with lm-eval, lmms-eval, and lighteval
Container Workflow	How Apptainer containers are built, deployed, and used

Extending the Platform

Doc	Description
Adding Tasks & Task Groups	YAML structure for defining new evaluation suites
Contributing Custom Benchmarks	Step-by-step guide for adding a contrib plugin
Contrib Registry	List of community-contributed benchmarks

Contributing Custom Benchmarks

ELLIOT supports two paths for adding benchmarks:

Benchmark already in lm-eval / lighteval / lmms-eval -- add a YAML entry to task-groups.yaml
Fully custom benchmark -- drop a contrib plugin into oellm/contrib/

See the Contributing Guide for step-by-step instructions.

Deploying Containers

Containers are deployed manually since PR #46. To build and deploy, select "Run workflow" in Actions.

Name		Name	Last commit message	Last commit date
Latest commit History 118 Commits
.github		.github
containers		containers
docs		docs
oellm		oellm
scripts		scripts
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
pyproject.toml		pyproject.toml
requirements-venv-dclm.txt		requirements-venv-dclm.txt
requirements-venv.txt		requirements-venv.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ELLIOT Evaluation Platform

Features

Quick Start

Task Groups

Text & Multilingual

Image

Custom Benchmarks (contrib)

Collecting Results

Installation

Development

Documentation

Cluster Setup

Environment & Infrastructure

Extending the Platform

Contributing Custom Benchmarks

Deploying Containers

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ELLIOT Evaluation Platform

Features

Quick Start

Task Groups

Text & Multilingual

Image

Custom Benchmarks (contrib)

Collecting Results

Installation

Development

Documentation

Cluster Setup

Environment & Infrastructure

Extending the Platform

Contributing Custom Benchmarks

Deploying Containers

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages