A multimodal evaluation framework for scheduling LLM and VLM evaluations across HPC clusters. Built as an orchestration layer over lm-eval, lighteval, and lmms-eval, with a plugin system for contributing custom benchmarks.
- Schedule evaluations on multiple models and tasks:
oellm schedule-eval - Collect results and check for missing evaluations:
oellm collect-results - Task groups for pre-defined evaluation suites with automatic dataset pre-downloading
- Multi-cluster support with auto-detection (Leonardo, LUMI, JURECA)
- Image evaluation via lmms-eval (VQAv2, MMBench, MMMU, ChartQA, DocVQA, TextVQA, OCRBench, MathVista)
- Plugin system for contributing custom benchmarks without touching core code
- Automatic container builds via GitHub Actions
Prerequisites:
- Install uv
- Set
HF_HOMEto your HuggingFace cache directory (e.g.export HF_HOME="/path/to/hf_home")
# Install
uv tool install -p 3.12 git+https://github.com/elliot-project/elliot-cli.git
# Run evaluations using a task group
oellm schedule-eval \
--models "EleutherAI/pythia-160m" \
--task_groups "open-sci-0.01"
# Image evaluation (requires venv with lmms-eval)
oellm schedule-eval \
--models "llava-hf/llava-1.5-7b-hf" \
--task_groups "image-vqa" \
--venv_path ~/elliot-venvThis will automatically detect your cluster, download models and datasets, and submit a SLURM job array with cluster-specific resources.
For custom environments instead of containers, pass --venv_path (see docs/VENV.md).
Task groups are pre-defined evaluation suites in task-groups.yaml. Each group specifies tasks, their n-shot settings, and HuggingFace dataset mappings.
| Group | Description | Engine |
|---|---|---|
open-sci-0.01 |
COPA, MMLU, HellaSwag, ARC, etc. | lm-eval |
belebele-eu-5-shot |
Belebele in 23 European languages | lm-eval |
flores-200-eu-to-eng |
EU to English translation | lighteval |
flores-200-eng-to-eu |
English to EU translation | lighteval |
global-mmlu-eu |
Global MMLU in EU languages | lm-eval |
mgsm-eu |
Multilingual GSM8K | lm-eval |
generic-multilingual |
XWinograd, XCOPA, XStoryCloze | lm-eval |
include |
INCLUDE benchmarks (44 languages) | lm-eval |
Super groups: oellm-multilingual (all multilingual benchmarks combined)
| Group | Benchmark | Engine |
|---|---|---|
image-vqa |
All 8 benchmarks combined | lmms-eval |
image-vqav2 |
VQAv2 | lmms-eval |
image-mmbench |
MMBench | lmms-eval |
image-mmmu |
MMMU | lmms-eval |
image-chartqa |
ChartQA | lmms-eval |
image-docvqa |
DocVQA | lmms-eval |
image-textvqa |
TextVQA | lmms-eval |
image-ocrbench |
OCRBench | lmms-eval |
image-mathvista |
MathVista | lmms-eval |
The lmms-eval adapter class (llava_hf, qwen2_5_vl, etc.) is auto-detected from the model name.
Community-contributed benchmarks that run outside the standard evaluation engines. See the contrib registry for the full list.
# Run all 8 image benchmarks
oellm schedule-eval \
--models "llava-hf/llava-1.5-7b-hf" \
--task_groups "image-vqa" \
--venv_path ~/elliot-venv
# Mix image and text benchmarks in one submission
oellm schedule-eval \
--models "llava-hf/llava-1.5-7b-hf" \
--task_groups "image-mmbench,open-sci-0.01" \
--venv_path ~/elliot-venv
# Use multiple task groups or a super group
oellm schedule-eval --models "model-name" --task_groups "belebele-eu-5-shot,global-mmlu-eu"
oellm schedule-eval --models "model-name" --task_groups "oellm-multilingual"# Basic collection
oellm collect-results /path/to/eval-output-dir
# Check for missing evaluations and create a CSV for re-running them
oellm collect-results /path/to/eval-output-dir --check --output_csv results.csv
# Re-schedule failed jobs
oellm schedule-eval --eval_csv_path results_missing.csvuv tool install -p 3.12 git+https://github.com/elliot-project/elliot-cli.gitUpdate to latest:
uv tool upgrade oellmFor cluster-specific setup, see the documentation section.
git clone https://github.com/elliot-project/elliot-cli.git
cd elliot-cli
uv sync --extra dev
# Run all unit tests
uv run pytest tests/ -v
# Download-only mode for testing
uv run oellm schedule-eval --models "EleutherAI/pythia-160m" --task_groups "open-sci-0.01" --download_only| Cluster | Guide |
|---|---|
| Leonardo (CINECA) | docs/LEONARDO.md |
| LUMI, JURECA | Coming soon |
| Doc | Description |
|---|---|
| Using a Virtual Environment | Setting up a custom venv with lm-eval, lmms-eval, and lighteval |
| Container Workflow | How Apptainer containers are built, deployed, and used |
| Doc | Description |
|---|---|
| Adding Tasks & Task Groups | YAML structure for defining new evaluation suites |
| Contributing Custom Benchmarks | Step-by-step guide for adding a contrib plugin |
| Contrib Registry | List of community-contributed benchmarks |
ELLIOT supports two paths for adding benchmarks:
- Benchmark already in lm-eval / lighteval / lmms-eval -- add a YAML entry to
task-groups.yaml - Fully custom benchmark -- drop a contrib plugin into
oellm/contrib/
See the Contributing Guide for step-by-step instructions.
Containers are deployed manually since PR #46. To build and deploy, select "Run workflow" in Actions.