Compact Automated Reproducible Assessment of Machine Learning (CARAML) is a benchmark framework designed to assess mainstream Computer Vision (CV) and Natural Language Processing (NLP) workloads on novel accelerators. It has been developed and extensively tested on systems at the Jülich Supercomputing Centre (JSC).
CARAML provides a compact and automated benchmarking tool that leverages JUBE, a scripting-based framework for creating benchmark sets, running them across different systems, and evaluating results. Additionally, it includes power/energy measurements through the jpwr tool.
CARAML has been tested on the JURECA-DC EVALUATION PLATFORM, JURECA-DC, JEDI and WEST-AI Nodes. These include the accelerators:
- AMD MI200 node with 4
$\times$ MI250 GPUs (tag: MI250) - Graphcore IPU-POD4 M2000 with 4
$\times$ GC200 IPUs (tag: GC200) - NVIDIA Ampere node (SXM) with 4
$\times$ A100 GPUs (tag: A100) - NVIDIA Hopper node (PCIe) with 4
$\times$ H100 GPUs (tag: H100) - NVIDIA Hopper node (NVLink) with 4
$\times$ H100 GPUs (tag: WAIH100) - NVIDIA Grace-Hopper chip with 1
$\times$ GH200 GPU (tag: GH200) - NVIDIA Grace-Hopper Node with 4
$\times$ GH200 GPUs (tag: JEDI)
CARAML currently provides two main benchmarks implemented in Python:
The image_classification model training benchmark is implemented in PyTorch. It is designed to test image classification models such as ResNet50 on various accelerators. For IPU's graphcore/examples is used. Performance is measured in images/sec and energy is measured in Wh.
Note: Support for the Image Classification benchmark in TensorFlow has been discontinued.
The LLM-training benchmark is implemented in PyTorch with:
- Megatron-LM with commit:
f7727433293427bef04858f67b2889fe9b177d88and patch applied for NVIDIA - Megatron-LM-ROCm with commit:
21045b59127cd2d5509f1ca27d81fae7b485bd22and patch applied for AMD - graphcore/examples (forked version) for Graphcore
Performance is measured in tokens/sec and energy is recorded in Wh.
To run the benchmarks, you must install JUBE. Follow the JUBE Installation Documentation for setup instructions. The benchmarks are deployed using Apptainer containers and executed using SLURM on the tested accelerators.
-
Image Classification: Synthetic data is generated on the host machine for benchmarking. The IPU tag
syntheticadditionally allows for the generation of synthetic data directly on the IPU. -
LLM Training: A subset of the OSCAR dataset (790 samples, ~10 MB) is pre-processed using GPT-2 tokenizers. This data is provided in the
llm_datadirectory.
Clone the repository and navigate into it:
git clone https://github.com/FZJ-JSC/CARAML.git
cd CARAML-
Modify
system,modelparameters in JUBE config -
To pull the required container use
containertag as:jube run image_classification/image_classification_torch_benchmark.xml --tag container H100
For JSC systems
H100can be replaced withGH200,MI250andGC200as required. -
To run the benchmark with defined configurations do
jube run image_classification/image_classification_torch_benchmark.xml --tag H100
H100can be replaced withA100,WAIH100,GH200,JEDI,MI250andGC200as required. -
After the benchmark has been executed, use
jube continueto postprocess resultsjube continue image_classification/image_classification_torch_benchmark._run -i last -
To generate result do:
jube result image_classification/image_classification_torch_benchmark._run -i last
-
Set the required
systemandmodelparameters in llm_benchmark_nvidia_amd.yaml for NVIDIA and AMD devices and in llm_benchmark_ipu.yaml for Graphcore -
To run the benchmark with defined configurations for
800MGPT model with OSCAR data do:jube run llm_training/llm_benchmark_nvidia_amd.yaml --tag 800M A100
A100can be replaced withH100,WAIH100,GH200,JEDIandMI250for the respective systems and800Mcan be replaced with13Band175Bfor systems with more node resources likeJEDI,H100,A100andMI250. -
To run the benchmark with defined configurations for
117MGPT model on Graphcore with synthetic data dojube run llm_training/llm_benchmark_ipu.yaml --tag 117M synthetic
If tag
syntheticis not given, the benchmark will use OSCAR data -
After the benchmark has been executed, use
jube continueto postprocess resultsjube continue llm_training/llm_benchmark_nvidia_amd_run -i last -
To generate result do:
jube result llm_training/llm_benchmark_nvidia_amd_run -i last
In order to use PyTorch torch run API on JSC systems fixed_torch_run.py fix is required. The fix solves the issue defined here.
Additionally the hostname is appended with an i for allowing communication over InfiniBand as described here.
@INPROCEEDINGS{10820809,
author={John, Chelsea Maria and Nassyr, Stepan and Penke, Carolin and Herten, Andreas},
booktitle={SC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis},
title={Performance and Power: Systematic Evaluation of AI Workloads on Accelerators with CARAML},
year={2024},
pages={1164-1176},
doi={10.1109/SCW63240.2024.00158}
}

