Skip to content

chenxuniu/LLM-Inference-Engine-Benchmark

Repository files navigation

LLM-Inference-Engine-Benchmark

A comprehensive benchmarking tool for measuring energy consumption and power efficiency of Large Language Model (LLM) inference engines including vLLM, DeepSpeed, TensorRT-LLM, and Transformers. Featured in HotCarbon '25.

🎯 Overview

This tool provides the first systematic evaluation of power consumption across multiple LLM inference engines, offering:

  • Multi-Engine Support: Benchmarks vLLM, DeepSpeed, TensorRT-LLM, and Hugging Face Transformers
  • Fine-grained Analysis: Separates inference lifecycle into setup and token generation stages
  • Component-wise Monitoring: Tracks GPU, CPU, and DRAM power consumption separately
  • Comprehensive Metrics: Measures energy per token, per response, and per second
  • Real-world Workloads: Uses Alpaca dataset for realistic benchmarking scenarios

πŸ“Š Key Findings

Our research reveals that:

  • No single engine universally optimizes energy efficiency across all scenarios
  • vLLM and TensorRT-LLM excel in token generation efficiency under high-throughput workloads
  • Transformers and DeepSpeed are more efficient during the setup stage
  • Higher throughput generally improves energy efficiency by amortizing fixed costs
  • GPU optimization is critical as it dominates overall energy consumption (>50%)

πŸ› οΈ Features

Multi-Stage Analysis

  • Setup Stage: Engine initialization and model loading
  • Token Generation Stage: Actual inference execution

Component-wise Power Monitoring

  • GPU Power: Using NVIDIA Management Library (NVML)
  • CPU Power: Using Intel RAPL (Running Average Power Limit)
  • DRAM Power: Memory subsystem power consumption
  • Total System Power: Using IPMI for complete system monitoring

Flexible Workload Configurations

  • Standard Load: Batch size 128, 500 output tokens
  • High Concurrency: Batch size 256, 500 output tokens
  • High Throughput: Batch size 256, 2000 output tokens

πŸš€ Quick Start

Prerequisites

# System requirements
- Python 3.8+
- NVIDIA GPUs with CUDA support
- Intel CPUs with RAPL support (for CPU power monitoring)
- IPMI tools (for total system power monitoring)

Installation

  1. Clone the repository:

    git clone https://github.com/chenxuniu/LLM-Inference-Engine-Benchmark.git
    cd LLM-Energy-Benchmark
  2. Create virtual environment:

    python3 -m venv venv
    source venv/bin/activate  # On Linux/macOS
    # venv\Scripts\activate  # On Windows
  3. Install dependencies:

    pip install torch transformers datasets nvidia-ml-py
    
    # Install specific engines (choose based on your needs)
    pip install vllm                    # For vLLM
    pip install deepspeed              # For DeepSpeed  
    pip install tensorrt-llm           # For TensorRT-LLM
  4. Setup system permissions:

    # For RAPL CPU power monitoring
    sudo chmod +r /sys/class/powercap/intel-rapl/*/energy_uj
    
    # For IPMI total power monitoring
    sudo modprobe ipmi_devintf
    sudo modprobe ipmi_si

Basic Usage

# Benchmark vLLM with Llama models
python llm_benchmark.py \
  --engine vllm \
  --models "TinyLlama/TinyLlama-1.1B-Chat-v1.0,meta-llama/Llama-2-7b-chat-hf" \
  --batch-sizes "128,256" \
  --output-tokens 500 \
  --num-samples 1024

# Benchmark all engines with different workloads
python llm_benchmark.py \
  --engine transformers \
  --models "TinyLlama/TinyLlama-1.1B-Chat-v1.0" \
  --batch-sizes "128,256" \
  --output-tokens 2000 \
  --num-samples 2048

πŸ“ˆ Sample Results

Our benchmarks on NVIDIA H100 GPUs show:

Engine Energy/Token (mJ) Setup Time (s) Peak Power (W)
vLLM 37 48.4 1382
TensorRT-LLM 91 30.2 1003
DeepSpeed 356 2.9 1063
Transformers 711 2.9 1055

Energy Per Token Results: (Heat Map):

Energy Per Token Results

The relationship between throughput and energy per token:

Energy Per Token Results

Results for Llama3.1-8B under High Throughput workload (BS: 256, Tokens: 2000)

πŸ“ Project Structure

LLM-Energy-Benchmark/
β”œβ”€β”€ llm_benchmark.py          # Main benchmarking script
β”œβ”€β”€ power_monitor.py          # Power monitoring utilities
β”œβ”€β”€ engines/
β”‚   β”œβ”€β”€ vllm_engine.py       # vLLM wrapper
β”‚   β”œβ”€β”€ deepspeed_engine.py  # DeepSpeed wrapper
β”‚   β”œβ”€β”€ transformer_engine.py # Transformers wrapper
β”‚   └── trtllm_engine.py     # TensorRT-LLM wrapper
β”œβ”€β”€ results/                 # Benchmark results (JSON)
β”œβ”€β”€ figures/                 # Research paper figures
└── README.md

πŸ”§ Advanced Configuration

Custom Datasets

# Modify prompts in llm_benchmark.py
prompts = load_alpaca_dataset(min_length=10, max_length=100)

Multiple GPU Setup

The tool automatically detects and utilizes multiple GPUs when available:

  • vLLM: Automatic tensor parallelism across GPUs
  • DeepSpeed: Distributed inference configuration
  • TensorRT-LLM: Multi-GPU optimization

Custom Metrics

Extend the PowerMonitor class to add custom metrics:

def custom_metric_calculation(self, duration, tokens):
    # Your custom energy efficiency metric
    return energy_per_complexity_unit

πŸ“Š Output Metrics

The tool provides comprehensive energy efficiency metrics:

Per-Token Metrics

  • Energy per token (J/token)
  • Component breakdown (GPU/CPU/DRAM)
  • Comparison across engines and models

Per-Response Metrics

  • Total energy per response
  • Latency vs energy trade-offs
  • Throughput efficiency analysis

System-Level Metrics

  • Power consumption over time
  • Component utilization patterns
  • Thermal and efficiency curves

🀝 Contributing

We welcome contributions! Please:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/new-engine)
  3. Make your changes
  4. Add tests for new functionality
  5. Submit a pull request

Adding New Engines

To add support for a new inference engine:

  1. Create a new engine wrapper in engines/
  2. Implement the required interface methods:
    • setup_model()
    • run_inference()
    • run_benchmark()
    • estimate_tokens()

πŸ”— Related Work

πŸ›Ÿ Support


Note: This tool requires specific hardware (NVIDIA GPUs, Intel CPUs with RAPL) and system permissions for accurate power monitoring. See the installation guide for detailed setup instructions.

About

A comprehensive benchmarking tool for measuring energy consumption and power efficiency of Large Language Model (LLM) inference engines including vLLM, DeepSpeed, TensorRT-LLM, and Transformers. Featured in HotCarbon '25.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages