A comprehensive benchmarking tool for measuring energy consumption and power efficiency of Large Language Model (LLM) inference engines including vLLM, DeepSpeed, TensorRT-LLM, and Transformers. Featured in HotCarbon '25.
This tool provides the first systematic evaluation of power consumption across multiple LLM inference engines, offering:
- Multi-Engine Support: Benchmarks vLLM, DeepSpeed, TensorRT-LLM, and Hugging Face Transformers
- Fine-grained Analysis: Separates inference lifecycle into setup and token generation stages
- Component-wise Monitoring: Tracks GPU, CPU, and DRAM power consumption separately
- Comprehensive Metrics: Measures energy per token, per response, and per second
- Real-world Workloads: Uses Alpaca dataset for realistic benchmarking scenarios
Our research reveals that:
- No single engine universally optimizes energy efficiency across all scenarios
- vLLM and TensorRT-LLM excel in token generation efficiency under high-throughput workloads
- Transformers and DeepSpeed are more efficient during the setup stage
- Higher throughput generally improves energy efficiency by amortizing fixed costs
- GPU optimization is critical as it dominates overall energy consumption (>50%)
- Setup Stage: Engine initialization and model loading
- Token Generation Stage: Actual inference execution
- GPU Power: Using NVIDIA Management Library (NVML)
- CPU Power: Using Intel RAPL (Running Average Power Limit)
- DRAM Power: Memory subsystem power consumption
- Total System Power: Using IPMI for complete system monitoring
- Standard Load: Batch size 128, 500 output tokens
- High Concurrency: Batch size 256, 500 output tokens
- High Throughput: Batch size 256, 2000 output tokens
# System requirements
- Python 3.8+
- NVIDIA GPUs with CUDA support
- Intel CPUs with RAPL support (for CPU power monitoring)
- IPMI tools (for total system power monitoring)-
Clone the repository:
git clone https://github.com/chenxuniu/LLM-Inference-Engine-Benchmark.git cd LLM-Energy-Benchmark -
Create virtual environment:
python3 -m venv venv source venv/bin/activate # On Linux/macOS # venv\Scripts\activate # On Windows
-
Install dependencies:
pip install torch transformers datasets nvidia-ml-py # Install specific engines (choose based on your needs) pip install vllm # For vLLM pip install deepspeed # For DeepSpeed pip install tensorrt-llm # For TensorRT-LLM
-
Setup system permissions:
# For RAPL CPU power monitoring sudo chmod +r /sys/class/powercap/intel-rapl/*/energy_uj # For IPMI total power monitoring sudo modprobe ipmi_devintf sudo modprobe ipmi_si
# Benchmark vLLM with Llama models
python llm_benchmark.py \
--engine vllm \
--models "TinyLlama/TinyLlama-1.1B-Chat-v1.0,meta-llama/Llama-2-7b-chat-hf" \
--batch-sizes "128,256" \
--output-tokens 500 \
--num-samples 1024
# Benchmark all engines with different workloads
python llm_benchmark.py \
--engine transformers \
--models "TinyLlama/TinyLlama-1.1B-Chat-v1.0" \
--batch-sizes "128,256" \
--output-tokens 2000 \
--num-samples 2048Our benchmarks on NVIDIA H100 GPUs show:
| Engine | Energy/Token (mJ) | Setup Time (s) | Peak Power (W) |
|---|---|---|---|
| vLLM | 37 | 48.4 | 1382 |
| TensorRT-LLM | 91 | 30.2 | 1003 |
| DeepSpeed | 356 | 2.9 | 1063 |
| Transformers | 711 | 2.9 | 1055 |
Energy Per Token Results: (Heat Map):
The relationship between throughput and energy per token:
Results for Llama3.1-8B under High Throughput workload (BS: 256, Tokens: 2000)
LLM-Energy-Benchmark/
βββ llm_benchmark.py # Main benchmarking script
βββ power_monitor.py # Power monitoring utilities
βββ engines/
β βββ vllm_engine.py # vLLM wrapper
β βββ deepspeed_engine.py # DeepSpeed wrapper
β βββ transformer_engine.py # Transformers wrapper
β βββ trtllm_engine.py # TensorRT-LLM wrapper
βββ results/ # Benchmark results (JSON)
βββ figures/ # Research paper figures
βββ README.md
# Modify prompts in llm_benchmark.py
prompts = load_alpaca_dataset(min_length=10, max_length=100)The tool automatically detects and utilizes multiple GPUs when available:
- vLLM: Automatic tensor parallelism across GPUs
- DeepSpeed: Distributed inference configuration
- TensorRT-LLM: Multi-GPU optimization
Extend the PowerMonitor class to add custom metrics:
def custom_metric_calculation(self, duration, tokens):
# Your custom energy efficiency metric
return energy_per_complexity_unitThe tool provides comprehensive energy efficiency metrics:
- Energy per token (J/token)
- Component breakdown (GPU/CPU/DRAM)
- Comparison across engines and models
- Total energy per response
- Latency vs energy trade-offs
- Throughput efficiency analysis
- Power consumption over time
- Component utilization patterns
- Thermal and efficiency curves
We welcome contributions! Please:
- Fork the repository
- Create a feature branch (
git checkout -b feature/new-engine) - Make your changes
- Add tests for new functionality
- Submit a pull request
To add support for a new inference engine:
- Create a new engine wrapper in
engines/ - Implement the required interface methods:
setup_model()run_inference()run_benchmark()estimate_tokens()
- vLLM - High-throughput LLM serving
- DeepSpeed - Distributed training and inference
- TensorRT-LLM - NVIDIA optimized inference
- Transformers - Hugging Face model library
- Email: ncxhxgtg@gmail.com
Note: This tool requires specific hardware (NVIDIA GPUs, Intel CPUs with RAPL) and system permissions for accurate power monitoring. See the installation guide for detailed setup instructions.

