Complete monitoring solution for vLLM with Prometheus and Grafana.
- vLLM: LLM inference engine with Qwen 2.5-3B model
- DCGM Exporter: NVIDIA GPU metrics
- Node Exporter: System metrics (CPU, RAM, disk)
- Prometheus: Metrics collection and storage
- Grafana: Visualization dashboards
- Docker and Docker Compose
- NVIDIA GPU with drivers installed
- NVIDIA Container Toolkit
docker-compose up -d- Go to Portainer UI → Stacks → Add stack
- Paste the contents of
docker-compose.yml - Deploy the stack
- Grafana: http://localhost:3000 (admin/admin)
- Prometheus: http://localhost:9090
- vLLM API: http://localhost:8001
- vLLM Metrics: http://localhost:8001/metrics
- DCGM Exporter: http://localhost:9401/metrics
- Node Exporter: http://localhost:9100/metrics
-
NVIDIA DCGM Dashboard
- ID: 12239
- Shows GPU utilization, memory, temperature, power
-
Node Exporter Full
- ID: 1860
- Shows CPU, memory, disk, network metrics
-
vLLM Monitoring Dashboard
- Import from
grafana-dashboards/vllm-dashboard.json - Shows request queue, token throughput, latency, cache usage
- Import from
Send a test request:
curl http://localhost:8001/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-3B-Instruct",
"prompt": "Hello, how are you?",
"max_tokens": 50
}'Modify in docker-compose.yml under vllm service command:
--gpu-memory-utilization: GPU memory to use (default: 0.90)--max-model-len: Maximum context length (default: 4096)--model: Model to load
- Scrape interval: 15s
- Retention: 15 days
- Config is embedded in docker-compose.yml
All data is stored in Docker volumes:
prometheus_data: Prometheus metricsgrafana_data: Grafana dashboards and settings./hf_cache: HuggingFace model cache
docker psdocker logs vllm-qwen3
docker logs prometheus
docker logs grafana# Check Prometheus targets
curl http://localhost:9090/api/v1/targets
# Check vLLM metrics
curl http://localhost:8001/metricsdocker-compose downTo remove volumes as well:
docker-compose down -vEOF