vLLM Production Monitoring Stack

Complete monitoring solution for vLLM with Prometheus and Grafana.

Stack Components

vLLM: LLM inference engine with Qwen 2.5-3B model
DCGM Exporter: NVIDIA GPU metrics
Node Exporter: System metrics (CPU, RAM, disk)
Prometheus: Metrics collection and storage
Grafana: Visualization dashboards

Prerequisites

Docker and Docker Compose
NVIDIA GPU with drivers installed
NVIDIA Container Toolkit

Deployment

Option 1: Docker Compose CLI

docker-compose up -d

Option 2: Portainer

Go to Portainer UI → Stacks → Add stack
Paste the contents of docker-compose.yml
Deploy the stack

Access URLs

Grafana: http://localhost:3000 (admin/admin)
Prometheus: http://localhost:9090
vLLM API: http://localhost:8001
vLLM Metrics: http://localhost:8001/metrics
DCGM Exporter: http://localhost:9401/metrics
Node Exporter: http://localhost:9100/metrics

Grafana Dashboards

Import Pre-built Dashboards

NVIDIA DCGM Dashboard
- ID: 12239
- Shows GPU utilization, memory, temperature, power
Node Exporter Full
- ID: 1860
- Shows CPU, memory, disk, network metrics
vLLM Monitoring Dashboard
- Import from grafana-dashboards/vllm-dashboard.json
- Shows request queue, token throughput, latency, cache usage

Testing vLLM

Send a test request:

curl http://localhost:8001/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-3B-Instruct",
    "prompt": "Hello, how are you?",
    "max_tokens": 50
  }'

Configuration

vLLM Settings

Modify in docker-compose.yml under vllm service command:

--gpu-memory-utilization: GPU memory to use (default: 0.90)
--max-model-len: Maximum context length (default: 4096)
--model: Model to load

Prometheus Settings

Scrape interval: 15s
Retention: 15 days
Config is embedded in docker-compose.yml

Data Persistence

All data is stored in Docker volumes:

prometheus_data: Prometheus metrics
grafana_data: Grafana dashboards and settings
./hf_cache: HuggingFace model cache

Troubleshooting

Check if all services are running

docker ps

Check logs

docker logs vllm-qwen3
docker logs prometheus
docker logs grafana

Verify metrics

# Check Prometheus targets
curl http://localhost:9090/api/v1/targets

# Check vLLM metrics
curl http://localhost:8001/metrics

Stopping the Stack

docker-compose down

To remove volumes as well:

docker-compose down -v

EOF

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
.github/workflows		.github/workflows
ajrasakha		ajrasakha
grafana_dashboards		grafana_dashboards
v-sakha		v-sakha
vision_model_containerzation		vision_model_containerzation
.gitignore		.gitignore
.gitkeep		.gitkeep
README.md		README.md
grafana-docker-compose.yml		grafana-docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

vLLM Production Monitoring Stack

Stack Components

Prerequisites

Deployment

Option 1: Docker Compose CLI

Option 2: Portainer

Access URLs

Grafana Dashboards

Import Pre-built Dashboards

Testing vLLM

Configuration

vLLM Settings

Prometheus Settings

Data Persistence

Troubleshooting

Check if all services are running

Check logs

Verify metrics

Stopping the Stack

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

vicharanashala/gpu-prod-deployment

Folders and files

Latest commit

History

Repository files navigation

vLLM Production Monitoring Stack

Stack Components

Prerequisites

Deployment

Option 1: Docker Compose CLI

Option 2: Portainer

Access URLs

Grafana Dashboards

Import Pre-built Dashboards

Testing vLLM

Configuration

vLLM Settings

Prometheus Settings

Data Persistence

Troubleshooting

Check if all services are running

Check logs

Verify metrics

Stopping the Stack

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages