A production-ready system for running multiple vLLM models with hot-swapping capability using vLLM's sleep mode feature. This enables efficient GPU memory management by putting inactive models to sleep (offloading to CPU RAM or discarding weights) while keeping the active model on GPU.
Current vLLM Version: v0.13.0
- Qwen3-VL Enhancements: EVS (Efficient Video Sampling) support and embeddings API
- Performance: 3x faster Whisper inference, DeepSeek V3.2 optimizations (up to 10.7% TTFT improvement)
- Quantization: W4A8 grouped GEMM on Hopper GPUs, MoE + LoRA support for AWQ/GPTQ
- Hardware: NVIDIA Blackwell Ultra (GB300) support with CUDA 13
- API: Model Context Protocol (MCP) integration, binary embeddings format
- 2025-12-30: Upgraded to vLLM v0.13.0
- 2025-12-02: Bumped vLLM to v0.11.2 across all model containers
- 2025-11-27: Added comprehensive monitoring stack (Prometheus, Grafana, Loki, DCGM)
- 2025-11-26: Refactored to use
config.yamlas single source of truth for port mappings - 2025-11-25: Added Qwen3-Next-80B-A3B-Thinking model support
# 1. Install dependencies
sudo apt install yq # YAML parser for bootstrap script
# 2. Run bootstrap script (handles proper startup sequence)
./bootstrap.sh
# This will:
# - Load and cache all models (one at a time to avoid OOM)
# - Put all models to sleep
# - Wake only the default model
# - Start Model Manager and WebUIAccess points:
- Model Manager API: http://localhost:9000
- Open WebUI: http://localhost:3000
- Grafana Dashboard: http://localhost:3001 (admin/admin)
- Prometheus: http://localhost:9091
- vLLM Qwen3-VL-30B: http://localhost:8001
- vLLM Qwen3-VL-32B: http://localhost:8002
- vLLM Qwen3-Next-80B: http://localhost:8003
- vLLM GPT-OSS-20B: http://localhost:8004
# Check current status
curl -s http://localhost:9000/models | jq
# Switch to GPT-OSS
curl -X POST http://localhost:9000/switch \
-H "Content-Type: application/json" \
-d '{"model_id": "gpt-oss-20b"}' | jq
# Switch to Qwen3-VL-32B
curl -X POST http://localhost:9000/switch \
-H "Content-Type: application/json" \
-d '{"model_id": "qwen3-vl-32b"}' | jq┌─────────────────┐
│ Open WebUI │ (Port 3000)
└────────┬────────┘
│
▼
┌─────────────────┐ ┌──────────────────────────┐
│ Model Manager │────▶│ vLLM Qwen3-VL-30B │ (Port 8001)
│ (Go Service) │ │ [Active/Sleep] │
│ Port 9000 │ └──────────────────────────┘
└─────────────────┘ ┌──────────────────────────┐
│ vLLM Qwen3-VL-32B │ (Port 8002)
│ [Active/Sleep] │
└──────────────────────────┘
┌──────────────────────────┐
│ vLLM Qwen3-Next-80B │ (Port 8003)
│ [Active/Sleep] │
└──────────────────────────┘
┌──────────────────────────┐
│ vLLM GPT-OSS-20B │ (Port 8004)
│ [Active/Sleep] │
└──────────────────────────┘
┌──────────────────────────────────────────────────────────────┐
│ Monitoring Stack │
├──────────────────┬──────────────┬──────────────┬─────────────┤
│ Prometheus │ Grafana │ Loki │ DCGM Export │
│ (Port 9091) │ (Port 3001) │ (Port 3100) │ (Port 9400) │
│ Metrics Store │ Dashboards │ Log Aggreg. │ GPU Metrics │
└──────────────────┴──────────────┴──────────────┴─────────────┘
│ │ │ │
└────────────────┴──────────────┴──────────────┘
│
┌────────┴─────────┐
│ Promtail │
│ Log Collector │
└──────────────────┘
homeGPT/
├── model-manager/ # Go service for model switching
│ ├── cmd/switcher/ # Main application entry point
│ ├── internal/ # Internal packages
│ │ ├── config/ # Configuration loading
│ │ ├── handlers/ # HTTP handlers (Gin)
│ │ ├── switcher/ # Core switching logic
│ │ └── vllm/ # vLLM HTTP client
│ ├── pkg/models/ # Shared data structures
│ ├── config.yaml # Model definitions & settings
│ ├── Dockerfile # Multi-stage build
│ └── go.mod # Go dependencies
├── docker/ # Modular Docker Compose files
│ ├── docker-compose.yml # Main orchestration (includes all)
│ ├── compose-model-manager.yml
│ ├── compose-monitoring.yml # Monitoring stack
│ ├── compose-vllm-qwen3-vl-30b-a3b.yml
│ ├── compose-vllm-qwen3-vl-32b.yml
│ ├── compose-vllm-qwen3-next-80b-a3b-thinking.yml
│ ├── compose-vllm-gpt-oss-20b.yml
│ └── compose-webui.yml
├── monitoring/ # Monitoring configuration
│ ├── prometheus.yml # Prometheus config & scrape targets
│ ├── loki-config.yml # Loki log aggregation config
│ ├── promtail-config.yml # Promtail log collection config
│ ├── grafana-datasources.yml # Grafana data sources
│ ├── grafana-dashboards.yml # Grafana dashboard provisioning
│ └── grafana/
│ └── dashboards/
│ ├── gpu-metrics.json # GPU monitoring dashboard
│ └── vllm-metrics.json # vLLM performance dashboard
├── logs/ # vLLM server logs
│ ├── qwen3-vl-30b-a3b/
│ ├── qwen3-vl-32b/
│ ├── qwen3-next-80b-a3b-thinking/
│ └── gpt-oss-20b/
├── config.yaml # Model Manager configuration
├── vllm-logging.json # vLLM logging configuration
├── bootstrap.sh # Automated startup script
└── test-switcher.sh # API test script
- Port: 9000
- Language: Go 1.21+
- Framework: Gin web framework
- Function: Orchestrates model switching by calling vLLM sleep/wake endpoints
- Features:
- Thread-safe concurrent request handling (sync.RWMutex)
- Memory-aware sleep level selection (Level 1 vs Level 2)
- Health monitoring with configurable retry logic
- RESTful HTTP API
- Prometheus (Port 9091) - Metrics collection and storage
- Scrapes vLLM metrics endpoints
- Collects GPU metrics from DCGM Exporter
- Stores time-series data for dashboards
- Grafana (Port 3001) - Visualization and dashboards
- Pre-configured dashboards for GPU metrics and vLLM performance
- Default credentials: admin/admin
- Auto-provisioned data sources (Prometheus, Loki)
- Loki (Port 3100) - Log aggregation
- Centralized log storage for all vLLM instances
- Queryable through Grafana
- Promtail - Log collector
- Scrapes logs from
logs/directory - Ships logs to Loki with metadata labels
- Scrapes logs from
- DCGM Exporter (Port 9400) - NVIDIA GPU metrics
- Exposes detailed GPU utilization, memory, temperature
- Compatible with Prometheus
GET /health- Service health checkGET /models- List all models with current status (active/sleeping/switching/error)POST /switch- Switch to a different model{ "model_id": "qwen3-vl-30b" }
vLLM v0.13.0 supports two sleep levels:
- Level 1: Offload model weights to CPU RAM (fast wake-up, requires RAM)
- Level 2: Discard model weights entirely (slow wake-up, no RAM needed)
The model manager automatically selects the appropriate level based on available RAM (configured in model-manager/config.yaml).
IMPORTANT: Development Mode Requirements Sleep mode endpoints are ONLY available when running vLLM in development mode:
- Environment variable:
VLLM_SERVER_DEV_MODE=1✅ (configured in Docker Compose) - Server flag:
--enable-sleep-mode✅ (configured in Docker Compose)
These endpoints should NOT be exposed to end users in production according to vLLM documentation.
Only available with VLLM_SERVER_DEV_MODE=1 and --enable-sleep-mode:
POST /sleep?level=1orPOST /sleep?level=2- Put model to sleepPOST /wake_up- Wake up sleeping modelGET /is_sleeping- Check if model is currently sleepingGET /is_sleeping- Check if model is sleeping
POST /sleep?level=1orPOST /sleep?level=2- Put model to sleepPOST /wake_up- Wake up sleeping modelGET /is_sleeping- Check if model is sleeping
Follow these steps to add a new vLLM model to the system. Example: Adding Qwen3-VL-32B-Instruct-FP8.
Create docker/compose-vllm-<model-name>.yml following the naming convention:
services:
vllm-<model-name>:
image: vllm/vllm-openai:v0.13.0 # Pinned version for stability
container_name: vllm-<model-name>
restart: unless-stopped
command: >
<HuggingFace/Model-Name> # Model identifier from HuggingFace
--gpu-memory-utilization 0.90 # Adjust based on model size
--max-model-len 262144 # Context window (adjust as needed)
--max-num-batched-tokens 49152 # Batch size (adjust as needed)
--kv-cache-dtype fp8 # Use fp8 for memory efficiency
--tensor-parallel-size 2 # Number of GPUs (1 or 2)
--enable-chunked-prefill # Enable for long context
--enable-sleep-mode # Required for hot-swapping
--enable-auto-tool-choice # Enable tool calling if supported
--tool-call-parser hermes # Tool parser (if applicable)
ports:
- "<host-port>:8000" # Choose next available host port
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 2 # Match tensor-parallel-size
capabilities: [gpu]
volumes:
- ~/.cache/huggingface:/root/.cache/huggingface
- ../vllm-logging.json:/logs/logging.json:ro
- ../logs/<model-name>:/logs
- ~/.cache/vllm:/root/.cache/vllm
- ~/.cache/triton:/root/.triton/cache
- ~/.cache/flashinfer:/root/.cache/flashinfer
environment:
- HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}
- VLLM_SERVER_DEV_MODE=1 # Required for sleep mode
- CUDA_VISIBLE_DEVICES=0,1 # GPU indices (adjust as needed)
- VLLM_HOST_IP=127.0.0.1
- VLLM_CONFIGURE_LOGGING=1
- VLLM_LOGGING_CONFIG_PATH=/logs/logging.json
ipc: host
networks:
- homegpt-networkExample (Qwen3-VL-32B):
services:
vllm-qwen3-vl-32b:
image: vllm/vllm-openai:v0.13.0
container_name: vllm-qwen3-vl-32b
restart: unless-stopped
command: >
Qwen/Qwen3-VL-32B-Instruct-FP8
--gpu-memory-utilization 0.90
--max-model-len 262144
--max-num-batched-tokens 49152
--kv-cache-dtype fp8
--tensor-parallel-size 2
--enable-chunked-prefill
--enable-sleep-mode
--enable-auto-tool-choice
--tool-call-parser hermes
ports:
- "8002:8000"
# ... rest of configCreate the logs directory with a basic logging.json:
mkdir -p logs/<model-name>
echo '{}' > logs/<model-name>/logging.jsonEdit docker/docker-compose.yml to include the new model service:
name: home-gpt
include:
- compose-model-manager.yml
- compose-vllm-qwen3-vl-30b-a3b.yml
- compose-vllm-qwen3-vl-32b.yml # Add new model here
- compose-vllm-qwen3-next-80b-a3b-thinking.yml
- compose-vllm-gpt-oss-20b.yml
- compose-webui.yml
networks:
homegpt-network:
driver: bridgeEdit config.yaml to add the model configuration:
models:
# ... existing models ...
- id: <model-id> # Short identifier (e.g., qwen3-vl-32b)
name: "<Display Name>" # Human-readable name
container_name: "vllm-<model-name>" # Must match Docker service name
port: 8000 # Internal container port (always 8000)
host_port: <unique-port> # External port (8001, 8002, 8003, etc.)
gpu_memory_gb: <estimated-gb> # Approximate GPU memory usage
startup_mode: disabled # Options: disabled | sleep | activeExample:
- id: qwen3-vl-32b
name: "Qwen 3 VL 32B (FP8)"
container_name: "vllm-qwen3-vl-32b"
port: 8000
host_port: 8002
gpu_memory_gb: 60.0
startup_mode: disabledPort Assignment Guidelines:
- Group related models together (e.g., VL models first, then text models)
- Use sequential ports (8001, 8002, 8003, 8004, ...)
- Document the port mapping in comments if needed
Edit docker/compose-webui.yml to expose the new model endpoint:
environment:
- OPENAI_API_BASE_URLS=http://vllm-model1:8000/v1;http://vllm-model2:8000/v1;http://vllm-newmodel:8000/v1
- OPENAI_API_KEYS=dummy;dummy;dummy # Add one 'dummy' per modelExample:
environment:
- OPENAI_API_BASE_URLS=http://vllm-qwen3-vl-30b-a3b:8000/v1;http://vllm-qwen3-vl-32b:8000/v1;http://vllm-gpt-oss-20b:8000/v1;http://vllm-qwen3-next-80b-a3b-thinking:8000/v1
- OPENAI_API_KEYS=dummy;dummy;dummy;dummycd docker
# Start the new model (it will download on first run)
docker compose up -d vllm-<model-name>
# Monitor logs to track download and initialization
docker compose logs -f vllm-<model-name>
# Restart WebUI to pick up the new endpoint
docker compose restart webui
# Optionally restart model-manager to reload config
docker compose restart model-manager# Check model status
curl -s http://localhost:9000/models | jq
# Switch to the new model
curl -X POST http://localhost:9000/switch \
-H "Content-Type: application/json" \
-d '{"model_id": "<model-id>"}' | jq
# Verify it's active
curl -s http://localhost:9000/models | jq '.[] | select(.id=="<model-id>")'GPU Memory Utilization:
0.90- Default, safe for most models0.95- Aggressive, use for maximum context0.85- Conservative, if experiencing OOM
Context Window (max-model-len):
262144- Ultra-long context (256K tokens)131072- Long context (128K tokens)32768- Standard (32K tokens)- Reduce if experiencing OOM errors
Tensor Parallelism:
--tensor-parallel-size 1- Single GPU--tensor-parallel-size 2- Two GPUs (common for 30B+ models)--tensor-parallel-size 4- Four GPUs (for 70B+ models)
Quantization Support:
- FP8/FP16 models work out of the box
- AWQ/GPTQ models are supported
- Adjust
--quantizationflag if needed
Model download fails:
# Check HuggingFace token
echo $HF_TOKEN
# Pre-download manually
huggingface-cli login
huggingface-cli download Org/Model-NameOut of memory during startup:
- Reduce
--gpu-memory-utilization - Reduce
--max-model-len - Increase
--tensor-parallel-size(use more GPUs) - Use a quantized version (AWQ/GPTQ/FP8)
Container won't start:
# Check detailed logs
docker compose logs vllm-<model-name>
# Verify GPU access
docker run --rm --gpus all nvidia/cuda:12.3.0-base-ubuntu24.04 nvidia-smi
# Check port conflicts
netstat -tulpn | grep <host-port>Model switches but doesn't respond:
- Ensure
VLLM_SERVER_DEV_MODE=1is set - Ensure
--enable-sleep-modeis in the command - Check health endpoint:
curl http://localhost:<host-port>/health - Verify model is awake:
curl http://localhost:<host-port>/is_sleeping
Go Code Organization:
cmd/switcher/main.go- Application entry point, initializes Gin routerinternal/config/- Loadsconfig.yamlinto Go structsinternal/vllm/client.go- HTTP client for vLLM API (Sleep, WakeUp, Health, IsSleeping)internal/switcher/switcher.go- Core logic: SwitchModel orchestrates sleep→wake→update flowinternal/handlers/- Gin HTTP handlers wrapping switcher methodspkg/models/- Shared data structures (Model, Config, StatusEnum)
Key Implementation Details:
- Thread safety via
sync.RWMutexinswitcher.go - Sleep level selection: Checks
available_ram_gbconfig to decide Level 1 vs Level 2 - Health check retry logic: Configurable
max_retriesandhealth_check_interval_seconds - Error handling: Returns errors up the stack, handlers convert to HTTP status codes
cd model-manager
go mod download
go build -o switcher ./cmd/switcher
./switcher # Runs on port 9000Or with Docker:
cd docker
docker compose build model-manager# Start services
cd docker
docker compose up -d
# Run tests
cd ..
./test-switcher.sh
# Manual testing
curl http://localhost:9000/health
curl http://localhost:9000/models
curl -X POST http://localhost:9000/switch \
-H "Content-Type: application/json" \
-d '{"model_id":"gpt-oss-20b"}'The core switching algorithm is in internal/switcher/switcher.go:
func (s *Switcher) SwitchModel(ctx context.Context, targetModelID string) error {
// 1. Validate target model exists
// 2. Lock for exclusive access (mutex)
// 3. Find currently active model
// 4. Sleep active model (determine level based on RAM)
// 5. Wake up target model
// 6. Wait for health check to pass
// 7. Update model statuses
// 8. Return success/error
}To modify behavior:
- Change sleep level logic: Edit
determineSleepLevel()method - Adjust retry behavior: Edit
max_retriesinconfig.yaml - Add new endpoints: Add methods to
internal/handlers/handlers.go
models:
- id: "qwen3-vl-30b-a3b" # Unique identifier
name: "Qwen 3 VL 30B A3B (MoE AWQ)" # Display name
container_name: "vllm-qwen3-vl-30b-a3b" # Docker container name
port: 8000 # Internal container port
host_port: 8001 # External host-mapped port
gpu_memory_gb: 57.0 # GPU memory usage estimate
startup_mode: active # Initial status: disabled/sleep/active
- id: "qwen3-vl-32b"
name: "Qwen 3 VL 32B (FP8)"
container_name: "vllm-qwen3-vl-32b"
port: 8000
host_port: 8002
gpu_memory_gb: 60.0
startup_mode: disabled
# ... more models ...
switching:
available_ram_gb: 128.0 # Total RAM for sleep level decision
max_retries: 450 # Health check retries (15 min max)
health_check_interval_seconds: 2 # Seconds between retriesSleep Level Selection:
- If
available_ram_gb >= 64: Use Level 1 (offload to CPU RAM) - Otherwise: Use Level 2 (discard weights)
Startup Modes:
disabled: Container not started at allsleep: Container started, model loaded, immediately put to sleepactive: Container started, model loaded, and ready to serve (only ONE model should be active)
- OS: Linux (Ubuntu 24.04 LTS recommended)
- RAM: 64GB+ (128GB recommended for Level 1 sleep)
- GPU: NVIDIA GPU with 24GB+ VRAM (RTX 4090, 5090, A100, etc.)
- Storage: 100GB+ free space for model caching
- Docker: 23.0+
- NVIDIA Container Toolkit: Latest version
- Go: 1.21+ (for development)
# Check logs
docker compose logs model-manager
# Common issues:
# - config.yaml syntax error
# - Port 9000 already in use
# - Network not created# Check vLLM instance logs
docker compose logs vllm-qwen
docker compose logs vllm-gptoss
# Verify sleep mode is enabled
curl http://localhost:8001/is_sleeping
curl http://localhost:8002/is_sleeping
# Ensure VLLM_SERVER_DEV_MODE=1 is set# Pre-download models
huggingface-cli download QuantTrio/Qwen3-VL-30B-A3B-Instruct-AWQ
huggingface-cli download openai/gpt-oss-20b
# Or set HF_TOKEN in .env file and let vLLM download on first run- Reduce
--gpu-memory-utilizationfrom 0.95 to 0.85 - Reduce
--max-model-lento limit context window - Use smaller quantized models (AWQ, GPTQ)
- Ensure sufficient RAM for Level 1 sleep (128GB recommended)
# Check yq is installed
yq --version
# Install yq if needed
sudo apt install yq
# Run with verbose output
bash -x ./bootstrap.shThe bootstrap.sh script handles the complex startup sequence required for multiple vLLM models on a single GPU:
Phase 1: Sequential Model Loading
- Starts each model one at a time
- Polls
/healthendpoint (max 7.5 min per model) - Immediately puts model to sleep after health check passes
- This prevents OOM by ensuring only one model uses VRAM at a time
Phase 2: Activate Default Model
- Wakes up the default model (defined in
config.yaml) - Waits for health check to confirm it's ready
Phase 3: Start Management Services
- Starts Model Manager (will resync and detect model states)
- Starts Open WebUI
Why Bootstrap is Necessary:
- Models cannot both be active simultaneously (VRAM constraints)
- Docker Compose's
depends_ondoesn't handle sleep/wake sequence - Models must be cached before switching works properly
- Proper state initialization prevents race conditions
-
Direct vLLM Endpoint Access
- Sending requests directly to sleeping vLLM endpoints causes crashes
- Workaround: Only use Model Manager
/switchAPI, not direct vLLM calls - Future: Model Router component will proxy all requests to active model
-
Single GPU Only
- Current implementation assumes all models share one GPU
- Multi-GPU support requires architecture changes
-
No Request Queuing
- Requests during model switch are lost
- Future: Queue requests during switch operations
-
Manual WebUI Model Selection
- WebUI shows all model endpoints, but only active one works
- Future: Dynamic model list based on Model Manager state
-
Sleep Mode Requires Dev Mode
VLLM_SERVER_DEV_MODE=1is required for sleep endpoints- Not recommended for production deployments per vLLM docs
-
Grafana Dashboards (http://localhost:3001)
- Login with
admin/admin - Navigate to "Dashboards" to view:
- GPU Metrics: Real-time GPU utilization, memory, temperature
- vLLM Metrics: Request latency, throughput, model performance
- Login with
-
Prometheus (http://localhost:9091)
- Query metrics directly using PromQL
- View scrape targets and their health status
- Explore available metrics from vLLM and DCGM
-
Log Queries
- View logs in Grafana's "Explore" section
- Filter by container name, log level, or search terms
- Correlate logs with metrics for debugging
GPU Metrics (from DCGM):
DCGM_FI_DEV_GPU_UTIL- GPU utilization percentageDCGM_FI_DEV_FB_USED- GPU memory used (bytes)DCGM_FI_DEV_GPU_TEMP- GPU temperature (Celsius)DCGM_FI_DEV_POWER_USAGE- Power consumption (Watts)
vLLM Metrics:
- Request latency and throughput
- Model-specific performance metrics
- Error rates and health status
Logs are automatically collected from all vLLM instances:
- Location:
logs/<model-name>/vllm-server.log.* - Retention: Rotated daily, stored in Loki for querying
- Access: Query through Grafana's Explore view or Loki API
Example Loki query:
{container_name="vllm-qwen3-vl-32b"} |= "error"
- WebSocket support for real-time status updates
- Open WebUI custom plugin for model selection UI
- Automatic model preloading on startup
- Model usage statistics and logging
- Support for multiple GPUs per model
- Graceful shutdown with state persistence
- Prometheus metrics export
- Admin dashboard for monitoring
- Alerting rules for GPU/memory thresholds
- Model performance comparison dashboards
- NVIDIA Driver (recommended: 525 or later)
- Docker Engine (23.0 or later)
- NVIDIA Container Toolkit
- Docker Compose V2
- Install NVIDIA Container Toolkit:
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker- Verify NVIDIA Docker installation:
docker run --rm --gpus all nvidia/cuda:12.3.0-base-ubuntu24.04 nvidia-smi- vLLM Documentation
- vLLM Sleep Mode Guide
- Open WebUI Documentation
- NVIDIA Container Toolkit Guide
- Gin Web Framework
MIT License - See LICENSE file for details.