homeGPT - Hot-Swappable vLLM Model Manager

A production-ready system for running multiple vLLM models with hot-swapping capability using vLLM's sleep mode feature. This enables efficient GPU memory management by putting inactive models to sleep (offloading to CPU RAM or discarding weights) while keeping the active model on GPU.

Current vLLM Version: v0.13.0

What's New in v0.13.0

Qwen3-VL Enhancements: EVS (Efficient Video Sampling) support and embeddings API
Performance: 3x faster Whisper inference, DeepSeek V3.2 optimizations (up to 10.7% TTFT improvement)
Quantization: W4A8 grouped GEMM on Hopper GPUs, MoE + LoRA support for AWQ/GPTQ
Hardware: NVIDIA Blackwell Ultra (GB300) support with CUDA 13
API: Model Context Protocol (MCP) integration, binary embeddings format

Recent Updates

2025-12-30: Upgraded to vLLM v0.13.0
2025-12-02: Bumped vLLM to v0.11.2 across all model containers
2025-11-27: Added comprehensive monitoring stack (Prometheus, Grafana, Loki, DCGM)
2025-11-26: Refactored to use config.yaml as single source of truth for port mappings
2025-11-25: Added Qwen3-Next-80B-A3B-Thinking model support

Quick Start

# 1. Install dependencies
sudo apt install yq  # YAML parser for bootstrap script

# 2. Run bootstrap script (handles proper startup sequence)
./bootstrap.sh

# This will:
# - Load and cache all models (one at a time to avoid OOM)
# - Put all models to sleep
# - Wake only the default model
# - Start Model Manager and WebUI

Access points:

Model Manager API: http://localhost:9000
Open WebUI: http://localhost:3000
Grafana Dashboard: http://localhost:3001 (admin/admin)
Prometheus: http://localhost:9091
vLLM Qwen3-VL-30B: http://localhost:8001
vLLM Qwen3-VL-32B: http://localhost:8002
vLLM Qwen3-Next-80B: http://localhost:8003
vLLM GPT-OSS-20B: http://localhost:8004

Test Model Switching

# Check current status
curl -s http://localhost:9000/models | jq

# Switch to GPT-OSS
curl -X POST http://localhost:9000/switch \
  -H "Content-Type: application/json" \
  -d '{"model_id": "gpt-oss-20b"}' | jq

# Switch to Qwen3-VL-32B
curl -X POST http://localhost:9000/switch \
  -H "Content-Type: application/json" \
  -d '{"model_id": "qwen3-vl-32b"}' | jq

Architecture Overview

┌─────────────────┐
│   Open WebUI    │ (Port 3000)
└────────┬────────┘
         │
         ▼
┌─────────────────┐     ┌──────────────────────────┐
│ Model Manager   │────▶│ vLLM Qwen3-VL-30B        │ (Port 8001)
│   (Go Service)  │     │ [Active/Sleep]           │
│   Port 9000     │     └──────────────────────────┘
└─────────────────┘     ┌──────────────────────────┐
                        │ vLLM Qwen3-VL-32B        │ (Port 8002)
                        │ [Active/Sleep]           │
                        └──────────────────────────┘
                        ┌──────────────────────────┐
                        │ vLLM Qwen3-Next-80B      │ (Port 8003)
                        │ [Active/Sleep]           │
                        └──────────────────────────┘
                        ┌──────────────────────────┐
                        │ vLLM GPT-OSS-20B         │ (Port 8004)
                        │ [Active/Sleep]           │
                        └──────────────────────────┘

┌──────────────────────────────────────────────────────────────┐
│                   Monitoring Stack                            │
├──────────────────┬──────────────┬──────────────┬─────────────┤
│   Prometheus     │   Grafana    │     Loki     │ DCGM Export │
│   (Port 9091)    │ (Port 3001)  │ (Port 3100)  │ (Port 9400) │
│  Metrics Store   │  Dashboards  │ Log Aggreg.  │ GPU Metrics │
└──────────────────┴──────────────┴──────────────┴─────────────┘
           │                │              │              │
           └────────────────┴──────────────┴──────────────┘
                                   │
                          ┌────────┴─────────┐
                          │    Promtail      │
                          │  Log Collector   │
                          └──────────────────┘

Project Structure

homeGPT/
├── model-manager/          # Go service for model switching
│   ├── cmd/switcher/       # Main application entry point
│   ├── internal/           # Internal packages
│   │   ├── config/         # Configuration loading
│   │   ├── handlers/       # HTTP handlers (Gin)
│   │   ├── switcher/       # Core switching logic
│   │   └── vllm/           # vLLM HTTP client
│   ├── pkg/models/         # Shared data structures
│   ├── config.yaml         # Model definitions & settings
│   ├── Dockerfile          # Multi-stage build
│   └── go.mod              # Go dependencies
├── docker/                 # Modular Docker Compose files
│   ├── docker-compose.yml  # Main orchestration (includes all)
│   ├── compose-model-manager.yml
│   ├── compose-monitoring.yml     # Monitoring stack
│   ├── compose-vllm-qwen3-vl-30b-a3b.yml
│   ├── compose-vllm-qwen3-vl-32b.yml
│   ├── compose-vllm-qwen3-next-80b-a3b-thinking.yml
│   ├── compose-vllm-gpt-oss-20b.yml
│   └── compose-webui.yml
├── monitoring/             # Monitoring configuration
│   ├── prometheus.yml      # Prometheus config & scrape targets
│   ├── loki-config.yml     # Loki log aggregation config
│   ├── promtail-config.yml # Promtail log collection config
│   ├── grafana-datasources.yml  # Grafana data sources
│   ├── grafana-dashboards.yml   # Grafana dashboard provisioning
│   └── grafana/
│       └── dashboards/
│           ├── gpu-metrics.json   # GPU monitoring dashboard
│           └── vllm-metrics.json  # vLLM performance dashboard
├── logs/                   # vLLM server logs
│   ├── qwen3-vl-30b-a3b/
│   ├── qwen3-vl-32b/
│   ├── qwen3-next-80b-a3b-thinking/
│   └── gpt-oss-20b/
├── config.yaml             # Model Manager configuration
├── vllm-logging.json       # vLLM logging configuration
├── bootstrap.sh            # Automated startup script
└── test-switcher.sh        # API test script

Components

Model Manager (Go Service)

Port: 9000
Language: Go 1.21+
Framework: Gin web framework
Function: Orchestrates model switching by calling vLLM sleep/wake endpoints
Features:
- Thread-safe concurrent request handling (sync.RWMutex)
- Memory-aware sleep level selection (Level 1 vs Level 2)
- Health monitoring with configurable retry logic
- RESTful HTTP API

Monitoring Stack

Prometheus (Port 9091) - Metrics collection and storage
- Scrapes vLLM metrics endpoints
- Collects GPU metrics from DCGM Exporter
- Stores time-series data for dashboards
Grafana (Port 3001) - Visualization and dashboards
- Pre-configured dashboards for GPU metrics and vLLM performance
- Default credentials: admin/admin
- Auto-provisioned data sources (Prometheus, Loki)
Loki (Port 3100) - Log aggregation
- Centralized log storage for all vLLM instances
- Queryable through Grafana
Promtail - Log collector
- Scrapes logs from logs/ directory
- Ships logs to Loki with metadata labels
DCGM Exporter (Port 9400) - NVIDIA GPU metrics
- Exposes detailed GPU utilization, memory, temperature
- Compatible with Prometheus

Model Manager API Endpoints

GET /health - Service health check
GET /models - List all models with current status (active/sleeping/switching/error)
POST /switch - Switch to a different model
```
{
  "model_id": "qwen3-vl-30b"
}
```

vLLM Sleep Mode

vLLM v0.13.0 supports two sleep levels:

Level 1: Offload model weights to CPU RAM (fast wake-up, requires RAM)
Level 2: Discard model weights entirely (slow wake-up, no RAM needed)

The model manager automatically selects the appropriate level based on available RAM (configured in model-manager/config.yaml).

IMPORTANT: Development Mode Requirements Sleep mode endpoints are ONLY available when running vLLM in development mode:

Environment variable: VLLM_SERVER_DEV_MODE=1 ✅ (configured in Docker Compose)
Server flag: --enable-sleep-mode ✅ (configured in Docker Compose)

These endpoints should NOT be exposed to end users in production according to vLLM documentation.

Sleep Mode Endpoints (vLLM)

Only available with VLLM_SERVER_DEV_MODE=1 and --enable-sleep-mode:

POST /sleep?level=1 or POST /sleep?level=2 - Put model to sleep
POST /wake_up - Wake up sleeping model
GET /is_sleeping - Check if model is currently sleeping
GET /is_sleeping - Check if model is sleeping

Sleep Mode Endpoints (vLLM)

POST /sleep?level=1 or POST /sleep?level=2 - Put model to sleep
POST /wake_up - Wake up sleeping model
GET /is_sleeping - Check if model is sleeping

Adding a New Model

Follow these steps to add a new vLLM model to the system. Example: Adding Qwen3-VL-32B-Instruct-FP8.

Step 1: Create Docker Compose File

Create docker/compose-vllm-<model-name>.yml following the naming convention:

services:
  vllm-<model-name>:
    image: vllm/vllm-openai:v0.13.0        # Pinned version for stability
    container_name: vllm-<model-name>
    restart: unless-stopped
    command: >
      <HuggingFace/Model-Name>              # Model identifier from HuggingFace
      --gpu-memory-utilization 0.90         # Adjust based on model size
      --max-model-len 262144                # Context window (adjust as needed)
      --max-num-batched-tokens 49152        # Batch size (adjust as needed)
      --kv-cache-dtype fp8                  # Use fp8 for memory efficiency
      --tensor-parallel-size 2              # Number of GPUs (1 or 2)
      --enable-chunked-prefill              # Enable for long context
      --enable-sleep-mode                   # Required for hot-swapping
      --enable-auto-tool-choice             # Enable tool calling if supported
      --tool-call-parser hermes             # Tool parser (if applicable)
    ports:
      - "<host-port>:8000"                  # Choose next available host port
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 2                      # Match tensor-parallel-size
              capabilities: [gpu]
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
      - ../vllm-logging.json:/logs/logging.json:ro
      - ../logs/<model-name>:/logs
      - ~/.cache/vllm:/root/.cache/vllm
      - ~/.cache/triton:/root/.triton/cache
      - ~/.cache/flashinfer:/root/.cache/flashinfer
    environment:
      - HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}
      - VLLM_SERVER_DEV_MODE=1              # Required for sleep mode
      - CUDA_VISIBLE_DEVICES=0,1            # GPU indices (adjust as needed)
      - VLLM_HOST_IP=127.0.0.1
      - VLLM_CONFIGURE_LOGGING=1
      - VLLM_LOGGING_CONFIG_PATH=/logs/logging.json
    ipc: host
    networks:
      - homegpt-network

Example (Qwen3-VL-32B):

services:
  vllm-qwen3-vl-32b:
    image: vllm/vllm-openai:v0.13.0
    container_name: vllm-qwen3-vl-32b
    restart: unless-stopped
    command: >
      Qwen/Qwen3-VL-32B-Instruct-FP8
      --gpu-memory-utilization 0.90
      --max-model-len 262144
      --max-num-batched-tokens 49152
      --kv-cache-dtype fp8
      --tensor-parallel-size 2
      --enable-chunked-prefill
      --enable-sleep-mode 
      --enable-auto-tool-choice 
      --tool-call-parser hermes
    ports:
      - "8002:8000"
    # ... rest of config

Step 2: Create Logs Directory

Create the logs directory with a basic logging.json:

mkdir -p logs/<model-name>
echo '{}' > logs/<model-name>/logging.json

Step 3: Update Main Docker Compose

Edit docker/docker-compose.yml to include the new model service:

name: home-gpt

include:
  - compose-model-manager.yml
  - compose-vllm-qwen3-vl-30b-a3b.yml
  - compose-vllm-qwen3-vl-32b.yml      # Add new model here
  - compose-vllm-qwen3-next-80b-a3b-thinking.yml
  - compose-vllm-gpt-oss-20b.yml
  - compose-webui.yml

networks:
  homegpt-network:
    driver: bridge

Step 4: Update Model Manager Configuration

Edit config.yaml to add the model configuration:

models:
  # ... existing models ...
  
  - id: <model-id>                      # Short identifier (e.g., qwen3-vl-32b)
    name: "<Display Name>"              # Human-readable name
    container_name: "vllm-<model-name>" # Must match Docker service name
    port: 8000                          # Internal container port (always 8000)
    host_port: <unique-port>            # External port (8001, 8002, 8003, etc.)
    gpu_memory_gb: <estimated-gb>       # Approximate GPU memory usage
    startup_mode: disabled              # Options: disabled | sleep | active

Example:

  - id: qwen3-vl-32b
    name: "Qwen 3 VL 32B (FP8)"
    container_name: "vllm-qwen3-vl-32b"
    port: 8000
    host_port: 8002
    gpu_memory_gb: 60.0
    startup_mode: disabled

Port Assignment Guidelines:

Group related models together (e.g., VL models first, then text models)
Use sequential ports (8001, 8002, 8003, 8004, ...)
Document the port mapping in comments if needed

Step 5: Update WebUI Configuration

Edit docker/compose-webui.yml to expose the new model endpoint:

environment:
  - OPENAI_API_BASE_URLS=http://vllm-model1:8000/v1;http://vllm-model2:8000/v1;http://vllm-newmodel:8000/v1
  - OPENAI_API_KEYS=dummy;dummy;dummy  # Add one 'dummy' per model

Example:

environment:
  - OPENAI_API_BASE_URLS=http://vllm-qwen3-vl-30b-a3b:8000/v1;http://vllm-qwen3-vl-32b:8000/v1;http://vllm-gpt-oss-20b:8000/v1;http://vllm-qwen3-next-80b-a3b-thinking:8000/v1
  - OPENAI_API_KEYS=dummy;dummy;dummy;dummy

Step 6: Start the New Model

cd docker

# Start the new model (it will download on first run)
docker compose up -d vllm-<model-name>

# Monitor logs to track download and initialization
docker compose logs -f vllm-<model-name>

# Restart WebUI to pick up the new endpoint
docker compose restart webui

# Optionally restart model-manager to reload config
docker compose restart model-manager

Step 7: Test Model Switching

# Check model status
curl -s http://localhost:9000/models | jq

# Switch to the new model
curl -X POST http://localhost:9000/switch \
  -H "Content-Type: application/json" \
  -d '{"model_id": "<model-id>"}' | jq

# Verify it's active
curl -s http://localhost:9000/models | jq '.[] | select(.id=="<model-id>")'

Common Model Configuration Parameters

GPU Memory Utilization:

0.90 - Default, safe for most models
0.95 - Aggressive, use for maximum context
0.85 - Conservative, if experiencing OOM

Context Window (max-model-len):

262144 - Ultra-long context (256K tokens)
131072 - Long context (128K tokens)
32768 - Standard (32K tokens)
Reduce if experiencing OOM errors

Tensor Parallelism:

--tensor-parallel-size 1 - Single GPU
--tensor-parallel-size 2 - Two GPUs (common for 30B+ models)
--tensor-parallel-size 4 - Four GPUs (for 70B+ models)

Quantization Support:

FP8/FP16 models work out of the box
AWQ/GPTQ models are supported
Adjust --quantization flag if needed

Troubleshooting New Models

Model download fails:

# Check HuggingFace token
echo $HF_TOKEN

# Pre-download manually
huggingface-cli login
huggingface-cli download Org/Model-Name

Out of memory during startup:

Reduce --gpu-memory-utilization
Reduce --max-model-len
Increase --tensor-parallel-size (use more GPUs)
Use a quantized version (AWQ/GPTQ/FP8)

Container won't start:

# Check detailed logs
docker compose logs vllm-<model-name>

# Verify GPU access
docker run --rm --gpus all nvidia/cuda:12.3.0-base-ubuntu24.04 nvidia-smi

# Check port conflicts
netstat -tulpn | grep <host-port>

Model switches but doesn't respond:

Ensure VLLM_SERVER_DEV_MODE=1 is set
Ensure --enable-sleep-mode is in the command
Check health endpoint: curl http://localhost:<host-port>/health
Verify model is awake: curl http://localhost:<host-port>/is_sleeping

Development Guide

Project Layout

Go Code Organization:

cmd/switcher/main.go - Application entry point, initializes Gin router
internal/config/ - Loads config.yaml into Go structs
internal/vllm/client.go - HTTP client for vLLM API (Sleep, WakeUp, Health, IsSleeping)
internal/switcher/switcher.go - Core logic: SwitchModel orchestrates sleep→wake→update flow
internal/handlers/ - Gin HTTP handlers wrapping switcher methods
pkg/models/ - Shared data structures (Model, Config, StatusEnum)

Key Implementation Details:

Thread safety via sync.RWMutex in switcher.go
Sleep level selection: Checks available_ram_gb config to decide Level 1 vs Level 2
Health check retry logic: Configurable max_retries and health_check_interval_seconds
Error handling: Returns errors up the stack, handlers convert to HTTP status codes

Building the Model Manager

cd model-manager
go mod download
go build -o switcher ./cmd/switcher
./switcher  # Runs on port 9000

Or with Docker:

cd docker
docker compose build model-manager

Testing

# Start services
cd docker
docker compose up -d

# Run tests
cd ..
./test-switcher.sh

# Manual testing
curl http://localhost:9000/health
curl http://localhost:9000/models
curl -X POST http://localhost:9000/switch \
  -H "Content-Type: application/json" \
  -d '{"model_id":"gpt-oss-20b"}'

Modifying Switching Logic

The core switching algorithm is in internal/switcher/switcher.go:

func (s *Switcher) SwitchModel(ctx context.Context, targetModelID string) error {
    // 1. Validate target model exists
    // 2. Lock for exclusive access (mutex)
    // 3. Find currently active model
    // 4. Sleep active model (determine level based on RAM)
    // 5. Wake up target model
    // 6. Wait for health check to pass
    // 7. Update model statuses
    // 8. Return success/error
}

To modify behavior:

Change sleep level logic: Edit determineSleepLevel() method
Adjust retry behavior: Edit max_retries in config.yaml
Add new endpoints: Add methods to internal/handlers/handlers.go

Configuration Reference

model-manager/config.yaml

models:
  - id: "qwen3-vl-30b-a3b"              # Unique identifier
    name: "Qwen 3 VL 30B A3B (MoE AWQ)" # Display name
    container_name: "vllm-qwen3-vl-30b-a3b"  # Docker container name
    port: 8000                          # Internal container port
    host_port: 8001                     # External host-mapped port
    gpu_memory_gb: 57.0                 # GPU memory usage estimate
    startup_mode: active                # Initial status: disabled/sleep/active

  - id: "qwen3-vl-32b"
    name: "Qwen 3 VL 32B (FP8)"
    container_name: "vllm-qwen3-vl-32b"
    port: 8000
    host_port: 8002
    gpu_memory_gb: 60.0
    startup_mode: disabled

  # ... more models ...

switching:
  available_ram_gb: 128.0               # Total RAM for sleep level decision
  max_retries: 450                      # Health check retries (15 min max)
  health_check_interval_seconds: 2      # Seconds between retries

Sleep Level Selection:

If available_ram_gb >= 64: Use Level 1 (offload to CPU RAM)
Otherwise: Use Level 2 (discard weights)

Startup Modes:

disabled: Container not started at all
sleep: Container started, model loaded, immediately put to sleep
active: Container started, model loaded, and ready to serve (only ONE model should be active)

System Requirements

OS: Linux (Ubuntu 24.04 LTS recommended)
RAM: 64GB+ (128GB recommended for Level 1 sleep)
GPU: NVIDIA GPU with 24GB+ VRAM (RTX 4090, 5090, A100, etc.)
Storage: 100GB+ free space for model caching
Docker: 23.0+
NVIDIA Container Toolkit: Latest version
Go: 1.21+ (for development)

Troubleshooting

Model Manager Won't Start

# Check logs
docker compose logs model-manager

# Common issues:
# - config.yaml syntax error
# - Port 9000 already in use
# - Network not created

Model Switch Fails

# Check vLLM instance logs
docker compose logs vllm-qwen
docker compose logs vllm-gptoss

# Verify sleep mode is enabled
curl http://localhost:8001/is_sleeping
curl http://localhost:8002/is_sleeping

# Ensure VLLM_SERVER_DEV_MODE=1 is set

Model Not Downloaded

# Pre-download models
huggingface-cli download QuantTrio/Qwen3-VL-30B-A3B-Instruct-AWQ
huggingface-cli download openai/gpt-oss-20b

# Or set HF_TOKEN in .env file and let vLLM download on first run

Out of Memory

Reduce --gpu-memory-utilization from 0.95 to 0.85
Reduce --max-model-len to limit context window
Use smaller quantized models (AWQ, GPTQ)
Ensure sufficient RAM for Level 1 sleep (128GB recommended)

Bootstrap Script Fails

# Check yq is installed
yq --version

# Install yq if needed
sudo apt install yq

# Run with verbose output
bash -x ./bootstrap.sh

Bootstrap Script Details

The bootstrap.sh script handles the complex startup sequence required for multiple vLLM models on a single GPU:

Phase 1: Sequential Model Loading

Starts each model one at a time
Polls /health endpoint (max 7.5 min per model)
Immediately puts model to sleep after health check passes
This prevents OOM by ensuring only one model uses VRAM at a time

Phase 2: Activate Default Model

Wakes up the default model (defined in config.yaml)
Waits for health check to confirm it's ready

Phase 3: Start Management Services

Starts Model Manager (will resync and detect model states)
Starts Open WebUI

Why Bootstrap is Necessary:

Models cannot both be active simultaneously (VRAM constraints)
Docker Compose's depends_on doesn't handle sleep/wake sequence
Models must be cached before switching works properly
Proper state initialization prevents race conditions

Known Limitations

Direct vLLM Endpoint Access
- Sending requests directly to sleeping vLLM endpoints causes crashes
- Workaround: Only use Model Manager /switch API, not direct vLLM calls
- Future: Model Router component will proxy all requests to active model
Single GPU Only
- Current implementation assumes all models share one GPU
- Multi-GPU support requires architecture changes
No Request Queuing
- Requests during model switch are lost
- Future: Queue requests during switch operations
Manual WebUI Model Selection
- WebUI shows all model endpoints, but only active one works
- Future: Dynamic model list based on Model Manager state
Sleep Mode Requires Dev Mode
- VLLM_SERVER_DEV_MODE=1 is required for sleep endpoints
- Not recommended for production deployments per vLLM docs

Monitoring and Observability

Accessing Dashboards

Grafana Dashboards (http://localhost:3001)
- Login with admin/admin
- Navigate to "Dashboards" to view:
  - GPU Metrics: Real-time GPU utilization, memory, temperature
  - vLLM Metrics: Request latency, throughput, model performance
Prometheus (http://localhost:9091)
- Query metrics directly using PromQL
- View scrape targets and their health status
- Explore available metrics from vLLM and DCGM
Log Queries
- View logs in Grafana's "Explore" section
- Filter by container name, log level, or search terms
- Correlate logs with metrics for debugging

Key Metrics to Monitor

GPU Metrics (from DCGM):

DCGM_FI_DEV_GPU_UTIL - GPU utilization percentage
DCGM_FI_DEV_FB_USED - GPU memory used (bytes)
DCGM_FI_DEV_GPU_TEMP - GPU temperature (Celsius)
DCGM_FI_DEV_POWER_USAGE - Power consumption (Watts)

vLLM Metrics:

Request latency and throughput
Model-specific performance metrics
Error rates and health status

Log Analysis

Logs are automatically collected from all vLLM instances:

Location: logs/<model-name>/vllm-server.log.*
Retention: Rotated daily, stored in Loki for querying
Access: Query through Grafana's Explore view or Loki API

Example Loki query:

{container_name="vllm-qwen3-vl-32b"} |= "error"

Future Enhancements

Prerequisites (Docker Setup)

NVIDIA Driver (recommended: 525 or later)
Docker Engine (23.0 or later)
NVIDIA Container Toolkit
Docker Compose V2

Installing Prerequisites

Install NVIDIA Container Toolkit:

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Verify NVIDIA Docker installation:

docker run --rm --gpus all nvidia/cuda:12.3.0-base-ubuntu24.04 nvidia-smi

Resources

License

MIT License - See LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
docker		docker
model-manager		model-manager
monitoring		monitoring
.gitignore		.gitignore
README.md		README.md
bootstrap.sh		bootstrap.sh
config.yaml		config.yaml
vllm-logging.json		vllm-logging.json

helloworld0909/homeGPT

Folders and files

Latest commit

History

Repository files navigation

homeGPT - Hot-Swappable vLLM Model Manager

What's New in v0.13.0

Recent Updates

Quick Start

Test Model Switching

Architecture Overview

Project Structure

Components

Model Manager (Go Service)

Monitoring Stack

Model Manager API Endpoints

vLLM Sleep Mode

Sleep Mode Endpoints (vLLM)

Sleep Mode Endpoints (vLLM)

Adding a New Model

Step 1: Create Docker Compose File

Step 2: Create Logs Directory

Step 3: Update Main Docker Compose

Step 4: Update Model Manager Configuration

Step 5: Update WebUI Configuration

Step 6: Start the New Model

Step 7: Test Model Switching

Common Model Configuration Parameters

Troubleshooting New Models

Development Guide

Project Layout

Building the Model Manager

Testing

Modifying Switching Logic

Configuration Reference

model-manager/config.yaml

System Requirements

Troubleshooting

Model Manager Won't Start

Model Switch Fails

Model Not Downloaded

Out of Memory

Bootstrap Script Fails

Bootstrap Script Details

Known Limitations

Monitoring and Observability

Accessing Dashboards

Key Metrics to Monitor

Log Analysis

Future Enhancements

Prerequisites (Docker Setup)

Installing Prerequisites

Resources

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages