Skip to content

Provider-agnostic on-demand GPU deployment (load balancing across GPUs)#26

Open
JasonWildMe wants to merge 1 commit into
mainfrom
feat/provider-agnostic-gpu-deploy
Open

Provider-agnostic on-demand GPU deployment (load balancing across GPUs)#26
JasonWildMe wants to merge 1 commit into
mainfrom
feat/provider-agnostic-gpu-deploy

Conversation

@JasonWildMe

Copy link
Copy Markdown

Goal

Enable load-balancing the detector service across multiple on-demand GPUs while paying only for GPU time in use. The service is stateless and GPU-bound, so it fits serverless GPU platforms where the platform itself is the load balancer + autoscaler — no nginx/HAProxy needed.

The design priority is provider independence: one OCI image (built from the existing docker/dockerfile), serving plain HTTP, runs unchanged on RunPod, Cloud Run, or a VM. Each provider gets only a thin config file under deploy/.

Container portability (backward-compatible)

Defaults preserve current VM behavior, so the existing prod compose is unaffected.

  • app/main.pyPORT/HOST/DEVICE/WORKERS now read from env (Cloud Run injects $PORT); ${MODEL_BASE} in the model config is expanded at startup.
  • app/model_config.json/datasets/...${MODEL_BASE}/... (defaults to /datasets).
  • app/models/yolo_ultralytics.py — detector weight resolves through get_checkpoint_path, so it's URL-pullable like every other model (this was the only path that bypassed the resolver).
  • docker/docker-compose.prod.ymlWORKERS default 4 → 1. Extra workers each load a full copy of all 5 models into the same VRAM (OOM risk) with no throughput gain, since the GPU executes serially. Scale via replicas, not workers.

MODEL_BASE — the one per-environment decision

Because all loaders resolve through get_checkpoint_path, weights can be a mounted volume (/datasets, /runpod-volume/models) or an https:// object-store prefix (fetched + cached at boot). The URL option is the most portable: identical config everywhere.

Deploy scaffold (deploy/)

  • deploy/README.md — portability contract, the two universal load-balancing knobs (concurrency=2 matching MAX_CONCURRENT_PREDICTIONS; min/max replicas), and the cold-start caveat.
  • deploy/cloudrun/service.yaml + deploy.sh — NVIDIA L4, min-instances=1 (warm baseline), concurrency=2, timeout=300.
  • deploy/runpod/endpoint.json — HTTP load-balancing serverless endpoint (deliberately not the queue/handler model, which would be RunPod-specific code), activeWorkers=1, concurrencyPerWorker=2.

Cold-start note for reviewers

The service eagerly loads 5 models (90s health start_period), so a true cold start is tens of seconds. Both providers default to a warm baseline (min/activeWorkers = 1): pay ~1 GPU continuously (an L4 ≈ $0.84/hr, far below a multi-GPU VM), bursts autoscale across GPUs. Flip to 0 for true scale-to-zero if a cold first-request is acceptable.

Testing

  • pytest tests/42 passed.
  • Verified ${MODEL_BASE} expansion both for the default (/datasets) and an https:// prefix.

🤖 Generated with Claude Code

Make the container portable across serverless/on-demand GPU providers so
the service can be load-balanced across multiple GPUs while paying only for
GPU time in use. The platform (RunPod / Cloud Run) acts as the load balancer
and autoscaler; one image runs unchanged on either or on a VM.

Container portability (backward-compatible; defaults preserve VM behavior):
- main.py: read PORT/HOST/DEVICE/WORKERS from env (Cloud Run injects PORT);
  expand ${MODEL_BASE} in model_config.json at startup.
- model_config.json: /datasets/... -> ${MODEL_BASE}/... (defaults to /datasets).
- yolo_ultralytics.py: resolve the detector weight via get_checkpoint_path so
  it is URL-pullable like the other models (was the only path that wasn't).
- docker-compose.prod.yml: WORKERS default 4 -> 1 (extra workers load a full
  copy of all models into the same VRAM with no throughput gain).

Deploy scaffold (thin per-provider configs over the same image):
- deploy/README.md: portability contract, load-balancing knobs, cold-start notes.
- deploy/cloudrun/{service.yaml,deploy.sh}: L4 GPU, min-instances=1, concurrency=2.
- deploy/runpod/endpoint.json: HTTP load-balancing endpoint, activeWorkers=1,
  concurrencyPerWorker=2 (matches MAX_CONCURRENT_PREDICTIONS).

Tests: 42 passed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant