Provider-agnostic on-demand GPU deployment (load balancing across GPUs) by JasonWildMe · Pull Request #26 · WildMeOrg/ml-service

JasonWildMe · 2026-06-01T17:50:04Z

Goal

Enable load-balancing the detector service across multiple on-demand GPUs while paying only for GPU time in use. The service is stateless and GPU-bound, so it fits serverless GPU platforms where the platform itself is the load balancer + autoscaler — no nginx/HAProxy needed.

The design priority is provider independence: one OCI image (built from the existing docker/dockerfile), serving plain HTTP, runs unchanged on RunPod, Cloud Run, or a VM. Each provider gets only a thin config file under deploy/.

Container portability (backward-compatible)

Defaults preserve current VM behavior, so the existing prod compose is unaffected.

app/main.py — PORT/HOST/DEVICE/WORKERS now read from env (Cloud Run injects $PORT); ${MODEL_BASE} in the model config is expanded at startup.
app/model_config.json — /datasets/... → ${MODEL_BASE}/... (defaults to /datasets).
app/models/yolo_ultralytics.py — detector weight resolves through get_checkpoint_path, so it's URL-pullable like every other model (this was the only path that bypassed the resolver).
docker/docker-compose.prod.yml — WORKERS default 4 → 1. Extra workers each load a full copy of all 5 models into the same VRAM (OOM risk) with no throughput gain, since the GPU executes serially. Scale via replicas, not workers.

`MODEL_BASE` — the one per-environment decision

Because all loaders resolve through get_checkpoint_path, weights can be a mounted volume (/datasets, /runpod-volume/models) or an https:// object-store prefix (fetched + cached at boot). The URL option is the most portable: identical config everywhere.

Deploy scaffold (`deploy/`)

deploy/README.md — portability contract, the two universal load-balancing knobs (concurrency=2 matching MAX_CONCURRENT_PREDICTIONS; min/max replicas), and the cold-start caveat.
deploy/cloudrun/service.yaml + deploy.sh — NVIDIA L4, min-instances=1 (warm baseline), concurrency=2, timeout=300.
deploy/runpod/endpoint.json — HTTP load-balancing serverless endpoint (deliberately not the queue/handler model, which would be RunPod-specific code), activeWorkers=1, concurrencyPerWorker=2.

Cold-start note for reviewers

The service eagerly loads 5 models (90s health start_period), so a true cold start is tens of seconds. Both providers default to a warm baseline (min/activeWorkers = 1): pay ~1 GPU continuously (an L4 ≈ $0.84/hr, far below a multi-GPU VM), bursts autoscale across GPUs. Flip to 0 for true scale-to-zero if a cold first-request is acceptable.

Testing

pytest tests/ — 42 passed.
Verified ${MODEL_BASE} expansion both for the default (/datasets) and an https:// prefix.

🤖 Generated with Claude Code

Make the container portable across serverless/on-demand GPU providers so the service can be load-balanced across multiple GPUs while paying only for GPU time in use. The platform (RunPod / Cloud Run) acts as the load balancer and autoscaler; one image runs unchanged on either or on a VM. Container portability (backward-compatible; defaults preserve VM behavior): - main.py: read PORT/HOST/DEVICE/WORKERS from env (Cloud Run injects PORT); expand ${MODEL_BASE} in model_config.json at startup. - model_config.json: /datasets/... -> ${MODEL_BASE}/... (defaults to /datasets). - yolo_ultralytics.py: resolve the detector weight via get_checkpoint_path so it is URL-pullable like the other models (was the only path that wasn't). - docker-compose.prod.yml: WORKERS default 4 -> 1 (extra workers load a full copy of all models into the same VRAM with no throughput gain). Deploy scaffold (thin per-provider configs over the same image): - deploy/README.md: portability contract, load-balancing knobs, cold-start notes. - deploy/cloudrun/{service.yaml,deploy.sh}: L4 GPU, min-instances=1, concurrency=2. - deploy/runpod/endpoint.json: HTTP load-balancing endpoint, activeWorkers=1, concurrencyPerWorker=2 (matches MAX_CONCURRENT_PREDICTIONS). Tests: 42 passed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Provider-agnostic on-demand GPU deployment (load balancing across GPUs)#26

Provider-agnostic on-demand GPU deployment (load balancing across GPUs)#26
JasonWildMe wants to merge 1 commit into
mainfrom
feat/provider-agnostic-gpu-deploy

JasonWildMe commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

JasonWildMe commented Jun 1, 2026

Goal

Container portability (backward-compatible)

MODEL_BASE — the one per-environment decision

Deploy scaffold (deploy/)

Cold-start note for reviewers

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`MODEL_BASE` — the one per-environment decision

Deploy scaffold (`deploy/`)