Provider-agnostic on-demand GPU deployment (load balancing across GPUs)#26
Open
JasonWildMe wants to merge 1 commit into
Open
Provider-agnostic on-demand GPU deployment (load balancing across GPUs)#26JasonWildMe wants to merge 1 commit into
JasonWildMe wants to merge 1 commit into
Conversation
Make the container portable across serverless/on-demand GPU providers so
the service can be load-balanced across multiple GPUs while paying only for
GPU time in use. The platform (RunPod / Cloud Run) acts as the load balancer
and autoscaler; one image runs unchanged on either or on a VM.
Container portability (backward-compatible; defaults preserve VM behavior):
- main.py: read PORT/HOST/DEVICE/WORKERS from env (Cloud Run injects PORT);
expand ${MODEL_BASE} in model_config.json at startup.
- model_config.json: /datasets/... -> ${MODEL_BASE}/... (defaults to /datasets).
- yolo_ultralytics.py: resolve the detector weight via get_checkpoint_path so
it is URL-pullable like the other models (was the only path that wasn't).
- docker-compose.prod.yml: WORKERS default 4 -> 1 (extra workers load a full
copy of all models into the same VRAM with no throughput gain).
Deploy scaffold (thin per-provider configs over the same image):
- deploy/README.md: portability contract, load-balancing knobs, cold-start notes.
- deploy/cloudrun/{service.yaml,deploy.sh}: L4 GPU, min-instances=1, concurrency=2.
- deploy/runpod/endpoint.json: HTTP load-balancing endpoint, activeWorkers=1,
concurrencyPerWorker=2 (matches MAX_CONCURRENT_PREDICTIONS).
Tests: 42 passed.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Goal
Enable load-balancing the detector service across multiple on-demand GPUs while paying only for GPU time in use. The service is stateless and GPU-bound, so it fits serverless GPU platforms where the platform itself is the load balancer + autoscaler — no nginx/HAProxy needed.
The design priority is provider independence: one OCI image (built from the existing
docker/dockerfile), serving plain HTTP, runs unchanged on RunPod, Cloud Run, or a VM. Each provider gets only a thin config file underdeploy/.Container portability (backward-compatible)
Defaults preserve current VM behavior, so the existing prod compose is unaffected.
app/main.py—PORT/HOST/DEVICE/WORKERSnow read from env (Cloud Run injects$PORT);${MODEL_BASE}in the model config is expanded at startup.app/model_config.json—/datasets/...→${MODEL_BASE}/...(defaults to/datasets).app/models/yolo_ultralytics.py— detector weight resolves throughget_checkpoint_path, so it's URL-pullable like every other model (this was the only path that bypassed the resolver).docker/docker-compose.prod.yml—WORKERSdefault4 → 1. Extra workers each load a full copy of all 5 models into the same VRAM (OOM risk) with no throughput gain, since the GPU executes serially. Scale via replicas, not workers.MODEL_BASE— the one per-environment decisionBecause all loaders resolve through
get_checkpoint_path, weights can be a mounted volume (/datasets,/runpod-volume/models) or anhttps://object-store prefix (fetched + cached at boot). The URL option is the most portable: identical config everywhere.Deploy scaffold (
deploy/)deploy/README.md— portability contract, the two universal load-balancing knobs (concurrency=2 matchingMAX_CONCURRENT_PREDICTIONS; min/max replicas), and the cold-start caveat.deploy/cloudrun/service.yaml+deploy.sh— NVIDIA L4,min-instances=1(warm baseline),concurrency=2,timeout=300.deploy/runpod/endpoint.json— HTTP load-balancing serverless endpoint (deliberately not the queue/handler model, which would be RunPod-specific code),activeWorkers=1,concurrencyPerWorker=2.Cold-start note for reviewers
The service eagerly loads 5 models (90s health
start_period), so a true cold start is tens of seconds. Both providers default to a warm baseline (min/activeWorkers = 1): pay ~1 GPU continuously (an L4 ≈ $0.84/hr, far below a multi-GPU VM), bursts autoscale across GPUs. Flip to0for true scale-to-zero if a cold first-request is acceptable.Testing
pytest tests/— 42 passed.${MODEL_BASE}expansion both for the default (/datasets) and anhttps://prefix.🤖 Generated with Claude Code