Distributed, Multi-GPU Inference Platform for Large Language Models
HydraServe is a distributed, multi-GPU inference platform for large language models (LLMs). It demonstrates how to serve, scale, and monitor modern foundation models efficiently across nodes using high-performance computing (HPC) principles and parameter-efficient fine-tuning (PEFT) techniques (e.g., LoRA/QLoRA). In simple terms, it's an end-to-end distributed LLM inference system—showing how production-grade orchestration, scheduling, and GPU-level optimization come together in one cohesive platform.
When someone sends a text prompt (like "Explain photosynthesis"), it flows through the HydraServe pipeline — a component-based, production-style architecture composed of several services working in unison. Just like OpenAI, Anthropic, and Gemini have API gateways that serve as the entry point to their LLM infrastructure, HydraServe's Gateway is the front door — handling authentication, routing requests, and streaming responses back to clients. Behind that gateway lies a sophisticated distributed system that orchestrates GPU resources, manages model inference, and optimizes performance across multiple nodes.
- Multi-Node Scaling: Seamlessly scale inference across multiple GPU nodes
- Intelligent Scheduling: Advanced GPU allocation with round-robin, least-loaded, and consistent hashing strategies
- Dynamic Load Balancing: Automatic request distribution based on real-time GPU utilization
- Micro-batching: Group compatible requests to maximize GPU throughput
- vLLM Integration: PagedAttention and continuous batching for maximum throughput (>1000 tokens/sec per GPU)
- Multi-GPU Inference: Tensor and pipeline parallelism for models up to 405B parameters
- TensorRT-LLM Support: Kernel fusion and FP8/INT4 quantization for ultra-low latency
- KV-Cache Reuse: Distributed attention cache for cross-session state sharing and faster time-to-first-token
- Token Streaming: Real-time streaming responses via HTTP SSE and gRPC
- LoRA Hot-Swapping: Switch between fine-tuned adapters without model reloads
- QLoRA Quantization: Memory-efficient 4-bit/8-bit quantization for higher model density
- Multiple Models: Serve Mistral, LLaMA, Falcon, and any HuggingFace-compatible model
- Adapter Caching: Intelligent caching of frequently-used LoRA adapters
- Kubernetes Operator: Declarative model deployment with Custom Resource Definitions (CRDs)
- Auto-Scaling: GPU-aware horizontal autoscaling with KEDA integration
- Multi-Cloud Support: Deploy on GKE, EKS, AKS, or on-premises Kubernetes
- Zero-Downtime Updates: Rolling deployments and health-check based traffic management
- Authentication: JWT and API key-based authentication
- Rate Limiting: Per-tenant request throttling and quota management
- mTLS: Secure service-to-service communication
- RBAC Integration: Kubernetes role-based access control for model governance
- Distributed Tracing: End-to-end request tracking with OpenTelemetry
- Real-Time Metrics: Prometheus metrics for latency (p50/p99), throughput, and GPU utilization
- Grafana Dashboards: Pre-built dashboards for system health and performance analysis
- Structured Logging: Correlation ID tracking across all microservices
- gRPC-First Design: Efficient binary protocol with automatic client generation
- Mock Mode: Test without GPUs for rapid development
- Comprehensive Testing: Unit, integration, and load testing frameworks included
- Infrastructure as Code: Terraform modules and Helm charts for reproducible deployments
┌─────────────┐
│ Client │
└──────┬──────┘
│
▼
┌─────────────────────────────────────────┐
│ Gateway (Go) │
│ • HTTP/gRPC API │
│ • Authentication & Routing │
│ • Token Streaming │
└─────────────┬───────────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ Scheduler (Go) │
│ • GPU Assignment │
│ • Micro-batching │
│ • Load Balancing │
└─────────────┬───────────────────────────┘
│
┌──────┴──────┐
▼ ▼
┌─────────────┐ ┌─────────────┐
│ Inference │ │ KV-Cache │
│ Engine (Py) │◄┤ Service │
│ • vLLM │ │ (Go/Python) │
│ • TensorRT │ │ • Redis │
│ • LoRA/ │ │ • Sharding │
│ QLoRA │ │ • Eviction │
└─────────────┘ └─────────────┘
│
▼
┌─────────────────────────────────────────┐
│ Observability (OpenTelemetry) │
│ • Prometheus • Grafana • OTEL │
└─────────────────────────────────────────┘
- Distributed Inference Architecture - Multi-node GPU orchestration
- Observability & Operational Excellence - OTEL tracing, metrics, autoscaling
- Kernel & Serving Optimization - vLLM, TensorRT-LLM, TGI with IR compilation
- PEFT & Quantized Serving - LoRA hot-swap, QLoRA (int4/int8)
- Cloud-Native Control Plane - Kubernetes Operator with CRDs
- Backend & Distributed Systems - Go microservices with gRPC
- HPC-Level Scaling - KV-cache sharding, RDMA/NCCL support
hydra_serve/
├── services/
│ ├── gateway/ # API entry point (Go)
│ ├── scheduler/ # GPU allocation & batching (Go)
│ ├── inference/ # Model runtime (Python)
│ └── kvcache/ # Distributed cache (Go/Python)
├── operator/ # Kubernetes Operator (Go)
├── infra/
│ ├── helm/ # Kubernetes deployment charts
│ └── terraform/ # Cloud infrastructure (GKE/EKS/AKS)
├── obs/ # Monitoring dashboards
├── proto/ # gRPC service definitions
├── scripts/ # Developer tools & benchmarks
├── ci/ # GitHub Actions workflows
└── docs/ # Architecture & API documentation
| Layer | Technologies |
|---|---|
| Gateway | Go, gRPC, REST |
| Scheduler | Go, gRPC, consistent hashing |
| Inference | Python, FastAPI, vLLM, TensorRT-LLM, TGI, PEFT |
| KV-Cache | Go/Python, Redis, distributed caching |
| Operator | Go, Kubebuilder, CRDs |
| Infrastructure | Kubernetes, Helm, Terraform, KEDA |
| Observability | OpenTelemetry, Prometheus, Grafana |
| CI/CD | GitHub Actions, Makefile |
- Kubernetes cluster (GKE/EKS/AKS or local with kind/minikube)
- GPU nodes with NVIDIA drivers and container runtime
- Go 1.21+ (for building services and operator)
- Python 3.10+ (for inference engine)
- Helm 3+ (for deployment)
- Terraform (optional, for cloud provisioning)
# Build all services
make build
# Run tests
make test
# Deploy to local Kubernetes
make deploy-local
# Run observability stack
make obs-up# Provision infrastructure (GKE example)
cd infra/terraform/gke
terraform init
terraform apply
# Deploy HydraServe
helm install hydraserve infra/helm/hydraserve \
--namespace hydraserve \
--create-namespace
# Access dashboard
kubectl port-forward -n hydraserve svc/grafana 3000:3000import requests
# Send inference request
response = requests.post(
"http://gateway:8080/v1/completions",
json={
"model": "mistralai/Mistral-7B-v0.1",
"prompt": "Explain photosynthesis in simple terms:",
"max_tokens": 200,
"temperature": 0.7,
"stream": True
},
stream=True
)
# Stream tokens
for chunk in response.iter_content(chunk_size=None):
print(chunk.decode(), end="", flush=True)- Mistral, LLaMA, Falcon, and any HuggingFace model
- LoRA adapter hot-swapping for multi-tenant serving
- QLoRA quantization (4-bit/8-bit) for memory efficiency
- vLLM for paged attention and continuous batching
- TensorRT-LLM for kernel fusion and FP8/INT4 quantization
- ONNX → TensorRT IR compilation pipeline
- KV-cache reuse across sessions
- Kubernetes Operator for declarative model management
- Horizontal autoscaling with KEDA (GPU-aware)
- Rolling updates with zero downtime
- Multi-cloud portability (GKE, EKS, AKS)
- Distributed tracing with OpenTelemetry
- Real-time metrics (latency, throughput, GPU utilization)
- Grafana dashboards for end-to-end visibility
- Cache hit-rate monitoring
| Metric | Target |
|---|---|
| Latency (p50) | < 50ms time-to-first-token |
| Latency (p99) | < 200ms time-to-first-token |
| Throughput | > 1000 tokens/sec per GPU |
| GPU Utilization | > 85% |
| Cache Hit Rate | > 70% |
# Build gateway
cd services/gateway && go build -o bin/gateway ./cmd/gateway
# Build scheduler
cd services/scheduler && go build -o bin/scheduler ./cmd/scheduler
# Build inference engine
cd services/inference && pip install -r requirements.txt# Start Redis (for KV-cache)
docker run -d -p 6379:6379 redis:alpine
# Start services
./scripts/run-local.sh# Run unit tests
make test
# Run integration tests
make test-integration
# Run load tests
./scripts/benchmark.shSee CONTRIBUTING.md for guidelines.
MIT License - see LICENSE for details.
Built with ❤️ for high-performance distributed inference
