Skip to content

dgatlin/hydra_serve

Repository files navigation

🧠 HydraServe

Distributed, Multi-GPU Inference Platform for Large Language Models

HydraServe is a distributed, multi-GPU inference platform for large language models (LLMs). It demonstrates how to serve, scale, and monitor modern foundation models efficiently across nodes using high-performance computing (HPC) principles and parameter-efficient fine-tuning (PEFT) techniques (e.g., LoRA/QLoRA). In simple terms, it's an end-to-end distributed LLM inference system—showing how production-grade orchestration, scheduling, and GPU-level optimization come together in one cohesive platform.

HydraServe Data Center

🚀 The Core Idea

When someone sends a text prompt (like "Explain photosynthesis"), it flows through the HydraServe pipeline — a component-based, production-style architecture composed of several services working in unison. Just like OpenAI, Anthropic, and Gemini have API gateways that serve as the entry point to their LLM infrastructure, HydraServe's Gateway is the front door — handling authentication, routing requests, and streaming responses back to clients. Behind that gateway lies a sophisticated distributed system that orchestrates GPU resources, manages model inference, and optimizes performance across multiple nodes.

✨ Key Features

🎯 Distributed GPU Orchestration

  • Multi-Node Scaling: Seamlessly scale inference across multiple GPU nodes
  • Intelligent Scheduling: Advanced GPU allocation with round-robin, least-loaded, and consistent hashing strategies
  • Dynamic Load Balancing: Automatic request distribution based on real-time GPU utilization
  • Micro-batching: Group compatible requests to maximize GPU throughput

⚡ High-Performance Inference

  • vLLM Integration: PagedAttention and continuous batching for maximum throughput (>1000 tokens/sec per GPU)
  • Multi-GPU Inference: Tensor and pipeline parallelism for models up to 405B parameters
  • TensorRT-LLM Support: Kernel fusion and FP8/INT4 quantization for ultra-low latency
  • KV-Cache Reuse: Distributed attention cache for cross-session state sharing and faster time-to-first-token
  • Token Streaming: Real-time streaming responses via HTTP SSE and gRPC

🔧 Multi-Tenant Model Serving

  • LoRA Hot-Swapping: Switch between fine-tuned adapters without model reloads
  • QLoRA Quantization: Memory-efficient 4-bit/8-bit quantization for higher model density
  • Multiple Models: Serve Mistral, LLaMA, Falcon, and any HuggingFace-compatible model
  • Adapter Caching: Intelligent caching of frequently-used LoRA adapters

☁️ Cloud-Native & Production-Ready

  • Kubernetes Operator: Declarative model deployment with Custom Resource Definitions (CRDs)
  • Auto-Scaling: GPU-aware horizontal autoscaling with KEDA integration
  • Multi-Cloud Support: Deploy on GKE, EKS, AKS, or on-premises Kubernetes
  • Zero-Downtime Updates: Rolling deployments and health-check based traffic management

🔐 Enterprise Security & Governance

  • Authentication: JWT and API key-based authentication
  • Rate Limiting: Per-tenant request throttling and quota management
  • mTLS: Secure service-to-service communication
  • RBAC Integration: Kubernetes role-based access control for model governance

📊 Observability & Monitoring

  • Distributed Tracing: End-to-end request tracking with OpenTelemetry
  • Real-Time Metrics: Prometheus metrics for latency (p50/p99), throughput, and GPU utilization
  • Grafana Dashboards: Pre-built dashboards for system health and performance analysis
  • Structured Logging: Correlation ID tracking across all microservices

🏗️ Developer Experience

  • gRPC-First Design: Efficient binary protocol with automatic client generation
  • Mock Mode: Test without GPUs for rapid development
  • Comprehensive Testing: Unit, integration, and load testing frameworks included
  • Infrastructure as Code: Terraform modules and Helm charts for reproducible deployments

🚀 Core Architecture

┌─────────────┐
│   Client    │
└──────┬──────┘
       │
       ▼
┌─────────────────────────────────────────┐
│  Gateway (Go)                           │
│  • HTTP/gRPC API                        │
│  • Authentication & Routing             │
│  • Token Streaming                      │
└─────────────┬───────────────────────────┘
              │
              ▼
┌─────────────────────────────────────────┐
│  Scheduler (Go)                         │
│  • GPU Assignment                       │
│  • Micro-batching                       │
│  • Load Balancing                       │
└─────────────┬───────────────────────────┘
              │
       ┌──────┴──────┐
       ▼             ▼
┌─────────────┐ ┌─────────────┐
│ Inference   │ │ KV-Cache    │
│ Engine (Py) │◄┤ Service     │
│ • vLLM      │ │ (Go/Python) │
│ • TensorRT  │ │ • Redis     │
│ • LoRA/     │ │ • Sharding  │
│   QLoRA     │ │ • Eviction  │
└─────────────┘ └─────────────┘
       │
       ▼
┌─────────────────────────────────────────┐
│  Observability (OpenTelemetry)          │
│  • Prometheus • Grafana • OTEL          │
└─────────────────────────────────────────┘

🔩 Design Pillars

  1. Distributed Inference Architecture - Multi-node GPU orchestration
  2. Observability & Operational Excellence - OTEL tracing, metrics, autoscaling
  3. Kernel & Serving Optimization - vLLM, TensorRT-LLM, TGI with IR compilation
  4. PEFT & Quantized Serving - LoRA hot-swap, QLoRA (int4/int8)
  5. Cloud-Native Control Plane - Kubernetes Operator with CRDs
  6. Backend & Distributed Systems - Go microservices with gRPC
  7. HPC-Level Scaling - KV-cache sharding, RDMA/NCCL support

📁 Project Structure

hydra_serve/
├── services/
│   ├── gateway/          # API entry point (Go)
│   ├── scheduler/        # GPU allocation & batching (Go)
│   ├── inference/        # Model runtime (Python)
│   └── kvcache/          # Distributed cache (Go/Python)
├── operator/             # Kubernetes Operator (Go)
├── infra/
│   ├── helm/            # Kubernetes deployment charts
│   └── terraform/       # Cloud infrastructure (GKE/EKS/AKS)
├── obs/                 # Monitoring dashboards
├── proto/               # gRPC service definitions
├── scripts/             # Developer tools & benchmarks
├── ci/                  # GitHub Actions workflows
└── docs/                # Architecture & API documentation

⚙️ Tech Stack

Layer Technologies
Gateway Go, gRPC, REST
Scheduler Go, gRPC, consistent hashing
Inference Python, FastAPI, vLLM, TensorRT-LLM, TGI, PEFT
KV-Cache Go/Python, Redis, distributed caching
Operator Go, Kubebuilder, CRDs
Infrastructure Kubernetes, Helm, Terraform, KEDA
Observability OpenTelemetry, Prometheus, Grafana
CI/CD GitHub Actions, Makefile

🚦 Quick Start

Prerequisites

  • Kubernetes cluster (GKE/EKS/AKS or local with kind/minikube)
  • GPU nodes with NVIDIA drivers and container runtime
  • Go 1.21+ (for building services and operator)
  • Python 3.10+ (for inference engine)
  • Helm 3+ (for deployment)
  • Terraform (optional, for cloud provisioning)

Local Development

# Build all services
make build

# Run tests
make test

# Deploy to local Kubernetes
make deploy-local

# Run observability stack
make obs-up

Cloud Deployment

# Provision infrastructure (GKE example)
cd infra/terraform/gke
terraform init
terraform apply

# Deploy HydraServe
helm install hydraserve infra/helm/hydraserve \
  --namespace hydraserve \
  --create-namespace

# Access dashboard
kubectl port-forward -n hydraserve svc/grafana 3000:3000

🧪 Example Usage

import requests

# Send inference request
response = requests.post(
    "http://gateway:8080/v1/completions",
    json={
        "model": "mistralai/Mistral-7B-v0.1",
        "prompt": "Explain photosynthesis in simple terms:",
        "max_tokens": 200,
        "temperature": 0.7,
        "stream": True
    },
    stream=True
)

# Stream tokens
for chunk in response.iter_content(chunk_size=None):
    print(chunk.decode(), end="", flush=True)

🎯 Key Features

✅ Multi-Model Support

  • Mistral, LLaMA, Falcon, and any HuggingFace model
  • LoRA adapter hot-swapping for multi-tenant serving
  • QLoRA quantization (4-bit/8-bit) for memory efficiency

✅ Performance Optimization

  • vLLM for paged attention and continuous batching
  • TensorRT-LLM for kernel fusion and FP8/INT4 quantization
  • ONNX → TensorRT IR compilation pipeline
  • KV-cache reuse across sessions

✅ Production-Grade Operations

  • Kubernetes Operator for declarative model management
  • Horizontal autoscaling with KEDA (GPU-aware)
  • Rolling updates with zero downtime
  • Multi-cloud portability (GKE, EKS, AKS)

✅ Observability

  • Distributed tracing with OpenTelemetry
  • Real-time metrics (latency, throughput, GPU utilization)
  • Grafana dashboards for end-to-end visibility
  • Cache hit-rate monitoring

📊 Performance Characteristics

Metric Target
Latency (p50) < 50ms time-to-first-token
Latency (p99) < 200ms time-to-first-token
Throughput > 1000 tokens/sec per GPU
GPU Utilization > 85%
Cache Hit Rate > 70%

🛠️ Development

Building Services

# Build gateway
cd services/gateway && go build -o bin/gateway ./cmd/gateway

# Build scheduler
cd services/scheduler && go build -o bin/scheduler ./cmd/scheduler

# Build inference engine
cd services/inference && pip install -r requirements.txt

Running Locally

# Start Redis (for KV-cache)
docker run -d -p 6379:6379 redis:alpine

# Start services
./scripts/run-local.sh

Testing

# Run unit tests
make test

# Run integration tests
make test-integration

# Run load tests
./scripts/benchmark.sh

📚 Documentation

🤝 Contributing

See CONTRIBUTING.md for guidelines.

📄 License

MIT License - see LICENSE for details.


Built with ❤️ for high-performance distributed inference

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors