Distributed Recommendation System

MLOps-focused recommendation pipeline: distributed training, experiment tracking, containerized serving.

Setup

python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
pip install -e .

Data

Download the Amazon Reviews dataset (JSONL or CSV format) and place it in the data/ directory:

data/
├── All_Beauty.jsonl         # Small dataset for testing
└── Electronics.csv.gz       # Large dataset for benchmarking

Pipeline

1. Preprocess

Supports JSONL, CSV, and CSV.GZ formats. Two split strategies: temporal (default) or leave-one-out.

# Basic (temporal split)
python data/preprocess.py --input data/All_Beauty.jsonl --output-dir data/processed

# With custom thresholds
python data/preprocess.py --input data/All_Beauty.jsonl --output-dir data/processed --min-user 3 --min-item 3

# Leave-one-out split
python data/preprocess.py --input data/All_Beauty.jsonl --output-dir data/processed --split-strategy leave-one-out

# Large CSV dataset
python data/preprocess.py --input data/Electronics.csv.gz --output-dir data/processed_electronics --min-user 3 --min-item 3

Preprocessing is tracked in MLflow under the preprocessing experiment.

2. Train (Single Node)

python training/train_single.py --data-dir data/processed --epochs 10

3. Train (Distributed)

torchrun --nproc_per_node=2 training/train_ddp.py --data-dir data/processed --epochs 10

Distributed training benchmark (Electronics-10M)

For a realistic CPU-bound workload, the NCF model is upgraded to a NeuMF-style architecture (user/item embeddings + MLP head). On the preprocessed data/processed_electronics_10m subset:

Single-process NeuMF vs 2-process DDP (gloo, CPU) shows roughly 1.2–1.3× higher throughput for DDP.
Both runs are tracked in the ncf-training MLflow experiment, with training_mode=single vs training_mode=ddp and world_size=2.
A small helper script prints the comparison table:

python training/compare_runs.py --experiment ncf-training --last 2 --markdown

The output can be pasted directly into this README to document the exact numbers you measured on your machine.

4. Baselines

python models/baselines.py --data-dir data/processed --baseline all

Evaluates popularity, random, and matrix factorization baselines. Results logged to MLflow baselines experiment.

5. Evaluate NCF

python evaluation/evaluate.py --model-path outputs/model.pt --data-dir data/processed

6. Automated Experiment Sweep

Run hyperparameter search across preprocessing and training configs:

python experiments/sweep.py --input data/All_Beauty.jsonl --config experiments/config.yaml

Features:

YAML-based config for preprocessing and training hyperparameters
Uses DDP by default for faster iteration
Resource monitoring (auto-adjusts based on CPU/memory)
Cleans up processed data after each config (reproducible from params)
All runs tagged with config_id for MLflow lineage tracking

Modes:

# Full sweep (preprocess + train for each config)
python experiments/sweep.py --input data/All_Beauty.jsonl --config experiments/config.yaml

# Ablation only (on existing preprocessed data)
python experiments/sweep.py --data-dir data/processed --config experiments/config.yaml

7. View Experiments

mlflow ui  # http://127.0.0.1:5000

Serving

Local

uvicorn serving.app:app --port 8000
curl http://localhost:8000/health
curl -X POST http://localhost:8000/recommend -H "Content-Type: application/json" -d '{"user_id": "USER_ID", "top_k": 5}'

Docker

# Build
docker build -f docker/Dockerfile.serving -t recommendation-api .

# Run (mount data and model)
docker run -p 8000:8000 \
  -v $(pwd)/data/processed:/app/data/processed \
  -v $(pwd)/outputs:/app/outputs \
  recommendation-api

Docker Training

docker build -f docker/Dockerfile.training -t recommendation-training .

docker run \
  -v $(pwd)/data/processed:/app/data/processed \
  -v $(pwd)/outputs:/app/outputs \
  recommendation-training

Kubernetes (Minikube)

Prerequisites: minikube and kubectl installed, Docker Desktop running, and outputs/model.pt exists.

# 1. Start cluster
minikube start --driver=docker

# 2. Load image into Minikube (no registry needed)
minikube image load recommendation-api:latest

# 3. Mount local data into the Minikube VM — run each in a separate terminal, keep them running
minikube mount $(pwd)/outputs:/mnt/outputs
minikube mount $(pwd)/data/processed:/mnt/data/processed

# 4. Deploy
kubectl apply -f k8s/deployment.yaml
kubectl apply -f k8s/service.yaml

# 5. Wait for pod to be ready
kubectl get pods -w   # wait for READY 1/1

# 6. Get service URL and test
minikube service recommendation-api --url
curl <URL>/health
curl -X POST <URL>/recommend \
  -H "Content-Type: application/json" \
  -d '{"user_id": "USER_ID", "top_k": 5}'

Scaling

# Enable metrics-server for HPA
minikube addons enable metrics-server
kubectl apply -f k8s/hpa.yaml
kubectl get hpa

# Manual scaling demo (scales 1 → 3 → 1 and verifies each pod)
bash k8s/test_scaling.sh

# Full smoke test (pods, /health, /recommend, HPA)
bash k8s/smoke_test.sh

Project Structure

├── data/
│   ├── preprocess.py       # Raw data → parquet splits
│   └── dataset.py          # PyTorch Dataset with negative sampling
├── models/
│   ├── ncf.py              # Neural Collaborative Filtering
│   └── baselines.py        # Popularity, Random, MF baselines
├── training/
│   ├── train_single.py     # Single-node + MLflow
│   └── train_ddp.py        # Distributed (PyTorch DDP)
├── evaluation/
│   ├── metrics.py          # Recall@K, NDCG@K
│   └── evaluate.py         # Model evaluation
├── experiments/
│   ├── sweep.py            # Automated hyperparameter sweep
│   ├── resource_monitor.py # CPU/memory monitoring
│   └── config.yaml         # Experiment configurations
├── serving/
│   └── app.py              # FastAPI inference
├── docker/
│   ├── Dockerfile.serving
│   └── Dockerfile.training
└── k8s/
    ├── deployment.yaml     # Pod spec, resource limits, health probes
    ├── service.yaml        # NodePort service (port 30800)
    ├── hpa.yaml            # Auto-scale 1–5 replicas at 70% CPU
    ├── test_scaling.sh     # Manual scale-up/down demo
    └── smoke_test.sh       # Full deployment validation

Key Design Decisions

Decision	Why
Temporal split	Prevents data leakage
Leave-one-out split	Alternative for per-user evaluation
Configurable filtering	Balance data size vs. quality
Baselines (Pop, MF)	Benchmark to validate NCF improvement
BPR loss	Ranking > rating prediction
Volume mounts	Data stays external to images
MLflow	Tracks preprocessing, baselines, and training experiments
DDP with gloo	CPU-based distributed training
YAML experiment config	Research-grade, not hardcoded presets
DDP by default in sweep	Faster hyperparameter search
Separate DDP benchmark	One-time comparison, not every sweep
Resource monitoring	Prevents system overload during sweeps
Kubernetes NodePort	Exposes service on Minikube without ingress setup
hostPath volumes	Mirrors S3/NFS mount pattern — data stays outside the image
HPA on CPU	Auto-scales replicas; production would add custom latency metrics
imagePullPolicy: Never	Uses locally loaded image — no registry needed for local dev

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Distributed Recommendation System

Setup

Data

Pipeline

1. Preprocess

2. Train (Single Node)

3. Train (Distributed)

Distributed training benchmark (Electronics-10M)

4. Baselines

5. Evaluate NCF

6. Automated Experiment Sweep

7. View Experiments

Serving

Local

Docker

Docker Training

Kubernetes (Minikube)

Scaling

Project Structure

Key Design Decisions

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.github/workflows		.github/workflows
data		data
docker		docker
evaluation		evaluation
experiments		experiments
k8s		k8s
models		models
pipeline		pipeline
serving		serving
training		training
.dockerignore		.dockerignore
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Distributed Recommendation System

Setup

Data

Pipeline

1. Preprocess

2. Train (Single Node)

3. Train (Distributed)

Distributed training benchmark (Electronics-10M)

4. Baselines

5. Evaluate NCF

6. Automated Experiment Sweep

7. View Experiments

Serving

Local

Docker

Docker Training

Kubernetes (Minikube)

Scaling

Project Structure

Key Design Decisions

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages