Skip to content

Par-t/recommendation_system

Repository files navigation

Distributed Recommendation System

MLOps-focused recommendation pipeline: distributed training, experiment tracking, containerized serving.

Setup

python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
pip install -e .

Data

Download the Amazon Reviews dataset (JSONL or CSV format) and place it in the data/ directory:

data/
├── All_Beauty.jsonl         # Small dataset for testing
└── Electronics.csv.gz       # Large dataset for benchmarking

Pipeline

1. Preprocess

Supports JSONL, CSV, and CSV.GZ formats. Two split strategies: temporal (default) or leave-one-out.

# Basic (temporal split)
python data/preprocess.py --input data/All_Beauty.jsonl --output-dir data/processed

# With custom thresholds
python data/preprocess.py --input data/All_Beauty.jsonl --output-dir data/processed --min-user 3 --min-item 3

# Leave-one-out split
python data/preprocess.py --input data/All_Beauty.jsonl --output-dir data/processed --split-strategy leave-one-out

# Large CSV dataset
python data/preprocess.py --input data/Electronics.csv.gz --output-dir data/processed_electronics --min-user 3 --min-item 3

Preprocessing is tracked in MLflow under the preprocessing experiment.

2. Train (Single Node)

python training/train_single.py --data-dir data/processed --epochs 10

3. Train (Distributed)

torchrun --nproc_per_node=2 training/train_ddp.py --data-dir data/processed --epochs 10

Distributed training benchmark (Electronics-10M)

For a realistic CPU-bound workload, the NCF model is upgraded to a NeuMF-style architecture (user/item embeddings + MLP head). On the preprocessed data/processed_electronics_10m subset:

  • Single-process NeuMF vs 2-process DDP (gloo, CPU) shows roughly 1.2–1.3× higher throughput for DDP.
  • Both runs are tracked in the ncf-training MLflow experiment, with training_mode=single vs training_mode=ddp and world_size=2.
  • A small helper script prints the comparison table:
python training/compare_runs.py --experiment ncf-training --last 2 --markdown

The output can be pasted directly into this README to document the exact numbers you measured on your machine.

4. Baselines

python models/baselines.py --data-dir data/processed --baseline all

Evaluates popularity, random, and matrix factorization baselines. Results logged to MLflow baselines experiment.

5. Evaluate NCF

python evaluation/evaluate.py --model-path outputs/model.pt --data-dir data/processed

6. Automated Experiment Sweep

Run hyperparameter search across preprocessing and training configs:

python experiments/sweep.py --input data/All_Beauty.jsonl --config experiments/config.yaml

Features:

  • YAML-based config for preprocessing and training hyperparameters
  • Uses DDP by default for faster iteration
  • Resource monitoring (auto-adjusts based on CPU/memory)
  • Cleans up processed data after each config (reproducible from params)
  • All runs tagged with config_id for MLflow lineage tracking

Modes:

# Full sweep (preprocess + train for each config)
python experiments/sweep.py --input data/All_Beauty.jsonl --config experiments/config.yaml

# Ablation only (on existing preprocessed data)
python experiments/sweep.py --data-dir data/processed --config experiments/config.yaml

7. View Experiments

mlflow ui  # http://127.0.0.1:5000

Serving

Local

uvicorn serving.app:app --port 8000
curl http://localhost:8000/health
curl -X POST http://localhost:8000/recommend -H "Content-Type: application/json" -d '{"user_id": "USER_ID", "top_k": 5}'

Docker

# Build
docker build -f docker/Dockerfile.serving -t recommendation-api .

# Run (mount data and model)
docker run -p 8000:8000 \
  -v $(pwd)/data/processed:/app/data/processed \
  -v $(pwd)/outputs:/app/outputs \
  recommendation-api

Docker Training

docker build -f docker/Dockerfile.training -t recommendation-training .

docker run \
  -v $(pwd)/data/processed:/app/data/processed \
  -v $(pwd)/outputs:/app/outputs \
  recommendation-training

Kubernetes (Minikube)

Prerequisites: minikube and kubectl installed, Docker Desktop running, and outputs/model.pt exists.

# 1. Start cluster
minikube start --driver=docker

# 2. Load image into Minikube (no registry needed)
minikube image load recommendation-api:latest

# 3. Mount local data into the Minikube VM — run each in a separate terminal, keep them running
minikube mount $(pwd)/outputs:/mnt/outputs
minikube mount $(pwd)/data/processed:/mnt/data/processed

# 4. Deploy
kubectl apply -f k8s/deployment.yaml
kubectl apply -f k8s/service.yaml

# 5. Wait for pod to be ready
kubectl get pods -w   # wait for READY 1/1

# 6. Get service URL and test
minikube service recommendation-api --url
curl <URL>/health
curl -X POST <URL>/recommend \
  -H "Content-Type: application/json" \
  -d '{"user_id": "USER_ID", "top_k": 5}'

Scaling

# Enable metrics-server for HPA
minikube addons enable metrics-server
kubectl apply -f k8s/hpa.yaml
kubectl get hpa

# Manual scaling demo (scales 1 → 3 → 1 and verifies each pod)
bash k8s/test_scaling.sh

# Full smoke test (pods, /health, /recommend, HPA)
bash k8s/smoke_test.sh

Project Structure

├── data/
│   ├── preprocess.py       # Raw data → parquet splits
│   └── dataset.py          # PyTorch Dataset with negative sampling
├── models/
│   ├── ncf.py              # Neural Collaborative Filtering
│   └── baselines.py        # Popularity, Random, MF baselines
├── training/
│   ├── train_single.py     # Single-node + MLflow
│   └── train_ddp.py        # Distributed (PyTorch DDP)
├── evaluation/
│   ├── metrics.py          # Recall@K, NDCG@K
│   └── evaluate.py         # Model evaluation
├── experiments/
│   ├── sweep.py            # Automated hyperparameter sweep
│   ├── resource_monitor.py # CPU/memory monitoring
│   └── config.yaml         # Experiment configurations
├── serving/
│   └── app.py              # FastAPI inference
├── docker/
│   ├── Dockerfile.serving
│   └── Dockerfile.training
└── k8s/
    ├── deployment.yaml     # Pod spec, resource limits, health probes
    ├── service.yaml        # NodePort service (port 30800)
    ├── hpa.yaml            # Auto-scale 1–5 replicas at 70% CPU
    ├── test_scaling.sh     # Manual scale-up/down demo
    └── smoke_test.sh       # Full deployment validation

Key Design Decisions

Decision Why
Temporal split Prevents data leakage
Leave-one-out split Alternative for per-user evaluation
Configurable filtering Balance data size vs. quality
Baselines (Pop, MF) Benchmark to validate NCF improvement
BPR loss Ranking > rating prediction
Volume mounts Data stays external to images
MLflow Tracks preprocessing, baselines, and training experiments
DDP with gloo CPU-based distributed training
YAML experiment config Research-grade, not hardcoded presets
DDP by default in sweep Faster hyperparameter search
Separate DDP benchmark One-time comparison, not every sweep
Resource monitoring Prevents system overload during sweeps
Kubernetes NodePort Exposes service on Minikube without ingress setup
hostPath volumes Mirrors S3/NFS mount pattern — data stays outside the image
HPA on CPU Auto-scales replicas; production would add custom latency metrics
imagePullPolicy: Never Uses locally loaded image — no registry needed for local dev

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors