CineMatch 🎬

Your personal AI-powered movie recommendation engine

CineMatch is an end-to-end machine learning system that delivers personalized movie recommendations using SVD collaborative filtering, trained on the MovieLens 1M dataset. Built with MLOps best practices from data versioning and experiment tracking to containerized cloud deployment.

Live Demo

API: http://your-ecs-url:8000/docs

Features

Personalized recommendations using SVD matrix factorization
RMSE of 0.965 on held-out test set (199,708 ratings)
REST API built with FastAPI and Pydantic validation
Dark Netflix-style UI with real movie posters via TMDB API
User watch history with color-coded ratings
Experiment tracking with MLflow (6 models compared)
Data versioning with DVC
Containerized with Docker
CI/CD via GitHub Actions
Deployed on AWS ECS Fargate

Architecture

MovieLens 1M
      ↓
Data Loading → Preprocessing → Feature Engineering
      ↓
User-Item Matrix (6040 × 3260, 95.94% sparse)
      ↓
Rating Normalization (subtract user mean)
      ↓
SVD Matrix Factorization (50 latent factors)
      ↓
FastAPI REST API
      ↓
CineMatch Streamlit UI

Tech Stack

Layer	Technology
ML Model	SVD Matrix Factorization (SciPy)
Baseline Model	ItemKNN (scikit-learn)
Experiment Tracking	MLflow
Data Versioning	DVC
API	FastAPI + Pydantic
Frontend	Streamlit + TMDB API
Containerization	Docker
CI/CD	GitHub Actions
Cloud	AWS ECS Fargate + ECR + IAM
Testing	pytest (15 tests)
Language	Python 3.12
Package Manager	uv

Project Structure

movie-recommendation-mlops-end-to-end/
│
├── notebooks/
│   ├── 01_EDA.ipynb                  → data exploration
│   ├── 02_preprocessing.ipynb        → data cleaning and splitting
│   └── 03_model_training.ipynb       → model experiments
│
├── src/
│   ├── data/
│   │   ├── loader.py                 → load MovieLens .dat files
│   │   └── preprocessor.py          → filter, clean, split ratings
│   │
│   ├── features/
│   │   └── build_features.py        → sparse matrix + normalization
│   │
│   └── models/
│       ├── train.py                 → ItemKNN + SVD training + MLflow
│       ├── evaluate.py              → RMSE evaluation
│       └── predict.py               → recommendation generation
│
├── api/
│   ├── main.py                      → FastAPI app + endpoints
│   ├── schemas.py                   → request/response validation
│   └── services.py                  → API business logic
│
├── frontend/
│   └── app.py                       → CineMatch Streamlit UI
│
├── tests/
│   ├── test_data.py                 → data pipeline tests
│   ├── test_model.py                → model tests
│   └── test_api.py                  → API endpoint tests
│
├── data/
│   ├── raw/ml-1m/                   → MovieLens 1M dataset (DVC tracked)
│   └── processed/                   → matrices + mappings (DVC tracked)
│
├── models/                          → saved model artifacts (DVC tracked)
├── reports/                         → evaluation reports
├── .github/workflows/ci-cd.yml      → CI/CD pipeline
├── Dockerfile
├── main.py                          → training pipeline entry point
└── pyproject.toml

Dataset

MovieLens 1M dataset from GroupLens Research:

Stat	Value
Total ratings	1,000,209
Users	6,040
Movies	3,883
Rating scale	1 to 5 stars
Time period	2000 — 2003
Sparsity	95.94%

Key findings from EDA:

Positive bias: users rate movies 4-5 stars on average
Long tail problem: few movies dominate ratings
No missing values or duplicates

Model Performance

All experiments tracked and compared in MLflow:

Model	Parameters	RMSE
ItemKNN	K=10, cosine	2.72
ItemKNN	K=20, cosine	2.82
ItemKNN	K=50, cosine	2.88
SVD	50 factors	0.965 ← production
SVD	100 factors	1.02
SVD	200 factors	1.02

Final evaluation on full test set (199,708 ratings): RMSE 0.965

Why SVD beats ItemKNN

SVD finds hidden latent factors (user taste profiles + movie features)
Rating normalization removes user bias (subtract user mean)
Handles sparse data better than direct similarity comparison
Predictions are instant (matrix lookup vs kneighbors loop)

API Endpoints

Method	Endpoint	Description
GET	`/health`	Server health check
POST	`/recommendations`	Get personalized recommendations
GET	`/user/{id}/history`	Get user watch history

Example request

curl -X POST http://localhost:8000/recommendations \
  -H "Content-Type: application/json" \
  -d '{"user_id": 1, "n": 10}'

Example response

{
  "user_id": 1,
  "recommendations": [
    {
      "movie_id": 3114,
      "title": "Toy Story 2 (1999)",
      "predicted_score": 4.26
    },
    {
      "movie_id": 364,
      "title": "Lion King, The (1994)",
      "predicted_score": 4.25
    }
  ]
}

Getting Started

Prerequisites

Python 3.12
uv package manager
Docker (optional)
TMDB API key (for movie posters)

Installation

# clone the repo
git clone https://github.com/harmandeep2993/movie-recommendation-mlops-end-to-end.git
cd movie-recommendation-mlops-end-to-end

# install dependencies
uv sync

# set up environment variables
cp .env.example .env
# add your TMDB_API_KEY to .env

Download dataset

Download MovieLens 1M from https://grouplens.org/datasets/movielens/1m/ and place in:

data/raw/ml-1m/
├── movies.dat
├── ratings.dat
└── users.dat

Train the model

python main.py

This runs the full pipeline:

Load and preprocess data
Build sparse User-Item Matrix
Normalize ratings
Train ItemKNN and SVD models
Track experiments in MLflow
Save best model

Run the API

uvicorn api.main:app --reload

API docs available at: http://localhost:8000/docs

Run the UI

streamlit run frontend/app.py

UI available at: http://localhost:8501

Run with Docker

docker build -t CineMatch .
docker run -p 8000:8000 CineMatch

View MLflow experiments

mlflow ui --backend-store-uri sqlite:///mlflow.db

MLflow UI available at: http://localhost:5000

MLOps Pipeline

Data Versioning (DVC)

dvc add data/raw/ml-1m
dvc add data/processed
dvc push

CI/CD (GitHub Actions)

Every push to main triggers:

push to GitHub
      ↓
run pytest tests (15 tests)
      ↓
build Docker image
      ↓
push to AWS ECR
      ↓
deploy to AWS ECS Fargate
      ↓
new version live

AWS Infrastructure

Service	Purpose
ECR	Docker image registry
ECS Fargate	Serverless container hosting
IAM	Access management for GitHub Actions

Tests

pytest tests/ -v

Test File	Coverage
test_data.py	data loading, preprocessing pipeline
test_model.py	model loading, predictions, artifacts
test_api.py	health endpoint, recommendations, history

Total: 15 tests, all passing.

Roadmap

Precompute recommendations for sub-100ms API response
PostgreSQL database for storing user ratings
Vector database (Qdrant) for semantic movie search
Neural Collaborative Filtering model
Model monitoring and automated retraining pipeline
A/B testing framework for model comparison

What I Learned

This project covers the full ML engineering lifecycle:

Building sparse matrix representations for recommendation systems
Implementing and comparing collaborative filtering algorithms
MLOps practices: experiment tracking, data versioning, CI/CD
Production API design with FastAPI
Docker containerization and cloud deployment
Writing clean modular Python code with proper testing

Author

Harman — Data Science and AI/ML Engineer, Berlin

GitHub · LinkedIn

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 101 Commits
.dvc		.dvc
.github/workflows		.github/workflows
api		api
frontend		frontend
reports		reports
src		src
tests		tests
.dvcignore		.dvcignore
.gitignore		.gitignore
.python-version		.python-version
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
conftest.py		conftest.py
main.py		main.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

CineMatch 🎬

Live Demo

Features

Architecture

Tech Stack

Project Structure

Dataset

Model Performance

Why SVD beats ItemKNN

API Endpoints

Example request

Example response

Getting Started

Prerequisites

Installation

Download dataset

Train the model

Run the API

Run the UI

Run with Docker

View MLflow experiments

MLOps Pipeline

Data Versioning (DVC)

CI/CD (GitHub Actions)

AWS Infrastructure

Tests

Roadmap

What I Learned

Author

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages