A hybrid recommendation engine combining Collaborative Filtering (NMF) and Content-Based Filtering (TF-IDF + Cosine Similarity) — built with Python, SQLite, and Streamlit.
- Features
- How It Works
- Project Structure
- Tech Stack
- Installation & Setup
- Usage
- Output
- Screenshots
- Future Improvements
- Author
| Feature | Description |
|---|---|
| 🔀 Hybrid Recommendations | Combines CF + Content-Based scores with tunable weight |
| 🎚️ Adjustable Alpha | Slide between collaborative and content-based filtering |
| 👤 Existing Users | Personalized recommendations from historical ratings |
| 🆕 Cold-Start Handling | Popularity-based or genre-based suggestions for new users |
| 🎯 Content-Based Mode | Recommendations from user-selected liked movies |
| 🌐 Interactive UI | Clean, browser-based interface powered by Streamlit |
- Reads
movies.csvandratings.csvusing pandas - Inserts into a SQLite database (
recsys.db) with two tables:movies(movieId, title, genres, ...)ratings(userId, movieId, rating, timestamp)
- Adds indexes on key columns for fast query performance
- Filters to top 500 users and top 1000 movies to keep the matrix manageable
- Builds a dense user × item rating matrix
R - Trains
sklearn.decomposition.NMFto decomposeRinto:user_factors— shape(num_users, k)item_factors— shape(k, num_movies)
- Saves
nmf_user_factors.pkl,nmf_item_factors.pkl, and ID ↔ index mapping dicts
- Combines each movie's title + genres into a text field
- Applies TF-IDF vectorization across all movies
- Computes a full pairwise cosine similarity matrix
- Saves
tfidf_vectorizer.pkl,content_cosine_sim.npz, and mapping dicts
CF Score — predicted rating via NMF dot product:
CF Score = user_factors[u] · item_factors[:, m]
CB Score — for each candidate movie, averages cosine similarity against all movies the user has rated ≥ 4 stars, then normalizes to [0, 1].
Both scores are normalized and blended:
Hybrid Score = α × CF Score + (1 - α) × CB Score
α = 1.0→ Pure Collaborative Filteringα = 0.0→ Pure Content-Based Filteringα = 0.5→ Equal blend (recommended default)
| Method | How it works |
|---|---|
| Popularity-based | Score = mean_rating × log(rating_count) — surfaces well-rated and widely seen films |
| Liked movies | User picks titles they enjoyed → CB cosine similarity finds the closest matches |
mini-recommender/
│
├── data/
│ ├── movies.csv
│ └── ratings.csv
│
├── models/
│ ├── nmf_user_factors.pkl
│ ├── nmf_item_factors.pkl
│ ├── tfidf_vectorizer.pkl
│ └── content_cosine_sim.npz
│
├── create_db.py # Sets up SQLite database
├── train_models.py # Trains NMF and TF-IDF models
├── recommender.py # Core recommendation logic
├── app.py # Streamlit frontend
├── recsys.db # SQLite database (auto-generated)
├── requirements.txt
└── README.md
| Layer | Technology |
|---|---|
| Frontend | Streamlit |
| Backend | Python 3.8+ |
| Database | SQLite |
| ML / Math | scikit-learn, scipy, numpy |
| Data | pandas |
| Serialization | joblib |
- Python 3.8+
- MovieLens dataset (
movies.csvandratings.csv)
git clone https://github.com/yourusername/mini-recommender.git
cd mini-recommenderpip install -r requirements.txtPlace the MovieLens CSV files in the data/ directory:
data/
├── movies.csv
└── ratings.csv
python create_db.pypython train_models.pystreamlit run app.pyThe app will open at http://localhost:8501 in your browser.
- Select a User ID from the dropdown
- Choose N (number of recommendations) and adjust the alpha slider
- Get hybrid recommendations with CF + CB scores
- See the user's own top-rated movies alongside results
Option A — Popularity-based:
Recommends widely-seen, well-rated movies using mean_rating × log(count) scoring.
Option B — Liked movies:
Pick titles you've enjoyed from a dropdown → content-based similarity returns the closest matches.
Each recommendation includes:
| Field | Description |
|---|---|
| 🎬 Movie Title | Recommended movie name |
| ⭐ CF Score | Predicted rating from collaborative filtering |
| 🔀 Hybrid Score | Weighted combination of CF + CB scores |
| ❤️ Liked Movies | User's historically rated movies (existing users) |
- Real-time user feedback loop for online learning
- Neural Collaborative Filtering (NCF / deep learning)
- Sparse matrix handling + ALS for larger datasets (e.g.
implicitlibrary) - Replace NMF truncation with
Surpriselibrary for better CF - REST API with FastAPI or Flask
- Explicit model retrain path within the Streamlit app
- Error handling for missing or corrupt model files
- Docker containerization & cloud deployment (AWS / GCP / Heroku)
Reth Rebello
Engineering Student · ML Enthusiast · Data Analytics · software Developer
Contributions are welcome! Here's how:
- Fork the repository
- Create a feature branch:
git checkout -b feature/your-feature - Commit your changes:
git commit -m 'Add your feature' - Push to the branch:
git push origin feature/your-feature - Open a Pull Request
This project is licensed under the MIT License.
If you found this useful, give it a ⭐ on GitHub!