A content-based movie recommendation engine built with Python as an interactive web app. Enter any movie and instantly get 5 similar recommendations with posters — powered by NLP and cosine similarity on the TMDB 5000 dataset.
- Each movie's overview, genres, keywords, cast and director are combined into a single
tagsstring - Tags are stemmed using NLTK's PorterStemmer to normalize word forms
- A CountVectorizer converts all tags into a 4806 × 5000 word-count matrix
- Cosine similarity is computed between every pair of movies
- Given a movie, the app returns the top 5 most similar movies by cosine score
cinematch/
├── app.py # Streamlit web app
├── netflix_recommendation.py # Full notebook logic (all steps)
├── movies.pkl # Preprocessed movie dataframe
├── similarity.pkl # Cosine similarity matrix
├── requirements.txt # Python dependencies
└── README.md # You are here
You need two dataset files from Kaggle and two model files to run this project.
Download both CSV files and place them in the project root:
| File | Description | Link |
|---|---|---|
tmdb_5000_movies.csv |
Movie metadata (budget, genres, keywords, overview) | ⬇️ Download |
tmdb_5000_credits.csv |
Cast and crew information | ⬇️ Download |
If you don't want to run the notebook yourself, download the pre-built model files:
| File | Description | Link |
|---|---|---|
movies.pkl |
Preprocessed movie titles + tags | ⬇️ Download |
similarity.pkl |
4806×4806 cosine similarity matrix | ⬇️ Download |
Place all downloaded files in the project root folder alongside
app.py.
git clone https://github.com/YOUR_USERNAME/cinematch.git
cd cinematchpython3 -m venv venv
source venv/bin/activate # Mac / Linux
venv\Scripts\activate # Windowspip install -r requirements.txtOpen app.py and replace the API key on this line:
API_KEY = "your_api_key_here"Get a free key at → themoviedb.org (Sign up → Settings → API → Create → Developer → Copy API Key v3)
Download movies.pkl and similarity.pkl from the links above and place them in the project root.
Or generate them yourself by running all cells in netflix_recommendation.py inside Jupyter:
pip install jupyter
jupyter notebook
# Open netflix_recommendation.py and run all cells
# This will generate movies.pkl and similarity.pkl automaticallystreamlit run app.pyThe app will open at http://localhost:8501 in your browser.
| Tool | Purpose |
|---|---|
| Python 3.10+ | Core language |
| Pandas & NumPy | Data manipulation |
| NLTK | Text stemming |
| Scikit-learn | CountVectorizer + Cosine Similarity |
| Streamlit | Web app framework |
| TMDB API | Fetching movie posters |
streamlit
pandas
numpy
scikit-learn
nltk
requests
Install with:
pip install -r requirements.txtNote:
similarity.pklis ~90MB. If GitHub rejects it, use Git LFS or host it on Google Drive and load it in the app viagdown.
Q: The app loads but posters are missing?
Your TMDB API key may not be activated yet. It can take up to 30 minutes after signup. Replace API_KEY in app.py with your key.
Q: similarity.pkl is too large to push to GitHub?
Use Git LFS:
brew install git-lfs # Mac
git lfs install
git lfs track "*.pkl"
git add .gitattributes similarity.pkl
git commit -m "Add model via LFS"
git pushQ: How do I regenerate the model files?
Run all cells in netflix_recommendation.py in Jupyter Notebook. It will create fresh movies.pkl and similarity.pkl files.
Gargi Joshi
- GitHub: @gargijoshi9
- LinkedIn:Gargi Joshi
This project is open source under the MIT License.
Built with ♥ using Python · Scikit-learn · NLTK · Streamlit · TMDB API

