This repository features a Content-Based Movie Recommendation System designed to provide personalized movie suggestions. Leveraging natural language processing and machine learning techniques, the system analyzes movie content to find and recommend films similar to a user's preferences.
-
Content-Based Filtering: Recommends movies based on the similarity of their content (genres, cast, director, keywords, plot summaries, etc.) to a given movie or user preference.
-
Interactive User Interface: Built with
Streamlit, providing an easy-to-use web interface for users to get movie recommendations. -
Model Persistence: The processed data and similarity matrix are serialized using
picklefor efficient loading and prediction without re-processing the entire dataset every time.
-
Python: The core programming language.
-
Jupyter Notebook: Used for exploratory data analysis, model development, and experimentation.
-
Pandas: For efficient data manipulation and analysis of the movie dataset.
-
scikit-learn (sklearn): Specifically for
CountVectorizerandcosine_similarity. -
Pickle: For serializing and deserializing Python objects, particularly the processed movie data or recommendation model.
-
Streamlit: For creating the interactive web application that serves the recommendations.
The core of this content-based recommendation system relies on the following components:
-
Feature Extraction: Movie attributes such as genre, keywords, cast, and director are combined into a single string for each movie.
-
Text Vectorization:
CountVectorizerfromscikit-learnis used to convert this combined textual data into a sparse matrix of token counts. Each movie is represented as a vector in this high-dimensional space, where each dimension corresponds to a unique word/token. -
Similarity Measurement:
Cosine Similarityis then applied to these vectorized representations. Cosine similarity measures the cosine of the angle between two vectors, indicating how similar they are. A higher cosine similarity score indicates greater similarity between movies' content profiles.
The development process involved thorough data analysis and a structured training approach.
-
Initial Data Inspection: The TMDB Movies Dataset was loaded into a Jupyter Notebook to inspect its structure, identify relevant columns (e.g.,
genres,keywords,cast,crew,overview,title), and check for missing values or inconsistencies. -
Feature Selection: Key textual and categorical features crucial for defining movie content were identified and extracted.
-
Text Cleaning: Basic text preprocessing steps were performed on relevant text columns, which might include converting to lowercase, removing punctuation, and handling potential formatting issues to prepare data for vectorization.
-
Feature Combination: Relevant textual features (e.g.,
genres,keywords,cast,director,overview) for each movie were combined into a single string. This unified text provides a comprehensive content profile for each film. -
Vectorization (CountVectorizer): An instance of
CountVectorizerwas fitted on the combined text data from all 10,000 movies. This step builds a vocabulary of all unique words/tokens across the dataset and transforms each movie's text into a numerical vector based on the frequency of these tokens. -
Similarity Matrix Generation: The
cosine_similarityfunction was applied to the output of theCountVectorizer. This generated a large square matrix where each element(i, j)represents the cosine similarity between movieiand moviej. This matrix is the heart of the recommendation engine, allowing for quick retrieval of similar movies. -
Model Serialization: Both the fitted
CountVectorizerobject and the calculatedcosine_similaritymatrix were saved (pickled) to disk. This allows the Streamlit application to load these pre-computed components quickly without needing to re-process the entire dataset upon startup, significantly improving performance.
- Name: TMDB Movies Dataset from Kaggle
To view this project locally on your machine:
-
Clone the Repository:
git clone https://github.com/sjain2580/Movie-Recommender-System.git
-
Navigate to the Project Directory:
cd Movie-Recommender-System -
Create a Virtual Environment (Recommended):
python -m venv venv
- Activate the virtual environment:
- Windows:
.\venv\Scripts\activate - macOS/Linux:
source venv/bin/activate
- Windows:
- Activate the virtual environment:
-
Run the Jupyter notebook: Make sure you have all the necessary libraries installed.
pip install jupyter jupyter notebook
Once the jupyter notebook opens, run it. It will create the necessary .pkl files. In case of error, download the necessary librarires and run again.
-
Run the Streamlit Application:
streamlit run app.py
-
Access the Application: Your default web browser should automatically open to the Streamlit application. If not, copy the URL provided in your terminal.
Check the live app here - https://movie-recommenda-system.streamlit.app/
Feel free to reach out if you have any questions or just want to connect!
