This project implements a news recommendation engine using the MIND_small dataset (Microsoft News Dataset). The system predicts which news articles a user is likely to click based on their historical behavior and article content.
Implemented methods:
- Baseline: Most popular based on recent Click-Through Rate
- Content-Based Filtering: Uses article features (titles/categories) and vector embeddings
- Collaborative Filtering: Item-item silimarity based on user interaction matrices
- Hybrid (Score Fusion): Weighted combination of scores from all models
- Hybrid (Rank Fusion): Reciprocal Rank Fusion (RRF) with adaptive weights based on user history length
- Python 3.8+
- MINDsmall_train dataset files placed in ./data/MINDsmall_train (mover here after downloading from MIND website)
You can install the required packages directly, but using a virtual environment is highly recommended to avoid version conflicts with other projects.
This keeps the project dependencies isolated from your local computer.
# 1. Create the environment
python -m venv .venv
# 2. Activate it
# Windows
.venv\Scripts\activate
# macOS/Linux
source .venv/bin/activate
# 3. Install packages
pip install -r requirements.txtIf you prefer not to use a virtual environment, simply run:
pip install -r requirements.txtThe main entry point is main.py. To load data, initialize models, generate sample recommendations for a test user, and run the evaluation suite:
python src/main.py- load_mind: Loads news, behaviors and interaction data
- setup_models:
- Filters interactions by time (48-hour window) for popularity calculation
- Generates TF-IDF/embeddings for content filtering
- Computes the sparse similarity matrix for collaborative filtering
- run_recommenders: Performs a "Live Demo" for a specific user ID, showing their history and what each model suggests
- run_evaluation: Samples 5000 impressions to calculate performance metrics
We use nDCG@5 (Normalized Discounted Cumulative Gain) to evaluate how well the models rank relevant articles within the top 5 suggestions in the impression logs.
To ensure the system isn't just a filter bubble, we calculate Diversity.
- Metric: Intra-list diversity based on article categories
- Goal: Ensure the recommended articles cover a variety of topics
├── data/
| ├── processed/
| | ├── preprocessing_behaviors.py
| | └── preprocessing_news.py
| └── MINDsmall_train/
| ├── behaviors.tsv
| ├── entity_embeddings.vec
| ├── news.tsv
| └── relation_embedding.vec
├── src/
| ├── data/
| │ ├── load_mind.py
| ├── evaluation/
| │ ├── accuracy.py
| │ └── beyondAccuracy.py
| ├── models/
| │ ├── popular.py
| │ ├── collaborative.py
| │ ├── content_based.py
| │ └── hybrid.py
| └── main.py
└── requirements.txt