Skip to content

Rockne/TDT4215

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

53 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Project Documentation: News Recommendation System (TDT4215)

Overview

This project implements a news recommendation engine using the MIND_small dataset (Microsoft News Dataset). The system predicts which news articles a user is likely to click based on their historical behavior and article content.

Implemented methods:

  • Baseline: Most popular based on recent Click-Through Rate
  • Content-Based Filtering: Uses article features (titles/categories) and vector embeddings
  • Collaborative Filtering: Item-item silimarity based on user interaction matrices
  • Hybrid (Score Fusion): Weighted combination of scores from all models
  • Hybrid (Rank Fusion): Reciprocal Rank Fusion (RRF) with adaptive weights based on user history length

Technical Setup

Prerequisities:

  • Python 3.8+
  • MINDsmall_train dataset files placed in ./data/MINDsmall_train (mover here after downloading from MIND website)

Installation

You can install the required packages directly, but using a virtual environment is highly recommended to avoid version conflicts with other projects.

Option 1: Using a Virtual Environment (Recommended)

This keeps the project dependencies isolated from your local computer.

# 1. Create the environment
python -m venv .venv

# 2. Activate it
# Windows
.venv\Scripts\activate
# macOS/Linux
source .venv/bin/activate

# 3. Install packages
pip install -r requirements.txt

Option 2: Quick Start (Global Install)

If you prefer not to use a virtual environment, simply run:

pip install -r requirements.txt

How to run

The main entry point is main.py. To load data, initialize models, generate sample recommendations for a test user, and run the evaluation suite:

python src/main.py

Script Workflow

  1. load_mind: Loads news, behaviors and interaction data
  2. setup_models:
    • Filters interactions by time (48-hour window) for popularity calculation
    • Generates TF-IDF/embeddings for content filtering
    • Computes the sparse similarity matrix for collaborative filtering
  3. run_recommenders: Performs a "Live Demo" for a specific user ID, showing their history and what each model suggests
  4. run_evaluation: Samples 5000 impressions to calculate performance metrics

Evaluation Strategy

Accuracy Metrics

We use nDCG@5 (Normalized Discounted Cumulative Gain) to evaluate how well the models rank relevant articles within the top 5 suggestions in the impression logs.

Beyond-Accuracy Metrics

To ensure the system isn't just a filter bubble, we calculate Diversity.

  • Metric: Intra-list diversity based on article categories
  • Goal: Ensure the recommended articles cover a variety of topics

Project structure

├── data/
|   ├── processed/
|   |   ├── preprocessing_behaviors.py
|   |   └── preprocessing_news.py 
|   └── MINDsmall_train/
|       ├── behaviors.tsv
|       ├── entity_embeddings.vec
|       ├── news.tsv
|       └── relation_embedding.vec   
├── src/
|   ├── data/
|   │   ├── load_mind.py
|   ├── evaluation/
|   │   ├── accuracy.py     
|   │   └── beyondAccuracy.py
|   ├── models/
|   │   ├── popular.py  
|   │   ├── collaborative.py
|   │   ├── content_based.py
|   │   └── hybrid.py       
|   └── main.py              
└── requirements.txt

About

Oppgave

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages