diff --git a/README.md b/README.md index 44e8ae9..320cb82 100644 --- a/README.md +++ b/README.md @@ -1,29 +1,35 @@ # **LLM-based Product Recommender System** -This repository contains a project focused on building a product recommendation system using Large Language Models (LLMs). The system is fine-tuned on Amazon appliances reviews and product metadata to predict the **exact** next products a user may purchase based on their historical interactions. The recommendation model leverages GPT-2 as a foundation and incorporates custom padding and data processing strategies to handle sequential product recommendations. +This repository contains a project focused on building a product recommendation system using Large Language Models (LLMs). The system is fine-tuned on Amazon appliance reviews and product metadata to predict the **exact** next products a user may purchase based on their historical interactions. The recommendation model leverages GPT-2 as a foundation and incorporates custom padding and data processing strategies to handle sequential product recommendations. --- ## **Project Overview** This project explores the use of large language models (LLMs) for product recommendation systems, leveraging their natural language understanding capabilities to predict which items a user might purchase or review next. The primary focus is not solely on achieving high prediction accuracy but rather on learning if LLMs can be effectively applied in the recommendation domain. +This code was run using [Lightning AI](lightning.ai) L4 GPU. + ### **Key Features:** - **Sequential Recommendation System**: Predicts the next (maximum) 10 items a user might purchase in the future, based on their review history. - **Custom Data Processing**: Handles temporal data (order by time to prevent data/target/information leakage) and processes the sequence of user interactions, ensuring proper padding and handling of missing data. - **Fine-tuning GPT-2**: Fine-tuned GPT-2 medium model to handle recommendation tasks using user reviews, product metadata, and timestamps. -- **Evaluation Metrics**: Supports evaluation with metrics such as Recall, Precision, MRR and more. +- **Evaluation Metrics**: Supports evaluation with metrics such as Recall, Precision, MRR, Hit Rate, Normalized Discounted Cumulative Gain and more. - **Custom Padding Strategies**: The model can handle missing data with customizable padding strategies (repeat, special token, or no padding). + #### Achieved Metrics: - **Precision@10**: $2.2$% - **Recall@10**: $5.5$% -- **Mean Reciprocal Rank (MRR)**: $0.04$ +- **Mean Reciprocal Rank (MRR)**: $0.04$% +- **Hit Rate at 10 (HR@10)**: $5.7$% +- **Normalized Discounted Cumulative Gain at 10 (NDCG@10)**: $0.04$% -These metrics provide insight into the model's ability to recommend relevant products. The precision@10 indicates that, on average, $2.2$% of the top-10 recommended items are correct. Recall@10 suggests the model can retrieve nearly $6$% of all relevant items. The MRR score of $0.04$ shows that correct recommendations are, on average, ranked quite low in the list. +These metrics provide insight into the model's ability to recommend relevant products. The precision@10 indicates that, on average, 2.2% of the top-10 recommended items are correct. Recall@10 suggests the model can retrieve nearly 6% of all relevant items. The MRR score of 0.041 shows that correct recommendations are ranked relatively low in the list. HR@10 reflects the percentage of times the correct product was recommended within the top-10, and NDCG@10 assesses both the ranking and relevance of the predicted items. -> While these numbers are modest and indicate that the model is far from perfect in recommending the exact next items, they still offer valuable insight into the potential of LLMs in capturing user intent and product features. The next iteration will involve comparing the LLM-based system's performance with more traditional baseline methods, such as collaborative filtering and matrix factorization, to assess its relative effectiveness. -To reiterate, the primary focus of this project is not just to maximize performance but to explore whether LLMs can offer a viable approach to recommendation tasks. The experiment is ongoing, and future improvements will focus on refining the model and comparing its performance with these baseline approaches. If one is focused on prediction accuracy alone, we could train the model on predicting the next product category a customer would purchase instead of exact items. +> While these numbers are modest and indicate that the model is far from perfect in recommending the exact next items, they still offer valuable insight into the potential of LLMs in capturing user intent and product features. + +To reiterate, the primary focus of this project is not just to maximize performance but to explore whether LLMs can offer a viable approach to recommendation tasks. The experiment is ongoing, and future improvements will focus on refining the model and comparing its performance with these baseline approaches (such as collaborative filtering and matrix factorization). If one is focused on prediction accuracy alone, we could train the model on predicting the next product category a customer would purchase instead of exact items. --- @@ -43,7 +49,7 @@ To reiterate, the primary focus of this project is not just to maximize performa ## **Data** -This project uses Amazon Appliance Reviews data for training, validation, and testing. The dataset can be broadky categorised into: +This project uses Amazon Appliance Reviews data for training, validation, and testing. The dataset can be broadly categorised into: - **User Reviews**: Including `user_id`, `review_text`, `parent_asin`, and `timestamp`. - **Product Metadata**: Including `parent_asin`, `title`, `category`, `features`, `price`, and more. @@ -75,8 +81,8 @@ To run this project locally, follow these steps: 1. **Clone the Repository**: ```bash - git clone https://github.com/yourusername/recsys-llm.git - cd recsys-llm + git clone https://github.com/babaniyi/LLMs-for-RecSys.git + cd LLMs-for-RecSys ``` 2. **Install Dependencies**: @@ -91,7 +97,6 @@ To run this project locally, follow these steps: 4. **Configure Environment Variables**: If using CUDA, ensure your environment is properly set up for PyTorch GPU usage. - To run the script. Go to the command line and run the following: If you want to run the model using Alpaca-style prompt **without** LORA (which is our baseline model). Run @@ -121,7 +126,7 @@ python gpt-experiment.py --run_solution phi3_and_lora ### **Data Preprocessing** -To curate quality training data, we focus on users that has purchased at least 10 materials. +To curate quality training data, we focus on users who have purchased at least 10 materials. The data should be preprocessed before training. The `get_next_10_items` function processes user interaction data and ensures that each entry contains a list of the next 10 items based on historical interactions. ```python @@ -129,7 +134,7 @@ df_with_next_10_items = get_next_10_items_optimized(df, padding_strategy='repeat ``` ### **Train/Test/Validation Split** -You can split the data while preserving its temporal time order to ensure there is no data and target leakage using the `temporal_train_val_test_split` function: +You can split the data while preserving its temporal time order to ensure no data and target leakage using the `temporal_train_val_test_split` function: ```python train_df, val_df, test_df = split_data_temporal(df_with_next_10_items, train_ratio=0.7, val_ratio=0.1, test_ratio=0.2) @@ -141,17 +146,6 @@ train_df, val_df, test_df = split_data_temporal(df_with_next_10_items, train_rat ## **Training the Model** First, we calculate the initial training and validation set loss before we start training (the goal is to minimize the loss). The initial train and validation losses can be visualised in a plot `loss-plot-...pdf` which is generated by the model. -Finally, to train the model on the data, we use the `train_model_simple` function: - -```python -train_losses, val_losses, tokens_seen = train_model_simple( - model, train_loader, val_loader, optimizer, device, - num_epochs=num_epochs, eval_freq=5, eval_iter=5, - start_context=start_context, tokenizer=tokenizer, - special_chars=special_user_item_ids, - ) -``` - Make sure that you have your CUDA environment set up properly if using GPUs. The model will output training and validation loss metrics after every evaluation step. An example of data entry is given below. @@ -173,22 +167,29 @@ Example of the processed data sent to the model. ## **Evaluation** -The model supports evaluation metrics like Recall, Precision, Mean Reciprocal Rank. After generating predictions for a batch of reviews, the recommended items are compared with the actual next items to calculate these metrics. +The model supports evaluation metrics like **Recall**, **Precision**, **Mean Reciprocal Rank (MRR)**, **Hit Rate at 10 (HR@10)**, and **Normalized Discounted Cumulative Gain at 10 (NDCG@10)**. After generating predictions for a batch of reviews, the recommended items are compared with the actual next items to calculate these metrics. -You can evaluate the model using the `evaluate_recall_precision` function: +### Evaluation Metrics: +- **Precision@10**: Proportion of relevant items in the top-10 predictions. +- **Recall@10**: Proportion of relevant items retrieved from the actual items. +- **MRR**: The reciprocal rank of the first relevant item in the top-10 recommendations. +- **HR@10**: Binary value indicating whether any relevant item appears in the top-10 predictions (1 if yes, 0 if no). +- **NDCG@10**: Discounted Cumulative Gain normalized over the ideal ranking, taking into account the rank position of relevant items. + +You can evaluate the model using the `evaluate_metrics` function: ```python evaluate_metrics(output_list, k=10) ``` -Example of the output list which includes the output and model response. -``` + +Example of the output list, which includes the output and model response: + +```python [{ "input": "<|user_AE2BFR2EGPHCYISLCTPOX2AQHKVQ|>", "output": "<|item_B07FTFD1XB|>, <|endoftext|>, <|endoftext|>, <|endoftext|>, <|endoftext|>, <|endoftext|>, <|endoftext|>, <|endoftext|>, <|endoftext|>, <|endoftext|>", "model_response": "<|item_B0081E9HRY|>, <|item_B07CP1KY9M|>, <|item_B07MWVCVR4|>, <|item_B07QVKSMKK|>, <|item_B07PJ8H3W5|>, <|item_B07PJ8H3W5|>, <|item_B07PJ8H3W5|>, <|item_B07V3ZF517|>, <|item_B07V3ZF517|>," - }, - - ] +}] ``` --- @@ -240,4 +241,4 @@ If you use this project or its code, please consider citing it as follows: month = {September}, github = {https://github.com/babaniyi/LLMs-for-RecSys} } -``` \ No newline at end of file +``` diff --git a/model_evaluation.ipynb b/model_evaluation.ipynb index c24d6a7..9bd6560 100644 --- a/model_evaluation.ipynb +++ b/model_evaluation.ipynb @@ -13,7 +13,8 @@ "metadata": {}, "outputs": [], "source": [ - "import json" + "import json\n", + "import numpy as np" ] }, { @@ -47,7 +48,7 @@ "source": [ "def evaluate_metrics(output, model_response, k=10):\n", " \"\"\"\n", - " Evaluate model's recommendation performance using precision@k, recall@k, and MRR.\n", + " Evaluate model's recommendation performance using precision@k, recall@k, MRR, HR@10, and NDCG@10.\n", " \n", " Args:\n", " output (str): The ground truth output containing the actual items (comma-separated string).\n", @@ -55,7 +56,7 @@ " k (int): The number of top items to consider for precision@k, recall@k, etc.\n", " \n", " Returns:\n", - " dict: A dictionary containing precision@k, recall@k, and MRR.\n", + " dict: A dictionary containing precision@k, recall@k, MRR, HR@10, and NDCG@10.\n", " \"\"\"\n", " # Parse the output and model_response strings into lists\n", " actual_items = [item.strip() for item in output.split(\",\") if \"item_\" in item]\n", @@ -67,27 +68,40 @@ " # Calculate true positives\n", " true_positives = set(actual_items) & set(predicted_items)\n", " \n", - " # Calculate precision@k (How many of the top-K predicted are correct)\n", + " # Calculate precision@k\n", " precision_at_k = len(true_positives) / min(len(predicted_items), k)\n", - "\n", - " # Calculate recall@k (How many of the relevant items are found in the top-K)\n", + " \n", + " # Calculate recall@k\n", " recall_at_k = len(true_positives) / len(actual_items) if len(actual_items) > 0 else 0\n", - "\n", + " \n", " # Calculate MRR (Mean Reciprocal Rank)\n", " mrr = 0\n", " for rank, predicted_item in enumerate(predicted_items, 1):\n", " if predicted_item in actual_items:\n", " mrr = 1 / rank\n", " break\n", + " \n", + " # Calculate HR@10 (Hit Rate at 10)\n", + " hit_rate_at_k = 1 if len(true_positives) > 0 else 0\n", + "\n", + " # Calculate NDCG@10 (Normalized Discounted Cumulative Gain)\n", + " dcg = 0\n", + " for i, item in enumerate(predicted_items):\n", + " if item in actual_items:\n", + " dcg += 1 / np.log2(i + 2)\n", + " idcg = sum([1 / np.log2(i + 2) for i in range(min(len(actual_items), k))])\n", + " ndcg_at_k = dcg / idcg if idcg > 0 else 0\n", "\n", " # Compile metrics into a dictionary\n", " metrics = {\n", - " \"precision@{}\".format(k): precision_at_k,\n", - " \"recall@{}\".format(k): recall_at_k,\n", - " \"mrr\": mrr\n", + " f\"precision@{k}\": precision_at_k,\n", + " f\"recall@{k}\": recall_at_k,\n", + " \"mrr\": mrr,\n", + " f\"hr@{k}\": hit_rate_at_k,\n", + " f\"ndcg@{k}\": ndcg_at_k\n", " }\n", "\n", - " return metrics" + " return metrics\n" ] }, { @@ -98,18 +112,20 @@ "source": [ "def evaluate_metrics_on_list(data_list, k=10):\n", " \"\"\"\n", - " Evaluate precision@k, recall@k, and MRR on a list of data.\n", + " Evaluate precision@k, recall@k, MRR, HR@10, and NDCG@10 on a list of data.\n", " \n", " Args:\n", " data_list (list): A list of dictionaries containing 'output' and 'model_response'.\n", " k (int): The number of top items to consider for precision@k, recall@k, etc.\n", " \n", " Returns:\n", - " dict: Average precision@k, recall@k, and MRR for the entire dataset.\n", + " dict: Average precision@k, recall@k, MRR, HR@10, and NDCG@10 for the entire dataset.\n", " \"\"\"\n", " precision_sum = 0\n", " recall_sum = 0\n", " mrr_sum = 0\n", + " hr_sum = 0\n", + " ndcg_sum = 0\n", " num_samples = len(data_list)\n", "\n", " for data in data_list:\n", @@ -123,12 +139,16 @@ " precision_sum += metrics[f\"precision@{k}\"]\n", " recall_sum += metrics[f\"recall@{k}\"]\n", " mrr_sum += metrics[\"mrr\"]\n", + " hr_sum += metrics[f\"hr@{k}\"]\n", + " ndcg_sum += metrics[f\"ndcg@{k}\"]\n", "\n", " # Calculate the average metrics over all samples\n", " avg_metrics = {\n", " f\"precision@{k}\": precision_sum / num_samples,\n", " f\"recall@{k}\": recall_sum / num_samples,\n", - " \"mrr\": mrr_sum / num_samples\n", + " \"mrr\": mrr_sum / num_samples,\n", + " f\"hr@{k}\": hr_sum / num_samples,\n", + " f\"ndcg@{k}\": ndcg_sum / num_samples\n", " }\n", "\n", " return avg_metrics\n" @@ -248,7 +268,6 @@ } ], "source": [ - "# Evaluate the metrics\n", "avg_metrics = evaluate_metrics_on_list(data_list_with_atleast_one_item_output, k=10)\n", "print(\"Average Metrics:\", avg_metrics)" ]