From 569955c43bbc1f6a3e39fc6ecc5c17f741d311b2 Mon Sep 17 00:00:00 2001 From: Syed Tauhid Ullah Shah Date: Thu, 12 Sep 2024 00:34:55 -0600 Subject: [PATCH 1/7] Update README.md --- README.md | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 44e8ae9..7c5e9c8 100644 --- a/README.md +++ b/README.md @@ -2,6 +2,8 @@ This repository contains a project focused on building a product recommendation system using Large Language Models (LLMs). The system is fine-tuned on Amazon appliances reviews and product metadata to predict the **exact** next products a user may purchase based on their historical interactions. The recommendation model leverages GPT-2 as a foundation and incorporates custom padding and data processing strategies to handle sequential product recommendations. + + --- ## **Project Overview** @@ -164,7 +166,7 @@ Example of the processed data sent to the model. { "input": "<|user_AE2BFR2EGPHCYISLCTPOX2AQHKVQ|>", "output": "<|item_B07FTFD1XB|>, <|endoftext|>, <|endoftext|>, <|endoftext|>, <|endoftext|>, <|endoftext|>, <|endoftext|>, <|endoftext|>, <|endoftext|>, <|endoftext|>", - "instruction": "Given a user purchased an an item with the following details, predict the next 10 items the user would purchase. Item id is <|item_B00Y5UZD5I|>. Rating of the item by user from 1 to 5 is 2.0. Text of the user review is To expensive, the quality is questionable... The coating is thin and becomes damaged very easy! I would not recommend for these reasons. Main category of the item is Tools & Home Improvement. Item name is GE WD28X10399 Upper Rack. Price USD is nan. Item details is manufacturer is: GE. part number is: WD28X10399. item weight is: 0.32 ounces. product dimensions is: 4.1 x 0.4 x 2.5 inches. item model number is: WD28X10399. item package quantity is: 1. certification is: Certified frustration-free, Not Applicable. included components is: Appliance-replacement-parts, Appliance Parts & Accessories. batteries included? is: No. batteries required? is: No. best sellers rank is: {'Tools & Home Improvement': 604524, 'Parts & Accessories': 91543}. date first available is: November 21, 2013.", + "instruction": "Given a user purchased an an item with the following details, predict the next 10 items the user would purchase. Item id is <|item_B00Y5UZD5I|>. Rating of the item by user from 1 to 5 is 2.0. Text of the user review is To expensive, the quality is questionable... The coating is thin and becomes damaged very easy! I would not recommend for these reasons. Main category of the item is Tools & Home Improvement. Item name is GE WD28X10399 Upper Rack. Price USD is nan. Item details is manufacturer is: GE. part number is: WD28X10399. item weight is: 0.32 ounces. product dimensions is: 4.1 x 0.4 x 2.5 inches. item model number is: WD28X10399. item package quantity is: 1. certification is: Certified frustration-free, Not Applicable. included components is: Appliance-replacement-parts, Appliance Parts & Accessories. batteries included? is: No. batteries required? is: No. best sellers rank is: {'Tools & Home Improvement': 604524, 'Parts & Accessories': 91543}. date first available is: November 21, 2013.", }, ``` @@ -240,4 +242,4 @@ If you use this project or its code, please consider citing it as follows: month = {September}, github = {https://github.com/babaniyi/LLMs-for-RecSys} } -``` \ No newline at end of file +``` From 1c3ffad01731e0c30e221158831d662c5b48576d Mon Sep 17 00:00:00 2001 From: Syed Tauhid Ullah Shah Date: Thu, 12 Sep 2024 00:38:11 -0600 Subject: [PATCH 2/7] Update README.md --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 7c5e9c8..455bd0c 100644 --- a/README.md +++ b/README.md @@ -13,7 +13,7 @@ This project explores the use of large language models (LLMs) for product recomm - **Sequential Recommendation System**: Predicts the next (maximum) 10 items a user might purchase in the future, based on their review history. - **Custom Data Processing**: Handles temporal data (order by time to prevent data/target/information leakage) and processes the sequence of user interactions, ensuring proper padding and handling of missing data. - **Fine-tuning GPT-2**: Fine-tuned GPT-2 medium model to handle recommendation tasks using user reviews, product metadata, and timestamps. -- **Evaluation Metrics**: Supports evaluation with metrics such as Recall, Precision, MRR and more. +- **Evaluation Metrics**: Supports evaluation with metrics such as Recall, Precision, MRR, HR@10, NDCG@10 and more. - **Custom Padding Strategies**: The model can handle missing data with customizable padding strategies (repeat, special token, or no padding). #### Achieved Metrics: @@ -175,7 +175,7 @@ Example of the processed data sent to the model. ## **Evaluation** -The model supports evaluation metrics like Recall, Precision, Mean Reciprocal Rank. After generating predictions for a batch of reviews, the recommended items are compared with the actual next items to calculate these metrics. +The model supports evaluation metrics such as Recall, Precision, Mean Reciprocal Rank (MRR), Hit Rate at 10 (HR@10), and Normalized Discounted Cumulative Gain at 10 (NDCG@10). After generating predictions for a batch of reviews, the recommended items are compared with the actual next items to calculate these metrics. You can evaluate the model using the `evaluate_recall_precision` function: From 6776f34145c097e1cdcaeca1074bb48b475c98ed Mon Sep 17 00:00:00 2001 From: TauhidShah Date: Thu, 12 Sep 2024 00:44:07 -0600 Subject: [PATCH 3/7] Update model_evaluation.ipynb --- model_evaluation.ipynb | 53 +++++++++++++++++++++++++++++------------- 1 file changed, 37 insertions(+), 16 deletions(-) diff --git a/model_evaluation.ipynb b/model_evaluation.ipynb index c24d6a7..53dd5de 100644 --- a/model_evaluation.ipynb +++ b/model_evaluation.ipynb @@ -13,7 +13,8 @@ "metadata": {}, "outputs": [], "source": [ - "import json" + "import json\n", + "import numpy as np" ] }, { @@ -45,9 +46,11 @@ "metadata": {}, "outputs": [], "source": [ + "\n", + "\n", "def evaluate_metrics(output, model_response, k=10):\n", " \"\"\"\n", - " Evaluate model's recommendation performance using precision@k, recall@k, and MRR.\n", + " Evaluate model's recommendation performance using precision@k, recall@k, MRR, HR@10, and NDCG@10.\n", " \n", " Args:\n", " output (str): The ground truth output containing the actual items (comma-separated string).\n", @@ -55,7 +58,7 @@ " k (int): The number of top items to consider for precision@k, recall@k, etc.\n", " \n", " Returns:\n", - " dict: A dictionary containing precision@k, recall@k, and MRR.\n", + " dict: A dictionary containing precision@k, recall@k, MRR, HR@10, and NDCG@10.\n", " \"\"\"\n", " # Parse the output and model_response strings into lists\n", " actual_items = [item.strip() for item in output.split(\",\") if \"item_\" in item]\n", @@ -67,27 +70,40 @@ " # Calculate true positives\n", " true_positives = set(actual_items) & set(predicted_items)\n", " \n", - " # Calculate precision@k (How many of the top-K predicted are correct)\n", + " # Calculate precision@k\n", " precision_at_k = len(true_positives) / min(len(predicted_items), k)\n", - "\n", - " # Calculate recall@k (How many of the relevant items are found in the top-K)\n", + " \n", + " # Calculate recall@k\n", " recall_at_k = len(true_positives) / len(actual_items) if len(actual_items) > 0 else 0\n", - "\n", + " \n", " # Calculate MRR (Mean Reciprocal Rank)\n", " mrr = 0\n", " for rank, predicted_item in enumerate(predicted_items, 1):\n", " if predicted_item in actual_items:\n", " mrr = 1 / rank\n", " break\n", + " \n", + " # Calculate HR@10 (Hit Rate at 10)\n", + " hit_rate_at_k = 1 if len(true_positives) > 0 else 0\n", + "\n", + " # Calculate NDCG@10 (Normalized Discounted Cumulative Gain)\n", + " dcg = 0\n", + " for i, item in enumerate(predicted_items):\n", + " if item in actual_items:\n", + " dcg += 1 / np.log2(i + 2)\n", + " idcg = sum([1 / np.log2(i + 2) for i in range(min(len(actual_items), k))])\n", + " ndcg_at_k = dcg / idcg if idcg > 0 else 0\n", "\n", " # Compile metrics into a dictionary\n", " metrics = {\n", - " \"precision@{}\".format(k): precision_at_k,\n", - " \"recall@{}\".format(k): recall_at_k,\n", - " \"mrr\": mrr\n", + " f\"precision@{k}\": precision_at_k,\n", + " f\"recall@{k}\": recall_at_k,\n", + " \"mrr\": mrr,\n", + " f\"hr@{k}\": hit_rate_at_k,\n", + " f\"ndcg@{k}\": ndcg_at_k\n", " }\n", "\n", - " return metrics" + " return metrics\n" ] }, { @@ -98,18 +114,20 @@ "source": [ "def evaluate_metrics_on_list(data_list, k=10):\n", " \"\"\"\n", - " Evaluate precision@k, recall@k, and MRR on a list of data.\n", + " Evaluate precision@k, recall@k, MRR, HR@10, and NDCG@10 on a list of data.\n", " \n", " Args:\n", " data_list (list): A list of dictionaries containing 'output' and 'model_response'.\n", " k (int): The number of top items to consider for precision@k, recall@k, etc.\n", " \n", " Returns:\n", - " dict: Average precision@k, recall@k, and MRR for the entire dataset.\n", + " dict: Average precision@k, recall@k, MRR, HR@10, and NDCG@10 for the entire dataset.\n", " \"\"\"\n", " precision_sum = 0\n", " recall_sum = 0\n", " mrr_sum = 0\n", + " hr_sum = 0\n", + " ndcg_sum = 0\n", " num_samples = len(data_list)\n", "\n", " for data in data_list:\n", @@ -123,12 +141,16 @@ " precision_sum += metrics[f\"precision@{k}\"]\n", " recall_sum += metrics[f\"recall@{k}\"]\n", " mrr_sum += metrics[\"mrr\"]\n", + " hr_sum += metrics[f\"hr@{k}\"]\n", + " ndcg_sum += metrics[f\"ndcg@{k}\"]\n", "\n", " # Calculate the average metrics over all samples\n", " avg_metrics = {\n", " f\"precision@{k}\": precision_sum / num_samples,\n", " f\"recall@{k}\": recall_sum / num_samples,\n", - " \"mrr\": mrr_sum / num_samples\n", + " \"mrr\": mrr_sum / num_samples,\n", + " f\"hr@{k}\": hr_sum / num_samples,\n", + " f\"ndcg@{k}\": ndcg_sum / num_samples\n", " }\n", "\n", " return avg_metrics\n" @@ -248,9 +270,8 @@ } ], "source": [ - "# Evaluate the metrics\n", "avg_metrics = evaluate_metrics_on_list(data_list_with_atleast_one_item_output, k=10)\n", - "print(\"Average Metrics:\", avg_metrics)" + "print(\"Average Metrics:\", avg_metrics)\n" ] }, { From 9ce5631b0f4d83e10d230543d8b732accceb849f Mon Sep 17 00:00:00 2001 From: TauhidShah Date: Thu, 12 Sep 2024 00:44:21 -0600 Subject: [PATCH 4/7] Readme updated --- README (1).md | 122 +++++++++++++++++++++++++ README.md | 243 -------------------------------------------------- 2 files changed, 122 insertions(+), 243 deletions(-) create mode 100644 README (1).md delete mode 100644 README.md diff --git a/README (1).md b/README (1).md new file mode 100644 index 0000000..2fd891c --- /dev/null +++ b/README (1).md @@ -0,0 +1,122 @@ + +## **LLM-based Product Recommender System** + +This repository contains a project focused on building a product recommendation system using Large Language Models (LLMs). The system is fine-tuned on Amazon appliances reviews and product metadata to predict the **exact** next products a user may purchase based on their historical interactions. The recommendation model leverages GPT-2 as a foundation and incorporates custom padding and data processing strategies to handle sequential product recommendations. + +--- + +## **Project Overview** +This project explores the use of large language models (LLMs) for product recommendation systems, leveraging their natural language understanding capabilities to predict which items a user might purchase or review next. The primary focus is not solely on achieving high prediction accuracy but rather on learning if LLMs can be effectively applied in the recommendation domain. + +### **Key Features:** +- **Sequential Recommendation System**: Predicts the next (maximum) 10 items a user might purchase in the future, based on their review history. +- **Custom Data Processing**: Handles temporal data (order by time to prevent data/target/information leakage) and processes the sequence of user interactions, ensuring proper padding and handling of missing data. +- **Fine-tuning GPT-2**: Fine-tuned GPT-2 medium model to handle recommendation tasks using user reviews, product metadata, and timestamps. +- **Evaluation Metrics**: Supports evaluation with metrics such as Recall, Precision, MRR, HR@10, and NDCG@10. +- **Custom Padding Strategies**: The model can handle missing data with customizable padding strategies (repeat, special token, or no padding). + +#### Achieved Metrics: +- **Precision@10**: 2.19% +- **Recall@10**: 5.36% +- **Mean Reciprocal Rank (MRR)**: 0.041 +- **Hit Rate at 10 (HR@10)**: 5.65% +- **Normalized Discounted Cumulative Gain at 10 (NDCG@10)**: 0.0429 + +These metrics provide insight into the model's ability to recommend relevant products. The precision@10 indicates that, on average, 2.19% of the top-10 recommended items are correct. Recall@10 suggests the model can retrieve nearly 5.36% of all relevant items. The MRR score of 0.041 shows that correct recommendations are ranked relatively low in the list. HR@10 reflects the percentage of times the correct product was recommended within the top-10, and NDCG@10 assesses both the ranking and relevance of the predicted items. + +> While these numbers are modest and indicate that the model is far from perfect in recommending the exact next items, they still offer valuable insight into the potential of LLMs in capturing user intent and product features. The next iteration will involve comparing the LLM-based system's performance with more traditional baseline methods, such as collaborative filtering and matrix factorization, to assess its relative effectiveness. + +To reiterate, the primary focus of this project is not just to maximize performance but to explore whether LLMs can offer a viable approach to recommendation tasks. The experiment is ongoing, and future improvements will focus on refining the model and comparing its performance with these baseline approaches. If one is focused on prediction accuracy alone, we could train the model on predicting the next product category a customer would purchase instead of exact items. + +--- + +## **Evaluation** + +The model supports evaluation metrics like **Recall**, **Precision**, **Mean Reciprocal Rank (MRR)**, **Hit Rate at 10 (HR@10)**, and **Normalized Discounted Cumulative Gain at 10 (NDCG@10)**. After generating predictions for a batch of reviews, the recommended items are compared with the actual next items to calculate these metrics. + +You can evaluate the model using the `evaluate_metrics` function: + +```python +evaluate_metrics(output_list, k=10) +``` + +### Evaluation Metrics: +- **Precision@10**: Proportion of relevant items in the top-10 predictions. +- **Recall@10**: Proportion of relevant items retrieved from the actual items. +- **MRR**: The reciprocal rank of the first relevant item in the top-10 recommendations. +- **HR@10**: Binary value indicating whether any relevant item appears in the top-10 predictions (1 if yes, 0 if no). +- **NDCG@10**: Discounted Cumulative Gain normalized over the ideal ranking, taking into account the rank position of relevant items. + +### Updated Results + +After applying these metrics, the following evaluation results were obtained: + +```python +Average Metrics: { + 'precision@10': 0.0219, + 'recall@10': 0.0536, + 'mrr': 0.0413, + 'hr@10': 0.0565, + 'ndcg@10': 0.0429 +} +``` + +Example of the output list, which includes the output and model response: + +```python +[{ + "input": "<|user_AE2BFR2EGPHCYISLCTPOX2AQHKVQ|>", + "output": "<|item_B07FTFD1XB|>, <|endoftext|>, <|endoftext|>, <|endoftext|>, <|endoftext|>, <|endoftext|>, <|endoftext|>, <|endoftext|>, <|endoftext|>, <|endoftext|>", + "model_response": "<|item_B0081E9HRY|>, <|item_B07CP1KY9M|>, <|item_B07MWVCVR4|>, <|item_B07QVKSMKK|>, <|item_B07PJ8H3W5|>, <|item_B07PJ8H3W5|>, <|item_B07PJ8H3W5|>, <|item_B07V3ZF517|>, <|item_B07V3ZF517|>," +}] +``` + +--- + +## **Contributing** + +We welcome contributions to this project! Please follow the steps below if you would like to contribute: + +1. Fork the repository. +2. Create a feature branch: `git checkout -b feature/feature-name`. +3. Commit your changes: `git commit -m 'Add feature'`. +4. Push to the branch: `git push origin feature/feature-name`. +5. Open a pull request. + +--- + +## **License** + +This project is licensed under the MIT License. See the `LICENSE` file for more details. + +--- + +## **Contact** + +For any questions or issues, feel free to open an issue on this repository or reach out to `horlaneyee@gmail.com`. + +--- + +### **Acknowledgements** +> Raschka, Sebastian. Build A Large Language Model (From Scratch). Manning, 2024. ISBN: 978-1633437166. + +> @article{hou2024bridging, + title={Bridging Language and Items for Retrieval and Recommendation}, + author={Hou, Yupeng and Li, Jiacheng and He, Zhankui and Yan, An and Chen, Xiusi and McAuley, Julian}, + journal={arXiv preprint arXiv:2403.03952}, + year={2024} +} + +### **Citation** +If you use this project or its code, please consider citing it as follows: + +``` +@article{llms-for-product-recsys, + title = {LLM-based Product Recommender System}, + author = {Babaniyi Olaniyi}, + journal = {babaniyi.com}, + year = {2024}, + month = {September}, + github = {https://github.com/babaniyi/LLMs-for-RecSys} +} +``` diff --git a/README.md b/README.md deleted file mode 100644 index 44e8ae9..0000000 --- a/README.md +++ /dev/null @@ -1,243 +0,0 @@ -# **LLM-based Product Recommender System** - -This repository contains a project focused on building a product recommendation system using Large Language Models (LLMs). The system is fine-tuned on Amazon appliances reviews and product metadata to predict the **exact** next products a user may purchase based on their historical interactions. The recommendation model leverages GPT-2 as a foundation and incorporates custom padding and data processing strategies to handle sequential product recommendations. - ---- - -## **Project Overview** -This project explores the use of large language models (LLMs) for product recommendation systems, leveraging their natural language understanding capabilities to predict which items a user might purchase or review next. The primary focus is not solely on achieving high prediction accuracy but rather on learning if LLMs can be effectively applied in the recommendation domain. - -### **Key Features:** -- **Sequential Recommendation System**: Predicts the next (maximum) 10 items a user might purchase in the future, based on their review history. -- **Custom Data Processing**: Handles temporal data (order by time to prevent data/target/information leakage) and processes the sequence of user interactions, ensuring proper padding and handling of missing data. -- **Fine-tuning GPT-2**: Fine-tuned GPT-2 medium model to handle recommendation tasks using user reviews, product metadata, and timestamps. -- **Evaluation Metrics**: Supports evaluation with metrics such as Recall, Precision, MRR and more. -- **Custom Padding Strategies**: The model can handle missing data with customizable padding strategies (repeat, special token, or no padding). - -#### Achieved Metrics: -- **Precision@10**: $2.2$% -- **Recall@10**: $5.5$% -- **Mean Reciprocal Rank (MRR)**: $0.04$ - -These metrics provide insight into the model's ability to recommend relevant products. The precision@10 indicates that, on average, $2.2$% of the top-10 recommended items are correct. Recall@10 suggests the model can retrieve nearly $6$% of all relevant items. The MRR score of $0.04$ shows that correct recommendations are, on average, ranked quite low in the list. - -> While these numbers are modest and indicate that the model is far from perfect in recommending the exact next items, they still offer valuable insight into the potential of LLMs in capturing user intent and product features. The next iteration will involve comparing the LLM-based system's performance with more traditional baseline methods, such as collaborative filtering and matrix factorization, to assess its relative effectiveness. - -To reiterate, the primary focus of this project is not just to maximize performance but to explore whether LLMs can offer a viable approach to recommendation tasks. The experiment is ongoing, and future improvements will focus on refining the model and comparing its performance with these baseline approaches. If one is focused on prediction accuracy alone, we could train the model on predicting the next product category a customer would purchase instead of exact items. - ---- - -## **Table of Contents** - -1. [Project Overview](#project-overview) -2. [Data](#data) -3. [Model Architecture](#model-architecture) -4. [Installation](#installation) -5. [Usage](#usage) -6. [Training the Model](#training-the-model) -7. [Evaluation](#evaluation) -8. [Contributing](#contributing) -9. [License](#license) - ---- - -## **Data** - -This project uses Amazon Appliance Reviews data for training, validation, and testing. The dataset can be broadky categorised into: -- **User Reviews**: Including `user_id`, `review_text`, `parent_asin`, and `timestamp`. -- **Product Metadata**: Including `parent_asin`, `title`, `category`, `features`, `price`, and more. - -Each row in the dataset is processed to generate the next 10 items that the user interacted with. Data is split into training, validation, and test sets, keeping the temporal nature of user interactions intact. - -[Download data here](https://amazon-reviews-2023.github.io/) - ---- - -## **Model Architecture** - -The project uses the GPT-2 medium architecture, consisting of: -- **Embedding Dimension**: 1024 -- **Number of Layers**: 24 -- **Number of Attention Heads**: 16 - -The model is fine-tuned to take as input: -- **User Reviews**: User-written text. -- **Item Metadata**: Key features like title, category, and other details. - -### **Loss Function**: -We calculate the loss using cross-entropy between the predicted next items and the actual items the user interacted with. - ---- - -## **Installation** - -To run this project locally, follow these steps: - -1. **Clone the Repository**: - ```bash - git clone https://github.com/yourusername/recsys-llm.git - cd recsys-llm - ``` - -2. **Install Dependencies**: - ```bash - pip install -r requirements.txt - ``` - -3. **Set Up the Data**: - - Download the data `Appliances.jsonl` and `meta_Appliances.jsonl` [data here](https://amazon-reviews-2023.github.io/) - - Place the data in `data/raw/` folder - -4. **Configure Environment Variables**: - If using CUDA, ensure your environment is properly set up for PyTorch GPU usage. - - -To run the script. Go to the command line and run the following: - -If you want to run the model using Alpaca-style prompt **without** LORA (which is our baseline model). Run -```bash -python gpt-experiment.py --run_solution baseline -``` - -Use Alpaca-style prompt **with** LORA -```bash -python gpt-experiment.py --run_solution alpaca_and_lora -``` - -If you want to run the model using Phi3-style prompt **without** LORA -```bash -python gpt-experiment.py --run_solution phi3_prompt -``` - -If you want to run the model using Phi3-style prompt **with** LORA -```bash -python gpt-experiment.py --run_solution phi3_and_lora -``` - - ---- - -## **Usage** - -### **Data Preprocessing** - -To curate quality training data, we focus on users that has purchased at least 10 materials. -The data should be preprocessed before training. The `get_next_10_items` function processes user interaction data and ensures that each entry contains a list of the next 10 items based on historical interactions. - -```python -df_with_next_10_items = get_next_10_items_optimized(df, padding_strategy='repeat', pad_token=50257) -``` - -### **Train/Test/Validation Split** -You can split the data while preserving its temporal time order to ensure there is no data and target leakage using the `temporal_train_val_test_split` function: - -```python -train_df, val_df, test_df = split_data_temporal(df_with_next_10_items, train_ratio=0.7, val_ratio=0.1, test_ratio=0.2) -``` - - ---- - -## **Training the Model** -First, we calculate the initial training and validation set loss before we start training (the goal is to minimize the loss). The initial train and validation losses can be visualised in a plot `loss-plot-...pdf` which is generated by the model. - -Finally, to train the model on the data, we use the `train_model_simple` function: - -```python -train_losses, val_losses, tokens_seen = train_model_simple( - model, train_loader, val_loader, optimizer, device, - num_epochs=num_epochs, eval_freq=5, eval_iter=5, - start_context=start_context, tokenizer=tokenizer, - special_chars=special_user_item_ids, - ) -``` - -Make sure that you have your CUDA environment set up properly if using GPUs. The model will output training and validation loss metrics after every evaluation step. - -An example of data entry is given below. -- **Input:** We set the input as the user_id in the following format <|user|>. -- **Output:** We set it as the next items purchased by the user and encoded as <|item_id|>. If items purchased are exhausted, it's encoded using a special token <|endoftext|> -- **Prompt:** We instruct the model with the following prompt `Given a user purchased an an item with the following details, predict the next 10 items the user would purchase.` - -Example of the processed data sent to the model. -``` -{ - "input": "<|user_AE2BFR2EGPHCYISLCTPOX2AQHKVQ|>", - "output": "<|item_B07FTFD1XB|>, <|endoftext|>, <|endoftext|>, <|endoftext|>, <|endoftext|>, <|endoftext|>, <|endoftext|>, <|endoftext|>, <|endoftext|>, <|endoftext|>", - "instruction": "Given a user purchased an an item with the following details, predict the next 10 items the user would purchase. Item id is <|item_B00Y5UZD5I|>. Rating of the item by user from 1 to 5 is 2.0. Text of the user review is To expensive, the quality is questionable... The coating is thin and becomes damaged very easy! I would not recommend for these reasons. Main category of the item is Tools & Home Improvement. Item name is GE WD28X10399 Upper Rack. Price USD is nan. Item details is manufacturer is: GE. part number is: WD28X10399. item weight is: 0.32 ounces. product dimensions is: 4.1 x 0.4 x 2.5 inches. item model number is: WD28X10399. item package quantity is: 1. certification is: Certified frustration-free, Not Applicable. included components is: Appliance-replacement-parts, Appliance Parts & Accessories. batteries included? is: No. batteries required? is: No. best sellers rank is: {'Tools & Home Improvement': 604524, 'Parts & Accessories': 91543}. date first available is: November 21, 2013.", - }, -``` - - ---- - -## **Evaluation** - -The model supports evaluation metrics like Recall, Precision, Mean Reciprocal Rank. After generating predictions for a batch of reviews, the recommended items are compared with the actual next items to calculate these metrics. - -You can evaluate the model using the `evaluate_recall_precision` function: - -```python -evaluate_metrics(output_list, k=10) -``` -Example of the output list which includes the output and model response. -``` -[{ - "input": "<|user_AE2BFR2EGPHCYISLCTPOX2AQHKVQ|>", - "output": "<|item_B07FTFD1XB|>, <|endoftext|>, <|endoftext|>, <|endoftext|>, <|endoftext|>, <|endoftext|>, <|endoftext|>, <|endoftext|>, <|endoftext|>, <|endoftext|>", - "model_response": "<|item_B0081E9HRY|>, <|item_B07CP1KY9M|>, <|item_B07MWVCVR4|>, <|item_B07QVKSMKK|>, <|item_B07PJ8H3W5|>, <|item_B07PJ8H3W5|>, <|item_B07PJ8H3W5|>, <|item_B07V3ZF517|>, <|item_B07V3ZF517|>," - }, - - ] -``` - ---- - - -## **Contributing** - -We welcome contributions to this project! Please follow the steps below if you would like to contribute: - -1. Fork the repository. -2. Create a feature branch: `git checkout -b feature/feature-name`. -3. Commit your changes: `git commit -m 'Add feature'`. -4. Push to the branch: `git push origin feature/feature-name`. -5. Open a pull request. - ---- - -## **License** - -This project is licensed under the MIT License. See the `LICENSE` file for more details. - ---- - -## **Contact** - -For any questions or issues, feel free to open an issue on this repository or reach out to `horlaneyee@gmail.com`. - ---- - -### **Acknowledgements** -> Raschka, Sebastian. Build A Large Language Model (From Scratch). Manning, 2024. ISBN: 978-1633437166. - -> @article{hou2024bridging, - title={Bridging Language and Items for Retrieval and Recommendation}, - author={Hou, Yupeng and Li, Jiacheng and He, Zhankui and Yan, An and Chen, Xiusi and McAuley, Julian}, - journal={arXiv preprint arXiv:2403.03952}, - year={2024} -} - -### **Citation** -If you use this project or its code, please consider citing it as follows: - -``` -@article{llms-for-product-recsys, - title = {LLM-based Product Recommender System}, - author = {Babaniyi Olaniyi}, - journal = {babaniyi.com}, - year = {2024}, - month = {September}, - github = {https://github.com/babaniyi/LLMs-for-RecSys} -} -``` \ No newline at end of file From 7594ccd19a80ccf5473396aa482e2d1fe13d95a4 Mon Sep 17 00:00:00 2001 From: TauhidShah Date: Thu, 12 Sep 2024 00:46:07 -0600 Subject: [PATCH 5/7] Update Read.me --- README (1).md | 122 -------------------------------------------------- 1 file changed, 122 deletions(-) delete mode 100644 README (1).md diff --git a/README (1).md b/README (1).md deleted file mode 100644 index 2fd891c..0000000 --- a/README (1).md +++ /dev/null @@ -1,122 +0,0 @@ - -## **LLM-based Product Recommender System** - -This repository contains a project focused on building a product recommendation system using Large Language Models (LLMs). The system is fine-tuned on Amazon appliances reviews and product metadata to predict the **exact** next products a user may purchase based on their historical interactions. The recommendation model leverages GPT-2 as a foundation and incorporates custom padding and data processing strategies to handle sequential product recommendations. - ---- - -## **Project Overview** -This project explores the use of large language models (LLMs) for product recommendation systems, leveraging their natural language understanding capabilities to predict which items a user might purchase or review next. The primary focus is not solely on achieving high prediction accuracy but rather on learning if LLMs can be effectively applied in the recommendation domain. - -### **Key Features:** -- **Sequential Recommendation System**: Predicts the next (maximum) 10 items a user might purchase in the future, based on their review history. -- **Custom Data Processing**: Handles temporal data (order by time to prevent data/target/information leakage) and processes the sequence of user interactions, ensuring proper padding and handling of missing data. -- **Fine-tuning GPT-2**: Fine-tuned GPT-2 medium model to handle recommendation tasks using user reviews, product metadata, and timestamps. -- **Evaluation Metrics**: Supports evaluation with metrics such as Recall, Precision, MRR, HR@10, and NDCG@10. -- **Custom Padding Strategies**: The model can handle missing data with customizable padding strategies (repeat, special token, or no padding). - -#### Achieved Metrics: -- **Precision@10**: 2.19% -- **Recall@10**: 5.36% -- **Mean Reciprocal Rank (MRR)**: 0.041 -- **Hit Rate at 10 (HR@10)**: 5.65% -- **Normalized Discounted Cumulative Gain at 10 (NDCG@10)**: 0.0429 - -These metrics provide insight into the model's ability to recommend relevant products. The precision@10 indicates that, on average, 2.19% of the top-10 recommended items are correct. Recall@10 suggests the model can retrieve nearly 5.36% of all relevant items. The MRR score of 0.041 shows that correct recommendations are ranked relatively low in the list. HR@10 reflects the percentage of times the correct product was recommended within the top-10, and NDCG@10 assesses both the ranking and relevance of the predicted items. - -> While these numbers are modest and indicate that the model is far from perfect in recommending the exact next items, they still offer valuable insight into the potential of LLMs in capturing user intent and product features. The next iteration will involve comparing the LLM-based system's performance with more traditional baseline methods, such as collaborative filtering and matrix factorization, to assess its relative effectiveness. - -To reiterate, the primary focus of this project is not just to maximize performance but to explore whether LLMs can offer a viable approach to recommendation tasks. The experiment is ongoing, and future improvements will focus on refining the model and comparing its performance with these baseline approaches. If one is focused on prediction accuracy alone, we could train the model on predicting the next product category a customer would purchase instead of exact items. - ---- - -## **Evaluation** - -The model supports evaluation metrics like **Recall**, **Precision**, **Mean Reciprocal Rank (MRR)**, **Hit Rate at 10 (HR@10)**, and **Normalized Discounted Cumulative Gain at 10 (NDCG@10)**. After generating predictions for a batch of reviews, the recommended items are compared with the actual next items to calculate these metrics. - -You can evaluate the model using the `evaluate_metrics` function: - -```python -evaluate_metrics(output_list, k=10) -``` - -### Evaluation Metrics: -- **Precision@10**: Proportion of relevant items in the top-10 predictions. -- **Recall@10**: Proportion of relevant items retrieved from the actual items. -- **MRR**: The reciprocal rank of the first relevant item in the top-10 recommendations. -- **HR@10**: Binary value indicating whether any relevant item appears in the top-10 predictions (1 if yes, 0 if no). -- **NDCG@10**: Discounted Cumulative Gain normalized over the ideal ranking, taking into account the rank position of relevant items. - -### Updated Results - -After applying these metrics, the following evaluation results were obtained: - -```python -Average Metrics: { - 'precision@10': 0.0219, - 'recall@10': 0.0536, - 'mrr': 0.0413, - 'hr@10': 0.0565, - 'ndcg@10': 0.0429 -} -``` - -Example of the output list, which includes the output and model response: - -```python -[{ - "input": "<|user_AE2BFR2EGPHCYISLCTPOX2AQHKVQ|>", - "output": "<|item_B07FTFD1XB|>, <|endoftext|>, <|endoftext|>, <|endoftext|>, <|endoftext|>, <|endoftext|>, <|endoftext|>, <|endoftext|>, <|endoftext|>, <|endoftext|>", - "model_response": "<|item_B0081E9HRY|>, <|item_B07CP1KY9M|>, <|item_B07MWVCVR4|>, <|item_B07QVKSMKK|>, <|item_B07PJ8H3W5|>, <|item_B07PJ8H3W5|>, <|item_B07PJ8H3W5|>, <|item_B07V3ZF517|>, <|item_B07V3ZF517|>," -}] -``` - ---- - -## **Contributing** - -We welcome contributions to this project! Please follow the steps below if you would like to contribute: - -1. Fork the repository. -2. Create a feature branch: `git checkout -b feature/feature-name`. -3. Commit your changes: `git commit -m 'Add feature'`. -4. Push to the branch: `git push origin feature/feature-name`. -5. Open a pull request. - ---- - -## **License** - -This project is licensed under the MIT License. See the `LICENSE` file for more details. - ---- - -## **Contact** - -For any questions or issues, feel free to open an issue on this repository or reach out to `horlaneyee@gmail.com`. - ---- - -### **Acknowledgements** -> Raschka, Sebastian. Build A Large Language Model (From Scratch). Manning, 2024. ISBN: 978-1633437166. - -> @article{hou2024bridging, - title={Bridging Language and Items for Retrieval and Recommendation}, - author={Hou, Yupeng and Li, Jiacheng and He, Zhankui and Yan, An and Chen, Xiusi and McAuley, Julian}, - journal={arXiv preprint arXiv:2403.03952}, - year={2024} -} - -### **Citation** -If you use this project or its code, please consider citing it as follows: - -``` -@article{llms-for-product-recsys, - title = {LLM-based Product Recommender System}, - author = {Babaniyi Olaniyi}, - journal = {babaniyi.com}, - year = {2024}, - month = {September}, - github = {https://github.com/babaniyi/LLMs-for-RecSys} -} -``` From bded799090c8ea923b450010046d2a5eaeebeff5 Mon Sep 17 00:00:00 2001 From: TauhidShah Date: Thu, 12 Sep 2024 00:49:45 -0600 Subject: [PATCH 6/7] Add HR@10 and NDCG@10 metrics and update README --- model_evaluation.ipynb | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/model_evaluation.ipynb b/model_evaluation.ipynb index 53dd5de..9bd6560 100644 --- a/model_evaluation.ipynb +++ b/model_evaluation.ipynb @@ -46,8 +46,6 @@ "metadata": {}, "outputs": [], "source": [ - "\n", - "\n", "def evaluate_metrics(output, model_response, k=10):\n", " \"\"\"\n", " Evaluate model's recommendation performance using precision@k, recall@k, MRR, HR@10, and NDCG@10.\n", @@ -271,7 +269,7 @@ ], "source": [ "avg_metrics = evaluate_metrics_on_list(data_list_with_atleast_one_item_output, k=10)\n", - "print(\"Average Metrics:\", avg_metrics)\n" + "print(\"Average Metrics:\", avg_metrics)" ] }, { From 6e54608537a63ba43dd16ac4ac982e9ad7a65f35 Mon Sep 17 00:00:00 2001 From: Syed Tauhid Ullah Shah Date: Thu, 12 Sep 2024 12:41:59 -0600 Subject: [PATCH 7/7] Update README.md --- README.md | 174 ++++++++++++++++++++++++++++++++++++++++++++++-------- 1 file changed, 148 insertions(+), 26 deletions(-) diff --git a/README.md b/README.md index 2fd891c..320cb82 100644 --- a/README.md +++ b/README.md @@ -1,45 +1,174 @@ +# **LLM-based Product Recommender System** -## **LLM-based Product Recommender System** - -This repository contains a project focused on building a product recommendation system using Large Language Models (LLMs). The system is fine-tuned on Amazon appliances reviews and product metadata to predict the **exact** next products a user may purchase based on their historical interactions. The recommendation model leverages GPT-2 as a foundation and incorporates custom padding and data processing strategies to handle sequential product recommendations. +This repository contains a project focused on building a product recommendation system using Large Language Models (LLMs). The system is fine-tuned on Amazon appliance reviews and product metadata to predict the **exact** next products a user may purchase based on their historical interactions. The recommendation model leverages GPT-2 as a foundation and incorporates custom padding and data processing strategies to handle sequential product recommendations. --- ## **Project Overview** This project explores the use of large language models (LLMs) for product recommendation systems, leveraging their natural language understanding capabilities to predict which items a user might purchase or review next. The primary focus is not solely on achieving high prediction accuracy but rather on learning if LLMs can be effectively applied in the recommendation domain. +This code was run using [Lightning AI](lightning.ai) L4 GPU. + ### **Key Features:** - **Sequential Recommendation System**: Predicts the next (maximum) 10 items a user might purchase in the future, based on their review history. - **Custom Data Processing**: Handles temporal data (order by time to prevent data/target/information leakage) and processes the sequence of user interactions, ensuring proper padding and handling of missing data. - **Fine-tuning GPT-2**: Fine-tuned GPT-2 medium model to handle recommendation tasks using user reviews, product metadata, and timestamps. -- **Evaluation Metrics**: Supports evaluation with metrics such as Recall, Precision, MRR, HR@10, and NDCG@10. +- **Evaluation Metrics**: Supports evaluation with metrics such as Recall, Precision, MRR, Hit Rate, Normalized Discounted Cumulative Gain and more. - **Custom Padding Strategies**: The model can handle missing data with customizable padding strategies (repeat, special token, or no padding). + #### Achieved Metrics: -- **Precision@10**: 2.19% -- **Recall@10**: 5.36% -- **Mean Reciprocal Rank (MRR)**: 0.041 -- **Hit Rate at 10 (HR@10)**: 5.65% -- **Normalized Discounted Cumulative Gain at 10 (NDCG@10)**: 0.0429 +- **Precision@10**: $2.2$% +- **Recall@10**: $5.5$% +- **Mean Reciprocal Rank (MRR)**: $0.04$% +- **Hit Rate at 10 (HR@10)**: $5.7$% +- **Normalized Discounted Cumulative Gain at 10 (NDCG@10)**: $0.04$% -These metrics provide insight into the model's ability to recommend relevant products. The precision@10 indicates that, on average, 2.19% of the top-10 recommended items are correct. Recall@10 suggests the model can retrieve nearly 5.36% of all relevant items. The MRR score of 0.041 shows that correct recommendations are ranked relatively low in the list. HR@10 reflects the percentage of times the correct product was recommended within the top-10, and NDCG@10 assesses both the ranking and relevance of the predicted items. +These metrics provide insight into the model's ability to recommend relevant products. The precision@10 indicates that, on average, 2.2% of the top-10 recommended items are correct. Recall@10 suggests the model can retrieve nearly 6% of all relevant items. The MRR score of 0.041 shows that correct recommendations are ranked relatively low in the list. HR@10 reflects the percentage of times the correct product was recommended within the top-10, and NDCG@10 assesses both the ranking and relevance of the predicted items. -> While these numbers are modest and indicate that the model is far from perfect in recommending the exact next items, they still offer valuable insight into the potential of LLMs in capturing user intent and product features. The next iteration will involve comparing the LLM-based system's performance with more traditional baseline methods, such as collaborative filtering and matrix factorization, to assess its relative effectiveness. -To reiterate, the primary focus of this project is not just to maximize performance but to explore whether LLMs can offer a viable approach to recommendation tasks. The experiment is ongoing, and future improvements will focus on refining the model and comparing its performance with these baseline approaches. If one is focused on prediction accuracy alone, we could train the model on predicting the next product category a customer would purchase instead of exact items. +> While these numbers are modest and indicate that the model is far from perfect in recommending the exact next items, they still offer valuable insight into the potential of LLMs in capturing user intent and product features. + +To reiterate, the primary focus of this project is not just to maximize performance but to explore whether LLMs can offer a viable approach to recommendation tasks. The experiment is ongoing, and future improvements will focus on refining the model and comparing its performance with these baseline approaches (such as collaborative filtering and matrix factorization). If one is focused on prediction accuracy alone, we could train the model on predicting the next product category a customer would purchase instead of exact items. --- -## **Evaluation** +## **Table of Contents** -The model supports evaluation metrics like **Recall**, **Precision**, **Mean Reciprocal Rank (MRR)**, **Hit Rate at 10 (HR@10)**, and **Normalized Discounted Cumulative Gain at 10 (NDCG@10)**. After generating predictions for a batch of reviews, the recommended items are compared with the actual next items to calculate these metrics. +1. [Project Overview](#project-overview) +2. [Data](#data) +3. [Model Architecture](#model-architecture) +4. [Installation](#installation) +5. [Usage](#usage) +6. [Training the Model](#training-the-model) +7. [Evaluation](#evaluation) +8. [Contributing](#contributing) +9. [License](#license) -You can evaluate the model using the `evaluate_metrics` function: +--- + +## **Data** + +This project uses Amazon Appliance Reviews data for training, validation, and testing. The dataset can be broadly categorised into: +- **User Reviews**: Including `user_id`, `review_text`, `parent_asin`, and `timestamp`. +- **Product Metadata**: Including `parent_asin`, `title`, `category`, `features`, `price`, and more. + +Each row in the dataset is processed to generate the next 10 items that the user interacted with. Data is split into training, validation, and test sets, keeping the temporal nature of user interactions intact. + +[Download data here](https://amazon-reviews-2023.github.io/) + +--- + +## **Model Architecture** + +The project uses the GPT-2 medium architecture, consisting of: +- **Embedding Dimension**: 1024 +- **Number of Layers**: 24 +- **Number of Attention Heads**: 16 + +The model is fine-tuned to take as input: +- **User Reviews**: User-written text. +- **Item Metadata**: Key features like title, category, and other details. + +### **Loss Function**: +We calculate the loss using cross-entropy between the predicted next items and the actual items the user interacted with. + +--- + +## **Installation** + +To run this project locally, follow these steps: + +1. **Clone the Repository**: + ```bash + git clone https://github.com/babaniyi/LLMs-for-RecSys.git + cd LLMs-for-RecSys + ``` + +2. **Install Dependencies**: + ```bash + pip install -r requirements.txt + ``` + +3. **Set Up the Data**: + - Download the data `Appliances.jsonl` and `meta_Appliances.jsonl` [data here](https://amazon-reviews-2023.github.io/) + - Place the data in `data/raw/` folder + +4. **Configure Environment Variables**: + If using CUDA, ensure your environment is properly set up for PyTorch GPU usage. + +To run the script. Go to the command line and run the following: + +If you want to run the model using Alpaca-style prompt **without** LORA (which is our baseline model). Run +```bash +python gpt-experiment.py --run_solution baseline +``` + +Use Alpaca-style prompt **with** LORA +```bash +python gpt-experiment.py --run_solution alpaca_and_lora +``` + +If you want to run the model using Phi3-style prompt **without** LORA +```bash +python gpt-experiment.py --run_solution phi3_prompt +``` + +If you want to run the model using Phi3-style prompt **with** LORA +```bash +python gpt-experiment.py --run_solution phi3_and_lora +``` + + +--- + +## **Usage** + +### **Data Preprocessing** + +To curate quality training data, we focus on users who have purchased at least 10 materials. +The data should be preprocessed before training. The `get_next_10_items` function processes user interaction data and ensures that each entry contains a list of the next 10 items based on historical interactions. ```python -evaluate_metrics(output_list, k=10) +df_with_next_10_items = get_next_10_items_optimized(df, padding_strategy='repeat', pad_token=50257) +``` + +### **Train/Test/Validation Split** +You can split the data while preserving its temporal time order to ensure no data and target leakage using the `temporal_train_val_test_split` function: + +```python +train_df, val_df, test_df = split_data_temporal(df_with_next_10_items, train_ratio=0.7, val_ratio=0.1, test_ratio=0.2) +``` + + +--- + +## **Training the Model** +First, we calculate the initial training and validation set loss before we start training (the goal is to minimize the loss). The initial train and validation losses can be visualised in a plot `loss-plot-...pdf` which is generated by the model. + +Make sure that you have your CUDA environment set up properly if using GPUs. The model will output training and validation loss metrics after every evaluation step. + +An example of data entry is given below. +- **Input:** We set the input as the user_id in the following format <|user|>. +- **Output:** We set it as the next items purchased by the user and encoded as <|item_id|>. If items purchased are exhausted, it's encoded using a special token <|endoftext|> +- **Prompt:** We instruct the model with the following prompt `Given a user purchased an an item with the following details, predict the next 10 items the user would purchase.` + +Example of the processed data sent to the model. +``` +{ + "input": "<|user_AE2BFR2EGPHCYISLCTPOX2AQHKVQ|>", + "output": "<|item_B07FTFD1XB|>, <|endoftext|>, <|endoftext|>, <|endoftext|>, <|endoftext|>, <|endoftext|>, <|endoftext|>, <|endoftext|>, <|endoftext|>, <|endoftext|>", + "instruction": "Given a user purchased an an item with the following details, predict the next 10 items the user would purchase. Item id is <|item_B00Y5UZD5I|>. Rating of the item by user from 1 to 5 is 2.0. Text of the user review is To expensive, the quality is questionable... The coating is thin and becomes damaged very easy! I would not recommend for these reasons. Main category of the item is Tools & Home Improvement. Item name is GE WD28X10399 Upper Rack. Price USD is nan. Item details is manufacturer is: GE. part number is: WD28X10399. item weight is: 0.32 ounces. product dimensions is: 4.1 x 0.4 x 2.5 inches. item model number is: WD28X10399. item package quantity is: 1. certification is: Certified frustration-free, Not Applicable. included components is: Appliance-replacement-parts, Appliance Parts & Accessories. batteries included? is: No. batteries required? is: No. best sellers rank is: {'Tools & Home Improvement': 604524, 'Parts & Accessories': 91543}. date first available is: November 21, 2013.", + }, ``` + +--- + +## **Evaluation** + +The model supports evaluation metrics like **Recall**, **Precision**, **Mean Reciprocal Rank (MRR)**, **Hit Rate at 10 (HR@10)**, and **Normalized Discounted Cumulative Gain at 10 (NDCG@10)**. After generating predictions for a batch of reviews, the recommended items are compared with the actual next items to calculate these metrics. + ### Evaluation Metrics: - **Precision@10**: Proportion of relevant items in the top-10 predictions. - **Recall@10**: Proportion of relevant items retrieved from the actual items. @@ -47,18 +176,10 @@ evaluate_metrics(output_list, k=10) - **HR@10**: Binary value indicating whether any relevant item appears in the top-10 predictions (1 if yes, 0 if no). - **NDCG@10**: Discounted Cumulative Gain normalized over the ideal ranking, taking into account the rank position of relevant items. -### Updated Results - -After applying these metrics, the following evaluation results were obtained: +You can evaluate the model using the `evaluate_metrics` function: ```python -Average Metrics: { - 'precision@10': 0.0219, - 'recall@10': 0.0536, - 'mrr': 0.0413, - 'hr@10': 0.0565, - 'ndcg@10': 0.0429 -} +evaluate_metrics(output_list, k=10) ``` Example of the output list, which includes the output and model response: @@ -73,6 +194,7 @@ Example of the output list, which includes the output and model response: --- + ## **Contributing** We welcome contributions to this project! Please follow the steps below if you would like to contribute: