GitHub - beclaire5/NLP_project

NLP-powered insights into academic trends in AI for pricing and promotion in large retail systems

Barontini Chiara c.barontini@studenti.luiss.it , Biggi Daniele daniele.biggi@studenti.luiss.it , Baldo Michele michele.baldo@studenti.luiss.it
Deloitte 05/2025

Section 1: Introduction

This project investigates recent academic developments at the intersection of Artificial Intelligence, pricing, promotion, and global supply chain operations. Leveraging the OpenAlex API, we collected and cleaned a dataset of approximately 9,000 scholarly articles, focusing on research published in the last decade. We applied BERTopic, a modern topic modeling technique, to extract and cluster key themes across abstracts. The model enabled the identification around 40 coherent research topics, ranging from supply chain management to dynamic pricing. Building on this, we developed a recommendation system that combines semantic similarity and topic probability to suggest the most relevant articles based on a user’s query. The system was evaluated using precision and recall, showing strong performance in returning thematically aligned results. The outcome is a scalable and insightful tool for navigating academic literature, with practical applications for businesses and researchers seeking targeted knowledge in AI-driven decision-making.

Section 2: Methods

2.1 Data Collection: Our objective for data collection was to gather academic articles focused on the relationship between Artificial Intelligence, pricing and promotion, and large-scale retail operations. So, we decided to use the OpenAlex API since it is an open-source and free platform that has millions of scholarly works with rich metadata. It doesn’t require an API key making it accessible for us. To narrow down the vast amount of data available on OpenAlex, we identified key concept IDs corresponding to our themes (e.g. AI: "C154945302") and grouped them into 30+ logical ‘AND’ combinations to capture research intersection of interest. In addition, we filtered for recent articles by selecting publication years up to 10 years ago. Parallel fetching was employed to speed up the process. To ensure uniqueness we tracked DOIs during the fetching stage and removed duplicates. This approach led us to a final dataset of slightly less than 10,000 unique publications.
2.2 Data cleaning: We began the cleaning stage by identifying missing values across all columns. We computed both the absolute number and percentage of missing entries per column. We dropped several columns that had a high percentage of missing values and were not relevant for our analysis. Next, we standardized the date fields by converting them to a datetime format using Pandas. We also removed entries that lacked a DOI, as these links are essential for accessing the full article and its abstract. Duplicate DOIs were dropped. Finally, we ended up with more than 9,000 articles after cleaning. 2.3 Exploratory Data Analysis (EDA):
This phase focused on understanding the structure and characteristics of the dataset after initial loading and filtering. The dataset’s columns were inspected to gain a sense of the available information. The majority of publications were found to be in English consisting of journal articles. We visualized the temporal distribution of publications using a histogram (Fig.1), which revealed peaks in research around the years 2021, 2022, and 2023. Further we identified the topmost cited papers such as: “Blockchain technology and its relationships to sustainable supply chain management” ; “Artificial Intelligence (AI): Multidisciplinary perspectives on emerging challenges, opportunities, and agenda for research, practice and policy” ; “Big Data in Smart Farming – A review”; “Industry 4.0 technologies: Implementation patterns in manufacturing companies”. To analyze author influence, we visualized with a horizontal bar chart the top contributing authors by publication frequency. Finally, we inspected the relationship between citation count and years with an interactive Fig.1 Plotly graph (Fig. 2). Fig.2 2.4 Abstract fetching and text preprocessing: We first collected the abstracts instead of the full text, for copyright issues. We originally attempted to use BeautifulSoup to fetch articles from each link of their websites. However, the result was superficial, because each abstract's location in the HTMLs changed in every single DOI. We opted for PyAlex library, a python wrapper for OpenAlex, to dynamically fetch abstracts using each paper’s DOI because the process was much smoother and cleaner. The abstracts were reconstructed from the abstract_inverted_index field, which encodes each word by its position. Due to potential rate limits and computational costs, we ran the function on small batches of the data at a time. Next, we moved to text preprocessing. We filtered the dataset to include only English papers and merged the title and the abstract into a new column. Using the SpaCy NLP library, we performed standard preprocessing steps: lowercasing, punctuation and number removal, lemmatization, stopword removal, and elimination of short or non-alphabetic tokens. The result was a cleaned column that is ready to be used for topic modeling. 2.5 BERTopic Model: We employed BERTopic, a state-of-the-art topic modeling technique, for clustering and labeling the abstracts into meaningful themes. We selected it for its ability to generate high-quality, and interpretable topics. The BERTopic model was initialized with the UMAP model for dimensionality reduction, the HDBSCAN model for clustering, and the CountVectorizer to preprocess the abstracts by removing English stopwords. We employed the embedding model paraphrase-MiniLM-L12-v2 to convert the text into semantically rich vectors that capture the meaning of the text. We chose it as it is lightweight, efficient, without taking away from performance. Throughout the process, we extensively tweaked the parameters for both UMAP and HBSCAN to optimize the results. Specifically, we fine-tuned the UMAP parameters to balance local and global structure, and we adjusted HDBSCAN parameters to detect clusters of varying sizes and density. After several trials, we settled on the final configuration. After fitting the model, the topics extracted were 40 excluding the outliers (-1). The outliers were documents that do not clearly fall into any of the identified topics, they represent data points that the HDBSCAN model could not cluster into specific topics. In our case, there were around 700, which compared to other models we tried was a good result for our dataset size. Overall, the topics span a wide variety of themes, from supply chain management to sustainability, dynamic pricing, blockchain technology, to more niche topics like Covid-19 and food supply chains. Topics like Topic 0 (supply chain management) and Topic 1 (sales and promotion strategies) represent broader, more prevalent themes, while topics with fewer documents, such as Topic 23 (green supply chain optimization), represent more specialized themes.
The embeddings done with paraphrase-MiniLM-L12-v2 model were stored in the df[‘embedding’] to be used later in the recommendation system and to calculate cosine similarity. 2.6 Recommendation system:
The recommendation system in our model requires the user to type an input in the notebook and this query is then embedded and confronted with the abstract embeddings to calculate the cosine similarity. The recommendation system outputs a list of k articles with their respective DOIs, their titles, the abstracts, their topic assigned to them and the ‘topic_probability’ which is measure of how likely an article belongs to a certain topic. The number of recommendations in the list can be changed by the user. Furthermore, the recommendation system prints the top articles according to cosine similarity. These results are stored in the df[‘similarity_score’] column. However, we decided to add something more to the recommendation system and wanted the probabilities of belonging to a topic to influence the recommendations. First, we used a logarithm function to flatten the high values of the df[‘topic_probability’] columns because most of the values were very close to one. Results are being stored in df['compressed_topic_belonging']. Moreover, we implemented this formula for the final score evaluation: final_score = alpha* df['similarity_score'] + (1 - alpha) * df['compressed_topic_belonging'] where alpha is a parameter that ranges from zero to one and its value chooses which of the two variables will have more impact on the final recommendation ranking. Currently, alpha is set to 0.8 as we believe that cosine similarity between the query and the results is the most important factor here. Finally, the recommendation function also associates the query words to a topic with a certain probability (df[‘query_topic’]). For example, if a user writes “dynamic pricing” the system will assign it to “topic 3”, the most likely one. Because this is an interesting indicator, it was included in the calculation of the final score. Beyond the final score formula, everytime the df[‘query_topic’] matches the df[‘topic’] (of the article) a 0.1 is added to the final score giving more priority to articles of that topic.
The df[‘query_topic’] column was calculated with BERTopic library with the
“topic_model.transform()” function.
2.6 Evaluation metrics: To evaluate our BERTopic we used the Coherence metric (range 0 < x < 1) which measures how coherent the embeddings in each topic are, in other words if they make sense together. The Coherence score associated with the model was around 0.55.
Furthermore, we calculated the precision and recall to evaluate our recommendation system based on the topic classification of the documents. After splitting the data into train and test data according to the Pareto rule (80% and 20%) we used the abstracts of the test data as queries of the recommendation system and checked whether the recommended articles belonged to the same topic or not. Initially, we chose to measure them at k=5, that is, for the top 5 recommendations. We already knew that this approach would have been too conservative but we decided to use it as a starting point. We then increased the recommended papers to 20. The results of the precision and recall were not surprising when considering only the top 5 recommendations: precision = 0.80 and recall = 0.015. This means that, on average, 4 out of 5 recommended articles belong to the same topic of the query. On the other hand, only 1.5% of the total articles belonging to the topic appear in the top 5 recommended articles. After increasing the threshold of the top recommended articles to 20 the recall increased to 0.047, three times higher than the original result. However, the precision decreased to 0.69. From a precision and recall trade-off perspective we believe that recommending the top 20 papers is better than recommending 5 only for several reasons. These reasons are discussed in the following section.

Section 3: Results and Discussion

The topic modeling clustered thousands of articles in roughly 40 different groups of articles for semantic similarity. The coherence score of the topics is equal to 0.5492 while the precision at 20 recommended articles equals 91%. On the other hand, recall at 20 resulted to be equals to 0.040, meaning that among the top 20 articles associated with the query topic are the 4% of the total topic group. When searching for “dynamic pricing in large scale retail” in the recommendation system the final similarity score of the top article is equal to more than 0.90. The latter score suggests that the query is strongly related to the article that is probably worth reading. The following results, in descending order, also have a high final score and can be used for research purposes. It is worth noting that while the results were high for this query the final score may decrease for other scientific branches, which were not the main target of this project. However, this model is highly flexible and can be used for any kind of research by changing the keywords at the beginning, while importing the dataset. Therefore, it is unlikely for a user to find an article talking about history or American corporate law. More likely it is to find similar-related articles within slightly different topics. If you think about any search engine such as Google, the results on the front page generated by any query differ just slightly and without going off track. From an academic point of view, it is useful not to have a recommendation system with precision equals 1. but rather to have some differentiated results to help consider other aspects of research or just for analyzing problems from a different perspective. Big companies’ search engines do so too.
The outcomes of this project can be applied to everyday business in numerous ways. The dataset of thousands of academic papers organized in topics and the recommendation system that displays the most relevant papers with respect to the search-engine-style query made this project a valuable source of knowledge for companies. Deloitte and other consultancy firms need data and information to offer a professional service to firms that need support. Knowledge means competitive advantage and the speed with which companies gain it its crucial in our economy where everything seems to be so quick. Customizable is the adjective that makes this project so unique from traditional scholar search engines like Google Scholar or SSRN: perhaps, a user wants to read more articles from the same topic and can easily do so by filtering the dataframe with the topic number. Moreover, engines like SSRN recommend articles of any kind, making these machines more dispersive than our focused model.

Section 4: Conclusions

This project stands as a concrete and successful example of how Artificial Intelligence and Natural Language Processing (NLP) can be applied not only to analyze large volumes of textual data, but also to return personalized informational value in complex domains like academic research. Starting from a carefully curated data collection and cleaning process, it was possible to construct a highly relevant and thematically rich corpus. This, in turn, enabled the extraction of coherent insights through BERTopic and the delivery of user-tailored recommendations. What we see as the true strength of this work is the integration of quantitative structure and qualitative value: on one hand, topic modeling organizes the academic content into meaningful clusters; on the other, the recommendation engine introduces an interactive and interpretive layer, capable of dynamically responding to user queries. This balance between structure and flexibility is what makes the project relevant and practically valuable—potentially even beyond academia. That said, there is still room for improvement. For instance, relying solely on abstracts—although understandable due to copyright constraints—limits the depth of semantic analysis. Additionally, while effective, the current recommendation algorithm does not yet factor in qualitative indicators such as source credibility or publication impact, which could significantly enrich the final output. Looking ahead, the future prospects are promising. The system could be enhanced by: Incorporating user feedback to make it more adaptive over time; Developing an interactive visualization of topics and inter article relationships; Extending the system to business domains; Integrating advanced indicators to recommend not just similar papers, but also the most cited and influential ones. In conclusion, this work does not merely solve a technical challenge—it lays the groundwork for a scalable model of intelligent knowledge access, one that can evolve and respond to the complex informational needs of researchers, professionals, and educators alike. It is a solid first step toward more context-aware and useful recommendation systems in the era of information overload.

Appendix A: Code Description

Data Collection: This section involves fetching scholarly data from OpenAlex API. The requests command in Python was employed to interact with the API. The goal was to gather articles based on specific concepts IDs (such as pricing, promotion, supply chain, AI, etc) grouped in different combinations, representing areas of interest. These are then paired with specific year ranges (2016-2020 and 2021-2025) to filter for recent articles. This process ensures efficient data retrieval by leveraging parallel fetching using the command ThreadpoolExecutor. The code handles rate limits by implementing a retry mechanism when the API returns a 429 status. It also avoids storing duplicate DOI by checking each DOI against seen ones. Data Cleaning: Columns with over 50% missing values or deemed irrelevant are dropped. Date columns are converted to proper datetime format using pd.to.datetime(). Articles without a DOI are removed, and duplicate records are dropped based on the DOI column. Finally, the number of unique DOIs is checked, ensuring the dataset is clean and ready for analysis. EDA: The code analyses the ‘language’ and ‘type’ columns using value_counts function to check category frequencies. Then it extracts metadata from the authorship column, which contained nested lists of dictionaries, including names, coutries, and affiliations. A histogram visualizes the distribution of publications over time, while the most cited papers are identified by sorting based on citation counts. The top contributing authors are found by splitting the ‘Author_name’ column and counting publications, with results displayed in a horizontal bar chart. Lastly, an interactive scatter plot created with Plotly explores the relationship between publication year and citation count. Pseudocode: Initialize API URL, headers, and concept IDs Define year ranges and empty set for tracking seen DOIs FOR each concept_group in grouped_concepts: FOR each year_range in year_ranges: SET API parameters (concept_group, year_range, pagination) FOR each page from 1 to max_pages: FETCH data from API IF rate limit exceeded (status 429): WAIT and retry IF successful (status 200): FOR each document in results: IF DOI is unique: Add document to results Add DOI to seen DOIs set ELSE: Skip document WAIT 0.1 seconds Parallelize fetching using ThreadPoolExecutor for faster retrieval Convert results to pandas DataFrame Save DataFrame to CSV Abstract fetching and text preprocessing: The code first fetches abstracts from the OpenAlex API using the DOI of each paper. The function get_abstract_or_title retrieves the abstract, reconstructs it from the inverted index if available, or fall back to the paper’s title. The tqdm library is used to track progress. We repeat this process for 5 batches time since it’s a long process. After fetching, the code ensures all missing abstracts if any are removed. For text preprocessing, the dataset is filtered to include only English abstracts, and title and abstracts are combined in a new column for topic modeling. A robust preprocessing function is then applied, which converts the text to lowercase, removes special characters and numbers, using the SpaCy library for tokenization and lemmatization excluding stopwords, punctuation, and short words. The cleaned text is stored in a new column. Pseudocode: Initialize CountVectorizer to remove English stopwords Extract processed abstracts into a list of texts Initialize UMAP for dimensionality reduction with specified parameters Initialize HDBSCAN for clustering with specified parameters Initialize BERTopic with UMAP, HDBSCAN, and vectorizer, using 'paraphrase-MiniLM-L12-v2' for embeddings Fit the BERTopic model to the texts to get topics and probabilities Add the topics and probabilities to the DataFrame as new columns Retrieve and display topic information using get_topic_info() BERTopic: BERTopic is used to perform topic modeling on the dataset. A custom CountVectorizer is created to remove English stopwords. The text data extracted from the processed_abstracts column, is then passed into the BERTopic model. The UMAP model is used for dimensionality reduction, while HDBSCAN clustering model is employed to group similar documents into topics, with settings optimized. The fine-tuning was done tweaking the paramters n_neighbors, n_components, and min_dist for UMAP and min_cluster_size, min_samples, and cluster_selection_method for HDBSCAN. The embedding model ’paraphrase-MiniLM-v2’ is chosen to convert the text to semantic embeddings. The model is set to extract 50 topics (nr_topics=50). After fitting the model to the text data, the topic assignments (topics) and their probablilities (probs) are returned. These results are then added back to the DataFrame in new columns. Finally, the get_topic_info() method is called to retrieve a summary of the topics, providing details on the most frequent words associated with each topic. Recommendation system: Using sentence-transformers, the code loads the 'paraphrase-MiniLM L12-v2' model to convert abstracts into numerical embeddings stored in the DataFrame's 'embeddings' column. Then a function recommend_combined is defined, that recommends documents for a text query by combining semantic similarity and topical relevance. It first embeds the query using a SentenceTransformer and computes cosine similarity against document embeddings. At the same time, it derives a topical relevance score by applying a non-linear transformation to each document's BERTopic assignment probability. These two scores are then combined via a weighted sum (alpha), with an additional bonus for documents whose BERTopic assignment matches the query's inferred topic. Then, the top N documents are returned. The next part prompts the user to enter a query, passes the query to the recommend_combined function to get recommendations, prints the query’s assigned topic and recommendation list. Evaluation metrics: The code is split with the scikit-learn function train_test_split into two subsets following the 20-80% rule. The function evaluate_precision_at_k assesses the performance of the recommendation system. For each document in the test set, it uses its abstract as a query to retrieve a list of top-k recommended papers from the training set. Then it calculates the precision at k by checking how many of the recommended papers share the same BERTopic topic as the test document. It returns the average precision across all test documents.

Appendix B: Author Contribution

Barontini Chiara: Methodology, Software, Investigation, Data curation, Writing Biggi Daniele: Conceptualization, Software, Validation, Visualization, Writing Baldo Michele: Visualization, Supervision, Writing

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.gitattributes		.gitattributes
Logo_LUISS.png		Logo_LUISS.png
README.md		README.md
doc_embeddings.npy		doc_embeddings.npy
project.zip		project.zip
recommender_app.py		recommender_app.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages