Codebase and data for bibliometric analysis
Our analysis is automatically performed by the code in this repository. It relies on CSVs containing publication metadata (retrieved from Scopus) and uses the Scopus APIs to access additional metadata.
To run the code, you need a Scopus API key (see below). Make sure your institution provides access to Scopus before running the code.
Access the Scopus Developer Portal and follow the instructions to get an API key.
Create a config.json with the following format:
{
"apikey": "PASTE_YOUR_API_KEY_HERE",
"insttoken": ""
}And store the file in the config folder.
The code is entirely written in Python3 and has been testes using Python 3.8 and Python 3.13.2. Dependencies are stored in requirements.txt. You can use a Python environment manager (e.g. Pipenv) to install them.
Start by installing Pipenv (if you don't have it already):
pip install --user pipenvAnd install the requirements:
pipenv install -r requirements.txtTypical installation time should be in the order of a few minutes.
The pipeline is contained entirely in the main.py file. To execute it, you can either activate the environment and run the file:
pipenv shell
python3 main.pyOr run directly using pipenv:
pipenv run python3 main.pyThe data generated from the analysis will be in the out folder.
Important Note: for testing purposes we set the code to run only on 20 entries of the query-based dataset. In the main, we set
# Limit number of entries (for testing)
limit_entries_to = 20Tipical execution time should be in the order of 10 minutes, depending on the speed of the internet connection.
To run on the full dataset, set
limit_entries_to=None. It might take 5/10 hours, depending on the speed of the connection. In case the full dataset is required, it is recommended to use the pre-generated data on Zenodo (see below).
The input data used in the analysis are in the data folder.
QueryBased_Scopus.csv: The query-based dataset (not enriched) downloaded from ScopusTWIMO_Scopus: The TWIMO dataset (not enrished) downloaded from Scopusgeofolder: contains geographic information used in the analysis (abbreviations of the US states, shape of the world map, etc)networksfolder: Bibliographic networks exported from VOSViewer and used to generate the chord chartsregexfolder: Regex patterns used in the analysisthesaurusfolder: synonims used in the analysis to prevent duplication of terms and authorsjournalsfolder: contains a file with the journal abbreviations that have been used
All generated data is in the out folder. To download a pre-generated version of the data, go to Zenodo or see below.
To allow the full reproduction of our results without having full access to Scopus APIs, we uploaded a pre-generated version of our data on Zenodo.
To use it with our script, download the data:
python3 src/fetch_zenodo_data.pyAnd change the following parts of the main script
out_folder = Path(f"demo/demo") # changed from `out`# Limit number of entries (for testing)
limit_entries_to = None # Changed from 20The preprocessing pipeline generates the following data:
out/preprocessing/not_found_in_TWIMO_QueryBased_Scopus.csv: list of papers in the TWIMO dataset that are not present in the Query-Based Dataset.out/preprocessing/QueryBased_Scopus_1951-2009.csv: Query-Based records published between 1951 and 2009.out/preprocessing/QueryBased_Scopus_1951-2016.csv: Query-Based records published between 1951 and 2016.out/preprocessing/QueryBased_Scopus_1951-2021.csv: Query-Based records published between 1951 and 2021.out/preprocessing/QueryBased_Scopus_random_samples.csv: 200 random samples from the Query-Based dataset. We used them to estimate the number of false positives.
The extraction pipeline retrives the full bibliometric information used for the analysis. It runs on the TWIMO datased (output: out/twimo), on the query-based dataset (output: out/query-based), and on different time-frames of the query-based dataframe (until 2009, until 2016, and until 2021).
For each, it generates the following files:
abstracts.json: all metadata for each publication, including data not contained in the CSV fileauthor_keywords.csv: author keywords and their frequency in the Query-Based datasetauthors.csv: authors and their frequency in the Query-Based datasetfunding_sponsors_percentage.txt: percentage of records with funding metadatafunding_sponsors.csv: funding bodies and their frequency in the Query-Based datasetindexed_keywords.csv: indexed keywords automatically generated by Scopus in the Query-Based dataset (not used in our bibliometric analysis)institutions_raw.csv: institutions and their frequency in the Query-Based datasetinstitutions.csv: institutions and their frequency in the Query-Based dataset, enriched with geographic datainstitutions.json: institutions and their frequency in the Query-Based dataset in JSON format, enriched with geographic datajournals_metrics.json: quartile and CiteScore information for journals in the Query-Based datasetjournals.csv: journals and their frequency in the Query-Based datasetPM_cancer_types.csv: pattern matching results using regexes for cancer types in the Query-Based datasetPM_modelling_approaches.csv: pattern matching results using regexes for modeling methods in the Query-Based datasetsubjects.csv: journal scientific subjects in the Query-Based dataset
The last part of the code generates the figures to inspect the data. These are stored in:
out/figuresout/supplementary_figures
Some figures (e.g. the chord charts and world maps) are generated using D3.js. The raw data to generate the figures are generated by python inside the folders src/network (for the chord charts) and src/map (for the world map). To .html files to generate the figures are stored in the same folder.
To reproduce them, open an http serve using python:
python3 -m http.serverThen, using your preferred browser, go to http://localhost:8000/ to access the .html files. Clicking on the files, you should be able to see the figures.
Warning: Figures are optimized for the full dataset. If you're running the code for testing purposes (see above) don't expect the figures to look the same as the manuscript, and to be polished.
Warning: not all the figures have been generated using Python. Some have been generated using external softwares and are not reproduced here. For instance, some of the figures have been composed on PowerPoint, and some of the matrices have been generated using Morpheus.
Contains the config.json file with your API key (you must create it yourself; see Quick start).
Contains all output files generated by the pipeline, including bibliometric data, extracted metrics, and visualizations used in the analysis.
Contains the input data used in the analysis.
Contains the code used in the analysis.
Franco Pradelli (franco.pradelli94@gmail.com)
This work is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc/4.0/ or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.