Skip to content

MathOnco/MathOnco-Bibliometrics

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

75 Years of Mathematical Oncology

Codebase and data for bibliometric analysis

Our analysis is automatically performed by the code in this repository. It relies on CSVs containing publication metadata (retrieved from Scopus) and uses the Scopus APIs to access additional metadata.

To run the code, you need a Scopus API key (see below). Make sure your institution provides access to Scopus before running the code.

Quick start

Scopus API Key

Access the Scopus Developer Portal and follow the instructions to get an API key.

Create a config.json with the following format:

{
    "apikey": "PASTE_YOUR_API_KEY_HERE",
    "insttoken": ""
}

And store the file in the config folder.

Install code dependencies

The code is entirely written in Python3 and has been testes using Python 3.8 and Python 3.13.2. Dependencies are stored in requirements.txt. You can use a Python environment manager (e.g. Pipenv) to install them.

Start by installing Pipenv (if you don't have it already):

pip install --user pipenv

And install the requirements:

pipenv install -r requirements.txt

Typical installation time should be in the order of a few minutes.

Execute the code

The pipeline is contained entirely in the main.py file. To execute it, you can either activate the environment and run the file:

pipenv shell
python3 main.py

Or run directly using pipenv:

pipenv run python3 main.py

The data generated from the analysis will be in the out folder.

Important Note: for testing purposes we set the code to run only on 20 entries of the query-based dataset. In the main, we set

# Limit number of entries (for testing)
limit_entries_to = 20

Tipical execution time should be in the order of 10 minutes, depending on the speed of the internet connection.

To run on the full dataset, set limit_entries_to=None. It might take 5/10 hours, depending on the speed of the connection. In case the full dataset is required, it is recommended to use the pre-generated data on Zenodo (see below).

Input data

The input data used in the analysis are in the data folder.

  • QueryBased_Scopus.csv: The query-based dataset (not enriched) downloaded from Scopus
  • TWIMO_Scopus: The TWIMO dataset (not enrished) downloaded from Scopus
  • geofolder: contains geographic information used in the analysis (abbreviations of the US states, shape of the world map, etc)
  • networks folder: Bibliographic networks exported from VOSViewer and used to generate the chord charts
  • regex folder: Regex patterns used in the analysis
  • thesaurus folder: synonims used in the analysis to prevent duplication of terms and authors
  • journals folder: contains a file with the journal abbreviations that have been used

Generated data

All generated data is in the out folder. To download a pre-generated version of the data, go to Zenodo or see below.

Pre-generated data

To allow the full reproduction of our results without having full access to Scopus APIs, we uploaded a pre-generated version of our data on Zenodo.

To use it with our script, download the data:

python3 src/fetch_zenodo_data.py

And change the following parts of the main script

out_folder = Path(f"demo/demo")  # changed from `out`
# Limit number of entries (for testing)
limit_entries_to = None  # Changed from 20

Preprocessing ('out/preprocessing')

The preprocessing pipeline generates the following data:

  • out/preprocessing/not_found_in_TWIMO_QueryBased_Scopus.csv: list of papers in the TWIMO dataset that are not present in the Query-Based Dataset.
  • out/preprocessing/QueryBased_Scopus_1951-2009.csv: Query-Based records published between 1951 and 2009.
  • out/preprocessing/QueryBased_Scopus_1951-2016.csv: Query-Based records published between 1951 and 2016.
  • out/preprocessing/QueryBased_Scopus_1951-2021.csv: Query-Based records published between 1951 and 2021.
  • out/preprocessing/QueryBased_Scopus_random_samples.csv: 200 random samples from the Query-Based dataset. We used them to estimate the number of false positives.

Extraction of Bibliometric Data

The extraction pipeline retrives the full bibliometric information used for the analysis. It runs on the TWIMO datased (output: out/twimo), on the query-based dataset (output: out/query-based), and on different time-frames of the query-based dataframe (until 2009, until 2016, and until 2021).

For each, it generates the following files:

  • abstracts.json: all metadata for each publication, including data not contained in the CSV file
  • author_keywords.csv: author keywords and their frequency in the Query-Based dataset
  • authors.csv: authors and their frequency in the Query-Based dataset
  • funding_sponsors_percentage.txt: percentage of records with funding metadata
  • funding_sponsors.csv: funding bodies and their frequency in the Query-Based dataset
  • indexed_keywords.csv: indexed keywords automatically generated by Scopus in the Query-Based dataset (not used in our bibliometric analysis)
  • institutions_raw.csv: institutions and their frequency in the Query-Based dataset
  • institutions.csv: institutions and their frequency in the Query-Based dataset, enriched with geographic data
  • institutions.json: institutions and their frequency in the Query-Based dataset in JSON format, enriched with geographic data
  • journals_metrics.json: quartile and CiteScore information for journals in the Query-Based dataset
  • journals.csv: journals and their frequency in the Query-Based dataset
  • PM_cancer_types.csv: pattern matching results using regexes for cancer types in the Query-Based dataset
  • PM_modelling_approaches.csv: pattern matching results using regexes for modeling methods in the Query-Based dataset
  • subjects.csv: journal scientific subjects in the Query-Based dataset

Figures

Matplotlib Figures

The last part of the code generates the figures to inspect the data. These are stored in:

  • out/figures
  • out/supplementary_figures

D3.js figures

Some figures (e.g. the chord charts and world maps) are generated using D3.js. The raw data to generate the figures are generated by python inside the folders src/network (for the chord charts) and src/map (for the world map). To .html files to generate the figures are stored in the same folder.

To reproduce them, open an http serve using python:

python3 -m http.server

Then, using your preferred browser, go to http://localhost:8000/ to access the .html files. Clicking on the files, you should be able to see the figures.

Warning: Figures are optimized for the full dataset. If you're running the code for testing purposes (see above) don't expect the figures to look the same as the manuscript, and to be polished.

Warning: not all the figures have been generated using Python. Some have been generated using external softwares and are not reproduced here. For instance, some of the figures have been composed on PowerPoint, and some of the matrices have been generated using Morpheus.

Repository structure

config folder

Contains the config.json file with your API key (you must create it yourself; see Quick start).

out folder

Contains all output files generated by the pipeline, including bibliometric data, extracted metrics, and visualizations used in the analysis.

data folder

Contains the input data used in the analysis.

src folder

Contains the code used in the analysis.

Meta

Franco Pradelli (franco.pradelli94@gmail.com)

License

This work is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc/4.0/ or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.

About

Codebase and data for bibliometric analysis on Mathematical Research in Oncology

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 87.3%
  • HTML 10.8%
  • JavaScript 1.9%