75 Years of Mathematical Oncology

Codebase and data for bibliometric analysis

Our analysis is automatically performed by the code in this repository. It relies on CSVs containing publication metadata (retrieved from Scopus) and uses the Scopus APIs to access additional metadata.

To run the code, you need a Scopus API key (see below). Make sure your institution provides access to Scopus before running the code.

Quick start

Scopus API Key

Access the Scopus Developer Portal and follow the instructions to get an API key.

Create a config.json with the following format:

{
    "apikey": "PASTE_YOUR_API_KEY_HERE",
    "insttoken": ""
}

And store the file in the config folder.

Install code dependencies

The code is entirely written in Python3 and has been testes using Python 3.8 and Python 3.13.2. Dependencies are stored in requirements.txt. You can use a Python environment manager (e.g. Pipenv) to install them.

Start by installing Pipenv (if you don't have it already):

pip install --user pipenv

And install the requirements:

pipenv install -r requirements.txt

Typical installation time should be in the order of a few minutes.

Execute the code

The pipeline is contained entirely in the main.py file. To execute it, you can either activate the environment and run the file:

pipenv shell
python3 main.py

Or run directly using pipenv:

pipenv run python3 main.py

The data generated from the analysis will be in the out folder.

Important Note: for testing purposes we set the code to run only on 20 entries of the query-based dataset. In the main, we set

# Limit number of entries (for testing)
limit_entries_to = 20

Tipical execution time should be in the order of 10 minutes, depending on the speed of the internet connection.

To run on the full dataset, set limit_entries_to=None. It might take 5/10 hours, depending on the speed of the connection. In case the full dataset is required, it is recommended to use the pre-generated data on Zenodo (see below).

Input data

The input data used in the analysis are in the data folder.

QueryBased_Scopus.csv: The query-based dataset (not enriched) downloaded from Scopus
TWIMO_Scopus: The TWIMO dataset (not enrished) downloaded from Scopus
geofolder: contains geographic information used in the analysis (abbreviations of the US states, shape of the world map, etc)
networks folder: Bibliographic networks exported from VOSViewer and used to generate the chord charts
regex folder: Regex patterns used in the analysis
thesaurus folder: synonims used in the analysis to prevent duplication of terms and authors
journals folder: contains a file with the journal abbreviations that have been used

Generated data

All generated data is in the out folder. To download a pre-generated version of the data, go to Zenodo or see below.

Pre-generated data

To allow the full reproduction of our results without having full access to Scopus APIs, we uploaded a pre-generated version of our data on Zenodo.

To use it with our script, download the data:

python3 src/fetch_zenodo_data.py

And change the following parts of the main script

out_folder = Path(f"demo/demo")  # changed from `out`

# Limit number of entries (for testing)
limit_entries_to = None  # Changed from 20

Preprocessing ('out/preprocessing')

The preprocessing pipeline generates the following data:

out/preprocessing/not_found_in_TWIMO_QueryBased_Scopus.csv: list of papers in the TWIMO dataset that are not present in the Query-Based Dataset.
out/preprocessing/QueryBased_Scopus_1951-2009.csv: Query-Based records published between 1951 and 2009.
out/preprocessing/QueryBased_Scopus_1951-2016.csv: Query-Based records published between 1951 and 2016.
out/preprocessing/QueryBased_Scopus_1951-2021.csv: Query-Based records published between 1951 and 2021.
out/preprocessing/QueryBased_Scopus_random_samples.csv: 200 random samples from the Query-Based dataset. We used them to estimate the number of false positives.

Extraction of Bibliometric Data

The extraction pipeline retrives the full bibliometric information used for the analysis. It runs on the TWIMO datased (output: out/twimo), on the query-based dataset (output: out/query-based), and on different time-frames of the query-based dataframe (until 2009, until 2016, and until 2021).

For each, it generates the following files:

abstracts.json: all metadata for each publication, including data not contained in the CSV file
author_keywords.csv: author keywords and their frequency in the Query-Based dataset
authors.csv: authors and their frequency in the Query-Based dataset
funding_sponsors_percentage.txt: percentage of records with funding metadata
funding_sponsors.csv: funding bodies and their frequency in the Query-Based dataset
indexed_keywords.csv: indexed keywords automatically generated by Scopus in the Query-Based dataset (not used in our bibliometric analysis)
institutions_raw.csv: institutions and their frequency in the Query-Based dataset
institutions.csv: institutions and their frequency in the Query-Based dataset, enriched with geographic data
institutions.json: institutions and their frequency in the Query-Based dataset in JSON format, enriched with geographic data
journals_metrics.json: quartile and CiteScore information for journals in the Query-Based dataset
journals.csv: journals and their frequency in the Query-Based dataset
PM_cancer_types.csv: pattern matching results using regexes for cancer types in the Query-Based dataset
PM_modelling_approaches.csv: pattern matching results using regexes for modeling methods in the Query-Based dataset
subjects.csv: journal scientific subjects in the Query-Based dataset

Figures

Matplotlib Figures

The last part of the code generates the figures to inspect the data. These are stored in:

out/figures
out/supplementary_figures

D3.js figures

Some figures (e.g. the chord charts and world maps) are generated using D3.js. The raw data to generate the figures are generated by python inside the folders src/network (for the chord charts) and src/map (for the world map). To .html files to generate the figures are stored in the same folder.

To reproduce them, open an http serve using python:

python3 -m http.server

Then, using your preferred browser, go to http://localhost:8000/ to access the .html files. Clicking on the files, you should be able to see the figures.

Warning: Figures are optimized for the full dataset. If you're running the code for testing purposes (see above) don't expect the figures to look the same as the manuscript, and to be polished.

Warning: not all the figures have been generated using Python. Some have been generated using external softwares and are not reproduced here. For instance, some of the figures have been composed on PowerPoint, and some of the matrices have been generated using Morpheus.

Repository structure

`config` folder

Contains the config.json file with your API key (you must create it yourself; see Quick start).

`out` folder

Contains all output files generated by the pipeline, including bibliometric data, extracted metrics, and visualizations used in the analysis.

`data` folder

Contains the input data used in the analysis.

`src` folder

Contains the code used in the analysis.

License

This work is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc/4.0/ or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
data		data
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

75 Years of Mathematical Oncology

Quick start

Scopus API Key

Install code dependencies

Execute the code

Input data

Generated data

Pre-generated data

Preprocessing ('out/preprocessing')

Extraction of Bibliometric Data

Figures

Matplotlib Figures

D3.js figures

Repository structure

`config` folder

`out` folder

`data` folder

`src` folder

Meta

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

75 Years of Mathematical Oncology

Quick start

Scopus API Key

Install code dependencies

Execute the code

Input data

Generated data

Pre-generated data

Preprocessing ('out/preprocessing')

Extraction of Bibliometric Data

Figures

Matplotlib Figures

D3.js figures

Repository structure

config folder

out folder

data folder

src folder

Meta

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`config` folder

`out` folder

`data` folder

`src` folder

Packages