This repository contains a three-notebook workflow for retrieving OpenAlex publication data, calculating institution-level metrics, and producing figures for a core-area analysis. This workflow was developed by Manuel Llano and Kasper Abcouwer (UvA UB) inspired by the methodology of Jules van Rooij (RUG).
The notebooks are intentionally kept separate so each stage can be run, checked, and rerun independently:
01-data-retrieval.ipynbretrieves and prepares OpenAlex data.02-metrics.ipynbcalculates metrics and creates institution-level tables.03-figures.ipynbgenerates exploratory figures from the processed data.
.
├── notebooks/
│ ├── 01-data-retrieval.ipynb
│ ├── 02-metrics.ipynb
│ └── 03-figures.ipynb
├── data/
│ ├── raw/
│ │ └── input.csv
│ ├── interim/
│ │ └── openalex-total-institute-output.csv
│ └── processed/
│ ├── institute_core_area.csv
│ ├── global_core_area.csv
│ └── institution_table_global_<timestamp>.xlsx
├── README.md
└── requirements.txt
The data/raw/ folder contains the manually prepared input data, for instance using a CRIS export. The data/interim/ folder contains retrieved OpenAlex data before final processing. The data/processed/ folder contains analysis-ready datasets and exported tables.
If the notebooks are stored in the repository root rather than a notebooks/ folder, keep the same data/ structure and adjust relative paths if needed.
The retrieval notebook expects a curated and deduplicated CSV file at:
data/raw/input.csv
At minimum, the input file should contain a DOI column:
doi
The workflow also expects the following columns when available, because they are used for reference and downstream analysis:
work_id.pure
research_unit
The workflow uses the OpenAlex API. To use the polite pool and, optionally, an API key, set these environment variables before running the notebooks:
export EMAIL="your.email@example.com"
export OPENALEX_API_KEY="your_openalex_api_key"Install the main dependencies with:
pip install pandas requests plotly itables openpyxlA minimal requirements.txt would be:
pandas
requests
plotly
itables
openpyxl
Run:
01-data-retrieval.ipynb
This notebook starts from the curated institute output in data/raw/pure_UU.csv, retrieves matching OpenAlex Works by DOI, and stores the enriched institute output in:
data/interim/openalex-total-institute-output.csv
It then identifies the institute core area based on selected topics and writes:
data/processed/institute_core_area.csv
Finally, it retrieves the global core area from OpenAlex for the same topic set and publication period, writing:
data/processed/global_core_area.csv
The retrieved OpenAlex fields include institution identifiers, institution display names, open-access information, FWCI, and citation-normalized percentile information.
Run:
02-metrics.ipynb
This notebook reads the processed institute and global core-area datasets and builds institution-level metrics tables.
Institution sorting and aggregation are based on OpenAlex institution IDs. Institution names are retained as display labels, which avoids merging or splitting institutions incorrectly when display names vary.
The institution table includes, among other metrics:
- total publication count;
- FWCI;
- citation-normalized percentile metrics;
- top 10% / top 1% citation-normalized indicators, where available;
- open-access publication percentage;
- incoming and outgoing citation ranking;
- internal collaboration metrics.
The interactive table is shown in the notebook using itables. An Excel export is written to:
data/processed/institution_table_global_<timestamp>.xlsx
Run:
03-figures.ipynb
This notebook reads:
data/processed/institute_core_area.csv
data/processed/global_core_area.csv
It creates visualizations for topic activity and citation impact over time, comparing the institute core area with the global core area.