Ian Van Buskirk, Marilena Hohmann, Ekaterina Landgren, Johan Ugander, Aaron Clauset, Daniel B. Larremore
https://arxiv.org/abs/2603.00807
This repository includes deidentified data and code to reproduce the figures and analyses of the manuscript. See installation instructions below.
Academic publishing requires solving a collective coordination problem: among thousands of possible venues, which deserve a community's attention? This repository contains the data and code to reproduce the figures in our paper, which surveys 3,510 US tenure-track faculty to understand how publication preferences vary across fields, institutions, and individuals.
Key findings:
- Fields occupy a wide spectrum of consensus β Economics and Chemistry show strong agreement on elite venues; Computer Science and Engineering show fragmented preferences across hundreds of outlets.
- Within fields, faculty at more prestigious institutions and men prefer higher-ranked venues.
- Researchers publish more successfully in their personal favorite venues than in their field's consensus favorites.
- Journal Impact Factors explain only ~64% of pairwise preference choices, systematically undervaluing what fields actually prefer.
This public-facing repository was improved for reproducibility, readability, and clarity, by AI tools. All original analyses and writing from which this repository was derived, including original research notebooks and code, were written by humans.
.
βββ public_data/ β Shareable dataset (start here)
β βββ respondents.csv β One row per survey respondent
β βββ comparisons.csv β One row per pairwise comparison
β βββ venue_selections.csv β Venues shown per respondent in initial pass
β βββ venues.csv β Venue metadata (OpenAlex + JIF)
β βββ publication_counts.csv β Publication counts by venue per respondent
β βββ field_jif_rankings.csv β Field-level consensus + JIF ordinal ranks
β
βββ reproducibility/ β Figure scripts
β βββ run_all.py β Run everything (start here)
β βββ compute_rankings.py β Produces derived/ from public_data/
β βββ fig1.py β¦ fig5.py β Main paper figures
β βββ si_fig1.py β¦ si_fig4.py β Supplementary figures
β βββ derived/ β Intermediate outputs (generated by compute_rankings.py)
β βββ individual_rankings.csv
β βββ field_rankings_per_user.csv
β βββ field_consensus_rankings.csv
β βββ global_rankings_per_user.csv β Global LOO SpringRank (all fields)
β
βββ networks/ β Pre-built network datasets (GraphML + CSV)
β βββ README.md β Schema, attribute tables, and Python examples
β βββ graphml/
β β βββ comparison/ β Directed venue comparison multigraphs (all + 13 fields)
β β βββ bipartite/ β Respondent Γ venue bipartite graphs (all + 13 fields)
β βββ csv/
β βββ comparison/ β Edge list + node attributes (all + 13 fields)
β βββ bipartite/ β Edge list + node attributes (all + 13 fields)
β
βββ utils/ β Shared plotting utilities
β βββ field_colors.py β Per-field color palette (tab20)
β βββ short_venue_names.py β Abbreviated venue name lookup
β
βββ README.md β This file
All data files are in public_data/. Survey responses are de-identified: names, institution names, email addresses, and timestamps have been removed.
One row per survey respondent (US tenure-track faculty, fields with β₯ 100 participants).
| Column | Description |
|---|---|
user_db_id |
Anonymous respondent ID |
field |
Academic field (13 fields) |
subfield |
Subfield within field |
career_stage |
Assistant / Associate / Full Professor |
top_init_venue_db_id |
Top-ranked aspiration venue ID |
mid_init_venue_db_id |
Mid-ranked aspiration venue ID |
low_init_venue_db_id |
Low-ranked aspiration venue ID |
venue_bug |
True if Science was not displayed correctly (exclude from analyses) |
gender |
gm (men) / gf (women) / NaN (not reported) |
aa_area |
Academic area from affiliation database |
clusters_20 |
Field cluster assignment (k=20) |
clusters_40 |
Field cluster assignment (k=40) |
institution_bin_3 |
Institution prestige tercile (1=most prestigious) |
institution_bin_10 |
Institution prestige decile (1=most prestigious) |
academia_prestige_bin_3 |
Academia-wide prestige tercile |
academia_prestige_bin_10 |
Academia-wide prestige decile (used in regressions) |
One row per pairwise comparison made during the survey. Rows provided only for fields with β₯ 100 respondents.
| Column | Description |
|---|---|
user_db_id |
Respondent ID |
is_tie |
True if respondent indicated no preference |
pref_venue_db_id |
ID of the preferred venue (if not a tie) |
other_venue_db_id |
ID of the non-preferred venue (if not a tie) |
field |
Field in which the comparison was made |
All venues presented to each respondent during their initial ranking pass. Venues provided only for fields with β₯ 100 respondents.
| Column | Description |
|---|---|
user_db_id |
Respondent ID |
would_publish |
True if respondent rated this venue positively |
venue_db_id |
Venue ID |
field |
Field context |
Metadata for all venues that appeared in the survey, for fields with β₯ 100 respondents.
| Column | Description |
|---|---|
venue_db_id |
Venue ID (links to other tables) |
venue_oa_id |
OpenAlex venue ID |
name |
Full venue name |
users_selected |
Number of respondents who selected this venue |
display_name |
Display name used in survey |
2yr_mean_citedness |
2-year mean citedness (OpenAlex) |
works_count |
Number of works indexed (OpenAlex) |
cited_by_count |
Total citation count (OpenAlex) |
issn_l |
Linking ISSN |
first_topic |
Primary topic (OpenAlex) |
db_concept |
Concept/field label |
jcr_jif |
Journal Citation Reports Impact Factor (2023) |
jcr_name |
JCR journal name |
Number of publications per respondent in each of their nominated venues, drawn from OpenAlex. Used in Figure 5 (preference realization).
Field-level consensus SpringRank scores (alpha=20) alongside JIF ordinal ranks and selection fractions. One row per (field, venue) pair. All aggregate statistics β no individual-level data.
Key columns: field, venue_db_id, ordinal_porc (consensus rank), ordinal_jif (JIF rank), subset_ordinal_porc, subset_ordinal_jif, frac_selected.
Leave-one-out global SpringRank(alpha=20) scores: for each respondent, scores are computed from all comparisons (across all fields) except that respondent's own, restricted to venues they actually compared. Generated by compute_rankings.py. Used in SI Figure 3.
Pre-built network files are in networks/. Two network types are provided, each
in GraphML and CSV formats, for the global dataset and separately for each of the
13 fields.
| Network type | Nodes | Edges | Global size |
|---|---|---|---|
| Venue comparison multigraph (directed) | Academic venues | One directed edge per pairwise comparison; ties produce two edges (AβB and BβA) at weight 0.5 | 5,686 nodes Β· 166,136 edges |
| Respondent Γ venue bipartite (undirected) | Respondents + venues | Edge if venue was in respondent's consideration set | 9,196 nodes Β· 55,455 edges |
See networks/README.md for full attribute schemas and
Python read-in examples (NetworkX, graph-tool, pandas).
- Python 3.9 or later
- uv (recommended) or pip
matplotlib
seaborn
scipy
numpy
pandas
statsmodels
springrank
pyprojroot
git clone https://github.com/<org>/<repo>.git
cd <repo>Using uv (recommended):
curl -LsSf https://astral.sh/uv/install.sh | sh # install uv if needed
~/.local/bin/uv venv .venv
~/.local/bin/uv pip install matplotlib seaborn scipy numpy pandas statsmodels springrank pyprojrootOr using standard pip:
python3 -m venv .venv
source .venv/bin/activate
pip install matplotlib seaborn scipy numpy pandas statsmodels springrank pyprojroot.venv/bin/python -c "import pandas, matplotlib, seaborn, springrank; print('OK')".venv/bin/python reproducibility/run_all.pyExpected total runtime: ~15β25 minutes on a modern laptop.
| Stage | Script | Time |
|---|---|---|
| Compute rankings | compute_rankings.py |
~10β15 min |
| Figures 1β5, SI 1β3 | fig1.py β¦ si_fig3.py |
< 1 min each |
| SI Figure 4 (permutation test) | si_fig4.py |
~3 min |
If reproducibility/derived/ already exists (e.g. on a second run), skip the ranking step:
.venv/bin/python reproducibility/run_all.py --skip-rankingsStep 1 β Generate intermediate ranking files (run once; takes ~10β15 min):
.venv/bin/python reproducibility/compute_rankings.pyThis reads public_data/comparisons.csv and produces four files in reproducibility/derived/:
individual_rankings.csvβ per-user SpringRank(alpha=0) scoresfield_rankings_per_user.csvβ leave-one-out field SpringRank(alpha=20) scoresfield_consensus_rankings.csvβ field-level consensus SpringRank(alpha=20) scoresglobal_rankings_per_user.csvβ leave-one-out global SpringRank(alpha=20) scores (all fields pooled)
Step 2 β Produce each figure:
.venv/bin/python reproducibility/fig1.py # Fig 1: field structure + heatmap
.venv/bin/python reproducibility/fig2.py # Fig 2: preference agreement scatter
.venv/bin/python reproducibility/fig3.py # Fig 3: network visualizations (HTML)
.venv/bin/python reproducibility/fig4.py # Fig 4: aspiration/preference regressions
.venv/bin/python reproducibility/fig5.py # Fig 5: preference realization
.venv/bin/python reproducibility/si_fig1.py # SI Fig 1: full regression forest plot
.venv/bin/python reproducibility/si_fig2.py # SI Fig 2: JIF vs consensus rankings
.venv/bin/python reproducibility/si_fig3.py # SI Fig 3: comparison accuracy
.venv/bin/python reproducibility/si_fig4.py # SI Fig 4: permutation robustness checkAll scripts are run from the project root and write their output PDF to reproducibility/.
Note on Figure 3: fig3.py produces 13 HTML files (one per field) containing D3.js force-directed network graphs. These were imported into Keynote as images to compose the final Figure 3 panel.
Note on SI Figure 4: The permutation test is stochastic. Each run produces a statistically equivalent figure, but it will not be pixel-identical to the published version.
Rankings are computed using SpringRank, a physics-inspired hierarchical ranking algorithm. The key parameters:
- alpha=0: Individual user rankings (no regularization); used in Figure 2.
- alpha=20: Field-level consensus rankings (regularization toward a neutral prior); used in Figures 4β5 and SI figures.
- Leave-one-out (LOO): For fairness analyses, each user's ranking is computed from the field's comparisons excluding their own, preventing circularity.
Indifferent responses (is_tie=True) constitute 8.1% of all comparisons (N=12,482). In the SpringRank adjacency matrix, each tie counts as a half-win for each venue: A[i,j] += 0.5 and A[j,i] += 0.5.
If you use this data or code, please cite:
@article{vanbuskirk2026consensus,
title = {Consensus and fragmentation in academic publication preferences},
author = {Van Buskirk, Ian and Hohmann, Marilena and Landgren, Ekaterina and
Ugander, Johan and Clauset, Aaron and Larremore, Daniel B.},
journal = {arXiv:2603.00807},
year = {2026},
}Data: CC BY 4.0 Code: MIT License
Daniel B. Larremore β daniel.larremore@colorado.edu