Consensus and Fragmentation in Academic Publication Preferences

Ian Van Buskirk, Marilena Hohmann, Ekaterina Landgren, Johan Ugander, Aaron Clauset, Daniel B. Larremore

This repository includes deidentified data and code to reproduce the figures and analyses of the manuscript. See installation instructions below.

Overview

Academic publishing requires solving a collective coordination problem: among thousands of possible venues, which deserve a community's attention? This repository contains the data and code to reproduce the figures in our paper, which surveys 3,510 US tenure-track faculty to understand how publication preferences vary across fields, institutions, and individuals.

Key findings:

Fields occupy a wide spectrum of consensus — Economics and Chemistry show strong agreement on elite venues; Computer Science and Engineering show fragmented preferences across hundreds of outlets.
Within fields, faculty at more prestigious institutions and men prefer higher-ranked venues.
Researchers publish more successfully in their personal favorite venues than in their field's consensus favorites.
Journal Impact Factors explain only ~64% of pairwise preference choices, systematically undervaluing what fields actually prefer.

AI Disclosure Note

This public-facing repository was improved for reproducibility, readability, and clarity, by AI tools. All original analyses and writing from which this repository was derived, including original research notebooks and code, were written by humans.

Repository Structure

.
├── public_data/                   ← Shareable dataset (start here)
│   ├── respondents.csv            ← One row per survey respondent
│   ├── comparisons.csv            ← One row per pairwise comparison
│   ├── venue_selections.csv       ← Venues shown per respondent in initial pass
│   ├── venues.csv                 ← Venue metadata (OpenAlex + JIF)
│   ├── publication_counts.csv     ← Publication counts by venue per respondent
│   └── field_jif_rankings.csv     ← Field-level consensus + JIF ordinal ranks
│
├── reproducibility/               ← Figure scripts
│   ├── run_all.py                 ← Run everything (start here)
│   ├── compute_rankings.py        ← Produces derived/ from public_data/
│   ├── fig1.py  …  fig5.py        ← Main paper figures
│   ├── si_fig1.py  …  si_fig4.py  ← Supplementary figures
│   └── derived/                   ← Intermediate outputs (generated by compute_rankings.py)
│       ├── individual_rankings.csv
│       ├── field_rankings_per_user.csv
│       ├── field_consensus_rankings.csv
│       └── global_rankings_per_user.csv  ← Global LOO SpringRank (all fields)
│
├── networks/                      ← Pre-built network datasets (GraphML + CSV)
│   ├── README.md                  ← Schema, attribute tables, and Python examples
│   ├── graphml/
│   │   ├── comparison/            ← Directed venue comparison multigraphs (all + 13 fields)
│   │   └── bipartite/             ← Respondent × venue bipartite graphs (all + 13 fields)
│   └── csv/
│       ├── comparison/            ← Edge list + node attributes (all + 13 fields)
│       └── bipartite/             ← Edge list + node attributes (all + 13 fields)
│
├── utils/                         ← Shared plotting utilities
│   ├── field_colors.py            ← Per-field color palette (tab20)
│   └── short_venue_names.py       ← Abbreviated venue name lookup
│
└── README.md                      ← This file

Data Description

All data files are in public_data/. Survey responses are de-identified: names, institution names, email addresses, and timestamps have been removed.

`respondents.csv` — 3,510 rows × 17 columns

One row per survey respondent (US tenure-track faculty, fields with ≥ 100 participants).

Column	Description
`user_db_id`	Anonymous respondent ID
`field`	Academic field (13 fields)
`subfield`	Subfield within field
`career_stage`	Assistant / Associate / Full Professor
`top_init_venue_db_id`	Top-ranked aspiration venue ID
`mid_init_venue_db_id`	Mid-ranked aspiration venue ID
`low_init_venue_db_id`	Low-ranked aspiration venue ID
`venue_bug`	True if Science was not displayed correctly (exclude from analyses)
`gender`	`gm` (men) / `gf` (women) / NaN (not reported)
`aa_area`	Academic area from affiliation database
`clusters_20`	Field cluster assignment (k=20)
`clusters_40`	Field cluster assignment (k=40)
`institution_bin_3`	Institution prestige tercile (1=most prestigious)
`institution_bin_10`	Institution prestige decile (1=most prestigious)
`academia_prestige_bin_3`	Academia-wide prestige tercile
`academia_prestige_bin_10`	Academia-wide prestige decile (used in regressions)

`comparisons.csv` — 153,654 rows × 5 columns

One row per pairwise comparison made during the survey. Rows provided only for fields with ≥ 100 respondents.

Column	Description
`user_db_id`	Respondent ID
`is_tie`	True if respondent indicated no preference
`pref_venue_db_id`	ID of the preferred venue (if not a tie)
`other_venue_db_id`	ID of the non-preferred venue (if not a tie)
`field`	Field in which the comparison was made

`venue_selections.csv` — 84,246 rows × 4 columns

All venues presented to each respondent during their initial ranking pass. Venues provided only for fields with ≥ 100 respondents.

Column	Description
`user_db_id`	Respondent ID
`would_publish`	True if respondent rated this venue positively
`venue_db_id`	Venue ID
`field`	Field context

`venues.csv` — 6,363 rows × 13 columns

Metadata for all venues that appeared in the survey, for fields with ≥ 100 respondents.

Column	Description
`venue_db_id`	Venue ID (links to other tables)
`venue_oa_id`	OpenAlex venue ID
`name`	Full venue name
`users_selected`	Number of respondents who selected this venue
`display_name`	Display name used in survey
`2yr_mean_citedness`	2-year mean citedness (OpenAlex)
`works_count`	Number of works indexed (OpenAlex)
`cited_by_count`	Total citation count (OpenAlex)
`issn_l`	Linking ISSN
`first_topic`	Primary topic (OpenAlex)
`db_concept`	Concept/field label
`jcr_jif`	Journal Citation Reports Impact Factor (2023)
`jcr_name`	JCR journal name

`publication_counts.csv` — 2,275 rows × 13 columns

Number of publications per respondent in each of their nominated venues, drawn from OpenAlex. Used in Figure 5 (preference realization).

`field_jif_rankings.csv` — 8,794 rows × 13 columns

Field-level consensus SpringRank scores (alpha=20) alongside JIF ordinal ranks and selection fractions. One row per (field, venue) pair. All aggregate statistics — no individual-level data.

Key columns: field, venue_db_id, ordinal_porc (consensus rank), ordinal_jif (JIF rank), subset_ordinal_porc, subset_ordinal_jif, frac_selected.

`derived/global_rankings_per_user.csv` — 52,965 rows × 3 columns

Leave-one-out global SpringRank(alpha=20) scores: for each respondent, scores are computed from all comparisons (across all fields) except that respondent's own, restricted to venues they actually compared. Generated by compute_rankings.py. Used in SI Figure 3.

Network Datasets

Pre-built network files are in networks/. Two network types are provided, each in GraphML and CSV formats, for the global dataset and separately for each of the 13 fields.

Network type	Nodes	Edges	Global size
Venue comparison multigraph (directed)	Academic venues	One directed edge per pairwise comparison; ties produce two edges (A→B and B→A) at weight 0.5	5,686 nodes · 166,136 edges
Respondent × venue bipartite (undirected)	Respondents + venues	Edge if venue was in respondent's consideration set	9,196 nodes · 55,455 edges

See networks/README.md for full attribute schemas and Python read-in examples (NetworkX, graph-tool, pandas).

Requirements

Python 3.9 or later
uv (recommended) or pip

Python packages

matplotlib
seaborn
scipy
numpy
pandas
statsmodels
springrank
pyprojroot

Setup

1. Clone the repository

git clone https://github.com/<org>/<repo>.git
cd <repo>

2. Create the virtual environment

Using uv (recommended):

curl -LsSf https://astral.sh/uv/install.sh | sh   # install uv if needed
~/.local/bin/uv venv .venv
~/.local/bin/uv pip install matplotlib seaborn scipy numpy pandas statsmodels springrank pyprojroot

Or using standard pip:

python3 -m venv .venv
source .venv/bin/activate
pip install matplotlib seaborn scipy numpy pandas statsmodels springrank pyprojroot

3. Verify the setup

.venv/bin/python -c "import pandas, matplotlib, seaborn, springrank; print('OK')"

Reproducing the Figures

Quick start — run everything

.venv/bin/python reproducibility/run_all.py

Expected total runtime: ~15–25 minutes on a modern laptop.

Stage	Script	Time
Compute rankings	`compute_rankings.py`	~10–15 min
Figures 1–5, SI 1–3	`fig1.py` … `si_fig3.py`	< 1 min each
SI Figure 4 (permutation test)	`si_fig4.py`	~3 min

If reproducibility/derived/ already exists (e.g. on a second run), skip the ranking step:

.venv/bin/python reproducibility/run_all.py --skip-rankings

Step-by-step

Step 1 — Generate intermediate ranking files (run once; takes ~10–15 min):

.venv/bin/python reproducibility/compute_rankings.py

This reads public_data/comparisons.csv and produces four files in reproducibility/derived/:

individual_rankings.csv — per-user SpringRank(alpha=0) scores
field_rankings_per_user.csv — leave-one-out field SpringRank(alpha=20) scores
field_consensus_rankings.csv — field-level consensus SpringRank(alpha=20) scores
global_rankings_per_user.csv — leave-one-out global SpringRank(alpha=20) scores (all fields pooled)

Step 2 — Produce each figure:

.venv/bin/python reproducibility/fig1.py      # Fig 1: field structure + heatmap
.venv/bin/python reproducibility/fig2.py      # Fig 2: preference agreement scatter
.venv/bin/python reproducibility/fig3.py      # Fig 3: network visualizations (HTML)
.venv/bin/python reproducibility/fig4.py      # Fig 4: aspiration/preference regressions
.venv/bin/python reproducibility/fig5.py      # Fig 5: preference realization
.venv/bin/python reproducibility/si_fig1.py   # SI Fig 1: full regression forest plot
.venv/bin/python reproducibility/si_fig2.py   # SI Fig 2: JIF vs consensus rankings
.venv/bin/python reproducibility/si_fig3.py   # SI Fig 3: comparison accuracy
.venv/bin/python reproducibility/si_fig4.py   # SI Fig 4: permutation robustness check

All scripts are run from the project root and write their output PDF to reproducibility/.

Note on Figure 3: fig3.py produces 13 HTML files (one per field) containing D3.js force-directed network graphs. These were imported into Keynote as images to compose the final Figure 3 panel.

Note on SI Figure 4: The permutation test is stochastic. Each run produces a statistically equivalent figure, but it will not be pixel-identical to the published version.

Notes on Rankings

Rankings are computed using SpringRank, a physics-inspired hierarchical ranking algorithm. The key parameters:

alpha=0: Individual user rankings (no regularization); used in Figure 2.
alpha=20: Field-level consensus rankings (regularization toward a neutral prior); used in Figures 4–5 and SI figures.
Leave-one-out (LOO): For fairness analyses, each user's ranking is computed from the field's comparisons excluding their own, preventing circularity.

Tie handling

Indifferent responses (is_tie=True) constitute 8.1% of all comparisons (N=12,482). In the SpringRank adjacency matrix, each tie counts as a half-win for each venue: A[i,j] += 0.5 and A[j,i] += 0.5.

Citation

If you use this data or code, please cite:

@article{vanbuskirk2026consensus,
  title   = {Consensus and fragmentation in academic publication preferences},
  author  = {Van Buskirk, Ian and Hohmann, Marilena and Landgren, Ekaterina and
             Ugander, Johan and Clauset, Aaron and Larremore, Daniel B.},
  journal = {arXiv:2603.00807},
  year    = {2026},
}

License

Data: CC BY 4.0 Code: MIT License

Contact

Daniel B. Larremore — daniel.larremore@colorado.edu

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Consensus and Fragmentation in Academic Publication Preferences

Overview

AI Disclosure Note

Repository Structure

Data Description

`respondents.csv` — 3,510 rows × 17 columns

`comparisons.csv` — 153,654 rows × 5 columns

`venue_selections.csv` — 84,246 rows × 4 columns

`venues.csv` — 6,363 rows × 13 columns

`publication_counts.csv` — 2,275 rows × 13 columns

`field_jif_rankings.csv` — 8,794 rows × 13 columns

`derived/global_rankings_per_user.csv` — 52,965 rows × 3 columns

Network Datasets

Requirements

Python packages

Setup

1. Clone the repository

2. Create the virtual environment

3. Verify the setup

Reproducing the Figures

Quick start — run everything

Step-by-step

Notes on Rankings

Tie handling

Citation

License

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
networks		networks
public_data		public_data
reproducibility		reproducibility
utils		utils
.gitignore		.gitignore
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Consensus and Fragmentation in Academic Publication Preferences

Overview

AI Disclosure Note

Repository Structure

Data Description

respondents.csv — 3,510 rows × 17 columns

comparisons.csv — 153,654 rows × 5 columns

venue_selections.csv — 84,246 rows × 4 columns

venues.csv — 6,363 rows × 13 columns

publication_counts.csv — 2,275 rows × 13 columns

field_jif_rankings.csv — 8,794 rows × 13 columns

derived/global_rankings_per_user.csv — 52,965 rows × 3 columns

Network Datasets

Requirements

Python packages

Setup

1. Clone the repository

2. Create the virtual environment

3. Verify the setup

Reproducing the Figures

Quick start — run everything

Step-by-step

Notes on Rankings

Tie handling

Citation

License

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`respondents.csv` — 3,510 rows × 17 columns

`comparisons.csv` — 153,654 rows × 5 columns

`venue_selections.csv` — 84,246 rows × 4 columns

`venues.csv` — 6,363 rows × 13 columns

`publication_counts.csv` — 2,275 rows × 13 columns

`field_jif_rankings.csv` — 8,794 rows × 13 columns

`derived/global_rankings_per_user.csv` — 52,965 rows × 3 columns

Packages