Open-source harmonization of U.S. higher-education survey data into reproducible analytical panels.
The current scope is three integrated datasets: NSF HERD (Higher Education Research and Development survey, FY 1972–2024) — the R&D expenditure-OUT face; Federal S&E Support (NSF Survey of Federal Science and Engineering Support to Universities, Colleges, and Nonprofit Institutions, FY 1971–2023) — the federal funding-IN face; and NSF GSS (Survey of Graduate Students and Postdoctorates in Science and Engineering, FY 1972–2024) — the human-capital face (the federally-supported graduate students and postdocs research funding trains) — joined on the institution-year via a cross-survey identity spine and GSS's native IPEDS UnitID, completing the funding → people → productivity picture. The roadmap continues with the IPEDS IC snapshot (dataset #4), then other NCSES surveys.
Higher-education survey data has methodological discontinuities — era boundaries in survey instruments, encoding shifts, taxonomy redesigns, infrastructure changes. Most published analyses treat the data as if those discontinuities don't exist, or skip the eras where they do.
Quadrivium applies Reconstructive Harmonization:
(a) reconstruct what each era can support on its own terms (rules, crosswalks, validated reconstructions);
(b) decompose what crossing a discontinuity actually involves into named, quantified components (real growth, definitional change, population expansion, residual unmeasurables);
(c) publish both the reconstruction and the decomposition with sufficient documentation that a cold reader can use either without misreading the discontinuity.
This is not a bridge across discontinuities. It is the discipline of making operational data legible across them by being precise about what is reconstructible, what is decomposable, and what remains unmeasurable. See docs/methods_notes/reconstructive_harmonization.md for the methodological account applied to the 2010 HERD era boundary.
data/harmonized/herd_panel.parquet— 50-year field-level R&D expenditure panel (FY 1975–2024), two parallel reconstructed series across the 2010 era boundary.data/harmonized/herd_panel_attributes.parquet— institution-year Q4/Q5 attribute sibling: medical-school and clinical-trials share and value columns.data/harmonized/herd_personnel.parquet— Q15 headcount + Q16 FTE personnel panel for FY 2022–2024 (the microdata-bearing years; NCSES Data Table 26 publishes institution totals for FY 2020–2024, but FY 2020–2021 are aggregate-only, with no per-institution microdata). Carries noquality_flagcolumn — a documented imputation-provenance asymmetry with the financial panel (seedocs/methods_notes/herd_panel_etl_scoping.md§12).data/harmonized/fedsupport_obligations.parquet— Federal S&E Support full-series long panel (FY 1971–2023): federal S&E obligations by department × agency × broad/detailed activity × institution × type × state, with a native IPEDS UnitID and ana_*/no_match/matchedstatus column. All universes (higher-ed / nonprofit / FFRDC) are carried; filter viainstitution_type.data/harmonized/fedsupport_institution_year.parquet— HERD-join-ready aggregate: higher-ed (academic + consortium), matched-UNITID, aggregated to(fiscal_year, ipeds_unitid)with R&D / S&E-support / total obligation columns. R&D-broad is the like-for-like counterpart to HERD federal R&D expenditure. See the discontinuity methods notedocs/methods_notes/fedsupport/discontinuities.md.data/harmonized/gss_support.parquet— GSS funding-of-human-capital face: full-time graduate students by support mechanism × federal agency × federal/nonfederal, FY 1972–2024, native IPEDS UnitID. The direct join to HERD/FedSupport federal agencies (NIH, NSF, DOD, DOE, USDA, NASA).data/harmonized/gss_race.parquet— GSS enrollment by enrollment-status × degree level × gender × race, FY 1972–2024 (OMB-1997 race taxonomy pre-bridged across the 2017 redesign).data/harmonized/gss_pd_nfr.parquet— GSS postdoctoral appointees and doctorate-holding non-faculty researchers (support, demographics, degree, citizenship); carries ameasure_groupdiscriminator (the source sheet is overlapping marginal tables — sum within a group, never across). See the GSS methods notedocs/methods_notes/gss/reconstructive_harmonization_gss.mdand its three boundary decompositions (2017 / 1984–87 / 2014). GSS field names are provisional (count-matched; 91/131 codes;field_coarse= Science/Engineering/Health only; ~40 historical codes NULL pending the NCSES field-code reference).
Companion validation reports in validation/reports/ carry the reconciliation against published NSF / NCSES ground truth.
git clone https://github.com/QuinnyXu/quadrivium.git quadrivium
cd quadrivium
uv sync
uv run python etl/build_herd_panel.py # rebuild HERD financial + attribute parquets
uv run python etl/build_herd_personnel.py # rebuild HERD personnel parquet
uv run python etl/build_fedsupport_obligations.py # rebuild Federal S&E Support panels
uv run python etl/build_fedsupport_identity_spine.py # rebuild the cross-survey identity spine
uv run python etl/acquire_gss.py # convert GSS zips -> CSV (acquisition)
uv run python etl/build_gss_support.py # rebuild GSS support / race / pd_nfr panels
uv run python etl/build_gss_race.py
uv run python etl/build_gss_pd_nfr.pyThe GSS panel builds read the committed crosswalks under crosswalks/gss/ (column maps + field_code_map.csv); see data/harmonized/MANIFEST.md for the full GSS regeneration order.
Requirements: Python 3.12 and uv (installed locally; this repo pins uv 0.11.8 in the lockfile and runtime deps to duckdb==1.5.2 + pypdf==6.10.2).
Raw NSF HERD zips are not redistributed via git. SHA-256 manifests in data/raw/MANIFEST.md document the exact files that reproduce the harmonized outputs; download from NSF's HERD survey archive (URLs listed in the MANIFEST).
A cold reader with the lockfile, the raw zips named in data/raw/MANIFEST.md, and the NCSES reference PDFs in data/reference/ reaches the same harmonized parquet bit-equivalently (modulo parquet writer determinism on a fixed input-and-code-version pair).
Ships in the deposit (tracked in git, CC-BY-4.0): the three harmonized parquets in data/harmonized/ — SHA-256s pinned in data/harmonized/MANIFEST.md — plus the crosswalks, the methods notes, the validation reports, the NCSES reference PDFs (data/reference/), the lockfile, and the build scripts. You can use the harmonized panels directly, or rebuild them.
Fetched from NSF (not redistributed): the 53 raw HERD year zips and 13 short-form zips. Their SHA-256s and download URLs are in data/raw/MANIFEST.md; they are U.S. government work, staged by checksum rather than redistributed (the provenance-clean choice — the zip is the bit-identical artifact NSF shipped). A consumer rebuilding from raw obtains them from NSF's HERD archive.
The integrity round-trip: raw-zip SHAs (NSF-fetched, data/raw/MANIFEST.md) → uv sync + build → harmonized-parquet SHAs (deposit-shipped, data/harmonized/MANIFEST.md). A consumer who fetches the raw zips, verifies them against data/raw/MANIFEST.md, runs uv sync and the build scripts, reproduces the harmonized SHAs in data/harmonized/MANIFEST.md. This round-trip is verified end-to-end from a clean checkout (the harmonized panel rebuilds to the exact pinned SHA, and the FY 2024 verification grid re-asserts 58/58 at +0.000%).
Methods-note figures are not deposit runtime. To rebuild figures:
uv sync --group charts
uv run --group charts python etl/spikes/era_2010_decomposition_chart.py
uv run --group charts python etl/spikes/herd_question_count_cliff_chart.pyThe HERD methods note lives at docs/methods_notes/reconstructive_harmonization.md. The deposit's personnel sibling README is at docs/methods_notes/herd_personnel_README.md. The HERD per-year profile is at docs/methods_notes/herd_profile.md.
The full HD 2.1 / HD 2.4 implementation contract — schema, era handling, codeset policy, validation gates — is in docs/methods_notes/herd_panel_etl_scoping.md and docs/hd_2_1_scoping.md.
quadrivium/
├── CLAUDE.md project doctrine, locked decisions
├── README.md you are here
├── LICENSE MIT (code)
├── LICENSE-DATA.md CC-BY-4.0 (data)
├── crosswalks/ discipline + question-mapping CSVs (decision_rationale tracked)
├── data/
│ ├── raw/ raw NSF zips (gitignored payload); MANIFEST.md is the SHA-256 anchor
│ ├── harmonized/ canonical parquets
│ └── reference/ NCSES reference PDFs; MANIFEST.md is the staging anchor
├── docs/ methods notes, scoping, source documents
├── etl/ loaders, builders, spikes
└── validation/ reconciliation reports, per-year profiling
Quadrivium is at Stage 1 of a three-stage trajectory:
- Stage 1 (current) — open datasets. HERD, Federal S&E Support, and NSF GSS harmonized (current). Future migrations: the IPEDS IC snapshot (the authoritative-identity spine backfill), then other NCSES surveys. Each migration applies the Reconstructive Harmonization methodology to that survey's discontinuities; the schema and validation patterns adapt to the survey's structure, the methodology does not.
- Stage 2 (planned) — platform. Interactive query and comparative-panel surface on top of the harmonized data.
- Stage 3 (planned) — commercial analytics. Analytics built on the platform.
Stages 2 and 3 are not built now; they are the durable framing of where the project goes. Stage-1 work does not assume Stage-2 readiness.
- Code: MIT — see
LICENSE. - Data: CC-BY-4.0 — see
LICENSE-DATA.md.
If you use quadrivium's harmonized panels in research, please cite the deposit and the methods note. Machine-readable citation metadata is in CITATION.cff — the single source of truth for the DOI. The DOI below is the concept DOI (all versions), minted on Zenodo.
Plain text:
Quadrivium contributors (2026). Quadrivium: Reconstructive Harmonization of U.S. Higher-Education Survey Data. Version 4.0.0. Zenodo. DOI: 10.5281/zenodo.20404785 (concept DOI, all versions; see
CITATION.cff). License: CC-BY-4.0. Version 4.0 contains three datasets — HERD (R&D expenditure-OUT panels), Federal S&E Support (federal funding-IN), and NSF GSS (graduate-student & postdoc human-capital, FY 1972–2024) — joined on the institution-year via the cross-survey institution-identity spine. GSS field names are provisional (count-matched; seeCITATION.cffand the GSS methods note).
BibTeX:
@dataset{quadrivium_2026,
author = {{Quadrivium contributors}},
title = {{Quadrivium: Reconstructive Harmonization of U.S. Higher-Education Survey Data}},
year = {2026},
version = {2.0.0},
publisher = {Zenodo},
doi = {10.5281/zenodo.20404785},
note = {Concept DOI (all versions); v4.0.0 version DOI 10.5281/zenodo.20530949; v3.0.0 version DOI 10.5281/zenodo.20514381; v2.0.0 version DOI 10.5281/zenodo.20469884; v1.0.0 (HERD-only) version DOI 10.5281/zenodo.20404786. Data CC-BY-4.0; code MIT.}
}External contribution flow is currently issue-based. To propose a crosswalk amendment or methodology extension, open a GitHub issue with: the proposed change, the empirical anchor (which raw HERD year and file, or which published NSF document), and the decision_rationale you would add to the crosswalk row. See CONTRIBUTING.md for full proposal guidance; pull-request mechanics arrive at the platform stage.