Skip to content

Fix reproducibility, statistical validity, and code hygiene#5

Open
urme-b wants to merge 10 commits into
mainfrom
fix/reproducibility-stats-hygiene
Open

Fix reproducibility, statistical validity, and code hygiene#5
urme-b wants to merge 10 commits into
mainfrom
fix/reproducibility-stats-hygiene

Conversation

@urme-b

@urme-b urme-b commented Jul 4, 2026

Copy link
Copy Markdown
Owner

Addresses the three defects that capped the repo's score, plus the remaining data-governance gaps. All work is honest: nothing fabricates absent data.

Reproducibility

  • Portable kernels — normalized all 37 notebooks' kernelspec pdf_processing/conda-base-pypython3, so a clean git clone no longer dies with NoSuchKernel. CI now passes --nbmake-kernel=python3.
  • Generating pipelinepipeline/build_group_summaries.py regenerates data/group_results/ and reconciles the case-study participant (verified == group row P02) to <1e-6. Documents that P01/P03–P10 raw data was not released, so those rows are summary-only (not fabricated).
  • Reproducible headline statsmms.stats.icc1 recomputes the README's ICCs (0.22/0.45/0.61) exactly and adds the missing 95% CIs (HRV's crosses zero → underpowered).
  • DATA_PROVENANCE.md; .python-version 3.9→3.11 (matches CI); bounded-pinned deps; pyproject.toml.

Statistical validity

  • SDNN broadcast bug fixedbuild_hrv.ipynb now writes a rolling 30-beat SDNN/RMSSD (distinct values 1 → ~1600), schema preserved so 1_hrv.ipynb still runs.
  • Pseudoreplication — single-subject non-independence caveats + a question-level aggregated comparison in 1_hr.ipynb; clustering caveat in 3_clustering.ipynb.
  • Multiplicity — Benjamini-Hochberg FDR correction on the group correlation matrix (mms.stats.corr_matrix_fdr).

Code hygiene

  • mms/ package (io, hrv, stats, fixation, paths) — kills the 173-call read_csv duplication.
  • Deleted 4 near-duplicate clustering notebooks (kept canonical 3_clustering.ipynb); renamed 16 space/cryptic notebooks to snake_case; purpose headers added (markdown coverage 0/37 → 33/33).
  • tests/test_mms.py asserts computed values (not just that files parse). Suite: 131 passing.

Data governance

  • data/README.md — data dictionary for all 12 CSV schemas.
  • DATA_LICENSE.md — dual-license (code MIT; data CC-BY-4.0 + no-re-identification term).
  • scripts/deidentify_timestamps.py — dry-run-by-default tool to shift absolute timestamps to relative session time (removes a re-identification vector), preserving cross-stream alignment. Not applied — it rewrites released data; run with --apply when ready.

Data integrity

P01's raw data is genuinely absent from the repo, so its known Session 02 == Session 03 HRV duplicate (both 65.39) is flagged and guarded by a test (test_no_undocumented_session_duplicates) rather than guessed. Restoring the true value requires the original source.

🤖 Generated with Claude Code

@urme-b urme-b force-pushed the fix/reproducibility-stats-hygiene branch from 4b52c02 to 54297f3 Compare July 5, 2026 08:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant