Fix reproducibility, statistical validity, and code hygiene#5
Open
urme-b wants to merge 10 commits into
Open
Conversation
4b52c02 to
54297f3
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Addresses the three defects that capped the repo's score, plus the remaining data-governance gaps. All work is honest: nothing fabricates absent data.
Reproducibility
pdf_processing/conda-base-py→python3, so a cleangit cloneno longer dies withNoSuchKernel. CI now passes--nbmake-kernel=python3.pipeline/build_group_summaries.pyregeneratesdata/group_results/and reconciles the case-study participant (verified == group row P02) to <1e-6. Documents that P01/P03–P10 raw data was not released, so those rows are summary-only (not fabricated).mms.stats.icc1recomputes the README's ICCs (0.22/0.45/0.61) exactly and adds the missing 95% CIs (HRV's crosses zero → underpowered).DATA_PROVENANCE.md;.python-version3.9→3.11 (matches CI); bounded-pinned deps;pyproject.toml.Statistical validity
build_hrv.ipynbnow writes a rolling 30-beat SDNN/RMSSD (distinct values 1 → ~1600), schema preserved so1_hrv.ipynbstill runs.1_hr.ipynb; clustering caveat in3_clustering.ipynb.mms.stats.corr_matrix_fdr).Code hygiene
mms/package (io,hrv,stats,fixation,paths) — kills the 173-callread_csvduplication.3_clustering.ipynb); renamed 16 space/cryptic notebooks to snake_case; purpose headers added (markdown coverage 0/37 → 33/33).tests/test_mms.pyasserts computed values (not just that files parse). Suite: 131 passing.Data governance
data/README.md— data dictionary for all 12 CSV schemas.DATA_LICENSE.md— dual-license (code MIT; data CC-BY-4.0 + no-re-identification term).scripts/deidentify_timestamps.py— dry-run-by-default tool to shift absolute timestamps to relative session time (removes a re-identification vector), preserving cross-stream alignment. Not applied — it rewrites released data; run with--applywhen ready.Data integrity
P01's raw data is genuinely absent from the repo, so its known
Session 02 == Session 03HRV duplicate (both 65.39) is flagged and guarded by a test (test_no_undocumented_session_duplicates) rather than guessed. Restoring the true value requires the original source.🤖 Generated with Claude Code