A cross-domain (LLM ↔ biology) application of a single two-metric framework — Gini and effective N — to cell-type compositional aging and to attention-head differentiation in transformer training
Theodor Spiro | ORCID 0009-0004-5382-9346 | tspiro@vaika.org
📄 Preprint: paper/draft_v2.pdf — manuscript draft, in preparation
🧮 Main analysis script: scripts/07_pythia_overlay.py (the cross-domain overlay)
📦 Companion paper (arXiv): Universal statistical signatures of evolution in artificial intelligence architectures, Spiro 2026, arXiv:2604.10571 — the broader DFE-universality hypothesis this paper instantiates in a specific substrate pair
Aging of cell-type composition in biological tissues and functional differentiation of trained neural networks have been studied in disjoint literatures. We apply a single two-metric framework — the Gini coefficient and the effective number of contributing components (Hill-1 number, eff_N) — to mouse cell-type proportion distributions and to Pythia-410M head-importance distributions, and demonstrate:
- Pythia training and bone-marrow aging move in the same quadrant of the (ΔGini, Δeff_N) plane. Pythia: (+0.145, −45.5) over 143k training steps, concentrating function into fewer heads. Mouse bone marrow on Smart-seq2 FACS: (+0.088, −2.19); 10x Droplet: (+0.038, −0.54), driven by myeloid skewing and loss of B-lineage differentiation intermediates.
- Kidney and Limb Muscle also crystallize at coarse cell-type granularity; Lung and Spleen fall in the opposite (dispersion) quadrant, driven by immune-compartment expansion. The dispersion direction for Lung and Spleen replicates on the independent Kimmel et al. (2019) cohort at all five Leiden clustering resolutions tested.
- Granularity is a primary parameter, not a nuisance. Kidney direction flips with clustering granularity within both cohorts (crystallization at coarse, dispersion at fine). A within-cell-type Leiden analysis of TMS Kidney resolves the apparent paradox: podocytes crystallize (p = 0.016 on both metrics) while macrophages disperse (p = 0.004). The tissue-level result is the net of opposing sub-type dynamics.
- Caloric restriction in rat bone marrow reverses the aging shift. On the Calico atlas (Zou et al. 2022), CR rescues 64% of the Gini drift and 57% of the eff_N drift; cell-level bootstrap places P(rescue > 0) = 1.000 on both metrics. Biological replication remains limited (n = 2 per condition); treat as proof-of-concept.
- The correspondence is narrower than full DFE universality. This work demonstrates a specific substrate-independent compositional signature, not a claim of causal equivalence between transformer training and bone-marrow aging. It is a proof-of-concept instance of the broader hypothesis developed in the companion arXiv preprint.
| Dataset | Source | Use |
|---|---|---|
| Tabula Muris Senis (FACS + Droplet) | figshare doi:10.6084/m9.figshare.12654728 (Schaum et al. 2020) | Primary cohort; 4 ages × 23 tissues × 2 platforms |
| Kimmel et al. 2019 | GEO GSE132901 | Independent cohort for direction validation across 5 Leiden resolutions |
| Calico rat aging atlas | GEO GSE141784 (Zou et al. 2022) | Caloric-restriction rescue test on bone marrow |
| Pythia 410M ablation CSV | Companion repo mool32/functional-differentiation-dfe | Head-importance trajectory across 8 checkpoints |
├── paper/
│ ├── draft_v2.md # Manuscript source (Markdown)
│ └── draft_v2.pdf # Compiled PDF (built by scripts/15_build_pdf.py)
├── scripts/ # 16 numbered pipeline scripts (00-15) + utils.py
├── data/ # 24 intermediate CSVs (committed; reproduces every paper number)
├── figures/ # 13 publication PNGs at 300 DPI
├── component_analysis.md # Cell-type drivers of tissue-level deltas
├── kimmel_validation_summary.md
├── pythia_overlay_summary.md
├── results_summary.md
├── results_addendum.md
└── substate_summary.md
| Step | Script | Purpose |
|---|---|---|
| 0 | 00_explore.py |
Structural summary of TMS FACS and Droplet arms |
| 1 | 01_compute_proportions.py |
Per (mouse, tissue) cell-type proportions |
| 2 | 02_compute_metrics.py |
Per (mouse, tissue) Gini / eff_N / count |
| 3 | 03_make_figures.py |
Tissue-level metrics vs age |
| 4 | 04_summary.py |
Per-tissue young-vs-old Mann-Whitney + BH-FDR |
| 5 | 05_platform_concordance.py |
FACS-vs-Droplet concordance |
| 6 | 06_component_analysis.py |
Cell-type drivers of tissue-level deltas |
| 7 | 07_pythia_overlay.py |
The cross-domain overlay — Pythia trajectory in (Gini, eff_N) plane |
| 8 | 08_substate_analysis.py |
Within-cell-type Leiden granularity test |
| 9 | 09_substate_figure.py |
Scale-invariance figure (Kidney vs Spleen) |
| 10 | 10_kimmel_validation.py |
Independent cohort (Kimmel 2019), Leiden 0.8 |
| 11 | 11_kimmel_robustness.py |
Kimmel at 5 Leiden resolutions |
| 12 | 12_calico_cr_marrow.py |
CR rescue test on Calico rat bone marrow |
| 13 | 13_tms_kidney_leiden.py |
TMS Kidney methodological symmetry test |
| 14 | 14_overlay_with_ci.py |
Pythia overlay with per-mouse bootstrap CI |
| 15 | 15_build_pdf.py |
Compile draft_v2.md → draft_v2.pdf with embedded figures |
| Figure | File | Paper ref |
|---|---|---|
| 1 | fig_pythia_overlay_v2.png |
Pythia + biology in (Gini, eff_N) plane with bootstrap CI |
| 2 | fig_platform_concordance.png |
FACS vs Droplet per-tissue medians |
| 3 | fig_substate_scale.png |
Sub-cell-type Leiden, Kidney vs Spleen |
| 4 | fig_kimmel_validation.png |
Independent cohort validation |
| 5 | fig_calico_cr_rescue.png |
CR rescue of marrow crystallization |
| S1 | fig_kimmel_robustness.png |
Kimmel direction vs clustering resolution |
| S2 | fig_direction_heatmap.png |
Cross-platform direction heat map |
| S3 | fig_kidney_symmetry.png |
TMS kidney at matching Leiden resolutions |
The repository commits all intermediate CSVs, so all paper numbers reproduce from data/ without re-downloading the upstream data. To re-run from scratch you need:
- Tabula Muris Senis FACS + Droplet
.h5ad(figshare doi:10.6084/m9.figshare.12654728) - Kimmel 2019 preprocessed
.h5ad(GEO GSE132901, see github.com/mjibanezsole/aging_pipeline for preprocessing) - Calico rat aging atlas
.h5ad(GEO GSE141784) - Pythia 410M ablation CSV (companion repo
mool32/functional-differentiation-dfe)
Update scripts/utils.py and scripts/07_pythia_overlay.py to point at your local data paths.
git clone https://github.com/mool32/clonal-crystallization-aging.git
cd clonal-crystallization-aging
pip install scanpy anndata numpy pandas scipy matplotlib leidenalg fpdf2# Numbered scripts execute in order; each is self-contained.
python scripts/01_compute_proportions.py
python scripts/02_compute_metrics.py
python scripts/03_make_figures.py
python scripts/04_summary.py
python scripts/05_platform_concordance.py
python scripts/06_component_analysis.py
python scripts/07_pythia_overlay.py # cross-domain overlay
python scripts/08_substate_analysis.py
python scripts/09_substate_figure.py
python scripts/10_kimmel_validation.py
python scripts/11_kimmel_robustness.py
python scripts/12_calico_cr_marrow.py
python scripts/13_tms_kidney_leiden.py
python scripts/14_overlay_with_ci.py
python scripts/15_build_pdf.py # compiles paper/draft_v2.pdfTotal runtime ≈ 2 hours on a single workstation; peak memory ≈ 16 GB when the Droplet .h5ad (7.7 GB) is loaded.
@article{spiro2026crystallization,
author = {Spiro, Theodor},
title = {Clonal crystallization as a shared signature of bone-marrow aging and neural-network training},
journal = {bioRxiv},
year = {2026},
note = {Manuscript in preparation. Companion paper: arXiv:2604.10571}
}And the underlying data sources (TMS, Kimmel, Calico rat, Pythia) per their own citation policies.
Theodor Spiro — tspiro@vaika.org
- Code (
scripts/,utils.py): MIT (see LICENSE) - Data (
data/*.csv): CC-BY 4.0, with upstream citation requirements honored for TMS, Kimmel, and Calico datasets - Figures (
figures/*.png): CC-BY 4.0 - Manuscript (
paper/draft_v2.mdand.pdf): CC-BY 4.0