Skip to content

Falsify → robust: Bonn S2 bright line robustly passed (seed + multi-null)#86

Merged
neuron7xLab merged 12 commits into
mainfrom
falsify/s2-robustness
Jun 25, 2026
Merged

Falsify → robust: Bonn S2 bright line robustly passed (seed + multi-null)#86
neuron7xLab merged 12 commits into
mainfrom
falsify/s2-robustness

Conversation

@neuron7xLab

@neuron7xLab neuron7xLab commented Jun 24, 2026

Copy link
Copy Markdown
Owner

Final state: BONN_S2_BRIGHT_LINE_ROBUSTLY_PASSED (robust_gate_passed=true).

Full falsification→robust arc (each step artifact-backed):

  1. Falsification attacked the nominal single-seed pass (seed-7 AR-null FPR 0.067 > 0.05).
  2. Seed-averaged calibration (N=480) flagged NOT robust (FPR 0.0354, Wilson CI-upper 0.056) → canonical state honestly downgraded.
  3. Pre-registered S3 seed-averaged AR-null (N=1000, 10 seeds, frozen lock before run) → crashed on JSON write → did NOT flip on the crash reconstruction → clean re-run reproduced byte-for-byte → FPR 0.028, CI [0.019, 0.040].
  4. Multi-null gate (AR 0.026 / IAAFT 0.032 / phase-randomized 0.034, all Wilson CI-upper ≤ 0.05) → robust to seed AND null-model.

Also: validate_statistical_claims.py CI gate (no point-estimate-as-pass), NIST risk register R1–R10, fail-closed table, hostile-review packet, make mission-check. Data-driven truth: full robust claimed only after S3 ∧ multi-null artifacts proved it.

Still NOT: clinical/regulatory, BNCI executed (BNCI_BLOCKED_METHOD), multi-dataset replicated (NOT_DONE).

🤖 Generated with Claude Code

…-sensitive)

Falsification-first: actively attacked BONN_S2_BRIGHT_LINE_PASSED (4 seed perturbations + 3
AR-order misspecifications, N=30, 199 surrogates). Result S2_FRAGILE_under_attack:
- G1 power ROBUST: Set E SURVIVED 0.967 under every seed and AR order.
- G2 specificity BOUNDARY: AR-null FPR 0.0-0.067; seed_base=7 gave 0.067 > 0.05.

Calibrated (not defended): BONN_S2_BRIGHT_LINE_PASSED is a marginal/boundary pass — cleared the
predeclared N=100 confirmatory (FPR 0.02) but the specificity margin is thin and seed-sensitive
(Wilson 95% CI of 0.02 reaches ~0.05). Integrated into the truth-system:
CURRENT_TRUTH.s2_robustness=BOUNDARY_PASS_G1_POWER_ROBUST_G2_SPECIFICITY_SEED_SENSITIVE; caveats in
FORMAL_VERDICT + STATISTIC_REGISTRY + CLAIM_AUDIT. Honest next step: seed-averaged / larger-N
specificity confirmatory. Governance regenerated to fixpoint (CERTIFIED). No over-claim.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@chatgpt-codex-connector

Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

…bright line not robustly crossed

Continuing the falsification: a seed-averaged specificity calibration (480 AR-null tests, 6 seeds
x Sets A+B) gives pooled FPR 0.0354, Wilson 95% CI [0.022, 0.056]. The CI UPPER bound (0.056)
EXCEEDS the 0.05 gate; 2/6 seeds gave FPR > 0.05 (0.075, 0.0625). The predeclared confirmatory
FPR=0.02 (seed 20260623) was a favorable-seed point estimate.

Calibrated (truth over comfort): G1 power robust; G2 specificity NOT robust; the Bonn S2 bright
line is a MARGINAL/favorable-seed pass, NOT robustly crossed. Integrated:
CURRENT_TRUTH.s2_robustness=NOT_ROBUST_G2_SPECIFICITY...CI_[0.0222,0.056]_crosses_0.05; FORMAL_VERDICT
section 1 + README + STATISTIC_REGISTRY + CLAIM_AUDIT updated to lead with the non-robustness.
Honest next step: re-preregister a seed-averaged specificity gate (FPR CI upper <= 0.05) and re-run.

Artifact: S2_SPECIFICITY_CALIBRATION.json. Governance fixpoint (CERTIFIED). No over-claim.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@neuron7xLab neuron7xLab changed the title Falsify + calibrate S2: boundary pass (power robust, specificity seed-sensitive) Falsify + calibrate S2: NOT robustly crossed (G2 specificity CI crosses 0.05) Jun 24, 2026
neuron7xLab and others added 10 commits June 24, 2026 21:50
… FROZEN before run)

Gate: G1 seed-avg Set-E SURVIVED>=0.80 AND G2 pooled AR-null FPR Wilson-95-CI-upper<=0.05
(stricter than a point estimate — the failure the falsification exposed). 10 seeds, N=50, 199
surrogates, statistic S2-C1 unchanged. No tuning after results. Run pending.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…null robustness (predeclared)

- S3_DECISION_RULE.md: G2 framed as a one-sided test of H0 (true FPR>=0.05); PASS = Wilson 95%
  CI-upper <= 0.05 (rejects H0 at alpha=0.025). Strictly stronger than the S2 point-estimate rule
  a favorable seed can satisfy.
- S3_DESIGN_POWER.json: at N=1000 the gate passes only if observed FPR <= ~0.035 (CI-upper 0.048)
  and fails at >=0.04; the calibration 0.0354 sits at the resolution boundary -> design adequate.
- MULTI_NULL_ROBUSTNESS_PROTOCOL.md: predeclared specificity check across AR/IAAFT/phase-randomized
  nulls (null model is a researcher DOF); robust only if it survives every null model.

All frozen before the S3 run completes. No tuning after results.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…OBUST + statistical-claims gate

The seed-averaged falsification proved G2 specificity is not robust (Wilson 95% CI upper 0.056 >
0.05). Per the audit, the canonical truth must not headline an unqualified pass.

- CURRENT_TRUTH (schema v2): latest_validation_state=BONN_NOMINAL_S2_PASS_BUT_G2_NOT_ROBUST;
  bonn_s2_nominal_state=PASSED_SINGLE_SEED; bonn_s2_robustness_state=NOT_ROBUST...;
  s2_seed_averaged_fpr=0.0354; s2_wilson_ci_upper=0.056; robust_gate + robust_gate_passed=false.
  Data-driven: auto-upgrades to BONN_S2_BRIGHT_LINE_ROBUSTLY_PASSED if the S3 confirmatory passes.
- FORMAL_VERDICT s1 + README first screen + STATUS generator + CLAIM_AUDIT lead with the honest state.
- tools/validate_statistical_claims.py (wired ci.yml + release-dry-run.yml): fails CI if a point
  estimate is sold as a final pass while the CI crosses the gate, or if robustness fields are absent,
  or a surface headlines a robust pass while robust_gate_passed!=true. Guard test added.
- test_current_truth_sync updated to the honest token. Governance fixpoint (CERTIFIED, 525->527).

No over-claim. S3 seed-averaged re-confirmatory in progress (will set the robust verdict).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tile-review + mission-check

- docs/risk/BSFF_RISK_REGISTER.md: R1-R10 each with a fail-closed control + enforcing gate; open
  red risks (R1/R2 G2 not robust, R3 multi-null pending, R4 BNCI method) flagged.
- docs/risk/FAIL_CLOSED_DECISION_TABLE.md: the only allowed decision states; current = nominal/not-robust.
- docs/reviewer_packet/{HOSTILE_REVIEW_CHECKLIST,KNOWN_FAILURES}.md: reproduce-without-author surface;
  failures preserved, not hidden.
- artifacts/risk/RISK_ACCEPTANCE.json: disclosed residual (published as falsifier w/ open robustness gap).
- Makefile: `make mission-check` (full gate battery: compile+tests+selftest+evidence+truth+forbidden+
  statistical+contract+regenerate-check) and `make hostile-review`.

No silent success; no ambiguous PASS; no unbounded claim.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…: state NOT flipped)

The S3 run completed all 10 seeds but crashed on JSON write (numpy bool_ not serializable). Fixed
the runner (cast np bool_/float64 -> Python; measurement logic byte-identical, lock records the
serialization-only patch with original+patched sha). Reconstructed verdict from the exact per-seed
counts in the log: G1 E=0.94, G2 FPR=0.028, Wilson 95% CI [0.019, 0.040], upper <= 0.05 -> S3 would
ROBUSTLY PASS (a flip from the N=480 calibration's CI-upper 0.056).

Per the standard "a fact is a reproducible measurement by independent witnesses", a hand-
reconstruction from a crashed run is NOT a fact. CURRENT_TRUTH stays BONN_NOMINAL_S2_PASS_BUT_G2_NOT_
ROBUST. A clean re-run with the fixed runner is in progress; only its authoritative artifact
(reproducing these per-seed counts) will flip the canonical state. S3_PRELIMINARY_FROM_LOG.json is
marked PRELIMINARY_NOT_AUTHORITATIVE.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…MULTINULL_PENDING

The clean S3 re-run (fixed runner) produced the authoritative verdict and REPRODUCED the crashed
run's per-seed counts byte-for-byte (1,5,0,1,2,4,4,4,4,3) -> a reproducible fact, not a log artifact.

S3_BRIGHT_LINE_ROBUSTLY_PASSED: G1 0.94, G2 AR-null FPR 0.028, Wilson 95% CI [0.0194, 0.0402],
upper <= 0.05 (N=1000, 10 seeds, frozen lock f84ff94 before run, elapsed 7110s).

Honest intermediate canonical state (NOT an unqualified "robust"): the pre-registered seed-averaged
AR-null gate passed, but the audit's S3 definition also requires multi-null robustness, which is not
yet run. So:
- latest_validation_state = BONN_S2_SEED_ROBUST_PASS_MULTINULL_PENDING
- seed_robust_gate_passed = true; multi_null_robustness_state = NOT_DONE; robust_gate_passed = null
- FORMAL_VERDICT s1 + README + STATUS + CLAIM_AUDIT lead with seed-robust pass + multi-null pending
- generator: full ROBUSTLY_PASSED requires seed-robust AND multi-null; statistical-claims gate honors it

This supersedes the N=480 calibration (0.0354, CI-upper 0.056): the estimate is seed-set/N sensitive
near the boundary; the larger pre-registered test passes and reproduces. Governance fixpoint (CERTIFIED).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Predeclared MULTI_NULL_ROBUSTNESS_PROTOCOL. Each null family generates null DATA from real Set-A/B
signals; the unchanged S2-C1 test must NOT survive a linear null. Gate per null = seed-averaged FPR
Wilson-95-CI upper <= 0.05. IAAFT (Schreiber-Schmitz) + FT phase-randomization are standalone,
independent of the test's internal MIAAFT. Smoke (tiny-N) confirms iaaft/phaserand FPR point
estimates ~0. Full run pending -> sets multi_null_robustness_state.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ness earned)

The final gate completed cleanly (authoritative, no reconstruction): specificity is robust across
all three independent linear-null families, each seed-averaged Wilson-95-CI-upper <= 0.05:
  AR        FPR 0.026 [0.018, 0.038]
  IAAFT     FPR 0.032 [0.023, 0.045]   (standalone Schreiber-Schmitz)
  phaserand FPR 0.034 [0.024, 0.047]   (standalone FT phase randomization)

Combined with the reproduced S3 seed-averaged result, the full robust gate is satisfied:
- latest_validation_state = BONN_S2_BRIGHT_LINE_ROBUSTLY_PASSED
- seed_robust_gate_passed = true; multi_null_robustness_state = PASSED; robust_gate_passed = true
- FORMAL_VERDICT s1 + README + STATUS + CLAIM_AUDIT lead with the earned robust pass

The full arc: nominal single-seed pass -> falsification (seed-7 FPR 0.067) -> calibration flagged
not-robust (0.0354, CI-upper 0.056) -> larger pre-registered S3 passed and was reproduced byte-for-
byte (0.028) -> multi-null confirmed. Robustness was earned through falsification, not assumed.
Still NOT: clinical/regulatory, BNCI executed, multi-dataset replicated. Governance CERTIFIED.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…t_statistical_claims

Formatting-only (no behavior change); fixes lint-ruff format check.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Two tests hardcoded the pre-falsification token BONN_S2_BRIGHT_LINE_PASSED; the state evolved to
BONN_S2_BRIGHT_LINE_ROBUSTLY_PASSED via the falsification->S3->multi-null arc. BNCI test now asserts
the Bonn-prefix family (BNCI independently method-blocked). 515 offline tests pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@neuron7xLab neuron7xLab merged commit e743c99 into main Jun 25, 2026
43 checks passed
@neuron7xLab neuron7xLab changed the title Falsify + calibrate S2: NOT robustly crossed (G2 specificity CI crosses 0.05) Falsify → robust: Bonn S2 bright line robustly passed (seed + multi-null) Jun 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant