Falsify → robust: Bonn S2 bright line robustly passed (seed + multi-null) by neuron7xLab · Pull Request #86 · neuron7xLab/bsff

neuron7xLab · 2026-06-24T17:36:51Z

Final state: BONN_S2_BRIGHT_LINE_ROBUSTLY_PASSED (robust_gate_passed=true).

Full falsification→robust arc (each step artifact-backed):

Falsification attacked the nominal single-seed pass (seed-7 AR-null FPR 0.067 > 0.05).
Seed-averaged calibration (N=480) flagged NOT robust (FPR 0.0354, Wilson CI-upper 0.056) → canonical state honestly downgraded.
Pre-registered S3 seed-averaged AR-null (N=1000, 10 seeds, frozen lock before run) → crashed on JSON write → did NOT flip on the crash reconstruction → clean re-run reproduced byte-for-byte → FPR 0.028, CI [0.019, 0.040].
Multi-null gate (AR 0.026 / IAAFT 0.032 / phase-randomized 0.034, all Wilson CI-upper ≤ 0.05) → robust to seed AND null-model.

Also: validate_statistical_claims.py CI gate (no point-estimate-as-pass), NIST risk register R1–R10, fail-closed table, hostile-review packet, make mission-check. Data-driven truth: full robust claimed only after S3 ∧ multi-null artifacts proved it.

Still NOT: clinical/regulatory, BNCI executed (BNCI_BLOCKED_METHOD), multi-dataset replicated (NOT_DONE).

🤖 Generated with Claude Code

…-sensitive) Falsification-first: actively attacked BONN_S2_BRIGHT_LINE_PASSED (4 seed perturbations + 3 AR-order misspecifications, N=30, 199 surrogates). Result S2_FRAGILE_under_attack: - G1 power ROBUST: Set E SURVIVED 0.967 under every seed and AR order. - G2 specificity BOUNDARY: AR-null FPR 0.0-0.067; seed_base=7 gave 0.067 > 0.05. Calibrated (not defended): BONN_S2_BRIGHT_LINE_PASSED is a marginal/boundary pass — cleared the predeclared N=100 confirmatory (FPR 0.02) but the specificity margin is thin and seed-sensitive (Wilson 95% CI of 0.02 reaches ~0.05). Integrated into the truth-system: CURRENT_TRUTH.s2_robustness=BOUNDARY_PASS_G1_POWER_ROBUST_G2_SPECIFICITY_SEED_SENSITIVE; caveats in FORMAL_VERDICT + STATISTIC_REGISTRY + CLAIM_AUDIT. Honest next step: seed-averaged / larger-N specificity confirmatory. Governance regenerated to fixpoint (CERTIFIED). No over-claim. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector · 2026-06-24T17:36:56Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

…bright line not robustly crossed Continuing the falsification: a seed-averaged specificity calibration (480 AR-null tests, 6 seeds x Sets A+B) gives pooled FPR 0.0354, Wilson 95% CI [0.022, 0.056]. The CI UPPER bound (0.056) EXCEEDS the 0.05 gate; 2/6 seeds gave FPR > 0.05 (0.075, 0.0625). The predeclared confirmatory FPR=0.02 (seed 20260623) was a favorable-seed point estimate. Calibrated (truth over comfort): G1 power robust; G2 specificity NOT robust; the Bonn S2 bright line is a MARGINAL/favorable-seed pass, NOT robustly crossed. Integrated: CURRENT_TRUTH.s2_robustness=NOT_ROBUST_G2_SPECIFICITY...CI_[0.0222,0.056]_crosses_0.05; FORMAL_VERDICT section 1 + README + STATISTIC_REGISTRY + CLAIM_AUDIT updated to lead with the non-robustness. Honest next step: re-preregister a seed-averaged specificity gate (FPR CI upper <= 0.05) and re-run. Artifact: S2_SPECIFICITY_CALIBRATION.json. Governance fixpoint (CERTIFIED). No over-claim. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… FROZEN before run) Gate: G1 seed-avg Set-E SURVIVED>=0.80 AND G2 pooled AR-null FPR Wilson-95-CI-upper<=0.05 (stricter than a point estimate — the failure the falsification exposed). 10 seeds, N=50, 199 surrogates, statistic S2-C1 unchanged. No tuning after results. Run pending. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…null robustness (predeclared) - S3_DECISION_RULE.md: G2 framed as a one-sided test of H0 (true FPR>=0.05); PASS = Wilson 95% CI-upper <= 0.05 (rejects H0 at alpha=0.025). Strictly stronger than the S2 point-estimate rule a favorable seed can satisfy. - S3_DESIGN_POWER.json: at N=1000 the gate passes only if observed FPR <= ~0.035 (CI-upper 0.048) and fails at >=0.04; the calibration 0.0354 sits at the resolution boundary -> design adequate. - MULTI_NULL_ROBUSTNESS_PROTOCOL.md: predeclared specificity check across AR/IAAFT/phase-randomized nulls (null model is a researcher DOF); robust only if it survives every null model. All frozen before the S3 run completes. No tuning after results. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…OBUST + statistical-claims gate The seed-averaged falsification proved G2 specificity is not robust (Wilson 95% CI upper 0.056 > 0.05). Per the audit, the canonical truth must not headline an unqualified pass. - CURRENT_TRUTH (schema v2): latest_validation_state=BONN_NOMINAL_S2_PASS_BUT_G2_NOT_ROBUST; bonn_s2_nominal_state=PASSED_SINGLE_SEED; bonn_s2_robustness_state=NOT_ROBUST...; s2_seed_averaged_fpr=0.0354; s2_wilson_ci_upper=0.056; robust_gate + robust_gate_passed=false. Data-driven: auto-upgrades to BONN_S2_BRIGHT_LINE_ROBUSTLY_PASSED if the S3 confirmatory passes. - FORMAL_VERDICT s1 + README first screen + STATUS generator + CLAIM_AUDIT lead with the honest state. - tools/validate_statistical_claims.py (wired ci.yml + release-dry-run.yml): fails CI if a point estimate is sold as a final pass while the CI crosses the gate, or if robustness fields are absent, or a surface headlines a robust pass while robust_gate_passed!=true. Guard test added. - test_current_truth_sync updated to the honest token. Governance fixpoint (CERTIFIED, 525->527). No over-claim. S3 seed-averaged re-confirmatory in progress (will set the robust verdict). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…tile-review + mission-check - docs/risk/BSFF_RISK_REGISTER.md: R1-R10 each with a fail-closed control + enforcing gate; open red risks (R1/R2 G2 not robust, R3 multi-null pending, R4 BNCI method) flagged. - docs/risk/FAIL_CLOSED_DECISION_TABLE.md: the only allowed decision states; current = nominal/not-robust. - docs/reviewer_packet/{HOSTILE_REVIEW_CHECKLIST,KNOWN_FAILURES}.md: reproduce-without-author surface; failures preserved, not hidden. - artifacts/risk/RISK_ACCEPTANCE.json: disclosed residual (published as falsifier w/ open robustness gap). - Makefile: `make mission-check` (full gate battery: compile+tests+selftest+evidence+truth+forbidden+ statistical+contract+regenerate-check) and `make hostile-review`. No silent success; no ambiguous PASS; no unbounded claim. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…: state NOT flipped) The S3 run completed all 10 seeds but crashed on JSON write (numpy bool_ not serializable). Fixed the runner (cast np bool_/float64 -> Python; measurement logic byte-identical, lock records the serialization-only patch with original+patched sha). Reconstructed verdict from the exact per-seed counts in the log: G1 E=0.94, G2 FPR=0.028, Wilson 95% CI [0.019, 0.040], upper <= 0.05 -> S3 would ROBUSTLY PASS (a flip from the N=480 calibration's CI-upper 0.056). Per the standard "a fact is a reproducible measurement by independent witnesses", a hand- reconstruction from a crashed run is NOT a fact. CURRENT_TRUTH stays BONN_NOMINAL_S2_PASS_BUT_G2_NOT_ ROBUST. A clean re-run with the fixed runner is in progress; only its authoritative artifact (reproducing these per-seed counts) will flip the canonical state. S3_PRELIMINARY_FROM_LOG.json is marked PRELIMINARY_NOT_AUTHORITATIVE. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…MULTINULL_PENDING The clean S3 re-run (fixed runner) produced the authoritative verdict and REPRODUCED the crashed run's per-seed counts byte-for-byte (1,5,0,1,2,4,4,4,4,3) -> a reproducible fact, not a log artifact. S3_BRIGHT_LINE_ROBUSTLY_PASSED: G1 0.94, G2 AR-null FPR 0.028, Wilson 95% CI [0.0194, 0.0402], upper <= 0.05 (N=1000, 10 seeds, frozen lock f84ff94 before run, elapsed 7110s). Honest intermediate canonical state (NOT an unqualified "robust"): the pre-registered seed-averaged AR-null gate passed, but the audit's S3 definition also requires multi-null robustness, which is not yet run. So: - latest_validation_state = BONN_S2_SEED_ROBUST_PASS_MULTINULL_PENDING - seed_robust_gate_passed = true; multi_null_robustness_state = NOT_DONE; robust_gate_passed = null - FORMAL_VERDICT s1 + README + STATUS + CLAIM_AUDIT lead with seed-robust pass + multi-null pending - generator: full ROBUSTLY_PASSED requires seed-robust AND multi-null; statistical-claims gate honors it This supersedes the N=480 calibration (0.0354, CI-upper 0.056): the estimate is seed-set/N sensitive near the boundary; the larger pre-registered test passes and reproduces. Governance fixpoint (CERTIFIED). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Predeclared MULTI_NULL_ROBUSTNESS_PROTOCOL. Each null family generates null DATA from real Set-A/B signals; the unchanged S2-C1 test must NOT survive a linear null. Gate per null = seed-averaged FPR Wilson-95-CI upper <= 0.05. IAAFT (Schreiber-Schmitz) + FT phase-randomization are standalone, independent of the test's internal MIAAFT. Smoke (tiny-N) confirms iaaft/phaserand FPR point estimates ~0. Full run pending -> sets multi_null_robustness_state. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ness earned) The final gate completed cleanly (authoritative, no reconstruction): specificity is robust across all three independent linear-null families, each seed-averaged Wilson-95-CI-upper <= 0.05: AR FPR 0.026 [0.018, 0.038] IAAFT FPR 0.032 [0.023, 0.045] (standalone Schreiber-Schmitz) phaserand FPR 0.034 [0.024, 0.047] (standalone FT phase randomization) Combined with the reproduced S3 seed-averaged result, the full robust gate is satisfied: - latest_validation_state = BONN_S2_BRIGHT_LINE_ROBUSTLY_PASSED - seed_robust_gate_passed = true; multi_null_robustness_state = PASSED; robust_gate_passed = true - FORMAL_VERDICT s1 + README + STATUS + CLAIM_AUDIT lead with the earned robust pass The full arc: nominal single-seed pass -> falsification (seed-7 FPR 0.067) -> calibration flagged not-robust (0.0354, CI-upper 0.056) -> larger pre-registered S3 passed and was reproduced byte-for- byte (0.028) -> multi-null confirmed. Robustness was earned through falsification, not assumed. Still NOT: clinical/regulatory, BNCI executed, multi-dataset replicated. Governance CERTIFIED. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…t_statistical_claims Formatting-only (no behavior change); fixes lint-ruff format check. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Two tests hardcoded the pre-falsification token BONN_S2_BRIGHT_LINE_PASSED; the state evolved to BONN_S2_BRIGHT_LINE_ROBUSTLY_PASSED via the falsification->S3->multi-null arc. BNCI test now asserts the Bonn-prefix family (BNCI independently method-blocked). 515 offline tests pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

neuron7xLab changed the title ~~Falsify + calibrate S2: boundary pass (power robust, specificity seed-sensitive)~~ Falsify + calibrate S2: NOT robustly crossed (G2 specificity CI crosses 0.05) Jun 24, 2026

neuron7xLab and others added 10 commits June 24, 2026 21:50

ruff format: generate_current_truth, validate_statistical_claims, tes…

6eae021

…t_statistical_claims Formatting-only (no behavior change); fixes lint-ruff format check. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

neuron7xLab merged commit e743c99 into main Jun 25, 2026
43 checks passed

neuron7xLab changed the title ~~Falsify + calibrate S2: NOT robustly crossed (G2 specificity CI crosses 0.05)~~ Falsify → robust: Bonn S2 bright line robustly passed (seed + multi-null) Jun 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Falsify → robust: Bonn S2 bright line robustly passed (seed + multi-null)#86

Falsify → robust: Bonn S2 bright line robustly passed (seed + multi-null)#86
neuron7xLab merged 12 commits into
mainfrom
falsify/s2-robustness

neuron7xLab commented Jun 24, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot commented Jun 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

neuron7xLab commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chatgpt-codex-connector Bot commented Jun 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

neuron7xLab commented Jun 24, 2026 •

edited

Loading