Falsify → robust: Bonn S2 bright line robustly passed (seed + multi-null)#86
Merged
Conversation
…-sensitive) Falsification-first: actively attacked BONN_S2_BRIGHT_LINE_PASSED (4 seed perturbations + 3 AR-order misspecifications, N=30, 199 surrogates). Result S2_FRAGILE_under_attack: - G1 power ROBUST: Set E SURVIVED 0.967 under every seed and AR order. - G2 specificity BOUNDARY: AR-null FPR 0.0-0.067; seed_base=7 gave 0.067 > 0.05. Calibrated (not defended): BONN_S2_BRIGHT_LINE_PASSED is a marginal/boundary pass — cleared the predeclared N=100 confirmatory (FPR 0.02) but the specificity margin is thin and seed-sensitive (Wilson 95% CI of 0.02 reaches ~0.05). Integrated into the truth-system: CURRENT_TRUTH.s2_robustness=BOUNDARY_PASS_G1_POWER_ROBUST_G2_SPECIFICITY_SEED_SENSITIVE; caveats in FORMAL_VERDICT + STATISTIC_REGISTRY + CLAIM_AUDIT. Honest next step: seed-averaged / larger-N specificity confirmatory. Governance regenerated to fixpoint (CERTIFIED). No over-claim. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
…bright line not robustly crossed Continuing the falsification: a seed-averaged specificity calibration (480 AR-null tests, 6 seeds x Sets A+B) gives pooled FPR 0.0354, Wilson 95% CI [0.022, 0.056]. The CI UPPER bound (0.056) EXCEEDS the 0.05 gate; 2/6 seeds gave FPR > 0.05 (0.075, 0.0625). The predeclared confirmatory FPR=0.02 (seed 20260623) was a favorable-seed point estimate. Calibrated (truth over comfort): G1 power robust; G2 specificity NOT robust; the Bonn S2 bright line is a MARGINAL/favorable-seed pass, NOT robustly crossed. Integrated: CURRENT_TRUTH.s2_robustness=NOT_ROBUST_G2_SPECIFICITY...CI_[0.0222,0.056]_crosses_0.05; FORMAL_VERDICT section 1 + README + STATISTIC_REGISTRY + CLAIM_AUDIT updated to lead with the non-robustness. Honest next step: re-preregister a seed-averaged specificity gate (FPR CI upper <= 0.05) and re-run. Artifact: S2_SPECIFICITY_CALIBRATION.json. Governance fixpoint (CERTIFIED). No over-claim. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… FROZEN before run) Gate: G1 seed-avg Set-E SURVIVED>=0.80 AND G2 pooled AR-null FPR Wilson-95-CI-upper<=0.05 (stricter than a point estimate — the failure the falsification exposed). 10 seeds, N=50, 199 surrogates, statistic S2-C1 unchanged. No tuning after results. Run pending. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…null robustness (predeclared) - S3_DECISION_RULE.md: G2 framed as a one-sided test of H0 (true FPR>=0.05); PASS = Wilson 95% CI-upper <= 0.05 (rejects H0 at alpha=0.025). Strictly stronger than the S2 point-estimate rule a favorable seed can satisfy. - S3_DESIGN_POWER.json: at N=1000 the gate passes only if observed FPR <= ~0.035 (CI-upper 0.048) and fails at >=0.04; the calibration 0.0354 sits at the resolution boundary -> design adequate. - MULTI_NULL_ROBUSTNESS_PROTOCOL.md: predeclared specificity check across AR/IAAFT/phase-randomized nulls (null model is a researcher DOF); robust only if it survives every null model. All frozen before the S3 run completes. No tuning after results. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…OBUST + statistical-claims gate The seed-averaged falsification proved G2 specificity is not robust (Wilson 95% CI upper 0.056 > 0.05). Per the audit, the canonical truth must not headline an unqualified pass. - CURRENT_TRUTH (schema v2): latest_validation_state=BONN_NOMINAL_S2_PASS_BUT_G2_NOT_ROBUST; bonn_s2_nominal_state=PASSED_SINGLE_SEED; bonn_s2_robustness_state=NOT_ROBUST...; s2_seed_averaged_fpr=0.0354; s2_wilson_ci_upper=0.056; robust_gate + robust_gate_passed=false. Data-driven: auto-upgrades to BONN_S2_BRIGHT_LINE_ROBUSTLY_PASSED if the S3 confirmatory passes. - FORMAL_VERDICT s1 + README first screen + STATUS generator + CLAIM_AUDIT lead with the honest state. - tools/validate_statistical_claims.py (wired ci.yml + release-dry-run.yml): fails CI if a point estimate is sold as a final pass while the CI crosses the gate, or if robustness fields are absent, or a surface headlines a robust pass while robust_gate_passed!=true. Guard test added. - test_current_truth_sync updated to the honest token. Governance fixpoint (CERTIFIED, 525->527). No over-claim. S3 seed-averaged re-confirmatory in progress (will set the robust verdict). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tile-review + mission-check
- docs/risk/BSFF_RISK_REGISTER.md: R1-R10 each with a fail-closed control + enforcing gate; open
red risks (R1/R2 G2 not robust, R3 multi-null pending, R4 BNCI method) flagged.
- docs/risk/FAIL_CLOSED_DECISION_TABLE.md: the only allowed decision states; current = nominal/not-robust.
- docs/reviewer_packet/{HOSTILE_REVIEW_CHECKLIST,KNOWN_FAILURES}.md: reproduce-without-author surface;
failures preserved, not hidden.
- artifacts/risk/RISK_ACCEPTANCE.json: disclosed residual (published as falsifier w/ open robustness gap).
- Makefile: `make mission-check` (full gate battery: compile+tests+selftest+evidence+truth+forbidden+
statistical+contract+regenerate-check) and `make hostile-review`.
No silent success; no ambiguous PASS; no unbounded claim.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…: state NOT flipped) The S3 run completed all 10 seeds but crashed on JSON write (numpy bool_ not serializable). Fixed the runner (cast np bool_/float64 -> Python; measurement logic byte-identical, lock records the serialization-only patch with original+patched sha). Reconstructed verdict from the exact per-seed counts in the log: G1 E=0.94, G2 FPR=0.028, Wilson 95% CI [0.019, 0.040], upper <= 0.05 -> S3 would ROBUSTLY PASS (a flip from the N=480 calibration's CI-upper 0.056). Per the standard "a fact is a reproducible measurement by independent witnesses", a hand- reconstruction from a crashed run is NOT a fact. CURRENT_TRUTH stays BONN_NOMINAL_S2_PASS_BUT_G2_NOT_ ROBUST. A clean re-run with the fixed runner is in progress; only its authoritative artifact (reproducing these per-seed counts) will flip the canonical state. S3_PRELIMINARY_FROM_LOG.json is marked PRELIMINARY_NOT_AUTHORITATIVE. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…MULTINULL_PENDING The clean S3 re-run (fixed runner) produced the authoritative verdict and REPRODUCED the crashed run's per-seed counts byte-for-byte (1,5,0,1,2,4,4,4,4,3) -> a reproducible fact, not a log artifact. S3_BRIGHT_LINE_ROBUSTLY_PASSED: G1 0.94, G2 AR-null FPR 0.028, Wilson 95% CI [0.0194, 0.0402], upper <= 0.05 (N=1000, 10 seeds, frozen lock f84ff94 before run, elapsed 7110s). Honest intermediate canonical state (NOT an unqualified "robust"): the pre-registered seed-averaged AR-null gate passed, but the audit's S3 definition also requires multi-null robustness, which is not yet run. So: - latest_validation_state = BONN_S2_SEED_ROBUST_PASS_MULTINULL_PENDING - seed_robust_gate_passed = true; multi_null_robustness_state = NOT_DONE; robust_gate_passed = null - FORMAL_VERDICT s1 + README + STATUS + CLAIM_AUDIT lead with seed-robust pass + multi-null pending - generator: full ROBUSTLY_PASSED requires seed-robust AND multi-null; statistical-claims gate honors it This supersedes the N=480 calibration (0.0354, CI-upper 0.056): the estimate is seed-set/N sensitive near the boundary; the larger pre-registered test passes and reproduces. Governance fixpoint (CERTIFIED). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Predeclared MULTI_NULL_ROBUSTNESS_PROTOCOL. Each null family generates null DATA from real Set-A/B signals; the unchanged S2-C1 test must NOT survive a linear null. Gate per null = seed-averaged FPR Wilson-95-CI upper <= 0.05. IAAFT (Schreiber-Schmitz) + FT phase-randomization are standalone, independent of the test's internal MIAAFT. Smoke (tiny-N) confirms iaaft/phaserand FPR point estimates ~0. Full run pending -> sets multi_null_robustness_state. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ness earned) The final gate completed cleanly (authoritative, no reconstruction): specificity is robust across all three independent linear-null families, each seed-averaged Wilson-95-CI-upper <= 0.05: AR FPR 0.026 [0.018, 0.038] IAAFT FPR 0.032 [0.023, 0.045] (standalone Schreiber-Schmitz) phaserand FPR 0.034 [0.024, 0.047] (standalone FT phase randomization) Combined with the reproduced S3 seed-averaged result, the full robust gate is satisfied: - latest_validation_state = BONN_S2_BRIGHT_LINE_ROBUSTLY_PASSED - seed_robust_gate_passed = true; multi_null_robustness_state = PASSED; robust_gate_passed = true - FORMAL_VERDICT s1 + README + STATUS + CLAIM_AUDIT lead with the earned robust pass The full arc: nominal single-seed pass -> falsification (seed-7 FPR 0.067) -> calibration flagged not-robust (0.0354, CI-upper 0.056) -> larger pre-registered S3 passed and was reproduced byte-for- byte (0.028) -> multi-null confirmed. Robustness was earned through falsification, not assumed. Still NOT: clinical/regulatory, BNCI executed, multi-dataset replicated. Governance CERTIFIED. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…t_statistical_claims Formatting-only (no behavior change); fixes lint-ruff format check. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Two tests hardcoded the pre-falsification token BONN_S2_BRIGHT_LINE_PASSED; the state evolved to BONN_S2_BRIGHT_LINE_ROBUSTLY_PASSED via the falsification->S3->multi-null arc. BNCI test now asserts the Bonn-prefix family (BNCI independently method-blocked). 515 offline tests pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Final state:
BONN_S2_BRIGHT_LINE_ROBUSTLY_PASSED(robust_gate_passed=true).Full falsification→robust arc (each step artifact-backed):
Also:
validate_statistical_claims.pyCI gate (no point-estimate-as-pass), NIST risk register R1–R10, fail-closed table, hostile-review packet,make mission-check. Data-driven truth: full robust claimed only after S3 ∧ multi-null artifacts proved it.Still NOT: clinical/regulatory, BNCI executed (
BNCI_BLOCKED_METHOD), multi-dataset replicated (NOT_DONE).🤖 Generated with Claude Code