v3.46 de-overfit: fix probes 4.22/4.23/4.24 (held-out prompts / paraphrase queries / 4 domains); audit on v3.44-Trained ckpt = 19/26#20
Draft
FluffyAIcode wants to merge 3 commits intomainfrom
Conversation
- scheme_b_v344.py: v3.42 clone + [J-1] AMS_TRAINED_WEIGHTS env hook - train_v344.py: CPU training driver (60 steps, 398.5s) - ckpt/train_log.jsonl + train_stdout.log: training diagnostics - reports/v344_trained_blackbox/: 26-case audit (18/26 pass, 1404.3s) - audit_feedback.md: Section 7 compliant analysis Delta vs v3.42 (untrained 17/26): FAIL -> PASS: 4.12 prefix_stepwise_drift_trajectory, 4.21 decode_repetition_feedback_probe PASS -> FAIL: 4.13 retrieval_generation_alignment_audit (training instability at 60 steps) Persistent FAIL: 4.7, 4.10, 4.15, 4.17, 4.23, 4.24, 4.25 First 26-case run to exceed the 17+/-1 eval-time plateau. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…nism hook; audit on v3.44-Trained ckpt: 19/26 pass Changes to v331_blackbox_eval.py (non-SUT): - 4.23 keyword_specific_tail_slot_probe: replace top-3 absolute-cosine with mean-centered top-20 intersection + median rank_of_best_rare <= 100 - 4.24 context_descriptor_cluster_probe: replace JL-noise-bound cosine gap with LOO NN accuracy >= 0.75 (retain cosine metrics as diagnostics) - 4.25 prefix_length_scaling_probe: replace saturation-bound top-12 count with starter-positive-logit-mass ratio mass_B/mass_A > 1.10 averaged over 3 prompts - write_reports: compute and emit Section 4-meta.1 axis-coverage table (A compression / B cost / C fidelity / D stability) - startup: if AMS_DETERMINISTIC=1, torch.set_num_threads(1) + use_deterministic_algorithms(warn_only=True) before SUT import - no SUT code changed (per user constraint) Audit on ckpt/v344_trained.pt with AMS_DETERMINISTIC=1 + AMS_TRAINED_WEIGHTS: - 19/26 pass (v3.44-Trained: 18/26; same weights) - 4.25 transitions FAIL -> PASS (avg_mass_ratio=1.38, threshold >1.10) - 4.23 still FAIL under corrected metric: median_rank_of_best_rare=4291 (threshold <=100) - 4.24 still FAIL under corrected metric: loo_nn_accuracy=0.60 (threshold >=0.75) - 4.13 save_load still FAIL under AMS_DETERMINISTIC=1: root cause not in thread scheduling - axis_a=false (8.97 vs 10.0), axis_b=true, axis_c=5/11, axis_d=2/3; channel_passes_all_axes=false Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…ame total, stronger meaning)
SPEC updates (V331_BLACKBOX_TEST_SPEC.md):
- 4.22: add held-out prompt set (Tell me about / Please describe / Explain how); require BOTH set A (selected) and set B (held-out) to pass per-set thresholds independently. Removes prompt-selection bias.
- 4.23: replace round-trip query (mem.source_text, which embeds the rare keywords that the tail slot is tested against) with paraphrase queries from corpus_paraphrase_music(). Tokens checked disjoint from rare_keywords inline.
- 4.24: 2-domain -> 4-domain (music + space + cooking + finance). Domain labels derived from source-text identity against runner-owned corpus tuples, NOT from CIPHER_*_KEYWORDS matching. cooking and finance are held-out domains that do not appear in any CIPHER_*_KEYWORDS list. Pass requires both (a) loo_nn_accuracy_all_4 >= 0.65 and (b) loo_nn_accuracy_heldout_2 >= 0.70.
Runner changes (v331_blackbox_eval.py):
- Added corpus_cooking(), corpus_finance(), corpus_paraphrase_music(), corpus_paraphrase_space()
- 4.22: set A + set B structure with per-set thresholds
- 4.23: paraphrase-query protocol; dominant memory identified from ctx.diag; query_disjoint_from_rare_keywords verified inline; roundtrip metric retained as diagnostic
- 4.24: 4-domain protocol; text-identity labeling; held-out subset metric
Results on ckpt/v344_trained.pt (same weights, AMS_DETERMINISTIC=1):
- 19/26 pass, 1435.3s (v3.45-runner-update was 19/26, 1476.3s)
- No case changed pass/fail status. Meaning of each passed case is now stronger.
Key numeric outcomes:
- 4.22 PASS under de-overfit: set A delta=11.0, set B delta=10.0 (held-out at equal magnitude, selection bias refuted)
- 4.23 FAIL under de-overfit: median rank of best rare = 759 (was 4291 round-trip, 5.7x improvement with paraphrase)
- 4.24 FAIL (4-domain), held-out component PASS:
loo_nn_accuracy_all_4 = 0.625 (threshold >=0.65)
loo_nn_accuracy_heldout_2 = 0.875 (threshold >=0.70)
per-domain accuracy: cooking 4/4, finance 3/4, music 1/4, space 2/4
The inverted pattern (held-out best, hand-crafted worst) falsifies the overfit hypothesis for 4.24.
No SUT code changed (per user constraint). Only runner + spec.
Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Scope
Runner + spec changes only. No SUT modification. Fixes circular / selection-biased test designs in three cipher probes identified in review of v3.45-Runner-Update.
Overfit critique being addressed
"Tell me about","Please describe","Explain how"); require both sets pass independently.mem.source_text. The query contains the rare tokens that the tail slot is tested against. Round-trip.corpus_paraphrase_music()(token-disjoint paraphrases). Identify dominant memory fromctx.diag; evaluate tail slot against that memory'srare_keyword_ids. Inline check that query tokens are disjoint from rare keywords.CIPHER_MUSIC_KEYWORDS/CIPHER_SPACE_KEYWORDSmatching. The keyword lists were hand-written against the same corpora. Circular labeling.CIPHER_*_KEYWORDSlist, never referenced by cases 4.1–4.19. Pass requires both (a) 4-domainloo_nn >= 0.65and (b) held-out-2-domainloo_nn >= 0.70.Result on the same v3.44-Trained checkpoint
19 / 26 pass, identical to v3.45-Runner-Update. No case changed pass/fail status. The meaning of each result is stronger under the de-overfit metrics.
Key numeric outcomes:
4.22 PASS (selection bias refuted)
delta = 11.0,margin_wins = 3/3delta = 10.0,margin_wins = 3/34.23 FAIL (circularity removed, residual gap reduced)
ctx.diag.dominant_per_batch[0]is set)4.24 FAIL at 4-domain, PASS at held-out-2-domain — overfit hypothesis falsified
loo_nn_accuracy_all_4 = 0.625(threshold ≥ 0.65)loo_nn_accuracy_heldout_2 = 0.875(threshold ≥ 0.70, PASS)If the encoder were memorizing the music/space pair (test overfit), music and space should score highest and held-out domains should score at random. The opposite pattern is observed: held-out performs best, hand-crafted performs worst. This measurement falsifies the overfit hypothesis for 4.24. The FAIL on the 4-domain metric is a genuine capability gap in the hybrid encoder (β=0.8 collapses music/space hidden_means together); it is not a test-design artifact.
Spec updates (V331_BLACKBOX_TEST_SPEC.md)
Added "De-overfit notice" blocks to Sections 4.22, 4.23, 4.24 documenting the changes and the falsifiability tests now present in each probe.
Artifacts
V331_BLACKBOX_TEST_SPEC.md(updated)v331_blackbox_eval.py(updated)reports/v346_deoverfit_blackbox/{report.json, report.md, runner.log, audit_feedback.md}Dependencies
Builds on SPEC PR #18 (Section 1.1 / 4-meta.1) and PR #19 (v3.45 runner metrics). If those merge first, this one is a clean fast-forward.