Skip to content

v3.46 de-overfit: fix probes 4.22/4.23/4.24 (held-out prompts / paraphrase queries / 4 domains); audit on v3.44-Trained ckpt = 19/26#20

Draft
FluffyAIcode wants to merge 3 commits intomainfrom
AgentMemory/v346-deoverfit-probes-audit-7e97
Draft

v3.46 de-overfit: fix probes 4.22/4.23/4.24 (held-out prompts / paraphrase queries / 4 domains); audit on v3.44-Trained ckpt = 19/26#20
FluffyAIcode wants to merge 3 commits intomainfrom
AgentMemory/v346-deoverfit-probes-audit-7e97

Conversation

@FluffyAIcode
Copy link
Copy Markdown
Owner

Scope

Runner + spec changes only. No SUT modification. Fixes circular / selection-biased test designs in three cipher probes identified in review of v3.45-Runner-Update.

Overfit critique being addressed

probe circular element in pre-v3.46 design v3.46 fix
4.22 3 prompts hand-selected for "Qwen's unconditional top-12 is function-token-dominated". Selection bias: PASS on these 3 does not imply generalization. Add held-out set B ("Tell me about", "Please describe", "Explain how"); require both sets pass independently.
4.23 Query = mem.source_text. The query contains the rare tokens that the tail slot is tested against. Round-trip. Query = corpus_paraphrase_music() (token-disjoint paraphrases). Identify dominant memory from ctx.diag; evaluate tail slot against that memory's rare_keyword_ids. Inline check that query tokens are disjoint from rare keywords.
4.24 2 domains (music, space) with labels assigned by CIPHER_MUSIC_KEYWORDS / CIPHER_SPACE_KEYWORDS matching. The keyword lists were hand-written against the same corpora. Circular labeling. 4 domains (music, space, cooking, finance). Label by source-text identity against runner-owned corpus tuples, not keyword matching. Cooking and finance are held-out: never in any CIPHER_*_KEYWORDS list, never referenced by cases 4.1–4.19. Pass requires both (a) 4-domain loo_nn >= 0.65 and (b) held-out-2-domain loo_nn >= 0.70.

Result on the same v3.44-Trained checkpoint

19 / 26 pass, identical to v3.45-Runner-Update. No case changed pass/fail status. The meaning of each result is stronger under the de-overfit metrics.

Key numeric outcomes:

4.22 PASS (selection bias refuted)

  • Set A (hand-picked): delta = 11.0, margin_wins = 3/3
  • Set B (held-out generic): delta = 10.0, margin_wins = 3/3
  • Held-out set passes at ~90% the magnitude of selected set

4.23 FAIL (circularity removed, residual gap reduced)

  • Median rank of best rare: 759 / 151936 (was 4291 under v3.45 round-trip → 5.7× improvement)
  • Paraphrase query successfully retrieves dominant memory (ctx.diag.dominant_per_batch[0] is set)
  • Top-20 intersection still = 0 → direction is broadly correct but concentration insufficient at 60-step training

4.24 FAIL at 4-domain, PASS at held-out-2-domain — overfit hypothesis falsified

  • loo_nn_accuracy_all_4 = 0.625 (threshold ≥ 0.65)
  • loo_nn_accuracy_heldout_2 = 0.875 (threshold ≥ 0.70, PASS)
  • Per-domain: cooking 4/4, finance 3/4, music 1/4, space 2/4

If the encoder were memorizing the music/space pair (test overfit), music and space should score highest and held-out domains should score at random. The opposite pattern is observed: held-out performs best, hand-crafted performs worst. This measurement falsifies the overfit hypothesis for 4.24. The FAIL on the 4-domain metric is a genuine capability gap in the hybrid encoder (β=0.8 collapses music/space hidden_means together); it is not a test-design artifact.

Spec updates (V331_BLACKBOX_TEST_SPEC.md)

Added "De-overfit notice" blocks to Sections 4.22, 4.23, 4.24 documenting the changes and the falsifiability tests now present in each probe.

Artifacts

  • V331_BLACKBOX_TEST_SPEC.md (updated)
  • v331_blackbox_eval.py (updated)
  • reports/v346_deoverfit_blackbox/{report.json, report.md, runner.log, audit_feedback.md}

Dependencies

Builds on SPEC PR #18 (Section 1.1 / 4-meta.1) and PR #19 (v3.45 runner metrics). If those merge first, this one is a clean fast-forward.

Open in Web Open in Cursor 

cursoragent and others added 3 commits April 20, 2026 15:32
- scheme_b_v344.py: v3.42 clone + [J-1] AMS_TRAINED_WEIGHTS env hook
- train_v344.py: CPU training driver (60 steps, 398.5s)
- ckpt/train_log.jsonl + train_stdout.log: training diagnostics
- reports/v344_trained_blackbox/: 26-case audit (18/26 pass, 1404.3s)
- audit_feedback.md: Section 7 compliant analysis

Delta vs v3.42 (untrained 17/26):
  FAIL -> PASS: 4.12 prefix_stepwise_drift_trajectory, 4.21 decode_repetition_feedback_probe
  PASS -> FAIL: 4.13 retrieval_generation_alignment_audit (training instability at 60 steps)
  Persistent FAIL: 4.7, 4.10, 4.15, 4.17, 4.23, 4.24, 4.25

First 26-case run to exceed the 17+/-1 eval-time plateau.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…nism hook; audit on v3.44-Trained ckpt: 19/26 pass

Changes to v331_blackbox_eval.py (non-SUT):
- 4.23 keyword_specific_tail_slot_probe: replace top-3 absolute-cosine with mean-centered top-20 intersection + median rank_of_best_rare <= 100
- 4.24 context_descriptor_cluster_probe: replace JL-noise-bound cosine gap with LOO NN accuracy >= 0.75 (retain cosine metrics as diagnostics)
- 4.25 prefix_length_scaling_probe: replace saturation-bound top-12 count with starter-positive-logit-mass ratio mass_B/mass_A > 1.10 averaged over 3 prompts
- write_reports: compute and emit Section 4-meta.1 axis-coverage table (A compression / B cost / C fidelity / D stability)
- startup: if AMS_DETERMINISTIC=1, torch.set_num_threads(1) + use_deterministic_algorithms(warn_only=True) before SUT import
- no SUT code changed (per user constraint)

Audit on ckpt/v344_trained.pt with AMS_DETERMINISTIC=1 + AMS_TRAINED_WEIGHTS:
- 19/26 pass (v3.44-Trained: 18/26; same weights)
- 4.25 transitions FAIL -> PASS (avg_mass_ratio=1.38, threshold >1.10)
- 4.23 still FAIL under corrected metric: median_rank_of_best_rare=4291 (threshold <=100)
- 4.24 still FAIL under corrected metric: loo_nn_accuracy=0.60 (threshold >=0.75)
- 4.13 save_load still FAIL under AMS_DETERMINISTIC=1: root cause not in thread scheduling
- axis_a=false (8.97 vs 10.0), axis_b=true, axis_c=5/11, axis_d=2/3; channel_passes_all_axes=false

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…ame total, stronger meaning)

SPEC updates (V331_BLACKBOX_TEST_SPEC.md):
- 4.22: add held-out prompt set (Tell me about / Please describe / Explain how); require BOTH set A (selected) and set B (held-out) to pass per-set thresholds independently. Removes prompt-selection bias.
- 4.23: replace round-trip query (mem.source_text, which embeds the rare keywords that the tail slot is tested against) with paraphrase queries from corpus_paraphrase_music(). Tokens checked disjoint from rare_keywords inline.
- 4.24: 2-domain -> 4-domain (music + space + cooking + finance). Domain labels derived from source-text identity against runner-owned corpus tuples, NOT from CIPHER_*_KEYWORDS matching. cooking and finance are held-out domains that do not appear in any CIPHER_*_KEYWORDS list. Pass requires both (a) loo_nn_accuracy_all_4 >= 0.65 and (b) loo_nn_accuracy_heldout_2 >= 0.70.

Runner changes (v331_blackbox_eval.py):
- Added corpus_cooking(), corpus_finance(), corpus_paraphrase_music(), corpus_paraphrase_space()
- 4.22: set A + set B structure with per-set thresholds
- 4.23: paraphrase-query protocol; dominant memory identified from ctx.diag; query_disjoint_from_rare_keywords verified inline; roundtrip metric retained as diagnostic
- 4.24: 4-domain protocol; text-identity labeling; held-out subset metric

Results on ckpt/v344_trained.pt (same weights, AMS_DETERMINISTIC=1):
- 19/26 pass, 1435.3s (v3.45-runner-update was 19/26, 1476.3s)
- No case changed pass/fail status. Meaning of each passed case is now stronger.

Key numeric outcomes:
- 4.22 PASS under de-overfit: set A delta=11.0, set B delta=10.0 (held-out at equal magnitude, selection bias refuted)
- 4.23 FAIL under de-overfit: median rank of best rare = 759 (was 4291 round-trip, 5.7x improvement with paraphrase)
- 4.24 FAIL (4-domain), held-out component PASS:
    loo_nn_accuracy_all_4 = 0.625 (threshold >=0.65)
    loo_nn_accuracy_heldout_2 = 0.875 (threshold >=0.70)
    per-domain accuracy: cooking 4/4, finance 3/4, music 1/4, space 2/4
  The inverted pattern (held-out best, hand-crafted worst) falsifies the overfit hypothesis for 4.24.

No SUT code changed (per user constraint). Only runner + spec.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants