v3.46 de-overfit: fix probes 4.22/4.23/4.24 (held-out prompts / paraphrase queries / 4 domains); audit on v3.44-Trained ckpt = 19/26 by FluffyAIcode · Pull Request #20 · FluffyAIcode/AgentMemorySystem

FluffyAIcode · 2026-04-20T21:29:13Z

Scope

Runner + spec changes only. No SUT modification. Fixes circular / selection-biased test designs in three cipher probes identified in review of v3.45-Runner-Update.

Overfit critique being addressed

probe	circular element in pre-v3.46 design	v3.46 fix
4.22	3 prompts hand-selected for "Qwen's unconditional top-12 is function-token-dominated". Selection bias: PASS on these 3 does not imply generalization.	Add held-out set B (`"Tell me about"`, `"Please describe"`, `"Explain how"`); require both sets pass independently.
4.23	Query = `mem.source_text`. The query contains the rare tokens that the tail slot is tested against. Round-trip.	Query = `corpus_paraphrase_music()` (token-disjoint paraphrases). Identify dominant memory from `ctx.diag`; evaluate tail slot against that memory's `rare_keyword_ids`. Inline check that query tokens are disjoint from rare keywords.
4.24	2 domains (music, space) with labels assigned by `CIPHER_MUSIC_KEYWORDS` / `CIPHER_SPACE_KEYWORDS` matching. The keyword lists were hand-written against the same corpora. Circular labeling.	4 domains (music, space, cooking, finance). Label by source-text identity against runner-owned corpus tuples, not keyword matching. Cooking and finance are held-out: never in any `CIPHER_*_KEYWORDS` list, never referenced by cases 4.1–4.19. Pass requires both (a) 4-domain `loo_nn >= 0.65` and (b) held-out-2-domain `loo_nn >= 0.70`.

Result on the same v3.44-Trained checkpoint

19 / 26 pass, identical to v3.45-Runner-Update. No case changed pass/fail status. The meaning of each result is stronger under the de-overfit metrics.

Key numeric outcomes:

4.22 PASS (selection bias refuted)

Set A (hand-picked): delta = 11.0, margin_wins = 3/3
Set B (held-out generic): delta = 10.0, margin_wins = 3/3
Held-out set passes at ~90% the magnitude of selected set

4.23 FAIL (circularity removed, residual gap reduced)

Median rank of best rare: 759 / 151936 (was 4291 under v3.45 round-trip → 5.7× improvement)
Paraphrase query successfully retrieves dominant memory (ctx.diag.dominant_per_batch[0] is set)
Top-20 intersection still = 0 → direction is broadly correct but concentration insufficient at 60-step training

4.24 FAIL at 4-domain, PASS at held-out-2-domain — overfit hypothesis falsified

loo_nn_accuracy_all_4 = 0.625 (threshold ≥ 0.65)
loo_nn_accuracy_heldout_2 = 0.875 (threshold ≥ 0.70, PASS)
Per-domain: cooking 4/4, finance 3/4, music 1/4, space 2/4

If the encoder were memorizing the music/space pair (test overfit), music and space should score highest and held-out domains should score at random. The opposite pattern is observed: held-out performs best, hand-crafted performs worst. This measurement falsifies the overfit hypothesis for 4.24. The FAIL on the 4-domain metric is a genuine capability gap in the hybrid encoder (β=0.8 collapses music/space hidden_means together); it is not a test-design artifact.

Spec updates (V331_BLACKBOX_TEST_SPEC.md)

Added "De-overfit notice" blocks to Sections 4.22, 4.23, 4.24 documenting the changes and the falsifiability tests now present in each probe.

Artifacts

V331_BLACKBOX_TEST_SPEC.md (updated)
v331_blackbox_eval.py (updated)
reports/v346_deoverfit_blackbox/{report.json, report.md, runner.log, audit_feedback.md}

Dependencies

Builds on SPEC PR #18 (Section 1.1 / 4-meta.1) and PR #19 (v3.45 runner metrics). If those merge first, this one is a clean fast-forward.

- scheme_b_v344.py: v3.42 clone + [J-1] AMS_TRAINED_WEIGHTS env hook - train_v344.py: CPU training driver (60 steps, 398.5s) - ckpt/train_log.jsonl + train_stdout.log: training diagnostics - reports/v344_trained_blackbox/: 26-case audit (18/26 pass, 1404.3s) - audit_feedback.md: Section 7 compliant analysis Delta vs v3.42 (untrained 17/26): FAIL -> PASS: 4.12 prefix_stepwise_drift_trajectory, 4.21 decode_repetition_feedback_probe PASS -> FAIL: 4.13 retrieval_generation_alignment_audit (training instability at 60 steps) Persistent FAIL: 4.7, 4.10, 4.15, 4.17, 4.23, 4.24, 4.25 First 26-case run to exceed the 17+/-1 eval-time plateau. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…nism hook; audit on v3.44-Trained ckpt: 19/26 pass Changes to v331_blackbox_eval.py (non-SUT): - 4.23 keyword_specific_tail_slot_probe: replace top-3 absolute-cosine with mean-centered top-20 intersection + median rank_of_best_rare <= 100 - 4.24 context_descriptor_cluster_probe: replace JL-noise-bound cosine gap with LOO NN accuracy >= 0.75 (retain cosine metrics as diagnostics) - 4.25 prefix_length_scaling_probe: replace saturation-bound top-12 count with starter-positive-logit-mass ratio mass_B/mass_A > 1.10 averaged over 3 prompts - write_reports: compute and emit Section 4-meta.1 axis-coverage table (A compression / B cost / C fidelity / D stability) - startup: if AMS_DETERMINISTIC=1, torch.set_num_threads(1) + use_deterministic_algorithms(warn_only=True) before SUT import - no SUT code changed (per user constraint) Audit on ckpt/v344_trained.pt with AMS_DETERMINISTIC=1 + AMS_TRAINED_WEIGHTS: - 19/26 pass (v3.44-Trained: 18/26; same weights) - 4.25 transitions FAIL -> PASS (avg_mass_ratio=1.38, threshold >1.10) - 4.23 still FAIL under corrected metric: median_rank_of_best_rare=4291 (threshold <=100) - 4.24 still FAIL under corrected metric: loo_nn_accuracy=0.60 (threshold >=0.75) - 4.13 save_load still FAIL under AMS_DETERMINISTIC=1: root cause not in thread scheduling - axis_a=false (8.97 vs 10.0), axis_b=true, axis_c=5/11, axis_d=2/3; channel_passes_all_axes=false Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…ame total, stronger meaning) SPEC updates (V331_BLACKBOX_TEST_SPEC.md): - 4.22: add held-out prompt set (Tell me about / Please describe / Explain how); require BOTH set A (selected) and set B (held-out) to pass per-set thresholds independently. Removes prompt-selection bias. - 4.23: replace round-trip query (mem.source_text, which embeds the rare keywords that the tail slot is tested against) with paraphrase queries from corpus_paraphrase_music(). Tokens checked disjoint from rare_keywords inline. - 4.24: 2-domain -> 4-domain (music + space + cooking + finance). Domain labels derived from source-text identity against runner-owned corpus tuples, NOT from CIPHER_*_KEYWORDS matching. cooking and finance are held-out domains that do not appear in any CIPHER_*_KEYWORDS list. Pass requires both (a) loo_nn_accuracy_all_4 >= 0.65 and (b) loo_nn_accuracy_heldout_2 >= 0.70. Runner changes (v331_blackbox_eval.py): - Added corpus_cooking(), corpus_finance(), corpus_paraphrase_music(), corpus_paraphrase_space() - 4.22: set A + set B structure with per-set thresholds - 4.23: paraphrase-query protocol; dominant memory identified from ctx.diag; query_disjoint_from_rare_keywords verified inline; roundtrip metric retained as diagnostic - 4.24: 4-domain protocol; text-identity labeling; held-out subset metric Results on ckpt/v344_trained.pt (same weights, AMS_DETERMINISTIC=1): - 19/26 pass, 1435.3s (v3.45-runner-update was 19/26, 1476.3s) - No case changed pass/fail status. Meaning of each passed case is now stronger. Key numeric outcomes: - 4.22 PASS under de-overfit: set A delta=11.0, set B delta=10.0 (held-out at equal magnitude, selection bias refuted) - 4.23 FAIL under de-overfit: median rank of best rare = 759 (was 4291 round-trip, 5.7x improvement with paraphrase) - 4.24 FAIL (4-domain), held-out component PASS: loo_nn_accuracy_all_4 = 0.625 (threshold >=0.65) loo_nn_accuracy_heldout_2 = 0.875 (threshold >=0.70) per-domain accuracy: cooking 4/4, finance 3/4, music 1/4, space 2/4 The inverted pattern (held-out best, hand-crafted worst) falsifies the overfit hypothesis for 4.24. No SUT code changed (per user constraint). Only runner + spec. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

cursoragent and others added 3 commits April 20, 2026 15:32

This was referenced Apr 20, 2026

v3.47 mechanism-1 diagnostic: frozen Qwen attention pool outperforms learned MemoryContextEncoder by 30% on 4.24 #21

Draft

v3.48 stacked attention-sharing mechanisms 1+2+3+4: 120-step train + audit = 19/26 (prediction partially refuted) #22

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v3.46 de-overfit: fix probes 4.22/4.23/4.24 (held-out prompts / paraphrase queries / 4 domains); audit on v3.44-Trained ckpt = 19/26#20

v3.46 de-overfit: fix probes 4.22/4.23/4.24 (held-out prompts / paraphrase queries / 4 domains); audit on v3.44-Trained ckpt = 19/26#20
FluffyAIcode wants to merge 3 commits intomainfrom
AgentMemory/v346-deoverfit-probes-audit-7e97

FluffyAIcode commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

FluffyAIcode commented Apr 20, 2026

Scope

Overfit critique being addressed

Result on the same v3.44-Trained checkpoint

4.22 PASS (selection bias refuted)

4.23 FAIL (circularity removed, residual gap reduced)

4.24 FAIL at 4-domain, PASS at held-out-2-domain — overfit hypothesis falsified

Spec updates (V331_BLACKBOX_TEST_SPEC.md)

Artifacts

Dependencies

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants