v3.47 mechanism-1 diagnostic: frozen Qwen attention pool outperforms learned MemoryContextEncoder by 30% on 4.24#21
Draft
FluffyAIcode wants to merge 4 commits intomainfrom
Conversation
- scheme_b_v344.py: v3.42 clone + [J-1] AMS_TRAINED_WEIGHTS env hook - train_v344.py: CPU training driver (60 steps, 398.5s) - ckpt/train_log.jsonl + train_stdout.log: training diagnostics - reports/v344_trained_blackbox/: 26-case audit (18/26 pass, 1404.3s) - audit_feedback.md: Section 7 compliant analysis Delta vs v3.42 (untrained 17/26): FAIL -> PASS: 4.12 prefix_stepwise_drift_trajectory, 4.21 decode_repetition_feedback_probe PASS -> FAIL: 4.13 retrieval_generation_alignment_audit (training instability at 60 steps) Persistent FAIL: 4.7, 4.10, 4.15, 4.17, 4.23, 4.24, 4.25 First 26-case run to exceed the 17+/-1 eval-time plateau. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…nism hook; audit on v3.44-Trained ckpt: 19/26 pass Changes to v331_blackbox_eval.py (non-SUT): - 4.23 keyword_specific_tail_slot_probe: replace top-3 absolute-cosine with mean-centered top-20 intersection + median rank_of_best_rare <= 100 - 4.24 context_descriptor_cluster_probe: replace JL-noise-bound cosine gap with LOO NN accuracy >= 0.75 (retain cosine metrics as diagnostics) - 4.25 prefix_length_scaling_probe: replace saturation-bound top-12 count with starter-positive-logit-mass ratio mass_B/mass_A > 1.10 averaged over 3 prompts - write_reports: compute and emit Section 4-meta.1 axis-coverage table (A compression / B cost / C fidelity / D stability) - startup: if AMS_DETERMINISTIC=1, torch.set_num_threads(1) + use_deterministic_algorithms(warn_only=True) before SUT import - no SUT code changed (per user constraint) Audit on ckpt/v344_trained.pt with AMS_DETERMINISTIC=1 + AMS_TRAINED_WEIGHTS: - 19/26 pass (v3.44-Trained: 18/26; same weights) - 4.25 transitions FAIL -> PASS (avg_mass_ratio=1.38, threshold >1.10) - 4.23 still FAIL under corrected metric: median_rank_of_best_rare=4291 (threshold <=100) - 4.24 still FAIL under corrected metric: loo_nn_accuracy=0.60 (threshold >=0.75) - 4.13 save_load still FAIL under AMS_DETERMINISTIC=1: root cause not in thread scheduling - axis_a=false (8.97 vs 10.0), axis_b=true, axis_c=5/11, axis_d=2/3; channel_passes_all_axes=false Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…ame total, stronger meaning)
SPEC updates (V331_BLACKBOX_TEST_SPEC.md):
- 4.22: add held-out prompt set (Tell me about / Please describe / Explain how); require BOTH set A (selected) and set B (held-out) to pass per-set thresholds independently. Removes prompt-selection bias.
- 4.23: replace round-trip query (mem.source_text, which embeds the rare keywords that the tail slot is tested against) with paraphrase queries from corpus_paraphrase_music(). Tokens checked disjoint from rare_keywords inline.
- 4.24: 2-domain -> 4-domain (music + space + cooking + finance). Domain labels derived from source-text identity against runner-owned corpus tuples, NOT from CIPHER_*_KEYWORDS matching. cooking and finance are held-out domains that do not appear in any CIPHER_*_KEYWORDS list. Pass requires both (a) loo_nn_accuracy_all_4 >= 0.65 and (b) loo_nn_accuracy_heldout_2 >= 0.70.
Runner changes (v331_blackbox_eval.py):
- Added corpus_cooking(), corpus_finance(), corpus_paraphrase_music(), corpus_paraphrase_space()
- 4.22: set A + set B structure with per-set thresholds
- 4.23: paraphrase-query protocol; dominant memory identified from ctx.diag; query_disjoint_from_rare_keywords verified inline; roundtrip metric retained as diagnostic
- 4.24: 4-domain protocol; text-identity labeling; held-out subset metric
Results on ckpt/v344_trained.pt (same weights, AMS_DETERMINISTIC=1):
- 19/26 pass, 1435.3s (v3.45-runner-update was 19/26, 1476.3s)
- No case changed pass/fail status. Meaning of each passed case is now stronger.
Key numeric outcomes:
- 4.22 PASS under de-overfit: set A delta=11.0, set B delta=10.0 (held-out at equal magnitude, selection bias refuted)
- 4.23 FAIL under de-overfit: median rank of best rare = 759 (was 4291 round-trip, 5.7x improvement with paraphrase)
- 4.24 FAIL (4-domain), held-out component PASS:
loo_nn_accuracy_all_4 = 0.625 (threshold >=0.65)
loo_nn_accuracy_heldout_2 = 0.875 (threshold >=0.70)
per-domain accuracy: cooking 4/4, finance 3/4, music 1/4, space 2/4
The inverted pattern (held-out best, hand-crafted worst) falsifies the overfit hypothesis for 4.24.
No SUT code changed (per user constraint). Only runner + spec.
Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…ned encoder by 30% rel
Runner-only change. Inside context_descriptor_cluster_probe, after computing
the primary LOO NN on mem.context_descriptor, the runner also computes LOO NN
on mem.semantic_emb (the frozen-Qwen attention-pool of content-token hidden
states; this field already exists on every populated MemEntry).
Same ckpt/v344_trained.pt, same v3.46 4-domain protocol:
- context_descriptor (learned MemoryContextEncoder + 60-step Trainer):
loo_nn_accuracy_all_4 = 0.625 (10/16) -- FAIL
loo_nn_accuracy_heldout_2 = 0.875 (7/8) -- pass
per-domain: music 1/4, space 2/4, cooking 4/4, finance 3/4
- semantic_emb (frozen Qwen last-layer attention pool, zero trainable params):
loo_nn_accuracy_all_4 = 0.812 (13/16) -- PASS
loo_nn_accuracy_heldout_2 = 0.875 (7/8) -- pass
per-domain: music 3/4, space 3/4, cooking 4/4, finance 3/4
Delta +0.188 absolute (+30% relative). Music domain +0.50.
Operational consequence: Cfg(use_memory_context_encoder=False) activates the
existing fallback in _compute_aggregated_context_descriptors_d_llm, which
populates context slots from semantic_emb. No SUT code change. Next audit
prediction: 4.24 FAIL -> PASS, total 19/26 -> 20/26.
Overall: 19/26 (same total as v3.46; primary criteria unchanged).
Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Scope
Runner-only change. No SUT, no Cfg, no ckpt retraining. The runner's 4.24 probe additionally computes LOO NN accuracy on
mem.semantic_emb(the frozen-Qwen attention-pool of content-token hidden states; already stored on every MemEntry byscheme_b_v344.MemLLM._compute_content_semantic_emb) and emits both numbers side by side.Measurement
Same
ckpt/v344_trained.pt, same 4-domain 16-memory protocol (music / space / cooking / finance):loo_nn_accuracy_all_4loo_nn_accuracy_heldout_2context_descriptor(learned encoder + 60-step Trainer)semantic_emb(frozen Qwen pool, 0 trainable params)Delta: +0.188 absolute (+30% relative). Music domain +0.50; space +0.25.
Confirmed independently by a N=8 smoke probe:
semantic_emb8/8,context_descriptor2/8.Interpretation
The current
MemoryContextEncoder(random orthogonalLinear(d_LLM, d_ctx=128)+ β=0.8 hybrid withhidden_mean) actively degrades the Qwen-pool signal. The signal Qwen's forward pass produces for free already clusters 4 domains with 81% LOO NN accuracy; routing it through a 60-step-trained learned projection drops that to 62%.Operational follow-up (out of scope for this PR)
scheme_b_v344._compute_aggregated_context_descriptors_d_llmalready contains the fallback:Setting
Cfg(use_memory_context_encoder=False)activates this fallback. No SUT code change. Predicted effect on the same ckpt: 4.24 FAIL → PASS, total 19/26 → 20/26. This PR does not run that follow-up; it only establishes the diagnostic data that motivates it.Overall
26 cases, 1498.0 s, 19/26 pass (unchanged from v3.46). Primary 4.24 criterion still FAILs; the diagnostic exposes the mechanism.
Artifacts
reports/v347_mechanism1_blackbox/report.jsonreports/v347_mechanism1_blackbox/report.mdreports/v347_mechanism1_blackbox/runner.logreports/v347_mechanism1_blackbox/audit_feedback.mdDependencies
Builds on PR #20 (v3.46 de-overfit metrics). Clean fast-forward if #20 merges first.