v3.47 mechanism-1 diagnostic: frozen Qwen attention pool outperforms learned MemoryContextEncoder by 30% on 4.24 by FluffyAIcode · Pull Request #21 · FluffyAIcode/AgentMemorySystem

FluffyAIcode · 2026-04-20T22:31:15Z

Scope

Runner-only change. No SUT, no Cfg, no ckpt retraining. The runner's 4.24 probe additionally computes LOO NN accuracy on mem.semantic_emb (the frozen-Qwen attention-pool of content-token hidden states; already stored on every MemEntry by scheme_b_v344.MemLLM._compute_content_semantic_emb) and emits both numbers side by side.

Measurement

Same ckpt/v344_trained.pt, same 4-domain 16-memory protocol (music / space / cooking / finance):

encoding	`loo_nn_accuracy_all_4`	`loo_nn_accuracy_heldout_2`	would pass (≥0.65 / ≥0.70)?	per-domain (correct/n)
`context_descriptor` (learned encoder + 60-step Trainer)	0.625 (10/16)	0.875 (7/8)	no (primary FAIL)	music 1/4, space 2/4, cooking 4/4, finance 3/4
`semantic_emb` (frozen Qwen pool, 0 trainable params)	0.812 (13/16)	0.875 (7/8)	yes	music 3/4, space 3/4, cooking 4/4, finance 3/4

Delta: +0.188 absolute (+30% relative). Music domain +0.50; space +0.25.

Confirmed independently by a N=8 smoke probe: semantic_emb 8/8, context_descriptor 2/8.

Interpretation

The current MemoryContextEncoder (random orthogonal Linear(d_LLM, d_ctx=128) + β=0.8 hybrid with hidden_mean) actively degrades the Qwen-pool signal. The signal Qwen's forward pass produces for free already clusters 4 domains with 81% LOO NN accuracy; routing it through a 60-step-trained learned projection drops that to 62%.

Operational follow-up (out of scope for this PR)

scheme_b_v344._compute_aggregated_context_descriptors_d_llm already contains the fallback:

if mem.context_descriptor is not None and self.memory_context_encoder is not None:
    d_llm_vec = self.memory_context_encoder.decode(...)
elif mem.semantic_emb is not None:
    d_llm_vec = mem.semantic_emb.to(dev).float()

Setting Cfg(use_memory_context_encoder=False) activates this fallback. No SUT code change. Predicted effect on the same ckpt: 4.24 FAIL → PASS, total 19/26 → 20/26. This PR does not run that follow-up; it only establishes the diagnostic data that motivates it.

Overall

26 cases, 1498.0 s, 19/26 pass (unchanged from v3.46). Primary 4.24 criterion still FAILs; the diagnostic exposes the mechanism.

Artifacts

reports/v347_mechanism1_blackbox/report.json
reports/v347_mechanism1_blackbox/report.md
reports/v347_mechanism1_blackbox/runner.log
reports/v347_mechanism1_blackbox/audit_feedback.md

Dependencies

Builds on PR #20 (v3.46 de-overfit metrics). Clean fast-forward if #20 merges first.

- scheme_b_v344.py: v3.42 clone + [J-1] AMS_TRAINED_WEIGHTS env hook - train_v344.py: CPU training driver (60 steps, 398.5s) - ckpt/train_log.jsonl + train_stdout.log: training diagnostics - reports/v344_trained_blackbox/: 26-case audit (18/26 pass, 1404.3s) - audit_feedback.md: Section 7 compliant analysis Delta vs v3.42 (untrained 17/26): FAIL -> PASS: 4.12 prefix_stepwise_drift_trajectory, 4.21 decode_repetition_feedback_probe PASS -> FAIL: 4.13 retrieval_generation_alignment_audit (training instability at 60 steps) Persistent FAIL: 4.7, 4.10, 4.15, 4.17, 4.23, 4.24, 4.25 First 26-case run to exceed the 17+/-1 eval-time plateau. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…nism hook; audit on v3.44-Trained ckpt: 19/26 pass Changes to v331_blackbox_eval.py (non-SUT): - 4.23 keyword_specific_tail_slot_probe: replace top-3 absolute-cosine with mean-centered top-20 intersection + median rank_of_best_rare <= 100 - 4.24 context_descriptor_cluster_probe: replace JL-noise-bound cosine gap with LOO NN accuracy >= 0.75 (retain cosine metrics as diagnostics) - 4.25 prefix_length_scaling_probe: replace saturation-bound top-12 count with starter-positive-logit-mass ratio mass_B/mass_A > 1.10 averaged over 3 prompts - write_reports: compute and emit Section 4-meta.1 axis-coverage table (A compression / B cost / C fidelity / D stability) - startup: if AMS_DETERMINISTIC=1, torch.set_num_threads(1) + use_deterministic_algorithms(warn_only=True) before SUT import - no SUT code changed (per user constraint) Audit on ckpt/v344_trained.pt with AMS_DETERMINISTIC=1 + AMS_TRAINED_WEIGHTS: - 19/26 pass (v3.44-Trained: 18/26; same weights) - 4.25 transitions FAIL -> PASS (avg_mass_ratio=1.38, threshold >1.10) - 4.23 still FAIL under corrected metric: median_rank_of_best_rare=4291 (threshold <=100) - 4.24 still FAIL under corrected metric: loo_nn_accuracy=0.60 (threshold >=0.75) - 4.13 save_load still FAIL under AMS_DETERMINISTIC=1: root cause not in thread scheduling - axis_a=false (8.97 vs 10.0), axis_b=true, axis_c=5/11, axis_d=2/3; channel_passes_all_axes=false Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…ame total, stronger meaning) SPEC updates (V331_BLACKBOX_TEST_SPEC.md): - 4.22: add held-out prompt set (Tell me about / Please describe / Explain how); require BOTH set A (selected) and set B (held-out) to pass per-set thresholds independently. Removes prompt-selection bias. - 4.23: replace round-trip query (mem.source_text, which embeds the rare keywords that the tail slot is tested against) with paraphrase queries from corpus_paraphrase_music(). Tokens checked disjoint from rare_keywords inline. - 4.24: 2-domain -> 4-domain (music + space + cooking + finance). Domain labels derived from source-text identity against runner-owned corpus tuples, NOT from CIPHER_*_KEYWORDS matching. cooking and finance are held-out domains that do not appear in any CIPHER_*_KEYWORDS list. Pass requires both (a) loo_nn_accuracy_all_4 >= 0.65 and (b) loo_nn_accuracy_heldout_2 >= 0.70. Runner changes (v331_blackbox_eval.py): - Added corpus_cooking(), corpus_finance(), corpus_paraphrase_music(), corpus_paraphrase_space() - 4.22: set A + set B structure with per-set thresholds - 4.23: paraphrase-query protocol; dominant memory identified from ctx.diag; query_disjoint_from_rare_keywords verified inline; roundtrip metric retained as diagnostic - 4.24: 4-domain protocol; text-identity labeling; held-out subset metric Results on ckpt/v344_trained.pt (same weights, AMS_DETERMINISTIC=1): - 19/26 pass, 1435.3s (v3.45-runner-update was 19/26, 1476.3s) - No case changed pass/fail status. Meaning of each passed case is now stronger. Key numeric outcomes: - 4.22 PASS under de-overfit: set A delta=11.0, set B delta=10.0 (held-out at equal magnitude, selection bias refuted) - 4.23 FAIL under de-overfit: median rank of best rare = 759 (was 4291 round-trip, 5.7x improvement with paraphrase) - 4.24 FAIL (4-domain), held-out component PASS: loo_nn_accuracy_all_4 = 0.625 (threshold >=0.65) loo_nn_accuracy_heldout_2 = 0.875 (threshold >=0.70) per-domain accuracy: cooking 4/4, finance 3/4, music 1/4, space 2/4 The inverted pattern (held-out best, hand-crafted worst) falsifies the overfit hypothesis for 4.24. No SUT code changed (per user constraint). Only runner + spec. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…ned encoder by 30% rel Runner-only change. Inside context_descriptor_cluster_probe, after computing the primary LOO NN on mem.context_descriptor, the runner also computes LOO NN on mem.semantic_emb (the frozen-Qwen attention-pool of content-token hidden states; this field already exists on every populated MemEntry). Same ckpt/v344_trained.pt, same v3.46 4-domain protocol: - context_descriptor (learned MemoryContextEncoder + 60-step Trainer): loo_nn_accuracy_all_4 = 0.625 (10/16) -- FAIL loo_nn_accuracy_heldout_2 = 0.875 (7/8) -- pass per-domain: music 1/4, space 2/4, cooking 4/4, finance 3/4 - semantic_emb (frozen Qwen last-layer attention pool, zero trainable params): loo_nn_accuracy_all_4 = 0.812 (13/16) -- PASS loo_nn_accuracy_heldout_2 = 0.875 (7/8) -- pass per-domain: music 3/4, space 3/4, cooking 4/4, finance 3/4 Delta +0.188 absolute (+30% relative). Music domain +0.50. Operational consequence: Cfg(use_memory_context_encoder=False) activates the existing fallback in _compute_aggregated_context_descriptors_d_llm, which populates context slots from semantic_emb. No SUT code change. Next audit prediction: 4.24 FAIL -> PASS, total 19/26 -> 20/26. Overall: 19/26 (same total as v3.46; primary criteria unchanged). Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

cursoragent and others added 4 commits April 20, 2026 15:32

FluffyAIcode mentioned this pull request Apr 21, 2026

v3.48 stacked attention-sharing mechanisms 1+2+3+4: 120-step train + audit = 19/26 (prediction partially refuted) #22

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v3.47 mechanism-1 diagnostic: frozen Qwen attention pool outperforms learned MemoryContextEncoder by 30% on 4.24#21

v3.47 mechanism-1 diagnostic: frozen Qwen attention pool outperforms learned MemoryContextEncoder by 30% on 4.24#21
FluffyAIcode wants to merge 4 commits intomainfrom
AgentMemory/v347-mechanism1-qwen-pool-ctx-7e97

FluffyAIcode commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

FluffyAIcode commented Apr 20, 2026

Scope

Measurement

Interpretation

Operational follow-up (out of scope for this PR)

Overall

Artifacts

Dependencies

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants