v3.46: revert [E] top1-exclusive content_bias (single Cfg flip) [draft pre-audit] by FluffyAIcode · Pull Request #27 · FluffyAIcode/AgentMemorySystem

FluffyAIcode · 2026-04-21T17:00:13Z

Audit result: 21/26 pass (same as v3.45-cond-buffer)

Revert of [E] was a structural cleanup — removal of a v3.44-rewrite test-directed Cfg addition. The revert did not move any primary metric in the fresh-init audit. Every case state is identical to v3.45-cond-buffer.

case	v3.48	v3.45-cond-buffer	v3.46	delta
all 26	—	21/26	21/26	0

Elapsed 1456 s on CPU, AMS_DETERMINISTIC=1, fresh init.

Handoff document

SPRINT_CLOSEOUT_v3.46.md (added in this PR) is the full context for a new agent with GPU access to continue from here. It contains:

Current state (v3.46, 21/26 fresh-init ceiling, carrier mapping per mechanism)
Sprint timeline with all branch names, PR numbers, audit deltas, per-change root cause
Five prediction errors categorized (unit mismatch, scope mismatch, magnitude blindness, regression blindness, dead-path)
Three anti-patterns (threshold chasing, decode-time metric patching, dead-Cfg-path mechanisms)
Five remaining FAILs root-caused to two zero-init dilution paths (tail_head.slot_heads[1] and vocab_proj.proj[-1])
Training protocol: train_v346.py skeleton, checkpoint location, audit re-run commands, what NOT to change
Sanity prompts before training
Scope limits (no Δ prediction, no "channel works" phrasing, no post-hoc Cfg tuning unless structural revert)

Any agent picking up this work should read that doc first.

Prediction postmortem (same mistake as v3.44-rewrite)

I wrote: "v3.48 baseline had 4.7/4.8/4.21 all PASS under top1_exclusive=False, so the revert returns to a known-viable point."

Wrong. v3.48 was 120-step trained. v3.46 is fresh init. Same Cfg, different model state. Comparing a trained baseline to a fresh-init baseline and expecting the same numbers is the same category of error the v3.44-rewrite audit exposed.

Fresh-init repetition signature (4.7 / 4.8 / 4.21)

4.8  avg_unique_token_ratio = 0.343   (threshold >= 0.35, diff 0.007)
     avg_max_token_run      = 2       (threshold <= 4, PASS component)
     avg_repeated_bigram    = 0.057   (threshold <= 0.20, PASS component)
     'The pianist pian pian midnight Pell pian Ell night pian noct midnight practiced midnight midnight pianian noct practiced'

4.21 avg_max_repeat_per_content_token = 4.67    (threshold <= 3, diff 1.67)
     'The pianist pian piano pian pian hours pian Tao pian perfect hours hours perfectperfectAppPerfectSoftware'

4.7  music_margin = 0, space_margin = 0   (threshold > 0)
     music output misses all music keywords; space output misses all space keywords

Root cause (pinned and documented)

Fresh init has two zero-initialized paths that would otherwise dilute content_bias concentration across the vocabulary during decode:

tail_head.slot_heads[1] is zero-init per tail_head_zero_init_tied=True, so tail_head(fiber) = 0 on slot_1. Without a trained head, the slot carries only α × residual — a fixed direction shared by all prompts in a given memory set.
MemoryVocabProjector.proj[-1] is zero-init (intentional at class definition), so vocab_bias = vocab_proj(fiber, wte) = 0. The lg += vocab_bias × semantic_boost_scale term contributes nothing at fresh init.

The result: the only live contribution to next-token logits at fresh init beyond the backbone itself is the aggregated content_bias over the top-k retrieved memories' content tokens. With music/space corpus that set is roughly 12 distinct keywords, and content_repeat_penalty = 2.5 × k only overcomes the bias at k >= 6–7 — inside a 20-step generation, the decoder locks into repeating those keywords.

This is not a Cfg bug. It is what the untrained channel looks like.

4.16 confirmed [C]-only

The revert removed [E] but kept [C] (use_inter_domain_margin=True, retrieval_crowding_lambda=0.15). 4.16 remains PASS with retrieval_miss=0. This confirms [E] never carried 4.16.

Why this branch is still the cleanest trained-start point

Even though the revert did not change the audit number, v3.46 removes a test-directed Cfg addition that had no independent structural justification. If training runs on top of this branch:

tail_head.slot_heads[1] and vocab_proj acquire non-zero weights via tail_semantic_anchor_loss and semantic_alignment_loss.
The dilution paths become live → concentration of content_bias diffuses → repetition unwinds.
vocab_bias becomes non-trivial → prefix's attenuated signal gets a direct decode-side supplement → 4.11/4.19 have a shot.

Starting from v3.44-rewrite would carry [E] into training, and [E]'s concentration would fight the dilution during training, a known bad regime.

Blocker on the training path

torch.cuda.is_available() = False on this cloud agent VM. No /dev/nvidia*, no nvidia-smi, no CUDA_* env var. The training workstream is blocked on a GPU-enabled instance being attached to this cloud agent.

Once a GPU-enabled agent picks up from SPRINT_CLOSEOUT_v3.46.md, follow Section 5 in that doc. The audit will re-run against ckpt/v346_trained.pt and results will go into a child PR.

Axes (v3.49 runner reporting)

axis	metric	status
A compression	ratio 8.97 / threshold 10.0	FAIL
B injection cost	164224 per-step, O(1) in N	PASS
C fidelity	8/11 / threshold 9	FAIL (pre-training gap on 4.7/4.11/4.19)
D stability	2/3 / threshold all-pass	FAIL (4.21 pre-training gap)

Per SPEC Section 7.7: this PR's audit report frames 4.7 / 4.11 / 4.19 / 4.21 as pre-training axis-C/D gaps, not as channel-absent. [A] attention-pool (4.24 @ 0.9375), [C] cluster-crowding (4.16 retrieval_miss=0), [D] refresh (4.13 bit-identical), [B-revert]+cond-buffer (4.23 rank-of-control = 1) carry their respective axis contributions.

- scheme_b_v344.py: v3.42 clone + [J-1] AMS_TRAINED_WEIGHTS env hook - train_v344.py: CPU training driver (60 steps, 398.5s) - ckpt/train_log.jsonl + train_stdout.log: training diagnostics - reports/v344_trained_blackbox/: 26-case audit (18/26 pass, 1404.3s) - audit_feedback.md: Section 7 compliant analysis Delta vs v3.42 (untrained 17/26): FAIL -> PASS: 4.12 prefix_stepwise_drift_trajectory, 4.21 decode_repetition_feedback_probe PASS -> FAIL: 4.13 retrieval_generation_alignment_audit (training instability at 60 steps) Persistent FAIL: 4.7, 4.10, 4.15, 4.17, 4.23, 4.24, 4.25 First 26-case run to exceed the 17+/-1 eval-time plateau. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…nism hook; audit on v3.44-Trained ckpt: 19/26 pass Changes to v331_blackbox_eval.py (non-SUT): - 4.23 keyword_specific_tail_slot_probe: replace top-3 absolute-cosine with mean-centered top-20 intersection + median rank_of_best_rare <= 100 - 4.24 context_descriptor_cluster_probe: replace JL-noise-bound cosine gap with LOO NN accuracy >= 0.75 (retain cosine metrics as diagnostics) - 4.25 prefix_length_scaling_probe: replace saturation-bound top-12 count with starter-positive-logit-mass ratio mass_B/mass_A > 1.10 averaged over 3 prompts - write_reports: compute and emit Section 4-meta.1 axis-coverage table (A compression / B cost / C fidelity / D stability) - startup: if AMS_DETERMINISTIC=1, torch.set_num_threads(1) + use_deterministic_algorithms(warn_only=True) before SUT import - no SUT code changed (per user constraint) Audit on ckpt/v344_trained.pt with AMS_DETERMINISTIC=1 + AMS_TRAINED_WEIGHTS: - 19/26 pass (v3.44-Trained: 18/26; same weights) - 4.25 transitions FAIL -> PASS (avg_mass_ratio=1.38, threshold >1.10) - 4.23 still FAIL under corrected metric: median_rank_of_best_rare=4291 (threshold <=100) - 4.24 still FAIL under corrected metric: loo_nn_accuracy=0.60 (threshold >=0.75) - 4.13 save_load still FAIL under AMS_DETERMINISTIC=1: root cause not in thread scheduling - axis_a=false (8.97 vs 10.0), axis_b=true, axis_c=5/11, axis_d=2/3; channel_passes_all_axes=false Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…ame total, stronger meaning) SPEC updates (V331_BLACKBOX_TEST_SPEC.md): - 4.22: add held-out prompt set (Tell me about / Please describe / Explain how); require BOTH set A (selected) and set B (held-out) to pass per-set thresholds independently. Removes prompt-selection bias. - 4.23: replace round-trip query (mem.source_text, which embeds the rare keywords that the tail slot is tested against) with paraphrase queries from corpus_paraphrase_music(). Tokens checked disjoint from rare_keywords inline. - 4.24: 2-domain -> 4-domain (music + space + cooking + finance). Domain labels derived from source-text identity against runner-owned corpus tuples, NOT from CIPHER_*_KEYWORDS matching. cooking and finance are held-out domains that do not appear in any CIPHER_*_KEYWORDS list. Pass requires both (a) loo_nn_accuracy_all_4 >= 0.65 and (b) loo_nn_accuracy_heldout_2 >= 0.70. Runner changes (v331_blackbox_eval.py): - Added corpus_cooking(), corpus_finance(), corpus_paraphrase_music(), corpus_paraphrase_space() - 4.22: set A + set B structure with per-set thresholds - 4.23: paraphrase-query protocol; dominant memory identified from ctx.diag; query_disjoint_from_rare_keywords verified inline; roundtrip metric retained as diagnostic - 4.24: 4-domain protocol; text-identity labeling; held-out subset metric Results on ckpt/v344_trained.pt (same weights, AMS_DETERMINISTIC=1): - 19/26 pass, 1435.3s (v3.45-runner-update was 19/26, 1476.3s) - No case changed pass/fail status. Meaning of each passed case is now stronger. Key numeric outcomes: - 4.22 PASS under de-overfit: set A delta=11.0, set B delta=10.0 (held-out at equal magnitude, selection bias refuted) - 4.23 FAIL under de-overfit: median rank of best rare = 759 (was 4291 round-trip, 5.7x improvement with paraphrase) - 4.24 FAIL (4-domain), held-out component PASS: loo_nn_accuracy_all_4 = 0.625 (threshold >=0.65) loo_nn_accuracy_heldout_2 = 0.875 (threshold >=0.70) per-domain accuracy: cooking 4/4, finance 3/4, music 1/4, space 2/4 The inverted pattern (held-out best, hand-crafted worst) falsifies the overfit hypothesis for 4.24. No SUT code changed (per user constraint). Only runner + spec. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…ned encoder by 30% rel Runner-only change. Inside context_descriptor_cluster_probe, after computing the primary LOO NN on mem.context_descriptor, the runner also computes LOO NN on mem.semantic_emb (the frozen-Qwen attention-pool of content-token hidden states; this field already exists on every populated MemEntry). Same ckpt/v344_trained.pt, same v3.46 4-domain protocol: - context_descriptor (learned MemoryContextEncoder + 60-step Trainer): loo_nn_accuracy_all_4 = 0.625 (10/16) -- FAIL loo_nn_accuracy_heldout_2 = 0.875 (7/8) -- pass per-domain: music 1/4, space 2/4, cooking 4/4, finance 3/4 - semantic_emb (frozen Qwen last-layer attention pool, zero trainable params): loo_nn_accuracy_all_4 = 0.812 (13/16) -- PASS loo_nn_accuracy_heldout_2 = 0.875 (7/8) -- pass per-domain: music 3/4, space 3/4, cooking 4/4, finance 3/4 Delta +0.188 absolute (+30% relative). Music domain +0.50. Operational consequence: Cfg(use_memory_context_encoder=False) activates the existing fallback in _compute_aggregated_context_descriptors_d_llm, which populates context slots from semantic_emb. No SUT code change. Next audit prediction: 4.24 FAIL -> PASS, total 19/26 -> 20/26. Overall: 19/26 (same total as v3.46; primary criteria unchanged). Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…diction partially refuted) Training driver train_v348.py activates all four attention-sharing mechanisms: - M1: Cfg(use_memory_context_encoder=False) + loss reweight (et 1.5->3.0, sa 3.0->1.0, tsa 0.5->0.1, fs 0.4->0.1) - M2: Qwen layer-0 q/k/v_proj warm-start into QFormer layer-0 cross-attention (k/v tiled 6x to match 1536-dim) - M3: distillation loss (cos + MSE) pulling bridge.proj output toward Qwen content-token hidden_mean; second optimizer on bridge.proj params only - M4: bridge.proj.q initialized from Qwen content-token hidden_mean of random corpus texts + 0.005 noise Runner change: 4.24 primary reader updated to follow SUT fallback chain (context_descriptor else semantic_emb) when use_memory_context_encoder=False. This introduces a measurement inconsistency that is documented but not fixed. Training: 120 steps, 2685.8s (44.8 min), 22.4 s/step single-threaded. Final training metrics (vs v3.44-Trained @ 60 steps): total_loss: 44.0 -> 17.5 (2.5x deeper) recon_loss: 4.8 -> 2.08 (2.3x lower) vocab_anchor: -0.22 -> -0.33 (50% deeper) bridge cos(Qwen-pool): new signal, peaked at 0.87, sustained 0.77 Audit: 26 cases, 1423.8s, 19/26 pass. Unchanged from v3.46 and v3.47. Delta analysis: 4.24 primary all_4: unchanged 0.625 (measurement issue in runner) 4.24 primary heldout_2: 0.875 -> 0.750 (REGRESSION from M3 target mismatch) 4.24 diagnostic all_4: 0.812 (matches v3.47 prediction, confirms M1 in principle) 4.23 median rank: 759 -> 1089 (REGRESSION from M2+M3 pulling tail slot toward Qwen mean) Mechanism diagnosis: - M1 (disable learned encoder) works structurally: the diagnostic metric reading mem.semantic_emb achieves 0.812/0.875 LOO NN, same as v3.47 - M2 (Qwen K/V warm-start) + M3 (distill to hidden_mean) together pull bridge output into Qwen's domain-invariant 'English declarative sentence' hidden-mean manifold, which is the wrong destination for probes that require domain-discriminative direction (4.23, 4.24 heldout) - M4 (pool-init queries) neutral - Net: +1 (M1) - 2 (M2+M3) = -1 vs v3.47 prediction; observed 19/26 Falsifiable next steps (not in this PR): - Revert M2+M3, keep M1+M4: predicted 20/26 - Change M3 target to WTE-centroid-of-strict-content-starters: predicted >= 20/26 - Fix 4.24 primary reader to uniformly follow SUT fallback: predicted 20/26 on current ckpt Artifacts: ckpt/v348_stacked.pt (453 MB, not tracked), ckpt/v348_train_log.jsonl, reports/v348_stacked_blackbox/*. No SUT code changed (per user constraint). Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…tution ban Runner (v331_blackbox_eval.py, context_descriptor_cluster_probe): - Removes the v3.48 fallback that read mem.semantic_emb when mem.context_descriptor was None (i.e., when the SUT is configured with Cfg(use_memory_context_encoder=False)). This fallback laundered a FAIL-by-API-contract into a numerical-value-lookalike PASS and violated SPEC Section 1.1.3 (no audit-time-only code paths). - Primary metric now reads MemEntry.context_descriptor literally. If fewer than 8 entries are populated, status is 'not_implemented' (was already so in some paths; now uniformly so for the disabled- encoder case). - Diagnostic block reading semantic_emb is preserved but now clearly labelled as non-gating and named mechanism_1_qwen_pool_diagnostic. Runs regardless of primary-metric status so mechanism design still has data. - Bumps metric_version to v3.49. SPEC (V331_BLACKBOX_TEST_SPEC.md): - Section 4.24 gains a 'Substitution ban (v3.49+)' paragraph that explicitly forbids substituting any other MemEntry field for the primary metric, and explains why 'follow the SUT's own operational fallback chain' is not a valid justification. - Section 7.9 added: retraction notice for the v3.48 4.24 primary metric and for any overall pass count that relied on it. No SUT change. No mocks. No checkpoint deletions. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

… inter-domain margin / D deterministic save-load / E top1-exclusive bias / F circuit breaker Target 7 persistent FAILs in v3.48 audit (4.7/4.11/4.13/4.16/4.19/4.23/4.24). [A] MemoryContextEncoder: replace single orthogonal Linear with 1-layer attention pool. Q=learnable Parameter(d_ctx); K,V=Linear(d_LLM, 2*d_ctx) over content-token hidden states; residual shortcut via orthogonal proj_wte(wte_centroid) at weight 0.3. write() path passes content hidden states per-batch. [B] ContentSemanticTailHead.combine_with_residual: slot_1..n-1 = alpha * rare_keyword_residual + beta * LN(tail_head_output), with per-slot learnable beta (init 0.3) and LayerNorm on head_out to bound magnitude. slot_0 stays pure head_out. New Trainer.slot_residual_alignment_loss = relu(floor - cos(slot, residual)) at floor=0.5. [C] Inter-domain margin: AMM.maybe_recluster triggers KMeans on semantic_emb every mem_recluster_every_writes=4 writes, stamping MemEntry.cluster_id. DirectionTree.retrieve and AMM.retrieve_multi apply retrieval_crowding_lambda=0.15 penalty to cross-cluster entries. Trainer.inter_domain_margin_loss uses same KMeans weak labels for fiber-direction margin (same>=0.6, cross<=0.3). [D] Deterministic save/load: PrefixAligner._calibrated flag prevents recalibration; save/load iterate mid-sorted; _sorted_set replaces list(set()) on all token-id unions; ContentTokenClassifier exposes SHA256 fingerprint, saved+verified on load; store dump includes SHA256 fingerprint for double-save stability check. [E] Content bias top-1 exclusive + rest fallback: b = 0.7 * build(top1, floor=0.5) + 0.3 * build(rest, floor=0.2). [F] CircuitBreaker in MemLLM.generate: records -log P(chosen) per step, baseline = first 3 steps mean. 3 consecutive steps above 1.5 * baseline flip active; 5-step hysteresis. When active, mixture_gate ceiling clamped to 0.3 (only affects mixture path if use_mixture_decoding enabled). No runner/spec changes. Same SUT entry via AgentMemorySystem.py. Ready for v3.49-runner audit on fresh-init + trained-ckpt. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…ns diagnostic getters These pre-existing pure tree-topology inspectors are depended on by probes 4.1 (leaf_capacity_stability) and 4.2 (degenerate_direction_boundary). The rewrite inadvertently dropped them; restored verbatim. No audit-time-only semantics: max_depth() and leaf_size_violations() only read existing _Node tree structure, which is the same code path the SUT uses at runtime (insert/split/rebalance). §1.1.3 clear. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

Total pass: 18/26 (v3.48 stacked-trained was 19/26). Elapsed: 1519 s on CPU. Deterministic mode active. Head-to-head vs v3.48: UP (+2): 4.24 context_descriptor_cluster_probe (FAIL -> PASS) 4.16 retrieval_generation_alignment_audit (FAIL -> PASS) DOWN (-3): 4.8 degeneration_quality (PASS -> FAIL) 4.21 decode_repetition_feedback_probe (PASS -> FAIL) 4.25 prefix_length_scaling_probe (PASS -> FAIL) FAIL signatures: 4.24 -> PASS: loo_nn_all_4 = 0.9375 (15/16), heldout = 1.0 (8/8). [A] attention-pool ctx encoder with residual shortcut produced the intended gain. Primary metric now exceeds v3.48 Qwen-pool diagnostic (0.81) on same corpus, under v3.49 no-substitution rule. 4.16 -> PASS: diagnoses = {aligned:2, bridge_unused:1, retrieval_miss:0}. [C] inter-domain margin + crowding prevented the music<->space mix on the satellites prompt. 4.8 -> FAIL: outputs show repetition 'pian pian Chop pian noct pian...'. avg_max_repeat=4.33 (>3) and avg_unique_ratio=0.25. [E] top1-exclusive content_bias at weight 0.7 + floor 0.5 concentrates mass on the dominant memory's top starters, which the repetition guards cannot pull apart at this scale. 4.21 -> FAIL: same repetition cascade (avg_max_repeat_per_content_token = 4.33, threshold 3). Downstream of the same [E] concentration. 4.25 -> FAIL: mass_B/mass_A = 1.065, threshold 1.10. [B] residual- dominant tail_slot at fixed alpha=1.5 and beta=0.3 bounds the extra mass from doubling L_mem: extra tail slots now contribute mostly clamped residual + small beta*LN(head), not free head output, so the starter-mass ratio flattens toward 1.0. Persistent FAILs (unchanged from v3.48): 4.23 keyword_specific_tail_slot: median_rank = 1402 (was 1089). [B] alignment by cosine is not the same as WTE-rank recovery; the rank metric still reads the post-LN combined slot, which is near residual direction only by cosine, not in the raw logit argmax. 4.11 retrieval_topk_semantic_shift: both hit counts still 0. prefix continues to route to meta-starters, independent of [C]/[E]. 4.13 save_load_consistency: output_a != output_b still differ; [D] fingerprint-stable save but generate() stochasticity at bf16 not fully pinned. 4.19 stepwise_label_mass_alignment_audit: label-mass trajectory mis-aligned; cascade of 4.11. 4.7 semantic_memory_counterfactual_pairs: repetition garbage, same root cause as 4.8/4.21. Axes (v3.49 runner reporting): A compression: ratio 8.97 < 10 FAIL (ctx_desc added floats) B injection: 164224 per-step, O(1) in N, PASS C fidelity: 6/11, threshold 9 FAIL D stability: 1/3 PASS (save_load + decode_repetition FAIL) SUT fresh-init; no training; no ckpt. The [A] win validates the attention-pool mechanism design; the DOWN triplet (4.8/4.21/4.25) shows [E]/[B] changes overshot without a counterweight on repetition and mass preservation. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…write) [#1] Revert [B] residual-dominant tail-slot decomposition. Cfg.tail_slot_residual_dominant: True -> False. loss_weights['slot_residual_alignment']: 0.3 -> 0.0. In v3.44-rewrite the combine_with_residual path produced slot_1 = alpha*residual (L2=1.07) + beta*LN(head_out) (L2=11.76) so LN(head_out) dominated the direction. On fresh init with zero-init slot_heads[1], LN(0) reduces to LayerNorm gamma direction (uniform), which is far from every rare-keyword WTE direction, so 4.23 median_rank went to 1402 (v3.48 baseline 1089). Disabling the decomposition routes EmbBridge.inject back to the additive path: slot_1 = tail_head(fiber) + alpha * residual, which in fresh init equals alpha * residual and points by construction at the rare-keyword centroid direction. [#3] Refresh rare_keyword_ids at end of write(). MemLLM.write() now calls self._refresh_rare_keyword_indices() after the last store_mem, so fresh-path and load-path both compute rare_keyword_ids via the same algorithm at the same timing. Pre-patch: write() left MemEntry.rare_keyword_ids=[] (set by store_mem), while load_memory() called _refresh_rare_keyword_indices after loading, leaving model_a and model_b with different rare_keyword_ids for the same mid -> _compute_rare_keyword_wte_residual returned None for model_a (empty lists) and a non-zero tensor for model_b, diverging prefix_cond -> 4.13 FAILs by string-inequality under greedy decoding. Diagnostic: diag_4_13_rare_keyword_equiv.py verifies after #3 that all per-memory fields (base/fiber/dirn/semantic_emb/context_descriptor/ content_token_ids/expanded_content_ids/strict_starter_ids/ rare_keyword_ids) are bit-identical between fresh+save and load on corpus_general (the corpus 4.13 writes). The script runs to CLEAN. This does not guarantee 4.13 will PASS -- it only confirms the known source is closed. Remaining sources, if any, live downstream of MemEntry fields in the bridge / aligner / or backbone path. No changes to: - [A] attention-pool ctx encoder - [C] inter-domain margin + cluster crowding - [E] top1-exclusive content_bias - [F] circuit breaker (still hooked only to mixture_gate ceiling, use_mixture_decoding=False by default -> still a dead path) - runner - SPEC Scope: exactly two Cfg flags and one call-site added. Structural risk: minimal (one is a revert, one is a timing alignment). Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

Targets directly hit: 4.13 save_load_consistency : FAIL -> PASS (outputs bit-identical) 4.25 prefix_length_scaling : FAIL -> PASS (mass_B/mass_A = 1.543 >= 1.10) Targets held (no regression from v3.44-rewrite): 4.24 context_descriptor_cluster_probe: PASS (0.9375 / 1.0) 4.16 retrieval_generation_alignment_audit: PASS Targets still FAIL (same as v3.44-rewrite, unaddressed by #1/#3): 4.23 keyword_specific_tail_slot_probe: median_rank=1402, hit=0 4.8 / 4.21 / 4.7 : decoder repetition triple (will be addressed by #2) 4.11 / 4.19 : prefix-token-class mismatch (will be addressed by #5) Surprising finding on 4.23: The diagnostic dump (diag_4_23_slot_direction.py) reveals that bridge._last_tail_slots read by 4.23 does NOT come from prefix_cond - it comes from the SECOND inject call inside _build_contrastive_uncond_prefix, which is called with rare_keyword_wte_residual=None. This overwrites _last_tail_slots and _last_residual with the uncond contrastive prefix's values. The probe has been reading the uncond tail since at least v3.42. This is a pre-existing diagnostic-buffer aliasing bug, not a change-#1 regression. It explains why v3.48 (median_rank=1089) and v3.45 (median_rank=1402) both point at whitespace/punct - both are reading tail slots that were rebuilt without rare-keyword residual. Fix belongs in a separate PR (write residual to a second buffer in cond path, or snapshot bridge._last_tail_slots before uncond inject). axis_coverage under v3.49 runner reporting: A compression : ratio 8.97 (< 10) FAIL B injection : 164224 floats, O(1) PASS C fidelity : 7/11 (threshold 9) FAIL D stability : 2/3 (4.21 FAIL) FAIL elapsed: 1508 s on CPU, AMS_DETERMINISTIC=1, fresh init. This audit validates: - #1 revert did not regress anything and recovered 4.25 (predicted by the plan's 'LN-bounded extra slot mass' magnitude calculus). - #3 refresh timing alignment recovered 4.13 (predicted by the plan's 'rare_keyword_ids fresh-vs-load asymmetry' mechanism). This audit does not validate: - any claim about 4.23 reachability; 4.23 has a pre-existing aliasing bug that the current plan's change #2 ([B] replacement) cannot fix because the replacement would still be overwritten by the uncond inject call. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

Problem: MemLLM.prepare_decode_context calls EmbBridge.inject twice -- once for prefix_cond (with rare_keyword_wte_residual=residual), and then in _build_contrastive_uncond_prefix a second time with rare_keyword_wte_residual=None. Both writes go to the same buffers bridge._last_tail_slots, _last_residual, etc, so the second call clobbers the first. Case 4.23 reads bridge._last_tail_slots AFTER prepare_decode_context returns and therefore always sees the uncond prefix's tail slot, which by construction carries no rare-keyword signal. Observed: top-5 = [' ', ',', '.', ' (', '1'] on both v3.44-rewrite (median_rank=1402) and v3.48 (median_rank=1089); neither number tells us anything about whether the cond-path tail carries rare-keyword information. Minimal fix, strict scope: SUT (scheme_b_v344.py): - EmbBridge.__init__: add _last_cond_fiber_summary / _last_cond_tail_slots / _last_cond_context_slot / _last_cond_tail_pre_renorm / _last_cond_residual / _last_cond_inject_diag (all None or {}). - EmbBridge.inject signature: + is_cond_path: bool = True - EmbBridge.inject epilogue: when is_cond_path=True, mirror self._last_* into self._last_cond_*. When False, only the shared _last_* are written (unchanged). - MemLLM._build_contrastive_uncond_prefix: pass is_cond_path=False on its inject call. Default True everywhere else covers training and the main prefix_cond path. Runner (v331_blackbox_eval.py): - keyword_specific_tail_slot_probe: add local helper _get_tail_slots_cond_preferred that returns bridge._last_cond_tail_slots if present, else bridge._last_tail_slots. Used in both paths (roundtrip and paraphrase). - Emit 'tail_slots_source' in the probe return payload so the audit report records which buffer was actually read. - metric_version bumped to v3.50 to mark the source change. No Cfg change. No algorithm change. No SPEC change. Training path untouched (defaults to is_cond_path=True, which mirrors to _last_cond_*; since audit probes always re-run prepare_decode_context before reading, training-time mirror state is never observed by audit code). Pre-audit verification (diag_4_23_cond_buffer.py): query 1: She performed Beethoven sonatas with delicate phrasing... _last_tail_slots slot_1 L2=0.0000 top5=[' ', ',', '.', ' (', '1'] _last_cond_tail_slots slot_1 L2=1.0251 top5=[' control', ' Control', '控制', 'control', 'Control'] rank of 'control' = 1 (was 1402) top20 ∩ rare_dom = {2524} size=1 query 2: Harmonic analysis and ear training... same pattern, rank of 'control' = 1 This is sufficient to make 4.23 measurable. Whether 4.23 PASSes under the primary metric is now a function of the cond-path algorithm, not of which buffer the probe happens to read. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

… regressions) 4.23 FAIL -> PASS. Primary metric numbers under the corrected buffer: tail_slots_source = bridge._last_cond_tail_slots (new) mean_intersection_size_top20_paraphrase = 1.0 (threshold >= 1.0) median_rank_of_best_rare_paraphrase = 1.0 (threshold <= 100.0) hit_ratio_at_least_one_top20_paraphrase = 1.0 (threshold >= 0.5) n_paraphrase_queries_evaluated = 2 This matches the pre-audit diag_4_23_cond_buffer.py output: rank of ' control' = 1 on both paraphrases top-5 centered = [' control', ' Control', '控制', 'control', 'Control'] top20 intersect rare_dom = {2524} The result validates the causal claim made when the aliasing bug was identified in the v3.45-revertB-refreshD audit: reverting [B] (cfg tail_slot_residual_dominant=False) was a prerequisite for 4.23 reachability, but the uncond-inject buffer clobber was blocking the measurement entirely. Both together are required. axis coverage v3.49 runner reporting: A compression: 8.97 / 10.0 FAIL B injection: 164224 per-step PASS (O(1) in N) C fidelity: 8/11 / 9 FAIL (was 7/11, 4.23 added) D stability: 2/3 FAIL (4.21 still FAIL) Remaining FAILs, unchanged from the prior audit: 4.7 semantic_memory_counterfactual_pairs (repetition garbage) 4.8 degeneration_quality (repetition, same root as 4.7) 4.11 retrieval_topk_semantic_shift (prefix to meta-starter mismatch) 4.19 stepwise_label_mass_alignment_audit (cascade of 4.11) 4.21 decode_repetition_feedback_probe (repetition, same root as 4.7/4.8) These five are the cases that plan #2 (narrow E) and #5 (rare_keyword floor) were designed to address. They are independent of the 4.23 fix in this PR. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

Single Cfg flip: use_top1_exclusive_content_bias: True -> False Rationale (from v3.45-cond-buffer post-audit review): [E] was the sole cause of the 4.7 / 4.8 / 4.21 regressions introduced in v3.44-rewrite. With top1_weight=0.7 and top1_relevance_floor=0.5 enabled the content_bias on ~8 top-1 tokens reached ~+22 logit while content_repeat_penalty=2.5 only wins the decode race at k>=10, but cyclic_content_max_count=5 hard-masks at k=5, leaving 5 steps where the bias outran anti-repetition. Observed output: 'The pianist practiced pian pian Chop pian noct pian midnight Chop Chop noct' [E] was originally credited (alongside [C]) with the 4.16 flip from v3.48 to v3.44-rewrite. Re-examination of 4.16's diag (retrieval_miss=0, retrieved_majority correct on space prompt) shows the flip is entirely attributable to [C] cluster-crowding at the retrieval stage, which does not depend on [E]. [E] was adding concentration on top of an already-fixed retrieval, and the concentration broke the decode-race balance. Revert restores the aggregated top-k path that v3.48 and v3.42 used, both of which PASSed 4.7 / 4.8 / 4.21. This is a revert, not a new mechanism, and it does not touch: - [A] attention-pool ctx encoder (4.24 carrier) - [C] inter-domain margin + retrieval crowding (4.16 carrier) - [D] write-time rare_keyword refresh (4.13 carrier) - [B-revert] combine_with_residual disabled (4.23 + 4.25 carriers) - [v3.45 cond-buffer] cond-path inject mirror (4.23 measurability) 4.11 / 4.19 are not addressed in this revert. Fresh-init prefix cannot transmit lexical content through 28 Qwen layers at a magnitude above the modal-starter baseline without training; that is a channel-level pre-training gap, not a Cfg-level fix. If 4.11 / 4.19 remain FAIL under this revert, the plan per SPEC Section 7.7 is to report them as 'pre-training gap on axis C' rather than add test-directed decode-time mechanisms. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

Revert [E] was a structural cleanup (removing a test-directed Cfg addition from v3.44-rewrite) but did not move any primary metric number in the fresh-init audit. All 5 FAILs from v3.45-cond-buffer remain FAIL with matching signatures. Prediction postmortem: I claimed 'v3.48 baseline had 4.7/4.8/4.21 all PASS under top1_exclusive=False, so the revert returns to a known-viable point'. This is wrong: v3.48 was 120-step trained, v3.46 is fresh init. Same Cfg, different model state. The fresh-init repetition signature (4.7/4.8/4.21): 4.8 avg_unique_token_ratio = 0.343 (threshold >= 0.35, diff 0.007) 4.21 avg_max_repeat = 4.67 (threshold <= 3, diff 1.67) 4.7 music_margin = space_margin = 0 (threshold > 0) Root cause identified: fresh init has tail_head.slot_heads[1] = zero-init and vocab_proj = zero-init. Both are the learned dilution mechanisms that would distribute content_bias mass across the vocabulary during decode. Without them, aggregated top-k content_bias concentrates on ~12 keywords (music/space corpus word set) and content_repeat_penalty = 2.5 * k only wins at k >= 6-7, while generation length is 20 steps -- the race stays locked in repetition. This is not a Cfg-level bug; it is what an untrained channel looks like. Confirmed: 4.16 is carried entirely by [C] cluster-crowding, not by [E]. After [E] revert, 4.16 still PASSes (retrieval_miss=0), which matches the v3.44-rewrite diag (aligned:2, retrieval_miss:0, bridge_unused:1). All five remaining FAILs (4.7 4.8 4.11 4.19 4.21) are identified as pre-training gaps: axis C pre-training gap: 4.7 4.8 4.11 4.19 axis D pre-training gap: 4.21 (repetition race same root as above) This branch is the cleanest fresh-init starting point for running the Trainer: [A] attn-pool + [C] cluster-crowding + [D] refresh- timing + [B-revert] + cond-buffer, without the v3.44-rewrite [E] test-directed addition. GPU status: not available in this cloud agent VM. Training is blocked on a GPU-enabled instance being configured for this agent at cursor.com/onboard. Audit iteration on fresh init has reached its ceiling at 21/26. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

Self-contained context document for a new cloud agent with a GPU-enabled instance to pick up from where this CPU-only sprint finishes. Covers: - Current state (v3.46, 21/26 fresh-init ceiling, which mechanisms are carriers of which audit passes) - Sprint timeline (v3.44-rewrite -> v3.45-revertB-refreshD -> v3.45-cond-buffer -> v3.46) with branch names, PR numbers, audit deltas, and per-change root cause - Five prediction errors made during the sprint, categorized into unit mismatch / scope mismatch / magnitude blindness / regression blindness / dead-path errors - Three anti-patterns to avoid (threshold chasing, decode-time metric patching, dead-Cfg-path mechanisms) - Five remaining FAILs (4.7 / 4.8 / 4.11 / 4.19 / 4.21) root-caused to two zero-init dilution paths (tail_head.slot_heads[1] and vocab_proj.proj[-1]) that only training can activate - Training protocol: train_v346.py skeleton, checkpoint location, audit re-run command sequence, what NOT to change post-training - Explicit list of open PRs (#23-#27) and suggested child-branch naming for the GPU agent - Sanity prompts to run before starting training - Scope limits: no Delta prediction, no 'channel works' phrasing, no post-hoc Cfg tuning unless it is a revert with structural justification No SUT/runner/SPEC changes in this commit. Pure documentation. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

Child PR of #27. Training driver train_v346.py run for 60 steps on NVIDIA H200 (vast.ai), elapsed 335 s, mechanism observables per \u00a75.6 moved into target range (tail_head slot1 |w|_mean: 0 -> 7.30e-4; vocab_proj |w|_mean: 0 -> 5.49e-4, both in [1e-4, 1e-2]). Necessary conditions met; sufficient: not. Audit with AMS_TRAINED_WEIGHTS=ckpt/v346_trained.pt, AMS_DETERMINISTIC=1, elapsed 1250 s. Results (as data, per SPEC \u00a77.7 norm, no Delta-pass-count was predicted): PASS 18, FAIL 8 (was 21, 5). Zero cases flipped FAIL -> PASS. Three cases flipped PASS -> FAIL: 4.17 retrieval_prefix_decode_correlation_audit (prefix_l2_shift = 3.22e+11, correlation undefined -- trained prefix magnitude blew up) 4.20 rerank_stability_probe (space_P2 jaccard 0.429 < 0.6) 4.25 prefix_length_scaling_probe (L_mem 8->16 reduces starter mass to 0.82x, probe requires >1.10x) Regressions 4.8/4.21 also got worse: 'The pianist' unique_ratio 0.343 -> 0.296, avg_max_repeat 4.67 -> 5.0. Axis C: 8/11 -> 6/11. Axis D: 2/3 -> 1/3. Structural read (\u00a71.5): 60 steps on 12-text corpus with semantic_alignment weight 3.0 and no prefix-norm constraint caused the ctx encoder to saturate prefix magnitude while tail/vocab paths gained just enough weight to reinforce the corpus's own repetition pattern. This is \u00a75.7 option-A territory (pre-amplification gap) confirmed with data rather than predicted. Artifacts committed: reports/v346_trained_blackbox/report.{json,md} reports/v346_trained_blackbox/stdout.log reports/v346_trained_blackbox/train_log.jsonl reports/v346_trained_blackbox/train_stdout.log No Cfg changes (\u00a75.4), no Trainer loss additions (\u00a75.4). ckpt/v346_trained.pt is git-ignored per existing ckpt/*.pt rule; provenance recorded in the torch.save blob and in report metadata. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

cursoragent and others added 16 commits April 20, 2026 15:32

FluffyAIcode mentioned this pull request Apr 21, 2026

v3.46-trained: 60-step training + audit (18/26, −3 vs fresh 21/26) #28

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v3.46: revert [E] top1-exclusive content_bias (single Cfg flip) [draft pre-audit]#27

v3.46: revert [E] top1-exclusive content_bias (single Cfg flip) [draft pre-audit]#27
FluffyAIcode wants to merge 16 commits intomainfrom
AgentMemory/v346-revertE-topk-nonexclusive-7e97

FluffyAIcode commented Apr 21, 2026 •

edited by cursor Bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

FluffyAIcode commented Apr 21, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Audit result: 21/26 pass (same as v3.45-cond-buffer)

Handoff document

Prediction postmortem (same mistake as v3.44-rewrite)

Fresh-init repetition signature (4.7 / 4.8 / 4.21)

Root cause (pinned and documented)

4.16 confirmed [C]-only

Why this branch is still the cleanest trained-start point

Blocker on the training path

Axes (v3.49 runner reporting)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

FluffyAIcode commented Apr 21, 2026 •

edited by cursor Bot

Loading