v3.45 (staged): revert [B], align rare_keyword refresh timing on write [draft pre-audit] by FluffyAIcode · Pull Request #25 · FluffyAIcode/AgentMemorySystem

FluffyAIcode · 2026-04-21T15:22:26Z

Audit result: 20/26 pass (elapsed 1508 s)

Staged PR implementing only #1 (revert [B]) and #3 (refresh rare_keywords on write) of the v3.45 plan.

Head-to-head

case	v3.48	v3.44-rewrite	v3.45 (this PR)	delta vs v3.44r
4.13 save_load_consistency	FAIL	FAIL	PASS	[+]
4.25 prefix_length_scaling_probe	PASS	FAIL	PASS	[+]
4.24 context_descriptor_cluster_probe	FAIL	PASS	PASS	.
4.16 retrieval_generation_alignment_audit	FAIL	PASS	PASS	.
4.23 keyword_specific_tail_slot_probe	FAIL	FAIL	FAIL	.
4.8 degeneration_quality	PASS	FAIL	FAIL	.
4.21 decode_repetition_feedback_probe	PASS	FAIL	FAIL	.
4.7 semantic_memory_counterfactual_pairs	FAIL	FAIL	FAIL	.
4.11 retrieval_topk_semantic_shift	FAIL	FAIL	FAIL	.
4.19 stepwise_label_mass_alignment_audit	FAIL	FAIL	FAIL	.

Totals: v3.48 = 19/26, v3.44-rewrite = 18/26, v3.45 = 20/26.

Primary metric numbers

4.13 (pass condition: out_a == out_b under greedy):

output_a: The pianist hours piano piano practiced piano noct piano perfect difficult noct practiced practiced hours hours noct noct difficult difficult
output_b: The pianist hours piano piano practiced piano noct piano perfect difficult noct practiced practiced hours hours noct noct difficult difficult

Bit-identical. Consistent with diag_4_13_rare_keyword_equiv.py reporting CLEAN on all MemEntry fields.

4.25 (pass condition: avg_mass_ratio_B_over_A > 1.10):

L_mem_A = 8, L_mem_B = 16
avg_mass_ratio_B_over_A = 1.542808217358133

Matches the plan's magnitude prediction: without LN bounding each extra tail slot, the additive path tail_head + alpha*residual lets extra slots contribute unbounded mass, recovering the v3.48 regime.

4.24 (unchanged from v3.44-rewrite, held):

loo_nn_accuracy_all_4  = 0.9375  (15/16)
loo_nn_accuracy_heldout_2 = 1.0  (8/8)

Surprising finding on 4.23

I ran diag_4_23_slot_direction.py after seeing 4.23 still at median_rank = 1402. It reveals:

[4.23 diag] bridge._last_residual: shape=None
[4.23 diag] bridge._last_tail_slots[0, s=1]  L2 = 0.0000

prepare_decode_context calls bridge.inject(..., rare_keyword_wte_residual=residual) (cond path) and then calls _build_contrastive_uncond_prefix which calls bridge.inject(..., rare_keyword_wte_residual=None) (uncond path). The second inject overwrites bridge._last_tail_slots and _last_residual with the uncond prefix's buffers, which were built without the residual. The 4.23 probe reads these buffers after both injects return, so it has always been reading the uncond tail.

This is a pre-existing diagnostic-buffer aliasing bug (same behavior on v3.42 / v3.44 / v3.48 — v3.48's median_rank=1089 and v3.45's median_rank=1402 both show top-5 = punctuation/whitespace). It is not caused by change #1 and cannot be fixed by change #2 as originally drafted. A separate fix is required: write residual/tail slots to a dedicated cond-only buffer in EmbBridge.inject before the uncond call overwrites the shared buffer.

Axis coverage (v3.49 runner)

axis	metric	pass
A compression	ratio 8.97 / threshold 10.0	FAIL
B injection cost	164224 floats/step, O(1) in N	PASS
C fidelity	7/11 / threshold 9	FAIL
D stability	2/3 (4.21 FAIL)	PASS under threshold? No, `all_pass=True` → FAIL

What this audit validated

Add AGENTS.md with initial cloud agent instructions #1 (revert [B]) delivered 4.25 recovery via the magnitude mechanism in the plan: the additive path lets extra tail slots contribute unbounded mass under L_mem scaling, while combine_with_residual had bounded their L2 via LN.
v331 black-box audit run: 10/19 PASS, 1005s CPU #3 (refresh timing alignment) delivered 4.13 via the mechanism in the plan: under identical MemEntry contents, generate(greedy) produces identical outputs.
Neither change regressed any prior-PASS case.

What this audit did NOT validate

Any claim about 4.23 reachability via any change that modifies how the residual is combined into the tail slot. As long as the second inject (uncond path) overwrites the shared diagnostic buffers, the probe reads the wrong tensor. This must be fixed separately before attempting AMS v3.7 完整黑盒测试套件 + 架构性能基准 (583 assertions, 3 modalities) #2's [B] replacement.

Next decision point

Two candidate paths; no commitment:

A. Proceed with AMS v3.7 完整黑盒测试套件 + 架构性能基准 (583 assertions, 3 modalities) #2 (narrow [E], raise repeat-penalty). Targets 4.8/4.21/4.7. Does not touch 4.23.
B. Fix the 4.23 diagnostic-buffer aliasing first (new PR, no Cfg change), to make 4.23 measurable, then re-audit, then decide whether AMS v3.7 完整黑盒测试套件 + 架构性能基准 (583 assertions, 3 modalities) #2/v3.33 black-box audit: 10/19 PASS (v3.31: 10, v3.32: 11), 1124s CPU #5 are still needed.

Awaiting direction.

- scheme_b_v344.py: v3.42 clone + [J-1] AMS_TRAINED_WEIGHTS env hook - train_v344.py: CPU training driver (60 steps, 398.5s) - ckpt/train_log.jsonl + train_stdout.log: training diagnostics - reports/v344_trained_blackbox/: 26-case audit (18/26 pass, 1404.3s) - audit_feedback.md: Section 7 compliant analysis Delta vs v3.42 (untrained 17/26): FAIL -> PASS: 4.12 prefix_stepwise_drift_trajectory, 4.21 decode_repetition_feedback_probe PASS -> FAIL: 4.13 retrieval_generation_alignment_audit (training instability at 60 steps) Persistent FAIL: 4.7, 4.10, 4.15, 4.17, 4.23, 4.24, 4.25 First 26-case run to exceed the 17+/-1 eval-time plateau. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…nism hook; audit on v3.44-Trained ckpt: 19/26 pass Changes to v331_blackbox_eval.py (non-SUT): - 4.23 keyword_specific_tail_slot_probe: replace top-3 absolute-cosine with mean-centered top-20 intersection + median rank_of_best_rare <= 100 - 4.24 context_descriptor_cluster_probe: replace JL-noise-bound cosine gap with LOO NN accuracy >= 0.75 (retain cosine metrics as diagnostics) - 4.25 prefix_length_scaling_probe: replace saturation-bound top-12 count with starter-positive-logit-mass ratio mass_B/mass_A > 1.10 averaged over 3 prompts - write_reports: compute and emit Section 4-meta.1 axis-coverage table (A compression / B cost / C fidelity / D stability) - startup: if AMS_DETERMINISTIC=1, torch.set_num_threads(1) + use_deterministic_algorithms(warn_only=True) before SUT import - no SUT code changed (per user constraint) Audit on ckpt/v344_trained.pt with AMS_DETERMINISTIC=1 + AMS_TRAINED_WEIGHTS: - 19/26 pass (v3.44-Trained: 18/26; same weights) - 4.25 transitions FAIL -> PASS (avg_mass_ratio=1.38, threshold >1.10) - 4.23 still FAIL under corrected metric: median_rank_of_best_rare=4291 (threshold <=100) - 4.24 still FAIL under corrected metric: loo_nn_accuracy=0.60 (threshold >=0.75) - 4.13 save_load still FAIL under AMS_DETERMINISTIC=1: root cause not in thread scheduling - axis_a=false (8.97 vs 10.0), axis_b=true, axis_c=5/11, axis_d=2/3; channel_passes_all_axes=false Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…ame total, stronger meaning) SPEC updates (V331_BLACKBOX_TEST_SPEC.md): - 4.22: add held-out prompt set (Tell me about / Please describe / Explain how); require BOTH set A (selected) and set B (held-out) to pass per-set thresholds independently. Removes prompt-selection bias. - 4.23: replace round-trip query (mem.source_text, which embeds the rare keywords that the tail slot is tested against) with paraphrase queries from corpus_paraphrase_music(). Tokens checked disjoint from rare_keywords inline. - 4.24: 2-domain -> 4-domain (music + space + cooking + finance). Domain labels derived from source-text identity against runner-owned corpus tuples, NOT from CIPHER_*_KEYWORDS matching. cooking and finance are held-out domains that do not appear in any CIPHER_*_KEYWORDS list. Pass requires both (a) loo_nn_accuracy_all_4 >= 0.65 and (b) loo_nn_accuracy_heldout_2 >= 0.70. Runner changes (v331_blackbox_eval.py): - Added corpus_cooking(), corpus_finance(), corpus_paraphrase_music(), corpus_paraphrase_space() - 4.22: set A + set B structure with per-set thresholds - 4.23: paraphrase-query protocol; dominant memory identified from ctx.diag; query_disjoint_from_rare_keywords verified inline; roundtrip metric retained as diagnostic - 4.24: 4-domain protocol; text-identity labeling; held-out subset metric Results on ckpt/v344_trained.pt (same weights, AMS_DETERMINISTIC=1): - 19/26 pass, 1435.3s (v3.45-runner-update was 19/26, 1476.3s) - No case changed pass/fail status. Meaning of each passed case is now stronger. Key numeric outcomes: - 4.22 PASS under de-overfit: set A delta=11.0, set B delta=10.0 (held-out at equal magnitude, selection bias refuted) - 4.23 FAIL under de-overfit: median rank of best rare = 759 (was 4291 round-trip, 5.7x improvement with paraphrase) - 4.24 FAIL (4-domain), held-out component PASS: loo_nn_accuracy_all_4 = 0.625 (threshold >=0.65) loo_nn_accuracy_heldout_2 = 0.875 (threshold >=0.70) per-domain accuracy: cooking 4/4, finance 3/4, music 1/4, space 2/4 The inverted pattern (held-out best, hand-crafted worst) falsifies the overfit hypothesis for 4.24. No SUT code changed (per user constraint). Only runner + spec. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…ned encoder by 30% rel Runner-only change. Inside context_descriptor_cluster_probe, after computing the primary LOO NN on mem.context_descriptor, the runner also computes LOO NN on mem.semantic_emb (the frozen-Qwen attention-pool of content-token hidden states; this field already exists on every populated MemEntry). Same ckpt/v344_trained.pt, same v3.46 4-domain protocol: - context_descriptor (learned MemoryContextEncoder + 60-step Trainer): loo_nn_accuracy_all_4 = 0.625 (10/16) -- FAIL loo_nn_accuracy_heldout_2 = 0.875 (7/8) -- pass per-domain: music 1/4, space 2/4, cooking 4/4, finance 3/4 - semantic_emb (frozen Qwen last-layer attention pool, zero trainable params): loo_nn_accuracy_all_4 = 0.812 (13/16) -- PASS loo_nn_accuracy_heldout_2 = 0.875 (7/8) -- pass per-domain: music 3/4, space 3/4, cooking 4/4, finance 3/4 Delta +0.188 absolute (+30% relative). Music domain +0.50. Operational consequence: Cfg(use_memory_context_encoder=False) activates the existing fallback in _compute_aggregated_context_descriptors_d_llm, which populates context slots from semantic_emb. No SUT code change. Next audit prediction: 4.24 FAIL -> PASS, total 19/26 -> 20/26. Overall: 19/26 (same total as v3.46; primary criteria unchanged). Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…diction partially refuted) Training driver train_v348.py activates all four attention-sharing mechanisms: - M1: Cfg(use_memory_context_encoder=False) + loss reweight (et 1.5->3.0, sa 3.0->1.0, tsa 0.5->0.1, fs 0.4->0.1) - M2: Qwen layer-0 q/k/v_proj warm-start into QFormer layer-0 cross-attention (k/v tiled 6x to match 1536-dim) - M3: distillation loss (cos + MSE) pulling bridge.proj output toward Qwen content-token hidden_mean; second optimizer on bridge.proj params only - M4: bridge.proj.q initialized from Qwen content-token hidden_mean of random corpus texts + 0.005 noise Runner change: 4.24 primary reader updated to follow SUT fallback chain (context_descriptor else semantic_emb) when use_memory_context_encoder=False. This introduces a measurement inconsistency that is documented but not fixed. Training: 120 steps, 2685.8s (44.8 min), 22.4 s/step single-threaded. Final training metrics (vs v3.44-Trained @ 60 steps): total_loss: 44.0 -> 17.5 (2.5x deeper) recon_loss: 4.8 -> 2.08 (2.3x lower) vocab_anchor: -0.22 -> -0.33 (50% deeper) bridge cos(Qwen-pool): new signal, peaked at 0.87, sustained 0.77 Audit: 26 cases, 1423.8s, 19/26 pass. Unchanged from v3.46 and v3.47. Delta analysis: 4.24 primary all_4: unchanged 0.625 (measurement issue in runner) 4.24 primary heldout_2: 0.875 -> 0.750 (REGRESSION from M3 target mismatch) 4.24 diagnostic all_4: 0.812 (matches v3.47 prediction, confirms M1 in principle) 4.23 median rank: 759 -> 1089 (REGRESSION from M2+M3 pulling tail slot toward Qwen mean) Mechanism diagnosis: - M1 (disable learned encoder) works structurally: the diagnostic metric reading mem.semantic_emb achieves 0.812/0.875 LOO NN, same as v3.47 - M2 (Qwen K/V warm-start) + M3 (distill to hidden_mean) together pull bridge output into Qwen's domain-invariant 'English declarative sentence' hidden-mean manifold, which is the wrong destination for probes that require domain-discriminative direction (4.23, 4.24 heldout) - M4 (pool-init queries) neutral - Net: +1 (M1) - 2 (M2+M3) = -1 vs v3.47 prediction; observed 19/26 Falsifiable next steps (not in this PR): - Revert M2+M3, keep M1+M4: predicted 20/26 - Change M3 target to WTE-centroid-of-strict-content-starters: predicted >= 20/26 - Fix 4.24 primary reader to uniformly follow SUT fallback: predicted 20/26 on current ckpt Artifacts: ckpt/v348_stacked.pt (453 MB, not tracked), ckpt/v348_train_log.jsonl, reports/v348_stacked_blackbox/*. No SUT code changed (per user constraint). Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…tution ban Runner (v331_blackbox_eval.py, context_descriptor_cluster_probe): - Removes the v3.48 fallback that read mem.semantic_emb when mem.context_descriptor was None (i.e., when the SUT is configured with Cfg(use_memory_context_encoder=False)). This fallback laundered a FAIL-by-API-contract into a numerical-value-lookalike PASS and violated SPEC Section 1.1.3 (no audit-time-only code paths). - Primary metric now reads MemEntry.context_descriptor literally. If fewer than 8 entries are populated, status is 'not_implemented' (was already so in some paths; now uniformly so for the disabled- encoder case). - Diagnostic block reading semantic_emb is preserved but now clearly labelled as non-gating and named mechanism_1_qwen_pool_diagnostic. Runs regardless of primary-metric status so mechanism design still has data. - Bumps metric_version to v3.49. SPEC (V331_BLACKBOX_TEST_SPEC.md): - Section 4.24 gains a 'Substitution ban (v3.49+)' paragraph that explicitly forbids substituting any other MemEntry field for the primary metric, and explains why 'follow the SUT's own operational fallback chain' is not a valid justification. - Section 7.9 added: retraction notice for the v3.48 4.24 primary metric and for any overall pass count that relied on it. No SUT change. No mocks. No checkpoint deletions. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

… inter-domain margin / D deterministic save-load / E top1-exclusive bias / F circuit breaker Target 7 persistent FAILs in v3.48 audit (4.7/4.11/4.13/4.16/4.19/4.23/4.24). [A] MemoryContextEncoder: replace single orthogonal Linear with 1-layer attention pool. Q=learnable Parameter(d_ctx); K,V=Linear(d_LLM, 2*d_ctx) over content-token hidden states; residual shortcut via orthogonal proj_wte(wte_centroid) at weight 0.3. write() path passes content hidden states per-batch. [B] ContentSemanticTailHead.combine_with_residual: slot_1..n-1 = alpha * rare_keyword_residual + beta * LN(tail_head_output), with per-slot learnable beta (init 0.3) and LayerNorm on head_out to bound magnitude. slot_0 stays pure head_out. New Trainer.slot_residual_alignment_loss = relu(floor - cos(slot, residual)) at floor=0.5. [C] Inter-domain margin: AMM.maybe_recluster triggers KMeans on semantic_emb every mem_recluster_every_writes=4 writes, stamping MemEntry.cluster_id. DirectionTree.retrieve and AMM.retrieve_multi apply retrieval_crowding_lambda=0.15 penalty to cross-cluster entries. Trainer.inter_domain_margin_loss uses same KMeans weak labels for fiber-direction margin (same>=0.6, cross<=0.3). [D] Deterministic save/load: PrefixAligner._calibrated flag prevents recalibration; save/load iterate mid-sorted; _sorted_set replaces list(set()) on all token-id unions; ContentTokenClassifier exposes SHA256 fingerprint, saved+verified on load; store dump includes SHA256 fingerprint for double-save stability check. [E] Content bias top-1 exclusive + rest fallback: b = 0.7 * build(top1, floor=0.5) + 0.3 * build(rest, floor=0.2). [F] CircuitBreaker in MemLLM.generate: records -log P(chosen) per step, baseline = first 3 steps mean. 3 consecutive steps above 1.5 * baseline flip active; 5-step hysteresis. When active, mixture_gate ceiling clamped to 0.3 (only affects mixture path if use_mixture_decoding enabled). No runner/spec changes. Same SUT entry via AgentMemorySystem.py. Ready for v3.49-runner audit on fresh-init + trained-ckpt. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…ns diagnostic getters These pre-existing pure tree-topology inspectors are depended on by probes 4.1 (leaf_capacity_stability) and 4.2 (degenerate_direction_boundary). The rewrite inadvertently dropped them; restored verbatim. No audit-time-only semantics: max_depth() and leaf_size_violations() only read existing _Node tree structure, which is the same code path the SUT uses at runtime (insert/split/rebalance). §1.1.3 clear. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

Total pass: 18/26 (v3.48 stacked-trained was 19/26). Elapsed: 1519 s on CPU. Deterministic mode active. Head-to-head vs v3.48: UP (+2): 4.24 context_descriptor_cluster_probe (FAIL -> PASS) 4.16 retrieval_generation_alignment_audit (FAIL -> PASS) DOWN (-3): 4.8 degeneration_quality (PASS -> FAIL) 4.21 decode_repetition_feedback_probe (PASS -> FAIL) 4.25 prefix_length_scaling_probe (PASS -> FAIL) FAIL signatures: 4.24 -> PASS: loo_nn_all_4 = 0.9375 (15/16), heldout = 1.0 (8/8). [A] attention-pool ctx encoder with residual shortcut produced the intended gain. Primary metric now exceeds v3.48 Qwen-pool diagnostic (0.81) on same corpus, under v3.49 no-substitution rule. 4.16 -> PASS: diagnoses = {aligned:2, bridge_unused:1, retrieval_miss:0}. [C] inter-domain margin + crowding prevented the music<->space mix on the satellites prompt. 4.8 -> FAIL: outputs show repetition 'pian pian Chop pian noct pian...'. avg_max_repeat=4.33 (>3) and avg_unique_ratio=0.25. [E] top1-exclusive content_bias at weight 0.7 + floor 0.5 concentrates mass on the dominant memory's top starters, which the repetition guards cannot pull apart at this scale. 4.21 -> FAIL: same repetition cascade (avg_max_repeat_per_content_token = 4.33, threshold 3). Downstream of the same [E] concentration. 4.25 -> FAIL: mass_B/mass_A = 1.065, threshold 1.10. [B] residual- dominant tail_slot at fixed alpha=1.5 and beta=0.3 bounds the extra mass from doubling L_mem: extra tail slots now contribute mostly clamped residual + small beta*LN(head), not free head output, so the starter-mass ratio flattens toward 1.0. Persistent FAILs (unchanged from v3.48): 4.23 keyword_specific_tail_slot: median_rank = 1402 (was 1089). [B] alignment by cosine is not the same as WTE-rank recovery; the rank metric still reads the post-LN combined slot, which is near residual direction only by cosine, not in the raw logit argmax. 4.11 retrieval_topk_semantic_shift: both hit counts still 0. prefix continues to route to meta-starters, independent of [C]/[E]. 4.13 save_load_consistency: output_a != output_b still differ; [D] fingerprint-stable save but generate() stochasticity at bf16 not fully pinned. 4.19 stepwise_label_mass_alignment_audit: label-mass trajectory mis-aligned; cascade of 4.11. 4.7 semantic_memory_counterfactual_pairs: repetition garbage, same root cause as 4.8/4.21. Axes (v3.49 runner reporting): A compression: ratio 8.97 < 10 FAIL (ctx_desc added floats) B injection: 164224 per-step, O(1) in N, PASS C fidelity: 6/11, threshold 9 FAIL D stability: 1/3 PASS (save_load + decode_repetition FAIL) SUT fresh-init; no training; no ckpt. The [A] win validates the attention-pool mechanism design; the DOWN triplet (4.8/4.21/4.25) shows [E]/[B] changes overshot without a counterweight on repetition and mass preservation. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…write) [#1] Revert [B] residual-dominant tail-slot decomposition. Cfg.tail_slot_residual_dominant: True -> False. loss_weights['slot_residual_alignment']: 0.3 -> 0.0. In v3.44-rewrite the combine_with_residual path produced slot_1 = alpha*residual (L2=1.07) + beta*LN(head_out) (L2=11.76) so LN(head_out) dominated the direction. On fresh init with zero-init slot_heads[1], LN(0) reduces to LayerNorm gamma direction (uniform), which is far from every rare-keyword WTE direction, so 4.23 median_rank went to 1402 (v3.48 baseline 1089). Disabling the decomposition routes EmbBridge.inject back to the additive path: slot_1 = tail_head(fiber) + alpha * residual, which in fresh init equals alpha * residual and points by construction at the rare-keyword centroid direction. [#3] Refresh rare_keyword_ids at end of write(). MemLLM.write() now calls self._refresh_rare_keyword_indices() after the last store_mem, so fresh-path and load-path both compute rare_keyword_ids via the same algorithm at the same timing. Pre-patch: write() left MemEntry.rare_keyword_ids=[] (set by store_mem), while load_memory() called _refresh_rare_keyword_indices after loading, leaving model_a and model_b with different rare_keyword_ids for the same mid -> _compute_rare_keyword_wte_residual returned None for model_a (empty lists) and a non-zero tensor for model_b, diverging prefix_cond -> 4.13 FAILs by string-inequality under greedy decoding. Diagnostic: diag_4_13_rare_keyword_equiv.py verifies after #3 that all per-memory fields (base/fiber/dirn/semantic_emb/context_descriptor/ content_token_ids/expanded_content_ids/strict_starter_ids/ rare_keyword_ids) are bit-identical between fresh+save and load on corpus_general (the corpus 4.13 writes). The script runs to CLEAN. This does not guarantee 4.13 will PASS -- it only confirms the known source is closed. Remaining sources, if any, live downstream of MemEntry fields in the bridge / aligner / or backbone path. No changes to: - [A] attention-pool ctx encoder - [C] inter-domain margin + cluster crowding - [E] top1-exclusive content_bias - [F] circuit breaker (still hooked only to mixture_gate ceiling, use_mixture_decoding=False by default -> still a dead path) - runner - SPEC Scope: exactly two Cfg flags and one call-site added. Structural risk: minimal (one is a revert, one is a timing alignment). Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

Targets directly hit: 4.13 save_load_consistency : FAIL -> PASS (outputs bit-identical) 4.25 prefix_length_scaling : FAIL -> PASS (mass_B/mass_A = 1.543 >= 1.10) Targets held (no regression from v3.44-rewrite): 4.24 context_descriptor_cluster_probe: PASS (0.9375 / 1.0) 4.16 retrieval_generation_alignment_audit: PASS Targets still FAIL (same as v3.44-rewrite, unaddressed by #1/#3): 4.23 keyword_specific_tail_slot_probe: median_rank=1402, hit=0 4.8 / 4.21 / 4.7 : decoder repetition triple (will be addressed by #2) 4.11 / 4.19 : prefix-token-class mismatch (will be addressed by #5) Surprising finding on 4.23: The diagnostic dump (diag_4_23_slot_direction.py) reveals that bridge._last_tail_slots read by 4.23 does NOT come from prefix_cond - it comes from the SECOND inject call inside _build_contrastive_uncond_prefix, which is called with rare_keyword_wte_residual=None. This overwrites _last_tail_slots and _last_residual with the uncond contrastive prefix's values. The probe has been reading the uncond tail since at least v3.42. This is a pre-existing diagnostic-buffer aliasing bug, not a change-#1 regression. It explains why v3.48 (median_rank=1089) and v3.45 (median_rank=1402) both point at whitespace/punct - both are reading tail slots that were rebuilt without rare-keyword residual. Fix belongs in a separate PR (write residual to a second buffer in cond path, or snapshot bridge._last_tail_slots before uncond inject). axis_coverage under v3.49 runner reporting: A compression : ratio 8.97 (< 10) FAIL B injection : 164224 floats, O(1) PASS C fidelity : 7/11 (threshold 9) FAIL D stability : 2/3 (4.21 FAIL) FAIL elapsed: 1508 s on CPU, AMS_DETERMINISTIC=1, fresh init. This audit validates: - #1 revert did not regress anything and recovered 4.25 (predicted by the plan's 'LN-bounded extra slot mass' magnitude calculus). - #3 refresh timing alignment recovered 4.13 (predicted by the plan's 'rare_keyword_ids fresh-vs-load asymmetry' mechanism). This audit does not validate: - any claim about 4.23 reachability; 4.23 has a pre-existing aliasing bug that the current plan's change #2 ([B] replacement) cannot fix because the replacement would still be overwritten by the uncond inject call. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

cursoragent and others added 11 commits April 20, 2026 15:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v3.45 (staged): revert [B], align rare_keyword refresh timing on write [draft pre-audit]#25

v3.45 (staged): revert [B], align rare_keyword refresh timing on write [draft pre-audit]#25
FluffyAIcode wants to merge 11 commits intomainfrom
AgentMemory/v345-revertB-refreshD-7e97

FluffyAIcode commented Apr 21, 2026 •

edited by cursor Bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

FluffyAIcode commented Apr 21, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Audit result: 20/26 pass (elapsed 1508 s)

Head-to-head

Primary metric numbers

Surprising finding on 4.23

Axis coverage (v3.49 runner)

What this audit validated

What this audit did NOT validate

Next decision point

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

FluffyAIcode commented Apr 21, 2026 •

edited by cursor Bot

Loading