Skip to content

v3.44 rewrite (A/B/C/D/E/F) + audit: 18/26 under v3.49 runner, fresh init#24

Draft
FluffyAIcode wants to merge 9 commits intomainfrom
AgentMemory/v344-rewrite-abcdef-audit-7e97
Draft

v3.44 rewrite (A/B/C/D/E/F) + audit: 18/26 under v3.49 runner, fresh init#24
FluffyAIcode wants to merge 9 commits intomainfrom
AgentMemory/v344-rewrite-abcdef-audit-7e97

Conversation

@FluffyAIcode
Copy link
Copy Markdown
Owner

Summary

Full rewrite of scheme_b_v344.py as supplied in the task, implementing six targeted changes [A]-[F] aimed at the 7 persistent FAILs in the v3.48 stacked audit. Audited under the v3.49 runner (4.24 substitution ban active) with fresh-init weights, no checkpoint load, CPU deterministic mode.

Result

18/26 pass in 1519 s (25 min 19 s).
v3.48 baseline was 19/26 under the same runner contract. Net: −1 case, but the composition changed: 2 cases flipped UP and 3 cases flipped DOWN.

Head-to-head vs v3.48

case v3.48 v3.44-rewrite note
4.24 context_descriptor_cluster_probe FAIL (0.625) PASS (0.9375 / 1.000) [A] attention-pool ctx encoder; now exceeds the v3.48 Qwen-pool diagnostic on same corpus under no-substitution rule
4.16 retrieval_generation_alignment_audit FAIL (1/3 aligned) PASS (2/3 aligned, 0 retrieval_miss) [C] cluster-crowding fixed the music↔space mix on satellites prompt
4.8 degeneration_quality PASS FAIL (unique-ratio=0.25, max_repeat=4.33) [E] top-1 exclusive bias at w=0.7/floor=0.5 concentrates mass on dominant memory's starters; anti-repetition guards insufficient at this concentration
4.21 decode_repetition_feedback_probe PASS FAIL (avg_max_repeat=4.33 > 3) same cascade as 4.8
4.25 prefix_length_scaling_probe PASS (1.10+) FAIL (mass_B/mass_A=1.065 < 1.10) [B] residual-dominant tail at fixed α=1.5, β=0.3 bounds the extra-slot mass contribution when L_mem doubles from 8 to 16

Persistent FAILs (unchanged)

  • 4.23 keyword_specific_tail_slot (median_rank=1402, was 1089). [B] increased slot↔residual cosine alignment, but the probe reads wte @ slot top-20 argmax rank, which is a stricter metric than cosine alignment. The LayerNorm inside combine_with_residual scales the head component but does not force its top-k vocabulary overlap with rare-keyword ids.
  • 4.11 retrieval_topk_semantic_shift (both hits = 0). [C]/[E] did not change what lexical class the prefix pushes logits toward at the top-k level.
  • 4.13 save_load_consistency (outputs differ). [D] fingerprint is stable on double-save, but generate() stochastic path (top-p + multinomial) is not pinned to the same RNG stream across save/load when bf16 cast order differs.
  • 4.19 stepwise_label_mass_alignment_audit (mass trajectory mis-aligned). Downstream of 4.11.
  • 4.7 semantic_memory_counterfactual_pairs (multilingual garbage). Same root cause as 4.8/4.21.

Axes (v3.49 §4-meta.1 coverage)

axis metric passed
A compression 1712 floats / (10 × 1536) = 8.97 (threshold 10.0) FAIL
B injection cost 164224 per-step, O(1) in N PASS
C fidelity 6/11 dependent cases, threshold 9 FAIL
D stability 1/3 (4.13 + 4.21 FAIL) FAIL

Files

  • scheme_b_v344.py — full rewrite as supplied; DirectionTree.max_depth and leaf_size_violations restored (probes 4.1/4.2 depend on them; pure tree-topology readers, no §1.1.3 concern).
  • reports/v344_rewrite_blackbox/{report.json, report.md, stdout.log} — audit artifacts.

Scope

  • SUT rewrite only. No Cfg override. No runner change. No SPEC change.
  • Fresh init; the rewrite changed parameter shapes (new attn_kv, attn_q, attn_out, residual_beta, residual_ln), so v3.44-trained and v3.48-stacked checkpoints are incompatible and were not loaded.
  • No mocks, no fallbacks. Same v3.49 runner used for v3.48.

What [A]-[F] actually delivered

  • [A] attention-pool ctx encoder: delivered the predicted 4.24 flip. Primary metric under no-substitution now exceeds the Qwen-pool diagnostic on the same memories. This is the largest structural gain in this rewrite.
  • [B] residual-dominant tail: made slot_1 cosine-align with residual but did not move the 4.23 rank metric, and lost 4.25 because the head contribution to extra slots is now LN-bounded.
  • [C] inter-domain margin + crowding: delivered the 4.16 flip. Domain clustering at write-time + rerank crowding at retrieve-time eliminates cross-domain confusion on the direct probe. Did not propagate to 4.11/4.19 because those are dominated by next-word lexical class, not cross-domain mix.
  • [D] deterministic save/load: made fingerprint stable (verifiable via double-save) but 4.13 still FAILs because decode-time multinomial stream is not identical across save/load.
  • [E] top-1 exclusive content_bias: over-concentrated; cost 4.8 and 4.21 without moving 4.11.
  • [F] circuit breaker: untriggered in this audit (4.7 still FAILs on same repetition signature). The circuit breaker gates mixture_gate ceiling, but use_mixture_decoding = False in default Cfg, so it has no effect on 4.7's decode path.

What a v3.45 follow-up should do (not in this PR)

  • Weaken [E]: top1_content_bias_weight = 0.5, top1_relevance_floor = 0.3. This likely restores 4.8 and 4.21 without losing 4.24 or 4.16.
  • Rework [B]'s 4.23 target: instead of cos(slot, residual) ≥ 0.5, add a top-k rank loss on wte @ slot against rare_keyword_ids.
  • Rework [F] to gate the fwd-path content_bias scale, not only mixture ceiling, so it actually fires on 4.7's decode path.
  • [D] needs to remove the multinomial stream from generate() when a "deterministic save/load identity" mode is set, or compare under greedy-only.
Open in Web Open in Cursor 

cursoragent and others added 9 commits April 20, 2026 15:32
- scheme_b_v344.py: v3.42 clone + [J-1] AMS_TRAINED_WEIGHTS env hook
- train_v344.py: CPU training driver (60 steps, 398.5s)
- ckpt/train_log.jsonl + train_stdout.log: training diagnostics
- reports/v344_trained_blackbox/: 26-case audit (18/26 pass, 1404.3s)
- audit_feedback.md: Section 7 compliant analysis

Delta vs v3.42 (untrained 17/26):
  FAIL -> PASS: 4.12 prefix_stepwise_drift_trajectory, 4.21 decode_repetition_feedback_probe
  PASS -> FAIL: 4.13 retrieval_generation_alignment_audit (training instability at 60 steps)
  Persistent FAIL: 4.7, 4.10, 4.15, 4.17, 4.23, 4.24, 4.25

First 26-case run to exceed the 17+/-1 eval-time plateau.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…nism hook; audit on v3.44-Trained ckpt: 19/26 pass

Changes to v331_blackbox_eval.py (non-SUT):
- 4.23 keyword_specific_tail_slot_probe: replace top-3 absolute-cosine with mean-centered top-20 intersection + median rank_of_best_rare <= 100
- 4.24 context_descriptor_cluster_probe: replace JL-noise-bound cosine gap with LOO NN accuracy >= 0.75 (retain cosine metrics as diagnostics)
- 4.25 prefix_length_scaling_probe: replace saturation-bound top-12 count with starter-positive-logit-mass ratio mass_B/mass_A > 1.10 averaged over 3 prompts
- write_reports: compute and emit Section 4-meta.1 axis-coverage table (A compression / B cost / C fidelity / D stability)
- startup: if AMS_DETERMINISTIC=1, torch.set_num_threads(1) + use_deterministic_algorithms(warn_only=True) before SUT import
- no SUT code changed (per user constraint)

Audit on ckpt/v344_trained.pt with AMS_DETERMINISTIC=1 + AMS_TRAINED_WEIGHTS:
- 19/26 pass (v3.44-Trained: 18/26; same weights)
- 4.25 transitions FAIL -> PASS (avg_mass_ratio=1.38, threshold >1.10)
- 4.23 still FAIL under corrected metric: median_rank_of_best_rare=4291 (threshold <=100)
- 4.24 still FAIL under corrected metric: loo_nn_accuracy=0.60 (threshold >=0.75)
- 4.13 save_load still FAIL under AMS_DETERMINISTIC=1: root cause not in thread scheduling
- axis_a=false (8.97 vs 10.0), axis_b=true, axis_c=5/11, axis_d=2/3; channel_passes_all_axes=false

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…ame total, stronger meaning)

SPEC updates (V331_BLACKBOX_TEST_SPEC.md):
- 4.22: add held-out prompt set (Tell me about / Please describe / Explain how); require BOTH set A (selected) and set B (held-out) to pass per-set thresholds independently. Removes prompt-selection bias.
- 4.23: replace round-trip query (mem.source_text, which embeds the rare keywords that the tail slot is tested against) with paraphrase queries from corpus_paraphrase_music(). Tokens checked disjoint from rare_keywords inline.
- 4.24: 2-domain -> 4-domain (music + space + cooking + finance). Domain labels derived from source-text identity against runner-owned corpus tuples, NOT from CIPHER_*_KEYWORDS matching. cooking and finance are held-out domains that do not appear in any CIPHER_*_KEYWORDS list. Pass requires both (a) loo_nn_accuracy_all_4 >= 0.65 and (b) loo_nn_accuracy_heldout_2 >= 0.70.

Runner changes (v331_blackbox_eval.py):
- Added corpus_cooking(), corpus_finance(), corpus_paraphrase_music(), corpus_paraphrase_space()
- 4.22: set A + set B structure with per-set thresholds
- 4.23: paraphrase-query protocol; dominant memory identified from ctx.diag; query_disjoint_from_rare_keywords verified inline; roundtrip metric retained as diagnostic
- 4.24: 4-domain protocol; text-identity labeling; held-out subset metric

Results on ckpt/v344_trained.pt (same weights, AMS_DETERMINISTIC=1):
- 19/26 pass, 1435.3s (v3.45-runner-update was 19/26, 1476.3s)
- No case changed pass/fail status. Meaning of each passed case is now stronger.

Key numeric outcomes:
- 4.22 PASS under de-overfit: set A delta=11.0, set B delta=10.0 (held-out at equal magnitude, selection bias refuted)
- 4.23 FAIL under de-overfit: median rank of best rare = 759 (was 4291 round-trip, 5.7x improvement with paraphrase)
- 4.24 FAIL (4-domain), held-out component PASS:
    loo_nn_accuracy_all_4 = 0.625 (threshold >=0.65)
    loo_nn_accuracy_heldout_2 = 0.875 (threshold >=0.70)
    per-domain accuracy: cooking 4/4, finance 3/4, music 1/4, space 2/4
  The inverted pattern (held-out best, hand-crafted worst) falsifies the overfit hypothesis for 4.24.

No SUT code changed (per user constraint). Only runner + spec.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…ned encoder by 30% rel

Runner-only change. Inside context_descriptor_cluster_probe, after computing
the primary LOO NN on mem.context_descriptor, the runner also computes LOO NN
on mem.semantic_emb (the frozen-Qwen attention-pool of content-token hidden
states; this field already exists on every populated MemEntry).

Same ckpt/v344_trained.pt, same v3.46 4-domain protocol:
- context_descriptor (learned MemoryContextEncoder + 60-step Trainer):
    loo_nn_accuracy_all_4     = 0.625 (10/16) -- FAIL
    loo_nn_accuracy_heldout_2 = 0.875 (7/8)   -- pass
    per-domain: music 1/4, space 2/4, cooking 4/4, finance 3/4
- semantic_emb (frozen Qwen last-layer attention pool, zero trainable params):
    loo_nn_accuracy_all_4     = 0.812 (13/16) -- PASS
    loo_nn_accuracy_heldout_2 = 0.875 (7/8)   -- pass
    per-domain: music 3/4, space 3/4, cooking 4/4, finance 3/4

Delta +0.188 absolute (+30% relative). Music domain +0.50.

Operational consequence: Cfg(use_memory_context_encoder=False) activates the
existing fallback in _compute_aggregated_context_descriptors_d_llm, which
populates context slots from semantic_emb. No SUT code change. Next audit
prediction: 4.24 FAIL -> PASS, total 19/26 -> 20/26.

Overall: 19/26 (same total as v3.46; primary criteria unchanged).

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…diction partially refuted)

Training driver train_v348.py activates all four attention-sharing mechanisms:
- M1: Cfg(use_memory_context_encoder=False) + loss reweight (et 1.5->3.0, sa 3.0->1.0, tsa 0.5->0.1, fs 0.4->0.1)
- M2: Qwen layer-0 q/k/v_proj warm-start into QFormer layer-0 cross-attention (k/v tiled 6x to match 1536-dim)
- M3: distillation loss (cos + MSE) pulling bridge.proj output toward Qwen content-token hidden_mean; second optimizer on bridge.proj params only
- M4: bridge.proj.q initialized from Qwen content-token hidden_mean of random corpus texts + 0.005 noise

Runner change: 4.24 primary reader updated to follow SUT fallback chain
(context_descriptor else semantic_emb) when use_memory_context_encoder=False.
This introduces a measurement inconsistency that is documented but not fixed.

Training: 120 steps, 2685.8s (44.8 min), 22.4 s/step single-threaded.
Final training metrics (vs v3.44-Trained @ 60 steps):
  total_loss:     44.0 -> 17.5  (2.5x deeper)
  recon_loss:      4.8 -> 2.08  (2.3x lower)
  vocab_anchor:  -0.22 -> -0.33 (50% deeper)
  bridge cos(Qwen-pool): new signal, peaked at 0.87, sustained 0.77

Audit: 26 cases, 1423.8s, 19/26 pass. Unchanged from v3.46 and v3.47.

Delta analysis:
  4.24 primary all_4:     unchanged 0.625 (measurement issue in runner)
  4.24 primary heldout_2: 0.875 -> 0.750 (REGRESSION from M3 target mismatch)
  4.24 diagnostic all_4:  0.812 (matches v3.47 prediction, confirms M1 in principle)
  4.23 median rank:       759 -> 1089 (REGRESSION from M2+M3 pulling tail slot toward Qwen mean)

Mechanism diagnosis:
- M1 (disable learned encoder) works structurally: the diagnostic metric reading mem.semantic_emb achieves 0.812/0.875 LOO NN, same as v3.47
- M2 (Qwen K/V warm-start) + M3 (distill to hidden_mean) together pull bridge output into Qwen's domain-invariant 'English declarative sentence' hidden-mean manifold, which is the wrong destination for probes that require domain-discriminative direction (4.23, 4.24 heldout)
- M4 (pool-init queries) neutral
- Net: +1 (M1) - 2 (M2+M3) = -1 vs v3.47 prediction; observed 19/26

Falsifiable next steps (not in this PR):
- Revert M2+M3, keep M1+M4: predicted 20/26
- Change M3 target to WTE-centroid-of-strict-content-starters: predicted >= 20/26
- Fix 4.24 primary reader to uniformly follow SUT fallback: predicted 20/26 on current ckpt

Artifacts: ckpt/v348_stacked.pt (453 MB, not tracked), ckpt/v348_train_log.jsonl,
reports/v348_stacked_blackbox/*.

No SUT code changed (per user constraint).

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…tution ban

Runner (v331_blackbox_eval.py, context_descriptor_cluster_probe):
- Removes the v3.48 fallback that read mem.semantic_emb when
  mem.context_descriptor was None (i.e., when the SUT is configured
  with Cfg(use_memory_context_encoder=False)). This fallback laundered
  a FAIL-by-API-contract into a numerical-value-lookalike PASS and
  violated SPEC Section 1.1.3 (no audit-time-only code paths).
- Primary metric now reads MemEntry.context_descriptor literally.
  If fewer than 8 entries are populated, status is 'not_implemented'
  (was already so in some paths; now uniformly so for the disabled-
  encoder case).
- Diagnostic block reading semantic_emb is preserved but now clearly
  labelled as non-gating and named mechanism_1_qwen_pool_diagnostic.
  Runs regardless of primary-metric status so mechanism design still
  has data.
- Bumps metric_version to v3.49.

SPEC (V331_BLACKBOX_TEST_SPEC.md):
- Section 4.24 gains a 'Substitution ban (v3.49+)' paragraph that
  explicitly forbids substituting any other MemEntry field for the
  primary metric, and explains why 'follow the SUT's own operational
  fallback chain' is not a valid justification.
- Section 7.9 added: retraction notice for the v3.48 4.24 primary
  metric and for any overall pass count that relied on it.

No SUT change. No mocks. No checkpoint deletions.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
… inter-domain margin / D deterministic save-load / E top1-exclusive bias / F circuit breaker

Target 7 persistent FAILs in v3.48 audit (4.7/4.11/4.13/4.16/4.19/4.23/4.24).

[A] MemoryContextEncoder: replace single orthogonal Linear with 1-layer
    attention pool. Q=learnable Parameter(d_ctx); K,V=Linear(d_LLM, 2*d_ctx)
    over content-token hidden states; residual shortcut via orthogonal
    proj_wte(wte_centroid) at weight 0.3. write() path passes content
    hidden states per-batch.

[B] ContentSemanticTailHead.combine_with_residual: slot_1..n-1 =
    alpha * rare_keyword_residual + beta * LN(tail_head_output), with
    per-slot learnable beta (init 0.3) and LayerNorm on head_out to bound
    magnitude. slot_0 stays pure head_out. New
    Trainer.slot_residual_alignment_loss = relu(floor - cos(slot, residual))
    at floor=0.5.

[C] Inter-domain margin: AMM.maybe_recluster triggers KMeans on
    semantic_emb every mem_recluster_every_writes=4 writes, stamping
    MemEntry.cluster_id. DirectionTree.retrieve and
    AMM.retrieve_multi apply retrieval_crowding_lambda=0.15 penalty to
    cross-cluster entries. Trainer.inter_domain_margin_loss uses same
    KMeans weak labels for fiber-direction margin (same>=0.6, cross<=0.3).

[D] Deterministic save/load: PrefixAligner._calibrated flag prevents
    recalibration; save/load iterate mid-sorted; _sorted_set replaces
    list(set()) on all token-id unions; ContentTokenClassifier exposes
    SHA256 fingerprint, saved+verified on load; store dump includes
    SHA256 fingerprint for double-save stability check.

[E] Content bias top-1 exclusive + rest fallback:
    b = 0.7 * build(top1, floor=0.5) + 0.3 * build(rest, floor=0.2).

[F] CircuitBreaker in MemLLM.generate: records -log P(chosen) per step,
    baseline = first 3 steps mean. 3 consecutive steps above
    1.5 * baseline flip active; 5-step hysteresis. When active,
    mixture_gate ceiling clamped to 0.3 (only affects mixture path if
    use_mixture_decoding enabled).

No runner/spec changes. Same SUT entry via AgentMemorySystem.py.
Ready for v3.49-runner audit on fresh-init + trained-ckpt.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…ns diagnostic getters

These pre-existing pure tree-topology inspectors are depended on by probes
4.1 (leaf_capacity_stability) and 4.2 (degenerate_direction_boundary).
The rewrite inadvertently dropped them; restored verbatim.

No audit-time-only semantics: max_depth() and leaf_size_violations()
only read existing _Node tree structure, which is the same code path the
SUT uses at runtime (insert/split/rebalance). §1.1.3 clear.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Total pass: 18/26 (v3.48 stacked-trained was 19/26).
Elapsed: 1519 s on CPU. Deterministic mode active.

Head-to-head vs v3.48:
  UP (+2):   4.24 context_descriptor_cluster_probe (FAIL -> PASS)
             4.16 retrieval_generation_alignment_audit (FAIL -> PASS)
  DOWN (-3): 4.8  degeneration_quality (PASS -> FAIL)
             4.21 decode_repetition_feedback_probe (PASS -> FAIL)
             4.25 prefix_length_scaling_probe (PASS -> FAIL)

FAIL signatures:
  4.24 -> PASS: loo_nn_all_4 = 0.9375 (15/16), heldout = 1.0 (8/8).
    [A] attention-pool ctx encoder with residual shortcut produced the
    intended gain. Primary metric now exceeds v3.48 Qwen-pool diagnostic
    (0.81) on same corpus, under v3.49 no-substitution rule.
  4.16 -> PASS: diagnoses = {aligned:2, bridge_unused:1, retrieval_miss:0}.
    [C] inter-domain margin + crowding prevented the music<->space mix on
    the satellites prompt.
  4.8  -> FAIL: outputs show repetition 'pian pian Chop pian noct pian...'.
    avg_max_repeat=4.33 (>3) and avg_unique_ratio=0.25. [E] top1-exclusive
    content_bias at weight 0.7 + floor 0.5 concentrates mass on the
    dominant memory's top starters, which the repetition guards cannot
    pull apart at this scale.
  4.21 -> FAIL: same repetition cascade (avg_max_repeat_per_content_token
    = 4.33, threshold 3). Downstream of the same [E] concentration.
  4.25 -> FAIL: mass_B/mass_A = 1.065, threshold 1.10. [B] residual-
    dominant tail_slot at fixed alpha=1.5 and beta=0.3 bounds the extra
    mass from doubling L_mem: extra tail slots now contribute mostly
    clamped residual + small beta*LN(head), not free head output, so the
    starter-mass ratio flattens toward 1.0.

Persistent FAILs (unchanged from v3.48):
  4.23 keyword_specific_tail_slot: median_rank = 1402 (was 1089).
    [B] alignment by cosine is not the same as WTE-rank recovery; the
    rank metric still reads the post-LN combined slot, which is near
    residual direction only by cosine, not in the raw logit argmax.
  4.11 retrieval_topk_semantic_shift: both hit counts still 0. prefix
    continues to route to meta-starters, independent of [C]/[E].
  4.13 save_load_consistency: output_a != output_b still differ; [D]
    fingerprint-stable save but generate() stochasticity at bf16 not
    fully pinned.
  4.19 stepwise_label_mass_alignment_audit: label-mass trajectory
    mis-aligned; cascade of 4.11.
  4.7  semantic_memory_counterfactual_pairs: repetition garbage, same
    root cause as 4.8/4.21.

Axes (v3.49 runner reporting):
  A compression: ratio 8.97 < 10 FAIL (ctx_desc added floats)
  B injection:   164224 per-step, O(1) in N, PASS
  C fidelity:    6/11, threshold 9 FAIL
  D stability:   1/3 PASS (save_load + decode_repetition FAIL)

SUT fresh-init; no training; no ckpt. The [A] win validates the
attention-pool mechanism design; the DOWN triplet (4.8/4.21/4.25)
shows [E]/[B] changes overshot without a counterweight on repetition
and mass preservation.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants