Skip to content

v4 architecture realign: v4.1–v4.6 stack + trained SUT on GPU (merge gate FAILED — ship-blocking follow-up v4.7 identified)#30

Draft
FluffyAIcode wants to merge 16 commits intomainfrom
AgentMemory/v347-architecture-realign-b7fa
Draft

v4 architecture realign: v4.1–v4.6 stack + trained SUT on GPU (merge gate FAILED — ship-blocking follow-up v4.7 identified)#30
FluffyAIcode wants to merge 16 commits intomainfrom
AgentMemory/v347-architecture-realign-b7fa

Conversation

@FluffyAIcode
Copy link
Copy Markdown
Owner

@FluffyAIcode FluffyAIcode commented Apr 22, 2026

Realigns the codebase to the abstract AMS spec (multiple Kakeya sets × three fiber bundles × explicit time/topic/background axes × cross-bundle attention). v4.1 → v4.6 implemented with 69 unit tests passing and end-to-end trained runs on Qwen2.5-1.5B at H200 scale.

Merge gate (v4-trained A/C at N=20 strictly > v3.46-trained 50/70) failed. Trained results with honest diagnosis below. This PR stays draft; v4.7 scope is identified.

Status at the merge gate

Run A N=20 C N=20 Gate (A>50, C>70)
v3.46-trained (PR #29) 50% 70% (reference)
v4 fresh-init GPU 0% 20%
v4-trained (this PR) 0% 30% FAIL

Training converged on the five-term v4 loss (total 5→1.6 over 60 steps, 15.7s on H200). But the hit-rate gap to v3.46-trained is large on both A and C. Two honest diagnostic cycles inside this PR:

  1. Prefix dominance: first trained run produced degenerate repetition ("1. 1. 1. ...", 0% everywhere). Diagnostic: ‖prefix‖₂ ≈ 39 per slot vs token embedding norm ≈ 2. Fix: learnable prefix_scale parameter initialized at 1/sqrt(d_LLM). Text became coherent; hit-rate moved from 0→10/20%.
  2. Topic collapse: trained topic tree retrieved the same mid on every query, with off-diag cos > 0.99 across all memory pairs. Diagnostic: token-ID Jaccard picked up stopword overlap; triplet loss collapsed the whole topic_base batch onto one direction. Fix: content-token Jaccard (drop ids < 1000) + diversity regularizer (off-diag cos ≤ 0.7). Hit-rate moved to 40/30% on C at N=10/N=20. Diagnostic retrieval accuracy: 1/5 correct top-1 (up from 0/5).

After both fixes, A at N=20 is still 0%. Remaining gap is architectural, not a tuning issue of the two I fixed — the v4.7 section below names what to work on.

What ships in this PR

  • ARCHITECTURE_v4.md — design spec.
  • ARCHITECTURE_v4_IMPL.md — per-PR implementation spec for v4.1–v4.5.
  • ARCHITECTURE_v4_TRAIN.md — v4.6 trainer + loss + data + merge-gate spec.
  • ams_v4/ full implementation (~2900 LOC):
    • v4.1 core/ + bundles/base.py — geometry primitives ported from v3.46.
    • v4.2 bundles/{temporal,topic,context}.py — three concrete Bundles + three encoders, each with explicit axis input.
    • v4.3 kakeya/ — multi-set kakeya with per-set bundle-axis alignment constraint.
    • v4.4 attention/{query_heads,cross_bundle}.py — three per-bundle attentions + learnable prefix_scale.
    • v4.5 projection/, bridge/ — end-to-end MemLLM4.
    • v4.6 training/{batch_encode,losses,trainer}.pyTrainer4 + five loss terms.
  • train_v4.py — training driver.
  • session_viability_v4.py — 3-mode parity harness (D, A, C).
  • ckpt/v4_trained.pt (via ckpt/v4_train_log.jsonl) + four reports/session_viability_v4_* directories (fresh + trained × N=10/N=20).

Test results

69/69 unit tests pass on CPU locally and on vast.ai H200:

Suite Count
skeleton 6
v4.1 geometry + store 11
v4.2 encoders + bundles 14
v4.3 kakeya + alignment 19
v4.4 attention 8
v4.5 smoke (distilgpt2 end-to-end) 1
v4.6 training (10 tests incl. grad flow, save/reload roundtrip) 10

GPU SUT — Qwen2.5-1.5B-Instruct, NVIDIA H200, mt=30, 60 training steps

Fresh-init (reports/session_viability_v4_fresh{,_20facts}/)

Mode N=10 hit N=20 hit N=20 gen-ms
D_full_history 100% 100% 533
A_ams_prefix 0% 0% 488
C_ams_hybrid 10% 20% 431

Trained (reports/session_viability_v4_trained{,_20facts}/)

Mode N=10 hit N=20 hit N=20 gen-ms Δ vs fresh
D_full_history 100% 100% 519 0
A_ams_prefix 10% 0% 466 A: +10/0
C_ams_hybrid 40% 30% 415 C: +30/+10

Reference: v3.46-trained (from PR #29)

Mode N=10 hit N=20 hit
A_ams_prefix 50% 50%
C_ams_hybrid 70% 70%

Honest reading

The merge gate fails. v4-trained A and C at N=20 are both below v3.46-trained. The per-turn generated text is coherent (after the prefix_scale fix), and the direction of the response is often topic-correct (e.g. "You love classical music..." for the chopin query) — what's missing is the fluent extraction of the specific keyword.

Three concrete gaps identified in the ckpt/v4_train_stdout.log probes and the diagnostic retrieval check:

  1. Topic space is still too crowded for N=20. At d_topic=16, with a 9-sentence training corpus (§5.3 rotating batch), the topic diversity regularizer can satisfy ceiling over 3 memories at a time, but the held-out eval session has 10/20 distinct memories — trained topic tree retrieves correctly on only 1/5 diagnostic queries.
  2. prefix_semantic_anchor target is too shallow. The 50/50 split of short training sentences leaves often just 3–5 target tokens; the NLL surface is shallow. Need a harder target (entity mask, cloze).
  3. Retrieval is flat. MemLLM4.prepare_decode_context attends over all entries rather than filtering to top-k via the direction trees. At N=20 the trained attention is still diluted by irrelevant memories.

v4.7 follow-up scope (not in this PR)

Three independently-testable changes, in order:

  1. Scale the training corpus. Replace the 9-sentence v3.46 rotating batch with the LongMemEval 50-entry sample (already in longmemeval_results.json) or a synthetic generator producing at least 200 entity-diverse sentences. Run 300+ steps.
  2. Reshape prefix_semantic_anchor. Replace the 50/50 split with entity masking: find the single highest-IDF content token in each sentence, mask it, supervise the LM to predict that token. This matches the session-viability query format.
  3. Tree-topk retrieval filter. In MemLLM4.prepare_decode_context, retrieve top-k (k=4 say) via a fused score across the three bundle trees before feeding into CrossBundleAttention. Same change would also bring flat-scan B_ams_text-style modes into v4 if needed later.

v4.7 merge gate is unchanged from v4.6: A/C at N=20 must strictly exceed 50/70.

What NOT to do in v4.7

Do not add v3.46-style decode-time patches (content_bias_*, strict_overlap_*, keyword_tail_slot, etc.). If v4.7's three fixes don't close the gate, that points at a real architectural issue that additional decode hacks would hide, not fix — SPRINT_CLOSEOUT_v3.46.md §10.9 bright line holds.

Commits in this PR

Chronological, most recent first:

  • 434b769v4.6 trained SUT results (honest failure diagnosis)
  • c102dfc — v4.6 fix: content-token Jaccard + topic diversity regularizer
  • a86ea25 — v4.6 fix: learnable prefix_scale in CrossBundleAttention
  • a81d40dsession_viability_v4 --trained-weights flag
  • f905e3a — v4.6: Trainer4 + five loss terms + train_v4.py driver
  • 9859216 — v4 fresh-init GPU SUT results
  • a913aad — v4.5 dtype fix (removed redundant prefix LN)
  • 5c7c729 — v4.5 auto-cuda in LLMBackbone4.load
  • 448c300session_viability_v4 harness
  • 9053b28 — v4.5 end-to-end + CPU smoke test
  • 1451733 — v4.4 BundleQueryHeads + CrossBundleAttention
  • 3f394c8 — v4.3 KakeyaSet + KakeyaRegistry + alignment
  • 08910fa — v4.2 three encoders + three Bundle subclasses
  • f4ef74c — v4.1 geometry primitives + MemStore + DirectionTreeV4
  • f7254afARCHITECTURE_v4_IMPL.md
  • 9f34781 — initial design skeleton + ARCHITECTURE_v4.md
Open in Web Open in Cursor 

cursoragent and others added 11 commits April 22, 2026 07:42
Opens a new architecture track AgentMemory/v347-architecture-realign-b7fa
that realigns the codebase to the abstract AMS spec:

  Multiple Kakeya sets compress the full context data. These Kakeya sets
  are linked on different fiber bundles. The fiber bundles carry memory
  encoding around time, topic, and background (context). An attention
  mechanism forms the current context window.

An audit of scheme_b_v344.py + kakeya_codec.py on PR #29 showed four of
the five structural claims in that sentence had drifted:

  - 'multiple kakeya sets'     : actually exactly 1 (KakeyaCodec is a singleton)
  - 'compress the full context': only semantic_emb is compressed
  - 'different fiber bundles'  : one bundle; kakeya and bundle are disjoint
  - 'time / topic / background': none are fiber-bundle coordinates. They
                                 live as scalar bookkeeping (ts/last/cnt),
                                 a side-channel tensor (context_descriptor)
                                 and an integer KMeans tag (cluster_id)
  - 'attention forms context'  : implemented (FiberAttn + QFormer + EmbBridge)

The 30-point gap between B_ams_text (80-90%) and A_ams_prefix (50%) on
PR #29 is the downstream symptom of the first four drifts.

This branch adds:

1. ARCHITECTURE_v4.md — 7-section design doc:
   §0 audit findings vs abstract spec
   §1 abstract-to-concrete mapping (5 subsections, one per spec clause)
   §2 ams_v4/ package layout
   §3 compilable-skeleton contract (NotImplementedError with v4-skel: markers)
   §4 migration plan: v4.1-v4.5 PRs, what each ports from v3.46
   §5 explicit non-goals (not RAG, not KG, not Cfg-knob-turning)
   §6 six assertable invariants
   §7 this PR's status and what's untouched

2. ams_v4/ package skeleton, importable, 24 Python files:
   core/      Cfg4, MemEntry, KakeyaHandle, MemStore, type helpers
              MemEntry now carries THREE (base, fiber, dirn) triples
              (one per bundle) instead of v3.46's single triple.
   bundles/   abstract Bundle + three concretes (Temporal/Topic/Context),
              each with its own encoder receiving the axis that bundle owns
              (time_scalars / content_tokens+wte / session_summary+prev_turns).
   kakeya/    KakeyaSet (single skeleton, bundle-owned, with alignment
              constraint), KakeyaRegistry (owns N sets, routes fields),
              alignment helpers, v4 codec facade.
   attention/ CrossBundleAttention (three per-bundle attentions + slot
              concat), BundleQueryHeads (three hidden->query projections).
   projection/ EmbBridge4 (thin prefix-prepend bridge; no content_bias,
              strict_overlap, keyword_tail_slot, or functional_suppression).
   bridge/    MemLLM4 top-level model, v3.46-compatible public surface.
   tests/     test_shapes.py, 6 static tests:
                - imports work
                - Cfg4 default constructs with all invariants passing
                - three Cfg4 invariants fire on violation (n_kakeya_sets>=2,
                  prefix_slots sum to L_mem, fiber dim divides head count)
                - stubbed methods raise NotImplementedError with v4-skel: marker

3. ams_v4/README.md — status + follow-up roadmap (v4.1-v4.5).

4. .gitignore (new) to keep __pycache__ etc. out.

v3.46 code (scheme_b_v344.py, kakeya_codec.py, train_v346.py,
session_viability.py) is not touched by this branch. PR #29 measurements
remain reproducible. The parity bar for v4.5's merge to main is:
MemLLM4 >= MemLLM v3.46 on session_viability.py, strict improvement on
A_ams_prefix and C_ams_hybrid at N=20.

Skeleton-test run (at commit time):
  PASS test_imports
  PASS test_cfg4_default_constructs
  PASS test_cfg4_invariant_n_kakeya_sets_min_2
  PASS test_cfg4_invariant_prefix_slots_sum
  PASS test_cfg4_invariant_fiber_divisibility
  PASS test_all_skeleton_components_raise_not_implemented
  all 6 skeleton tests passed

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Companion to ARCHITECTURE_v4.md. For each follow-up PR, specifies:

- which files are in scope
- which v3.46 classes port with what edits
- pseudocode for the new encoders / attention / kakeya math
- test list with exit criteria

Scope choice: v4.5 ships end-to-end write+retrieve+attend+inject+generate
with a CPU smoke test on a tiny backbone (sshleifer/tiny-gpt2, 7M params).
Training convergence is explicitly v4.6, not bundled here — mixing design-
drift fix with training-convergence fix would make failure modes hard to
diagnose.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Ports from scheme_b_v344.py (v3.46) with per-bundle parameterization:
  RiemannianMetric, GeodesicSolver, FiberConnection, FiberTransporter

New:
  - Bundle abstract (canonical_axis as nn.Parameter, unit-normalized on access)
  - DirectionTreeV4: beam retrieval; no AMM cross-coupling, no cluster-crowding
    rerank (those v3.46 workarounds are superseded by the per-bundle axis)
  - MemStore: three trees (time/topic/ctx), routes on add/remove, invariant check
  - MemEntry.assert_no_raw_large_fields — §6 invariant 2 enforcement

Tests (ams_v4/tests/test_v41.py, 11/11 pass on CPU):
  PASS test_metric_spd                           SPD + symmetric
  PASS test_connection_antisymmetric             A + A^T ~ 0
  PASS test_transporter_preserves_norm           closed-loop drift < 10%
  PASS test_geodesic_endpoints                   path[0]=xs, path[-1]=xe
  PASS test_geodesic_linear_fallback             linear_path correct shape
  PASS test_memstore_add_routes_to_all_three_trees
  PASS test_direction_tree_insert_retrieve       target mid in top-3
  PASS test_memstore_remove_updates_trees
  PASS test_memstore_verify_consistency_empty
  PASS test_memstore_verify_consistency_populated
  PASS test_memstore_invariant_no_raw_large_fields

Skeleton stub tests (ams_v4/tests/test_shapes.py) pruned: v4.1 components
no longer raise NotImplementedError, so test_all_skeleton_components was
renamed to test_remaining_stubs and now checks only v4.2+ stubs.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
TimeEncoder (temporal.py):
  - Fourier-feature encoding of (absolute_ts, recency, cnt)
  - base = LN(time_mlp(fourier) + hidden_proj(hidden))
  - fiber = MLP(concat(hidden, base, surprise))
  - dirn  = normalize(base)

TopicEncoder (topic.py):
  - IDF-weighted centroid of content_token_ids over wte_normed (batched)
  - base = normalize(down_project(centroid) + hidden_to_topic(hidden))  -> on sphere by construction
  - fiber = MLP(concat(hidden, base))
  - dirn  = base (already unit)
  - Ragged batch input (list-of-lists) supported

ContextEncoder (context.py):
  - Single-head attention pool over optional prev_turns
  - base = LN(mix_mlp(hidden + session_summary + attn))
  - fiber = MLP(concat(hidden, base, session_summary))
  - dirn  = normalize(base)

TopicBundle overrides _solver=None and provides _great_circle_path (slerp)
for transport; topic transport does not need gradient-descent geodesic solver.

Tests (ams_v4/tests/test_v42.py, 14/14 pass on CPU):
  PASS test_time_encoder_shapes
  PASS test_time_dirn_unit_norm
  PASS test_temporal_bundle_encode_matches_encoder
  PASS test_idf_centroid_empty_returns_zero
  PASS test_idf_centroid_oov_returns_zero
  PASS test_topic_encoder_shapes_batched
  PASS test_topic_base_on_sphere              (||base||=1 within 1e-4)
  PASS test_topic_bundle_canonical_axis_unit
  PASS test_topic_great_circle_endpoints      (slerp endpoints exact, mid-points on sphere)
  PASS test_topic_transport_preserves_norm    (drift < 15%)
  PASS test_context_encoder_no_prev_turns
  PASS test_context_encoder_with_prev_turns
  PASS test_all_bundles_canonical_axis_unit
  PASS test_gradients_flow_through_time_encoder

Skeleton stub test in test_shapes.py pruned to only KakeyaRegistry.define_sets
now that all three encoders are implemented.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
alignment.py (pure functions, no state):
  - pushforward(axis_in_base, base_to_field) = axis @ map
  - project_into_pca(direction, basis)       = basis @ direction
  - alignment_error(t_dir, target)           = ||t_dir - normalize(target)||
  - solve_aligned_t_dir(target, tol)         = (normalize(target), 0)
                                                on near-zero -> unit e_0 + err=1

KakeyaSet:
  - Build pipeline: PCA -> align t_dir to bundle axis pushforward ->
    perpendicular spherical K-means -> store KakeyaSkeleton4
  - encode(v): (alpha on t_dir, segment id, t along center, sparse residual top-k)
  - decode(cv): reconstruct field vector from CompressedVec
  - verify_alignment: recompute pushforward and return ||t_dir - projected||
  - _compute_pca + _spherical_kmeans ported from kakeya_codec.py (v3.12 helpers)

KakeyaRegistry:
  - Owns N KakeyaSet instances per _routing (default: 4 sets across 3 bundles,
    with cross-axis redundancy semantic_emb+content_wte_mean)
  - build(field_corpus, bundle_axes) populates all active sets; auto-initializes
    per-routing-key base_to_field map (seeded for determinism)
  - encode_memory_fields / decode_field: per-memory API
  - verify_invariants(n, bundle_axes): enforces §6 invariants 3 + 4

Tests (ams_v4/tests/test_v43.py, 19/19 pass on CPU):
  6 alignment-math tests (pushforward, project, alignment_error, solve)
  2 helper tests (_compute_pca, _spherical_kmeans)
  4 KakeyaSet tests (build activates, alignment near-zero, roundtrip,
                     reject-wrong-dim)
  7 Registry tests (default 4 sets, custom routing, short-routing rejection,
                    handle covers all fields, decode roundtrip, invariant pass,
                    invariant 3 fires when active-set-count < 2)

§6 invariant 5 (reconstruction) verified: median rel err <= 0.15, max < 0.65.
§6 invariant 4 (alignment) verified: err < kakeya_alignment_tol = 1e-3 after build.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
BundleQueryHeads (attention/query_heads.py):
  - LayerNorm on hidden_state
  - Three independent Linear heads: time / topic / ctx
  - Each projects d_LLM -> d_F_{bundle}

CrossBundleAttention (attention/cross_bundle.py):
  - Three MultiheadAttention modules, one per bundle fiber space
    (d_F_time / d_F_topic / d_F_ctx, each with its own head count)
  - Per-slot Linear lifts: each of prefix_slots_{time,topic,ctx} slots
    gets its own d_F_bundle -> d_LLM map
  - Concat-along-slot-dim -> (B, L_mem, d_LLM) -> post LayerNorm
  - Asserts output shape invariant §6.6

Design choice: three per-bundle attentions instead of one shared attention.
This keeps the topic signal from getting mixed with the temporal signal in
the attention kernel itself; combination happens at the slot-concat stage.

Tests (ams_v4/tests/test_v44.py, 8/8 pass on CPU):
  PASS test_query_heads_shapes
  PASS test_query_heads_distinct
  PASS test_cross_bundle_forward_shape        (B, L_mem, d_LLM) exactly
  PASS test_cross_bundle_requires_at_least_one_entry
  PASS test_cross_bundle_gradient_flow        backward through q_time.weight
  PASS test_cross_bundle_finite_with_random_fibers
  PASS test_cross_bundle_batch_determinism    eval() + identical input -> identical output
  PASS test_cross_bundle_slot_allocation_matches_cfg
        perturbing only time fibers changes time slots more than topic/ctx slots

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
LLMBackbone4 (bridge/backbone.py):
  - Thin wrapper over HF AutoModelForCausalLM
  - Freezes backbone params (v4 does NOT fine-tune the LM)
  - tokenize / hidden_states / forward_with_prefix / generate_with_prefix
  - Manual greedy-decode loop with inputs_embeds (avoids HF generate()
    inputs_embeds edge cases)

EmbBridge4 (projection/bridge.py):
  - prefix_post_ln + build_inputs(prefix, ids, mask, wte)
  - Prepends prefix embeds + extends attention mask
  - No CFG, no content_bias, no logit shaping — v3.46 decode-time patches
    are intentionally NOT ported

MemLLM4 (bridge/memllm.py):
  - Composes backbone + 3 bundles + cross_attn + kakeya registry + store
  - write(text):
      1. backbone.hidden_states -> pooled float32 hidden
      2. three bundles.encode -> (time_*, topic_*, ctx_*) triples
      3. extract large fields (semantic_emb, content_wte_mean, context_descriptor)
      4. store.add -> triggers _maybe_build_kakeya once n >= min_entries
      5. existing entries re-encoded through the active registry
  - prepare_decode_context(ids, mask):
      1. pooled query hidden
      2. cross_attn over ALL entries (flat attend; retrieval-filter in v4.6)
  - generate(prompt, mt):
      1. prepare_decode_context
      2. backbone.generate_with_prefix (manual greedy loop)

CPU smoke test (tests/test_v45_smoke.py):
  - distilgpt2 backbone (82M params, d_LLM=768), 6 written memories
  - Verifies §6 invariants 1, 2, 3, 4, 6 on live data
  - Runs generate(); does NOT assert output quality (that's v4.6)
  - Completes in ~5 s on CPU

All v4 tests passing (59 total):
    6 skeleton
   11 v4.1 (geometry + MemStore + DirectionTreeV4)
   14 v4.2 (three encoders + three bundles)
   19 v4.3 (kakeya + alignment)
    8 v4.4 (attention)
    1 v4.5 smoke (end-to-end)

Skeleton test test_remaining_stubs_raise_not_implemented renamed to
test_v45_constructs_without_backbone: no stubs remain after v4.5.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
3-mode subset (D_full_history, A_ams_prefix, C_ams_hybrid) on the same
10-query synthetic session as PR #29's session_viability.py, but using
MemLLM4 for A and C.

Fresh-init expectation: A/C hit-rates at Qwen2.5-1.5B scale on GPU are
not expected to beat v3.46 fresh-init — that requires training (v4.6).
This harness produces the fresh-init baseline that v4-trained will be
compared against.

B modes (B_flat_cos, B_ams_text) omitted: they are RAG-shaped upper-bound
diagnostics, not v4 product modes (per SPRINT_CLOSEOUT_v3.46.md §10.9).
D_full_history is kept as the ceiling baseline.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Default device = torch.device('cuda') when torch.cuda.is_available() and the
caller didn't pass a device override. Without this, MemLLM4 ran on CPU even
on GPU-equipped hosts, making the session_viability_v4 harness unusably slow
(~27 s per D_full_history generate on Qwen 1.5B).

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
CrossBundleAttention already applies a LayerNorm at the end of forward().
EmbBridge4 (and MemLLM4.generate) previously applied a *second* LayerNorm,
which triggered RuntimeError('expected BFloat16 but found Float') when the
backbone is bf16 on GPU (v4 modules are fp32 by default).

Fix:
  - EmbBridge4 no longer owns prefix_post_ln; build_inputs just concats
    prefix.to(dtype) with wte(ids).
  - MemLLM4.generate() skips the LN and passes ctx.prefix.to(backbone_dtype)
    directly to backbone.generate_with_prefix.

Local v4.5 smoke test still passes (distilgpt2, fp32). No unit-test change
since no test exercised EmbBridge4.build_inputs directly.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
N=10 (reports/session_viability_v4_fresh/):
  D_full_history     hit=100%  in_tok=159  gen=501ms
  A_ams_prefix       hit=  0%  in_tok=11   gen=493ms  ret=28ms
  C_ams_hybrid       hit= 10%  in_tok=27   gen=412ms  ret=27ms

N=20 (reports/session_viability_v4_fresh_20facts/):
  D_full_history     hit=100%  in_tok=301  gen=533ms
  A_ams_prefix       hit=  0%  in_tok=11   gen=488ms  ret=18ms
  C_ams_hybrid       hit= 20%  in_tok=27   gen=431ms  ret=28ms

EXPECTED BEHAVIOR for fresh-init v4. The v3.46 fresh-GPU numbers at N=20
(A=50%, C=70%) reflect ~15 decode-time logit-shaping hacks (content_bias,
strict_overlap, keyword_tail_slot, functional_suppression, etc.) that v4
does NOT port. v4 exposes the pure prefix-channel mechanism.

Purpose of this baseline: it is the FRESH-INIT floor that v4.6 trained
numbers will be compared against. Expected training lift is large because
the new v4 loss terms (bundle_axis_alignment, cross_bundle_independence,
prefix_semantic_anchor, recon, write_policy) directly target the prefix
channel mechanism, unlike v3.46 where only a few loss terms touched the
prefix channel and decode-time hacks compensated for the rest.

v4 ran end-to-end at Qwen2.5-1.5B scale on H200 with:
  - All 6 skeleton tests passing
  - All 52 v4.1-v4.5 unit tests passing
  - Full stack: 3 bundles x 3 trees x kakeya registry with 4 active sets
  - C_ams_hybrid retrieve latency 28ms (beats v3.46 ~400ms — the v4 tree is
    per-bundle and does not run v3.46's rerank-inside-retrieve)
  - Generate latency 412-493ms (bounded by backbone forward, no CFG)

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
@cursor cursor Bot changed the title v4 architecture realign: skeleton — multi-kakeya × 3 bundles × (time/topic/context) axes v4 architecture realign: full stack (v4.1–v4.5) + fresh-init SUT on GPU Apr 22, 2026
cursoragent and others added 5 commits April 22, 2026 08:44
ARCHITECTURE_v4_TRAIN.md: per-loss design, training data, merge gate.

ams_v4/training/batch_encode.py:
  - encode_batch_for_training(): three bundles run on a list of texts,
    produces (base, fiber, dirn) stacks with gradients retained. Used only
    during training — production write() path is untouched (still detaches).
  - batch_to_mementries(): build MemEntry objects that reference grad-
    carrying tensors, for use by CrossBundleAttention during the loss.

ams_v4/training/losses.py (5 terms, mirroring Cfg4.loss_weights keys):
  - prefix_semantic_anchor: teacher-forced next-token NLL through
    (cross_attn → prefix → backbone). Main signal.
  - bundle_axis_alignment: three per-bundle sub-terms
      * time:  -Pearson(proj_onto_axis, batch_index)  [non-saturating, grad always flows]
      * topic: triplet margin on topic_base using Jaccard-on-token-ids targets
      * ctx:   mild axis-alignment hinge on ctx_base projection
  - cross_bundle_independence: target pairwise |Pearson| of fiber-scalars ≈ 0.3
  - recon: relative error through kakeya encode/decode (diagnostic only in
    v4.6 since base_to_field maps are not yet nn.Parameter)
  - write_policy: tiny collapse-prevention + short-text penalty

ams_v4/training/trainer.py:
  - Trainer4 freezes backbone, collects trainable params from bundles +
    cross_attn + bridge, AdamW(lr=3e-4, wd=0.01), grad-clip 1.0.
  - step(batch_texts): reseeds store + registry, runs write() to mirror
    inference-side data structures, runs encode_batch_for_training for
    grad-bearing copies, sums weighted losses, backprop.
  - probe_weights(): snapshot of three representative weight magnitudes.
  - save(path, ...): dumps only trainable params + cfg_summary + provenance.

MemLLM4.load_trained_weights(path):
  - Strict shape match on named_parameters. Prints v4-native log line
    (mirrors v3.46 '[AMS_TRAINED_WEIGHTS] loaded=X skipped=Y' log format).

train_v4.py:
  - Same 9-sentence corpus as v3.46's train_v346.py (§5.3 of SPRINT_CLOSEOUT).
  - AdamW, 60 steps default, batch 3.
  - Writes ckpt/v4_trained.pt + ckpt/v4_train_log.jsonl.

Tests (ams_v4/tests/test_v46_train.py, 10/10 pass on CPU with distilgpt2):
  PASS test_encode_batch_for_training_shapes
  PASS test_loss_prefix_semantic_anchor_scalar_and_finite
  PASS test_loss_bundle_axis_alignment_nonneg
  PASS test_loss_cross_bundle_independence_nonneg
  PASS test_loss_recon_finite
  PASS test_loss_write_policy_finite
  PASS test_loss_prefix_anchor_gradient_flow_cross_attn
        gradient reaches cross_attn.lift_time[0].weight
  PASS test_loss_bundle_axis_alignment_gradient_flow
        gradient reaches bundle_time._axis_raw (the canonical axis)
  PASS test_trainer_three_step_cpu_smoke
        3 trainer steps run, losses vary across steps
  PASS test_trainer_save_and_reload_roundtrip
        save -> load_trained_weights -> weights bit-identical

Full v4 regression: 69/69 tests pass (6 skeleton + 11 v4.1 + 14 v4.2 +
19 v4.3 + 8 v4.4 + 1 v4.5 smoke + 10 v4.6 training).

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Allows the parity harness to run against a v4 Trainer4 checkpoint
(ckpt/v4_trained.pt) via MemLLM4.load_trained_weights. Output report.json
records which checkpoint (if any) was used under config.trained_weights.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Diagnostic: v4 trained run at Qwen 1.5B produced 0% hit-rate with degenerate
repetition ('1. 1. 1. ...') outputs. Root cause: after prefix_ln, each prefix
slot has ||x||_2 ≈ sqrt(d_LLM) = 39 for d_LLM=1536, vs Qwen token embedding
norm ~2. Prefix was ~20x louder than tokens and dominated the backbone
forward, forcing repetition regardless of what memories the prefix encoded.

Fix: add a learnable nn.Parameter 'prefix_scale' initialized at 1/sqrt(d_LLM),
applied as prefix_ln(x) * prefix_scale. Initial magnitude matches token
embeddings; training can tune up from there via the prefix_semantic_anchor
loss.

No unit test needed to change — the learnable scale is shape-preserving.
All 10 v4.6 training tests + 6 skeleton + 8 v4.4 + 1 v4.5 smoke pass.
Will retrain and re-run SUT on H200.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…izer

Diagnostic: after the prefix_scale fix, trained topic tree retrieval was
still collapsed: every query's top-1 retrieval returned mid=7 (Thai) or
mid=9 (sister), with cos > 0.99 across ALL (query, memory) pairs.
topic_base vectors were nearly collinear.

Two root causes in bundle_axis_alignment topic sub-term:
1. Jaccard was computed over raw token ids, which are dominated by shared
   stopwords ("User:", "I", "my", "the") — so "positive pair" was
   usually just "the next-door fact" rather than a meaningful content
   similarity. Triplet loss was pulling everything to a global mean.
2. No explicit diversity pressure; triplet loss alone doesn't prevent the
   whole batch from collapsing onto one direction.

Fix:
- _jaccard now drops token ids < 1000, cutting punctuation and the most-
  common BPE merges. Coarse heuristic, but works for Qwen2.5 + GPT-2
  vocabularies.
- Added a diversity regularizer: relu(off_diag_cos - 0.7).mean() penalizes
  any pair of topic_bases that are too collinear.
- Triplet margin bumped 0.1 -> 0.2 to give the diversity term room to
  push.

All 10 v4.6 training tests still pass. Will retrain on GPU and re-run SUT.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…+ topic-diversity fixes applied)

Training: 15.7 s, 60 steps x batch 3. Checkpoint ckpt/v4_trained.pt.
Psa final epoch avg: ~1.6 (healthy; pre-fix run hit ~0.2 which was
overfitting the dominant prefix; see prefix_scale fix commit).

N=10 (reports/session_viability_v4_trained/):
  D_full_history     hit=100%  in_tok=159  gen=483ms
  A_ams_prefix       hit= 10%  in_tok=11   gen=452ms  ret=21ms
  C_ams_hybrid       hit= 40%  in_tok=27   gen=452ms  ret=27ms

N=20 (reports/session_viability_v4_trained_20facts/):
  D_full_history     hit=100%  in_tok=301  gen=519ms
  A_ams_prefix       hit=  0%  in_tok=11   gen=466ms  ret=19ms
  C_ams_hybrid       hit= 30%  in_tok=27   gen=415ms  ret=26ms

MERGE GATE FAILED (v4-trained A,C at N=20 must exceed v3.46-trained 50/70).

The improvements from fresh-init -> trained are clear on C (N=20: 20% -> 30%,
N=10: 10% -> 40%) but A stays at 0-10%. Session is not mergeable to main.

Two diagnostic-driven fixes landed in the training cycle:
  1. CrossBundleAttention.prefix_scale as a learnable nn.Parameter — without
     it, the prefix L2-norm was ~39 per slot vs token embedding norm ~2, so
     the prefix dominated the backbone and produced degenerate repetition.
     After fix, generated text is coherent.
  2. Topic axis loss: content-token Jaccard (drop ids<1000) + a diversity
     regularizer (off-diag cos <= 0.7) — without these, triplet loss was
     driven by stopword overlap and collapsed all topic_base vectors to one
     direction. After fix, trained topic tree retrieves the correct memory
     on 1/5 diagnostic queries (vs 0/5 before) and off-diagonal cos is
     distributed instead of all > 0.99.

Remaining root cause (not in scope for this PR):
  - Topic base space is still too crowded at d_topic=16 with a 60-step tiny
    corpus. At training time the model sees 9 rotating sentences; the topic
    loss can satisfy diversity over 3 at a time, but the held-out session
    has 10/20 distinct memories.
  - The prefix_semantic_anchor loss uses a 50/50 text split, which for
    short training sentences leaves very little target signal (often 3-5
    tokens); the NLL surface is shallow.
  - Retrieval is still run flat (cross_attn attends over ALL entries), not
    filtered by tree top-k. Trained attention can still be overwhelmed by
    20 memories when only ~1 is relevant.

Follow-up PR (v4.7) should address these in the order above: scale the
training corpus, rework the prefix_semantic_anchor target (mask-the-entity
instead of 50/50 split), and add a tree-topk retrieval filter before
cross-bundle attention.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
@cursor cursor Bot changed the title v4 architecture realign: full stack (v4.1–v4.5) + fresh-init SUT on GPU v4 architecture realign: v4.1–v4.6 stack + trained SUT on GPU (merge gate FAILED — ship-blocking follow-up v4.7 identified) Apr 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants