v4 architecture realign: v4.1–v4.6 stack + trained SUT on GPU (merge gate FAILED — ship-blocking follow-up v4.7 identified) by FluffyAIcode · Pull Request #30 · FluffyAIcode/AgentMemorySystem

FluffyAIcode · 2026-04-22T07:43:58Z

Realigns the codebase to the abstract AMS spec (multiple Kakeya sets × three fiber bundles × explicit time/topic/background axes × cross-bundle attention). v4.1 → v4.6 implemented with 69 unit tests passing and end-to-end trained runs on Qwen2.5-1.5B at H200 scale.

Merge gate (v4-trained A/C at N=20 strictly > v3.46-trained 50/70) failed. Trained results with honest diagnosis below. This PR stays draft; v4.7 scope is identified.

Status at the merge gate

Run	A N=20	C N=20	Gate (A>50, C>70)
v3.46-trained (PR #29)	50%	70%	(reference)
v4 fresh-init GPU	0%	20%	—
v4-trained (this PR)	0%	30%	FAIL

Training converged on the five-term v4 loss (total 5→1.6 over 60 steps, 15.7s on H200). But the hit-rate gap to v3.46-trained is large on both A and C. Two honest diagnostic cycles inside this PR:

Prefix dominance: first trained run produced degenerate repetition ("1. 1. 1. ...", 0% everywhere). Diagnostic: ‖prefix‖₂ ≈ 39 per slot vs token embedding norm ≈ 2. Fix: learnable prefix_scale parameter initialized at 1/sqrt(d_LLM). Text became coherent; hit-rate moved from 0→10/20%.
Topic collapse: trained topic tree retrieved the same mid on every query, with off-diag cos > 0.99 across all memory pairs. Diagnostic: token-ID Jaccard picked up stopword overlap; triplet loss collapsed the whole topic_base batch onto one direction. Fix: content-token Jaccard (drop ids < 1000) + diversity regularizer (off-diag cos ≤ 0.7). Hit-rate moved to 40/30% on C at N=10/N=20. Diagnostic retrieval accuracy: 1/5 correct top-1 (up from 0/5).

After both fixes, A at N=20 is still 0%. Remaining gap is architectural, not a tuning issue of the two I fixed — the v4.7 section below names what to work on.

What ships in this PR

ARCHITECTURE_v4.md — design spec.
ARCHITECTURE_v4_IMPL.md — per-PR implementation spec for v4.1–v4.5.
ARCHITECTURE_v4_TRAIN.md — v4.6 trainer + loss + data + merge-gate spec.
ams_v4/ full implementation (~2900 LOC):
- v4.1 core/ + bundles/base.py — geometry primitives ported from v3.46.
- v4.2 bundles/{temporal,topic,context}.py — three concrete Bundles + three encoders, each with explicit axis input.
- v4.3 kakeya/ — multi-set kakeya with per-set bundle-axis alignment constraint.
- v4.4 attention/{query_heads,cross_bundle}.py — three per-bundle attentions + learnable prefix_scale.
- v4.5 projection/, bridge/ — end-to-end MemLLM4.
- v4.6 training/{batch_encode,losses,trainer}.py — Trainer4 + five loss terms.
train_v4.py — training driver.
session_viability_v4.py — 3-mode parity harness (D, A, C).
ckpt/v4_trained.pt (via ckpt/v4_train_log.jsonl) + four reports/session_viability_v4_* directories (fresh + trained × N=10/N=20).

Test results

69/69 unit tests pass on CPU locally and on vast.ai H200:

Suite	Count
skeleton	6
v4.1 geometry + store	11
v4.2 encoders + bundles	14
v4.3 kakeya + alignment	19
v4.4 attention	8
v4.5 smoke (distilgpt2 end-to-end)	1
v4.6 training (10 tests incl. grad flow, save/reload roundtrip)	10

GPU SUT — Qwen2.5-1.5B-Instruct, NVIDIA H200, mt=30, 60 training steps

Fresh-init (`reports/session_viability_v4_fresh{,_20facts}/`)

Mode	N=10 hit	N=20 hit	N=20 gen-ms
`D_full_history`	100%	100%	533
`A_ams_prefix`	0%	0%	488
`C_ams_hybrid`	10%	20%	431

Trained (`reports/session_viability_v4_trained{,_20facts}/`)

Mode	N=10 hit	N=20 hit	N=20 gen-ms	Δ vs fresh
`D_full_history`	100%	100%	519	0
`A_ams_prefix`	10%	0%	466	A: +10/0
`C_ams_hybrid`	40%	30%	415	C: +30/+10

Reference: v3.46-trained (from PR #29)

Mode	N=10 hit	N=20 hit
`A_ams_prefix`	50%	50%
`C_ams_hybrid`	70%	70%

Honest reading

The merge gate fails. v4-trained A and C at N=20 are both below v3.46-trained. The per-turn generated text is coherent (after the prefix_scale fix), and the direction of the response is often topic-correct (e.g. "You love classical music..." for the chopin query) — what's missing is the fluent extraction of the specific keyword.

Three concrete gaps identified in the ckpt/v4_train_stdout.log probes and the diagnostic retrieval check:

Topic space is still too crowded for N=20. At d_topic=16, with a 9-sentence training corpus (§5.3 rotating batch), the topic diversity regularizer can satisfy ceiling over 3 memories at a time, but the held-out eval session has 10/20 distinct memories — trained topic tree retrieves correctly on only 1/5 diagnostic queries.
prefix_semantic_anchor target is too shallow. The 50/50 split of short training sentences leaves often just 3–5 target tokens; the NLL surface is shallow. Need a harder target (entity mask, cloze).
Retrieval is flat. MemLLM4.prepare_decode_context attends over all entries rather than filtering to top-k via the direction trees. At N=20 the trained attention is still diluted by irrelevant memories.

v4.7 follow-up scope (not in this PR)

Three independently-testable changes, in order:

Scale the training corpus. Replace the 9-sentence v3.46 rotating batch with the LongMemEval 50-entry sample (already in longmemeval_results.json) or a synthetic generator producing at least 200 entity-diverse sentences. Run 300+ steps.
Reshape prefix_semantic_anchor. Replace the 50/50 split with entity masking: find the single highest-IDF content token in each sentence, mask it, supervise the LM to predict that token. This matches the session-viability query format.
Tree-topk retrieval filter. In MemLLM4.prepare_decode_context, retrieve top-k (k=4 say) via a fused score across the three bundle trees before feeding into CrossBundleAttention. Same change would also bring flat-scan B_ams_text-style modes into v4 if needed later.

v4.7 merge gate is unchanged from v4.6: A/C at N=20 must strictly exceed 50/70.

What NOT to do in v4.7

Do not add v3.46-style decode-time patches (content_bias_*, strict_overlap_*, keyword_tail_slot, etc.). If v4.7's three fixes don't close the gate, that points at a real architectural issue that additional decode hacks would hide, not fix — SPRINT_CLOSEOUT_v3.46.md §10.9 bright line holds.

Commits in this PR

Chronological, most recent first:

434b769 — v4.6 trained SUT results (honest failure diagnosis)
c102dfc — v4.6 fix: content-token Jaccard + topic diversity regularizer
a86ea25 — v4.6 fix: learnable prefix_scale in CrossBundleAttention
a81d40d — session_viability_v4 --trained-weights flag
f905e3a — v4.6: Trainer4 + five loss terms + train_v4.py driver
9859216 — v4 fresh-init GPU SUT results
a913aad — v4.5 dtype fix (removed redundant prefix LN)
5c7c729 — v4.5 auto-cuda in LLMBackbone4.load
448c300 — session_viability_v4 harness
9053b28 — v4.5 end-to-end + CPU smoke test
1451733 — v4.4 BundleQueryHeads + CrossBundleAttention
3f394c8 — v4.3 KakeyaSet + KakeyaRegistry + alignment
08910fa — v4.2 three encoders + three Bundle subclasses
f4ef74c — v4.1 geometry primitives + MemStore + DirectionTreeV4
f7254af — ARCHITECTURE_v4_IMPL.md
9f34781 — initial design skeleton + ARCHITECTURE_v4.md

Opens a new architecture track AgentMemory/v347-architecture-realign-b7fa that realigns the codebase to the abstract AMS spec: Multiple Kakeya sets compress the full context data. These Kakeya sets are linked on different fiber bundles. The fiber bundles carry memory encoding around time, topic, and background (context). An attention mechanism forms the current context window. An audit of scheme_b_v344.py + kakeya_codec.py on PR #29 showed four of the five structural claims in that sentence had drifted: - 'multiple kakeya sets' : actually exactly 1 (KakeyaCodec is a singleton) - 'compress the full context': only semantic_emb is compressed - 'different fiber bundles' : one bundle; kakeya and bundle are disjoint - 'time / topic / background': none are fiber-bundle coordinates. They live as scalar bookkeeping (ts/last/cnt), a side-channel tensor (context_descriptor) and an integer KMeans tag (cluster_id) - 'attention forms context' : implemented (FiberAttn + QFormer + EmbBridge) The 30-point gap between B_ams_text (80-90%) and A_ams_prefix (50%) on PR #29 is the downstream symptom of the first four drifts. This branch adds: 1. ARCHITECTURE_v4.md — 7-section design doc: §0 audit findings vs abstract spec §1 abstract-to-concrete mapping (5 subsections, one per spec clause) §2 ams_v4/ package layout §3 compilable-skeleton contract (NotImplementedError with v4-skel: markers) §4 migration plan: v4.1-v4.5 PRs, what each ports from v3.46 §5 explicit non-goals (not RAG, not KG, not Cfg-knob-turning) §6 six assertable invariants §7 this PR's status and what's untouched 2. ams_v4/ package skeleton, importable, 24 Python files: core/ Cfg4, MemEntry, KakeyaHandle, MemStore, type helpers MemEntry now carries THREE (base, fiber, dirn) triples (one per bundle) instead of v3.46's single triple. bundles/ abstract Bundle + three concretes (Temporal/Topic/Context), each with its own encoder receiving the axis that bundle owns (time_scalars / content_tokens+wte / session_summary+prev_turns). kakeya/ KakeyaSet (single skeleton, bundle-owned, with alignment constraint), KakeyaRegistry (owns N sets, routes fields), alignment helpers, v4 codec facade. attention/ CrossBundleAttention (three per-bundle attentions + slot concat), BundleQueryHeads (three hidden->query projections). projection/ EmbBridge4 (thin prefix-prepend bridge; no content_bias, strict_overlap, keyword_tail_slot, or functional_suppression). bridge/ MemLLM4 top-level model, v3.46-compatible public surface. tests/ test_shapes.py, 6 static tests: - imports work - Cfg4 default constructs with all invariants passing - three Cfg4 invariants fire on violation (n_kakeya_sets>=2, prefix_slots sum to L_mem, fiber dim divides head count) - stubbed methods raise NotImplementedError with v4-skel: marker 3. ams_v4/README.md — status + follow-up roadmap (v4.1-v4.5). 4. .gitignore (new) to keep __pycache__ etc. out. v3.46 code (scheme_b_v344.py, kakeya_codec.py, train_v346.py, session_viability.py) is not touched by this branch. PR #29 measurements remain reproducible. The parity bar for v4.5's merge to main is: MemLLM4 >= MemLLM v3.46 on session_viability.py, strict improvement on A_ams_prefix and C_ams_hybrid at N=20. Skeleton-test run (at commit time): PASS test_imports PASS test_cfg4_default_constructs PASS test_cfg4_invariant_n_kakeya_sets_min_2 PASS test_cfg4_invariant_prefix_slots_sum PASS test_cfg4_invariant_fiber_divisibility PASS test_all_skeleton_components_raise_not_implemented all 6 skeleton tests passed Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

Companion to ARCHITECTURE_v4.md. For each follow-up PR, specifies: - which files are in scope - which v3.46 classes port with what edits - pseudocode for the new encoders / attention / kakeya math - test list with exit criteria Scope choice: v4.5 ships end-to-end write+retrieve+attend+inject+generate with a CPU smoke test on a tiny backbone (sshleifer/tiny-gpt2, 7M params). Training convergence is explicitly v4.6, not bundled here — mixing design- drift fix with training-convergence fix would make failure modes hard to diagnose. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

Ports from scheme_b_v344.py (v3.46) with per-bundle parameterization: RiemannianMetric, GeodesicSolver, FiberConnection, FiberTransporter New: - Bundle abstract (canonical_axis as nn.Parameter, unit-normalized on access) - DirectionTreeV4: beam retrieval; no AMM cross-coupling, no cluster-crowding rerank (those v3.46 workarounds are superseded by the per-bundle axis) - MemStore: three trees (time/topic/ctx), routes on add/remove, invariant check - MemEntry.assert_no_raw_large_fields — §6 invariant 2 enforcement Tests (ams_v4/tests/test_v41.py, 11/11 pass on CPU): PASS test_metric_spd SPD + symmetric PASS test_connection_antisymmetric A + A^T ~ 0 PASS test_transporter_preserves_norm closed-loop drift < 10% PASS test_geodesic_endpoints path[0]=xs, path[-1]=xe PASS test_geodesic_linear_fallback linear_path correct shape PASS test_memstore_add_routes_to_all_three_trees PASS test_direction_tree_insert_retrieve target mid in top-3 PASS test_memstore_remove_updates_trees PASS test_memstore_verify_consistency_empty PASS test_memstore_verify_consistency_populated PASS test_memstore_invariant_no_raw_large_fields Skeleton stub tests (ams_v4/tests/test_shapes.py) pruned: v4.1 components no longer raise NotImplementedError, so test_all_skeleton_components was renamed to test_remaining_stubs and now checks only v4.2+ stubs. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

TimeEncoder (temporal.py): - Fourier-feature encoding of (absolute_ts, recency, cnt) - base = LN(time_mlp(fourier) + hidden_proj(hidden)) - fiber = MLP(concat(hidden, base, surprise)) - dirn = normalize(base) TopicEncoder (topic.py): - IDF-weighted centroid of content_token_ids over wte_normed (batched) - base = normalize(down_project(centroid) + hidden_to_topic(hidden)) -> on sphere by construction - fiber = MLP(concat(hidden, base)) - dirn = base (already unit) - Ragged batch input (list-of-lists) supported ContextEncoder (context.py): - Single-head attention pool over optional prev_turns - base = LN(mix_mlp(hidden + session_summary + attn)) - fiber = MLP(concat(hidden, base, session_summary)) - dirn = normalize(base) TopicBundle overrides _solver=None and provides _great_circle_path (slerp) for transport; topic transport does not need gradient-descent geodesic solver. Tests (ams_v4/tests/test_v42.py, 14/14 pass on CPU): PASS test_time_encoder_shapes PASS test_time_dirn_unit_norm PASS test_temporal_bundle_encode_matches_encoder PASS test_idf_centroid_empty_returns_zero PASS test_idf_centroid_oov_returns_zero PASS test_topic_encoder_shapes_batched PASS test_topic_base_on_sphere (||base||=1 within 1e-4) PASS test_topic_bundle_canonical_axis_unit PASS test_topic_great_circle_endpoints (slerp endpoints exact, mid-points on sphere) PASS test_topic_transport_preserves_norm (drift < 15%) PASS test_context_encoder_no_prev_turns PASS test_context_encoder_with_prev_turns PASS test_all_bundles_canonical_axis_unit PASS test_gradients_flow_through_time_encoder Skeleton stub test in test_shapes.py pruned to only KakeyaRegistry.define_sets now that all three encoders are implemented. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

alignment.py (pure functions, no state): - pushforward(axis_in_base, base_to_field) = axis @ map - project_into_pca(direction, basis) = basis @ direction - alignment_error(t_dir, target) = ||t_dir - normalize(target)|| - solve_aligned_t_dir(target, tol) = (normalize(target), 0) on near-zero -> unit e_0 + err=1 KakeyaSet: - Build pipeline: PCA -> align t_dir to bundle axis pushforward -> perpendicular spherical K-means -> store KakeyaSkeleton4 - encode(v): (alpha on t_dir, segment id, t along center, sparse residual top-k) - decode(cv): reconstruct field vector from CompressedVec - verify_alignment: recompute pushforward and return ||t_dir - projected|| - _compute_pca + _spherical_kmeans ported from kakeya_codec.py (v3.12 helpers) KakeyaRegistry: - Owns N KakeyaSet instances per _routing (default: 4 sets across 3 bundles, with cross-axis redundancy semantic_emb+content_wte_mean) - build(field_corpus, bundle_axes) populates all active sets; auto-initializes per-routing-key base_to_field map (seeded for determinism) - encode_memory_fields / decode_field: per-memory API - verify_invariants(n, bundle_axes): enforces §6 invariants 3 + 4 Tests (ams_v4/tests/test_v43.py, 19/19 pass on CPU): 6 alignment-math tests (pushforward, project, alignment_error, solve) 2 helper tests (_compute_pca, _spherical_kmeans) 4 KakeyaSet tests (build activates, alignment near-zero, roundtrip, reject-wrong-dim) 7 Registry tests (default 4 sets, custom routing, short-routing rejection, handle covers all fields, decode roundtrip, invariant pass, invariant 3 fires when active-set-count < 2) §6 invariant 5 (reconstruction) verified: median rel err <= 0.15, max < 0.65. §6 invariant 4 (alignment) verified: err < kakeya_alignment_tol = 1e-3 after build. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

BundleQueryHeads (attention/query_heads.py): - LayerNorm on hidden_state - Three independent Linear heads: time / topic / ctx - Each projects d_LLM -> d_F_{bundle} CrossBundleAttention (attention/cross_bundle.py): - Three MultiheadAttention modules, one per bundle fiber space (d_F_time / d_F_topic / d_F_ctx, each with its own head count) - Per-slot Linear lifts: each of prefix_slots_{time,topic,ctx} slots gets its own d_F_bundle -> d_LLM map - Concat-along-slot-dim -> (B, L_mem, d_LLM) -> post LayerNorm - Asserts output shape invariant §6.6 Design choice: three per-bundle attentions instead of one shared attention. This keeps the topic signal from getting mixed with the temporal signal in the attention kernel itself; combination happens at the slot-concat stage. Tests (ams_v4/tests/test_v44.py, 8/8 pass on CPU): PASS test_query_heads_shapes PASS test_query_heads_distinct PASS test_cross_bundle_forward_shape (B, L_mem, d_LLM) exactly PASS test_cross_bundle_requires_at_least_one_entry PASS test_cross_bundle_gradient_flow backward through q_time.weight PASS test_cross_bundle_finite_with_random_fibers PASS test_cross_bundle_batch_determinism eval() + identical input -> identical output PASS test_cross_bundle_slot_allocation_matches_cfg perturbing only time fibers changes time slots more than topic/ctx slots Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

LLMBackbone4 (bridge/backbone.py): - Thin wrapper over HF AutoModelForCausalLM - Freezes backbone params (v4 does NOT fine-tune the LM) - tokenize / hidden_states / forward_with_prefix / generate_with_prefix - Manual greedy-decode loop with inputs_embeds (avoids HF generate() inputs_embeds edge cases) EmbBridge4 (projection/bridge.py): - prefix_post_ln + build_inputs(prefix, ids, mask, wte) - Prepends prefix embeds + extends attention mask - No CFG, no content_bias, no logit shaping — v3.46 decode-time patches are intentionally NOT ported MemLLM4 (bridge/memllm.py): - Composes backbone + 3 bundles + cross_attn + kakeya registry + store - write(text): 1. backbone.hidden_states -> pooled float32 hidden 2. three bundles.encode -> (time_*, topic_*, ctx_*) triples 3. extract large fields (semantic_emb, content_wte_mean, context_descriptor) 4. store.add -> triggers _maybe_build_kakeya once n >= min_entries 5. existing entries re-encoded through the active registry - prepare_decode_context(ids, mask): 1. pooled query hidden 2. cross_attn over ALL entries (flat attend; retrieval-filter in v4.6) - generate(prompt, mt): 1. prepare_decode_context 2. backbone.generate_with_prefix (manual greedy loop) CPU smoke test (tests/test_v45_smoke.py): - distilgpt2 backbone (82M params, d_LLM=768), 6 written memories - Verifies §6 invariants 1, 2, 3, 4, 6 on live data - Runs generate(); does NOT assert output quality (that's v4.6) - Completes in ~5 s on CPU All v4 tests passing (59 total): 6 skeleton 11 v4.1 (geometry + MemStore + DirectionTreeV4) 14 v4.2 (three encoders + three bundles) 19 v4.3 (kakeya + alignment) 8 v4.4 (attention) 1 v4.5 smoke (end-to-end) Skeleton test test_remaining_stubs_raise_not_implemented renamed to test_v45_constructs_without_backbone: no stubs remain after v4.5. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

3-mode subset (D_full_history, A_ams_prefix, C_ams_hybrid) on the same 10-query synthetic session as PR #29's session_viability.py, but using MemLLM4 for A and C. Fresh-init expectation: A/C hit-rates at Qwen2.5-1.5B scale on GPU are not expected to beat v3.46 fresh-init — that requires training (v4.6). This harness produces the fresh-init baseline that v4-trained will be compared against. B modes (B_flat_cos, B_ams_text) omitted: they are RAG-shaped upper-bound diagnostics, not v4 product modes (per SPRINT_CLOSEOUT_v3.46.md §10.9). D_full_history is kept as the ceiling baseline. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

Default device = torch.device('cuda') when torch.cuda.is_available() and the caller didn't pass a device override. Without this, MemLLM4 ran on CPU even on GPU-equipped hosts, making the session_viability_v4 harness unusably slow (~27 s per D_full_history generate on Qwen 1.5B). Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

CrossBundleAttention already applies a LayerNorm at the end of forward(). EmbBridge4 (and MemLLM4.generate) previously applied a *second* LayerNorm, which triggered RuntimeError('expected BFloat16 but found Float') when the backbone is bf16 on GPU (v4 modules are fp32 by default). Fix: - EmbBridge4 no longer owns prefix_post_ln; build_inputs just concats prefix.to(dtype) with wte(ids). - MemLLM4.generate() skips the LN and passes ctx.prefix.to(backbone_dtype) directly to backbone.generate_with_prefix. Local v4.5 smoke test still passes (distilgpt2, fp32). No unit-test change since no test exercised EmbBridge4.build_inputs directly. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

N=10 (reports/session_viability_v4_fresh/): D_full_history hit=100% in_tok=159 gen=501ms A_ams_prefix hit= 0% in_tok=11 gen=493ms ret=28ms C_ams_hybrid hit= 10% in_tok=27 gen=412ms ret=27ms N=20 (reports/session_viability_v4_fresh_20facts/): D_full_history hit=100% in_tok=301 gen=533ms A_ams_prefix hit= 0% in_tok=11 gen=488ms ret=18ms C_ams_hybrid hit= 20% in_tok=27 gen=431ms ret=28ms EXPECTED BEHAVIOR for fresh-init v4. The v3.46 fresh-GPU numbers at N=20 (A=50%, C=70%) reflect ~15 decode-time logit-shaping hacks (content_bias, strict_overlap, keyword_tail_slot, functional_suppression, etc.) that v4 does NOT port. v4 exposes the pure prefix-channel mechanism. Purpose of this baseline: it is the FRESH-INIT floor that v4.6 trained numbers will be compared against. Expected training lift is large because the new v4 loss terms (bundle_axis_alignment, cross_bundle_independence, prefix_semantic_anchor, recon, write_policy) directly target the prefix channel mechanism, unlike v3.46 where only a few loss terms touched the prefix channel and decode-time hacks compensated for the rest. v4 ran end-to-end at Qwen2.5-1.5B scale on H200 with: - All 6 skeleton tests passing - All 52 v4.1-v4.5 unit tests passing - Full stack: 3 bundles x 3 trees x kakeya registry with 4 active sets - C_ams_hybrid retrieve latency 28ms (beats v3.46 ~400ms — the v4 tree is per-bundle and does not run v3.46's rerank-inside-retrieve) - Generate latency 412-493ms (bounded by backbone forward, no CFG) Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

ARCHITECTURE_v4_TRAIN.md: per-loss design, training data, merge gate. ams_v4/training/batch_encode.py: - encode_batch_for_training(): three bundles run on a list of texts, produces (base, fiber, dirn) stacks with gradients retained. Used only during training — production write() path is untouched (still detaches). - batch_to_mementries(): build MemEntry objects that reference grad- carrying tensors, for use by CrossBundleAttention during the loss. ams_v4/training/losses.py (5 terms, mirroring Cfg4.loss_weights keys): - prefix_semantic_anchor: teacher-forced next-token NLL through (cross_attn → prefix → backbone). Main signal. - bundle_axis_alignment: three per-bundle sub-terms * time: -Pearson(proj_onto_axis, batch_index) [non-saturating, grad always flows] * topic: triplet margin on topic_base using Jaccard-on-token-ids targets * ctx: mild axis-alignment hinge on ctx_base projection - cross_bundle_independence: target pairwise |Pearson| of fiber-scalars ≈ 0.3 - recon: relative error through kakeya encode/decode (diagnostic only in v4.6 since base_to_field maps are not yet nn.Parameter) - write_policy: tiny collapse-prevention + short-text penalty ams_v4/training/trainer.py: - Trainer4 freezes backbone, collects trainable params from bundles + cross_attn + bridge, AdamW(lr=3e-4, wd=0.01), grad-clip 1.0. - step(batch_texts): reseeds store + registry, runs write() to mirror inference-side data structures, runs encode_batch_for_training for grad-bearing copies, sums weighted losses, backprop. - probe_weights(): snapshot of three representative weight magnitudes. - save(path, ...): dumps only trainable params + cfg_summary + provenance. MemLLM4.load_trained_weights(path): - Strict shape match on named_parameters. Prints v4-native log line (mirrors v3.46 '[AMS_TRAINED_WEIGHTS] loaded=X skipped=Y' log format). train_v4.py: - Same 9-sentence corpus as v3.46's train_v346.py (§5.3 of SPRINT_CLOSEOUT). - AdamW, 60 steps default, batch 3. - Writes ckpt/v4_trained.pt + ckpt/v4_train_log.jsonl. Tests (ams_v4/tests/test_v46_train.py, 10/10 pass on CPU with distilgpt2): PASS test_encode_batch_for_training_shapes PASS test_loss_prefix_semantic_anchor_scalar_and_finite PASS test_loss_bundle_axis_alignment_nonneg PASS test_loss_cross_bundle_independence_nonneg PASS test_loss_recon_finite PASS test_loss_write_policy_finite PASS test_loss_prefix_anchor_gradient_flow_cross_attn gradient reaches cross_attn.lift_time[0].weight PASS test_loss_bundle_axis_alignment_gradient_flow gradient reaches bundle_time._axis_raw (the canonical axis) PASS test_trainer_three_step_cpu_smoke 3 trainer steps run, losses vary across steps PASS test_trainer_save_and_reload_roundtrip save -> load_trained_weights -> weights bit-identical Full v4 regression: 69/69 tests pass (6 skeleton + 11 v4.1 + 14 v4.2 + 19 v4.3 + 8 v4.4 + 1 v4.5 smoke + 10 v4.6 training). Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

Allows the parity harness to run against a v4 Trainer4 checkpoint (ckpt/v4_trained.pt) via MemLLM4.load_trained_weights. Output report.json records which checkpoint (if any) was used under config.trained_weights. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

Diagnostic: v4 trained run at Qwen 1.5B produced 0% hit-rate with degenerate repetition ('1. 1. 1. ...') outputs. Root cause: after prefix_ln, each prefix slot has ||x||_2 ≈ sqrt(d_LLM) = 39 for d_LLM=1536, vs Qwen token embedding norm ~2. Prefix was ~20x louder than tokens and dominated the backbone forward, forcing repetition regardless of what memories the prefix encoded. Fix: add a learnable nn.Parameter 'prefix_scale' initialized at 1/sqrt(d_LLM), applied as prefix_ln(x) * prefix_scale. Initial magnitude matches token embeddings; training can tune up from there via the prefix_semantic_anchor loss. No unit test needed to change — the learnable scale is shape-preserving. All 10 v4.6 training tests + 6 skeleton + 8 v4.4 + 1 v4.5 smoke pass. Will retrain and re-run SUT on H200. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…izer Diagnostic: after the prefix_scale fix, trained topic tree retrieval was still collapsed: every query's top-1 retrieval returned mid=7 (Thai) or mid=9 (sister), with cos > 0.99 across ALL (query, memory) pairs. topic_base vectors were nearly collinear. Two root causes in bundle_axis_alignment topic sub-term: 1. Jaccard was computed over raw token ids, which are dominated by shared stopwords ("User:", "I", "my", "the") — so "positive pair" was usually just "the next-door fact" rather than a meaningful content similarity. Triplet loss was pulling everything to a global mean. 2. No explicit diversity pressure; triplet loss alone doesn't prevent the whole batch from collapsing onto one direction. Fix: - _jaccard now drops token ids < 1000, cutting punctuation and the most- common BPE merges. Coarse heuristic, but works for Qwen2.5 + GPT-2 vocabularies. - Added a diversity regularizer: relu(off_diag_cos - 0.7).mean() penalizes any pair of topic_bases that are too collinear. - Triplet margin bumped 0.1 -> 0.2 to give the diversity term room to push. All 10 v4.6 training tests still pass. Will retrain on GPU and re-run SUT. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…+ topic-diversity fixes applied) Training: 15.7 s, 60 steps x batch 3. Checkpoint ckpt/v4_trained.pt. Psa final epoch avg: ~1.6 (healthy; pre-fix run hit ~0.2 which was overfitting the dominant prefix; see prefix_scale fix commit). N=10 (reports/session_viability_v4_trained/): D_full_history hit=100% in_tok=159 gen=483ms A_ams_prefix hit= 10% in_tok=11 gen=452ms ret=21ms C_ams_hybrid hit= 40% in_tok=27 gen=452ms ret=27ms N=20 (reports/session_viability_v4_trained_20facts/): D_full_history hit=100% in_tok=301 gen=519ms A_ams_prefix hit= 0% in_tok=11 gen=466ms ret=19ms C_ams_hybrid hit= 30% in_tok=27 gen=415ms ret=26ms MERGE GATE FAILED (v4-trained A,C at N=20 must exceed v3.46-trained 50/70). The improvements from fresh-init -> trained are clear on C (N=20: 20% -> 30%, N=10: 10% -> 40%) but A stays at 0-10%. Session is not mergeable to main. Two diagnostic-driven fixes landed in the training cycle: 1. CrossBundleAttention.prefix_scale as a learnable nn.Parameter — without it, the prefix L2-norm was ~39 per slot vs token embedding norm ~2, so the prefix dominated the backbone and produced degenerate repetition. After fix, generated text is coherent. 2. Topic axis loss: content-token Jaccard (drop ids<1000) + a diversity regularizer (off-diag cos <= 0.7) — without these, triplet loss was driven by stopword overlap and collapsed all topic_base vectors to one direction. After fix, trained topic tree retrieves the correct memory on 1/5 diagnostic queries (vs 0/5 before) and off-diagonal cos is distributed instead of all > 0.99. Remaining root cause (not in scope for this PR): - Topic base space is still too crowded at d_topic=16 with a 60-step tiny corpus. At training time the model sees 9 rotating sentences; the topic loss can satisfy diversity over 3 at a time, but the held-out session has 10/20 distinct memories. - The prefix_semantic_anchor loss uses a 50/50 text split, which for short training sentences leaves very little target signal (often 3-5 tokens); the NLL surface is shallow. - Retrieval is still run flat (cross_attn attends over ALL entries), not filtered by tree top-k. Trained attention can still be overwhelmed by 20 memories when only ~1 is relevant. Follow-up PR (v4.7) should address these in the order above: scale the training corpus, rework the prefix_semantic_anchor target (mask-the-entity instead of 50/50 split), and add a tree-topk retrieval filter before cross-bundle attention. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

cursoragent and others added 11 commits April 22, 2026 07:42

cursor Bot changed the title ~~v4 architecture realign: skeleton — multi-kakeya × 3 bundles × (time/topic/context) axes~~ v4 architecture realign: full stack (v4.1–v4.5) + fresh-init SUT on GPU Apr 22, 2026

cursoragent and others added 5 commits April 22, 2026 08:44

cursor Bot changed the title ~~v4 architecture realign: full stack (v4.1–v4.5) + fresh-init SUT on GPU~~ v4 architecture realign: v4.1–v4.6 stack + trained SUT on GPU (merge gate FAILED — ship-blocking follow-up v4.7 identified) Apr 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v4 architecture realign: v4.1–v4.6 stack + trained SUT on GPU (merge gate FAILED — ship-blocking follow-up v4.7 identified)#30

v4 architecture realign: v4.1–v4.6 stack + trained SUT on GPU (merge gate FAILED — ship-blocking follow-up v4.7 identified)#30
FluffyAIcode wants to merge 16 commits intomainfrom
AgentMemory/v347-architecture-realign-b7fa

FluffyAIcode commented Apr 22, 2026 •

edited by cursor Bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

FluffyAIcode commented Apr 22, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Status at the merge gate

What ships in this PR

Test results

GPU SUT — Qwen2.5-1.5B-Instruct, NVIDIA H200, mt=30, 60 training steps

Fresh-init (reports/session_viability_v4_fresh{,_20facts}/)

Trained (reports/session_viability_v4_trained{,_20facts}/)

Reference: v3.46-trained (from PR #29)

Honest reading

v4.7 follow-up scope (not in this PR)

What NOT to do in v4.7

Commits in this PR

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

FluffyAIcode commented Apr 22, 2026 •

edited by cursor Bot

Loading

Fresh-init (`reports/session_viability_v4_fresh{,_20facts}/`)

Trained (`reports/session_viability_v4_trained{,_20facts}/`)