v4 architecture realign: v4.1–v4.6 stack + trained SUT on GPU (merge gate FAILED — ship-blocking follow-up v4.7 identified)#30
Draft
FluffyAIcode wants to merge 16 commits intomainfrom
Conversation
Opens a new architecture track AgentMemory/v347-architecture-realign-b7fa that realigns the codebase to the abstract AMS spec: Multiple Kakeya sets compress the full context data. These Kakeya sets are linked on different fiber bundles. The fiber bundles carry memory encoding around time, topic, and background (context). An attention mechanism forms the current context window. An audit of scheme_b_v344.py + kakeya_codec.py on PR #29 showed four of the five structural claims in that sentence had drifted: - 'multiple kakeya sets' : actually exactly 1 (KakeyaCodec is a singleton) - 'compress the full context': only semantic_emb is compressed - 'different fiber bundles' : one bundle; kakeya and bundle are disjoint - 'time / topic / background': none are fiber-bundle coordinates. They live as scalar bookkeeping (ts/last/cnt), a side-channel tensor (context_descriptor) and an integer KMeans tag (cluster_id) - 'attention forms context' : implemented (FiberAttn + QFormer + EmbBridge) The 30-point gap between B_ams_text (80-90%) and A_ams_prefix (50%) on PR #29 is the downstream symptom of the first four drifts. This branch adds: 1. ARCHITECTURE_v4.md — 7-section design doc: §0 audit findings vs abstract spec §1 abstract-to-concrete mapping (5 subsections, one per spec clause) §2 ams_v4/ package layout §3 compilable-skeleton contract (NotImplementedError with v4-skel: markers) §4 migration plan: v4.1-v4.5 PRs, what each ports from v3.46 §5 explicit non-goals (not RAG, not KG, not Cfg-knob-turning) §6 six assertable invariants §7 this PR's status and what's untouched 2. ams_v4/ package skeleton, importable, 24 Python files: core/ Cfg4, MemEntry, KakeyaHandle, MemStore, type helpers MemEntry now carries THREE (base, fiber, dirn) triples (one per bundle) instead of v3.46's single triple. bundles/ abstract Bundle + three concretes (Temporal/Topic/Context), each with its own encoder receiving the axis that bundle owns (time_scalars / content_tokens+wte / session_summary+prev_turns). kakeya/ KakeyaSet (single skeleton, bundle-owned, with alignment constraint), KakeyaRegistry (owns N sets, routes fields), alignment helpers, v4 codec facade. attention/ CrossBundleAttention (three per-bundle attentions + slot concat), BundleQueryHeads (three hidden->query projections). projection/ EmbBridge4 (thin prefix-prepend bridge; no content_bias, strict_overlap, keyword_tail_slot, or functional_suppression). bridge/ MemLLM4 top-level model, v3.46-compatible public surface. tests/ test_shapes.py, 6 static tests: - imports work - Cfg4 default constructs with all invariants passing - three Cfg4 invariants fire on violation (n_kakeya_sets>=2, prefix_slots sum to L_mem, fiber dim divides head count) - stubbed methods raise NotImplementedError with v4-skel: marker 3. ams_v4/README.md — status + follow-up roadmap (v4.1-v4.5). 4. .gitignore (new) to keep __pycache__ etc. out. v3.46 code (scheme_b_v344.py, kakeya_codec.py, train_v346.py, session_viability.py) is not touched by this branch. PR #29 measurements remain reproducible. The parity bar for v4.5's merge to main is: MemLLM4 >= MemLLM v3.46 on session_viability.py, strict improvement on A_ams_prefix and C_ams_hybrid at N=20. Skeleton-test run (at commit time): PASS test_imports PASS test_cfg4_default_constructs PASS test_cfg4_invariant_n_kakeya_sets_min_2 PASS test_cfg4_invariant_prefix_slots_sum PASS test_cfg4_invariant_fiber_divisibility PASS test_all_skeleton_components_raise_not_implemented all 6 skeleton tests passed Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Companion to ARCHITECTURE_v4.md. For each follow-up PR, specifies: - which files are in scope - which v3.46 classes port with what edits - pseudocode for the new encoders / attention / kakeya math - test list with exit criteria Scope choice: v4.5 ships end-to-end write+retrieve+attend+inject+generate with a CPU smoke test on a tiny backbone (sshleifer/tiny-gpt2, 7M params). Training convergence is explicitly v4.6, not bundled here — mixing design- drift fix with training-convergence fix would make failure modes hard to diagnose. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Ports from scheme_b_v344.py (v3.46) with per-bundle parameterization:
RiemannianMetric, GeodesicSolver, FiberConnection, FiberTransporter
New:
- Bundle abstract (canonical_axis as nn.Parameter, unit-normalized on access)
- DirectionTreeV4: beam retrieval; no AMM cross-coupling, no cluster-crowding
rerank (those v3.46 workarounds are superseded by the per-bundle axis)
- MemStore: three trees (time/topic/ctx), routes on add/remove, invariant check
- MemEntry.assert_no_raw_large_fields — §6 invariant 2 enforcement
Tests (ams_v4/tests/test_v41.py, 11/11 pass on CPU):
PASS test_metric_spd SPD + symmetric
PASS test_connection_antisymmetric A + A^T ~ 0
PASS test_transporter_preserves_norm closed-loop drift < 10%
PASS test_geodesic_endpoints path[0]=xs, path[-1]=xe
PASS test_geodesic_linear_fallback linear_path correct shape
PASS test_memstore_add_routes_to_all_three_trees
PASS test_direction_tree_insert_retrieve target mid in top-3
PASS test_memstore_remove_updates_trees
PASS test_memstore_verify_consistency_empty
PASS test_memstore_verify_consistency_populated
PASS test_memstore_invariant_no_raw_large_fields
Skeleton stub tests (ams_v4/tests/test_shapes.py) pruned: v4.1 components
no longer raise NotImplementedError, so test_all_skeleton_components was
renamed to test_remaining_stubs and now checks only v4.2+ stubs.
Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
TimeEncoder (temporal.py): - Fourier-feature encoding of (absolute_ts, recency, cnt) - base = LN(time_mlp(fourier) + hidden_proj(hidden)) - fiber = MLP(concat(hidden, base, surprise)) - dirn = normalize(base) TopicEncoder (topic.py): - IDF-weighted centroid of content_token_ids over wte_normed (batched) - base = normalize(down_project(centroid) + hidden_to_topic(hidden)) -> on sphere by construction - fiber = MLP(concat(hidden, base)) - dirn = base (already unit) - Ragged batch input (list-of-lists) supported ContextEncoder (context.py): - Single-head attention pool over optional prev_turns - base = LN(mix_mlp(hidden + session_summary + attn)) - fiber = MLP(concat(hidden, base, session_summary)) - dirn = normalize(base) TopicBundle overrides _solver=None and provides _great_circle_path (slerp) for transport; topic transport does not need gradient-descent geodesic solver. Tests (ams_v4/tests/test_v42.py, 14/14 pass on CPU): PASS test_time_encoder_shapes PASS test_time_dirn_unit_norm PASS test_temporal_bundle_encode_matches_encoder PASS test_idf_centroid_empty_returns_zero PASS test_idf_centroid_oov_returns_zero PASS test_topic_encoder_shapes_batched PASS test_topic_base_on_sphere (||base||=1 within 1e-4) PASS test_topic_bundle_canonical_axis_unit PASS test_topic_great_circle_endpoints (slerp endpoints exact, mid-points on sphere) PASS test_topic_transport_preserves_norm (drift < 15%) PASS test_context_encoder_no_prev_turns PASS test_context_encoder_with_prev_turns PASS test_all_bundles_canonical_axis_unit PASS test_gradients_flow_through_time_encoder Skeleton stub test in test_shapes.py pruned to only KakeyaRegistry.define_sets now that all three encoders are implemented. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
alignment.py (pure functions, no state):
- pushforward(axis_in_base, base_to_field) = axis @ map
- project_into_pca(direction, basis) = basis @ direction
- alignment_error(t_dir, target) = ||t_dir - normalize(target)||
- solve_aligned_t_dir(target, tol) = (normalize(target), 0)
on near-zero -> unit e_0 + err=1
KakeyaSet:
- Build pipeline: PCA -> align t_dir to bundle axis pushforward ->
perpendicular spherical K-means -> store KakeyaSkeleton4
- encode(v): (alpha on t_dir, segment id, t along center, sparse residual top-k)
- decode(cv): reconstruct field vector from CompressedVec
- verify_alignment: recompute pushforward and return ||t_dir - projected||
- _compute_pca + _spherical_kmeans ported from kakeya_codec.py (v3.12 helpers)
KakeyaRegistry:
- Owns N KakeyaSet instances per _routing (default: 4 sets across 3 bundles,
with cross-axis redundancy semantic_emb+content_wte_mean)
- build(field_corpus, bundle_axes) populates all active sets; auto-initializes
per-routing-key base_to_field map (seeded for determinism)
- encode_memory_fields / decode_field: per-memory API
- verify_invariants(n, bundle_axes): enforces §6 invariants 3 + 4
Tests (ams_v4/tests/test_v43.py, 19/19 pass on CPU):
6 alignment-math tests (pushforward, project, alignment_error, solve)
2 helper tests (_compute_pca, _spherical_kmeans)
4 KakeyaSet tests (build activates, alignment near-zero, roundtrip,
reject-wrong-dim)
7 Registry tests (default 4 sets, custom routing, short-routing rejection,
handle covers all fields, decode roundtrip, invariant pass,
invariant 3 fires when active-set-count < 2)
§6 invariant 5 (reconstruction) verified: median rel err <= 0.15, max < 0.65.
§6 invariant 4 (alignment) verified: err < kakeya_alignment_tol = 1e-3 after build.
Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
BundleQueryHeads (attention/query_heads.py):
- LayerNorm on hidden_state
- Three independent Linear heads: time / topic / ctx
- Each projects d_LLM -> d_F_{bundle}
CrossBundleAttention (attention/cross_bundle.py):
- Three MultiheadAttention modules, one per bundle fiber space
(d_F_time / d_F_topic / d_F_ctx, each with its own head count)
- Per-slot Linear lifts: each of prefix_slots_{time,topic,ctx} slots
gets its own d_F_bundle -> d_LLM map
- Concat-along-slot-dim -> (B, L_mem, d_LLM) -> post LayerNorm
- Asserts output shape invariant §6.6
Design choice: three per-bundle attentions instead of one shared attention.
This keeps the topic signal from getting mixed with the temporal signal in
the attention kernel itself; combination happens at the slot-concat stage.
Tests (ams_v4/tests/test_v44.py, 8/8 pass on CPU):
PASS test_query_heads_shapes
PASS test_query_heads_distinct
PASS test_cross_bundle_forward_shape (B, L_mem, d_LLM) exactly
PASS test_cross_bundle_requires_at_least_one_entry
PASS test_cross_bundle_gradient_flow backward through q_time.weight
PASS test_cross_bundle_finite_with_random_fibers
PASS test_cross_bundle_batch_determinism eval() + identical input -> identical output
PASS test_cross_bundle_slot_allocation_matches_cfg
perturbing only time fibers changes time slots more than topic/ctx slots
Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
LLMBackbone4 (bridge/backbone.py):
- Thin wrapper over HF AutoModelForCausalLM
- Freezes backbone params (v4 does NOT fine-tune the LM)
- tokenize / hidden_states / forward_with_prefix / generate_with_prefix
- Manual greedy-decode loop with inputs_embeds (avoids HF generate()
inputs_embeds edge cases)
EmbBridge4 (projection/bridge.py):
- prefix_post_ln + build_inputs(prefix, ids, mask, wte)
- Prepends prefix embeds + extends attention mask
- No CFG, no content_bias, no logit shaping — v3.46 decode-time patches
are intentionally NOT ported
MemLLM4 (bridge/memllm.py):
- Composes backbone + 3 bundles + cross_attn + kakeya registry + store
- write(text):
1. backbone.hidden_states -> pooled float32 hidden
2. three bundles.encode -> (time_*, topic_*, ctx_*) triples
3. extract large fields (semantic_emb, content_wte_mean, context_descriptor)
4. store.add -> triggers _maybe_build_kakeya once n >= min_entries
5. existing entries re-encoded through the active registry
- prepare_decode_context(ids, mask):
1. pooled query hidden
2. cross_attn over ALL entries (flat attend; retrieval-filter in v4.6)
- generate(prompt, mt):
1. prepare_decode_context
2. backbone.generate_with_prefix (manual greedy loop)
CPU smoke test (tests/test_v45_smoke.py):
- distilgpt2 backbone (82M params, d_LLM=768), 6 written memories
- Verifies §6 invariants 1, 2, 3, 4, 6 on live data
- Runs generate(); does NOT assert output quality (that's v4.6)
- Completes in ~5 s on CPU
All v4 tests passing (59 total):
6 skeleton
11 v4.1 (geometry + MemStore + DirectionTreeV4)
14 v4.2 (three encoders + three bundles)
19 v4.3 (kakeya + alignment)
8 v4.4 (attention)
1 v4.5 smoke (end-to-end)
Skeleton test test_remaining_stubs_raise_not_implemented renamed to
test_v45_constructs_without_backbone: no stubs remain after v4.5.
Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
3-mode subset (D_full_history, A_ams_prefix, C_ams_hybrid) on the same 10-query synthetic session as PR #29's session_viability.py, but using MemLLM4 for A and C. Fresh-init expectation: A/C hit-rates at Qwen2.5-1.5B scale on GPU are not expected to beat v3.46 fresh-init — that requires training (v4.6). This harness produces the fresh-init baseline that v4-trained will be compared against. B modes (B_flat_cos, B_ams_text) omitted: they are RAG-shaped upper-bound diagnostics, not v4 product modes (per SPRINT_CLOSEOUT_v3.46.md §10.9). D_full_history is kept as the ceiling baseline. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Default device = torch.device('cuda') when torch.cuda.is_available() and the
caller didn't pass a device override. Without this, MemLLM4 ran on CPU even
on GPU-equipped hosts, making the session_viability_v4 harness unusably slow
(~27 s per D_full_history generate on Qwen 1.5B).
Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
CrossBundleAttention already applies a LayerNorm at the end of forward().
EmbBridge4 (and MemLLM4.generate) previously applied a *second* LayerNorm,
which triggered RuntimeError('expected BFloat16 but found Float') when the
backbone is bf16 on GPU (v4 modules are fp32 by default).
Fix:
- EmbBridge4 no longer owns prefix_post_ln; build_inputs just concats
prefix.to(dtype) with wte(ids).
- MemLLM4.generate() skips the LN and passes ctx.prefix.to(backbone_dtype)
directly to backbone.generate_with_prefix.
Local v4.5 smoke test still passes (distilgpt2, fp32). No unit-test change
since no test exercised EmbBridge4.build_inputs directly.
Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
N=10 (reports/session_viability_v4_fresh/):
D_full_history hit=100% in_tok=159 gen=501ms
A_ams_prefix hit= 0% in_tok=11 gen=493ms ret=28ms
C_ams_hybrid hit= 10% in_tok=27 gen=412ms ret=27ms
N=20 (reports/session_viability_v4_fresh_20facts/):
D_full_history hit=100% in_tok=301 gen=533ms
A_ams_prefix hit= 0% in_tok=11 gen=488ms ret=18ms
C_ams_hybrid hit= 20% in_tok=27 gen=431ms ret=28ms
EXPECTED BEHAVIOR for fresh-init v4. The v3.46 fresh-GPU numbers at N=20
(A=50%, C=70%) reflect ~15 decode-time logit-shaping hacks (content_bias,
strict_overlap, keyword_tail_slot, functional_suppression, etc.) that v4
does NOT port. v4 exposes the pure prefix-channel mechanism.
Purpose of this baseline: it is the FRESH-INIT floor that v4.6 trained
numbers will be compared against. Expected training lift is large because
the new v4 loss terms (bundle_axis_alignment, cross_bundle_independence,
prefix_semantic_anchor, recon, write_policy) directly target the prefix
channel mechanism, unlike v3.46 where only a few loss terms touched the
prefix channel and decode-time hacks compensated for the rest.
v4 ran end-to-end at Qwen2.5-1.5B scale on H200 with:
- All 6 skeleton tests passing
- All 52 v4.1-v4.5 unit tests passing
- Full stack: 3 bundles x 3 trees x kakeya registry with 4 active sets
- C_ams_hybrid retrieve latency 28ms (beats v3.46 ~400ms — the v4 tree is
per-bundle and does not run v3.46's rerank-inside-retrieve)
- Generate latency 412-493ms (bounded by backbone forward, no CFG)
Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
ARCHITECTURE_v4_TRAIN.md: per-loss design, training data, merge gate.
ams_v4/training/batch_encode.py:
- encode_batch_for_training(): three bundles run on a list of texts,
produces (base, fiber, dirn) stacks with gradients retained. Used only
during training — production write() path is untouched (still detaches).
- batch_to_mementries(): build MemEntry objects that reference grad-
carrying tensors, for use by CrossBundleAttention during the loss.
ams_v4/training/losses.py (5 terms, mirroring Cfg4.loss_weights keys):
- prefix_semantic_anchor: teacher-forced next-token NLL through
(cross_attn → prefix → backbone). Main signal.
- bundle_axis_alignment: three per-bundle sub-terms
* time: -Pearson(proj_onto_axis, batch_index) [non-saturating, grad always flows]
* topic: triplet margin on topic_base using Jaccard-on-token-ids targets
* ctx: mild axis-alignment hinge on ctx_base projection
- cross_bundle_independence: target pairwise |Pearson| of fiber-scalars ≈ 0.3
- recon: relative error through kakeya encode/decode (diagnostic only in
v4.6 since base_to_field maps are not yet nn.Parameter)
- write_policy: tiny collapse-prevention + short-text penalty
ams_v4/training/trainer.py:
- Trainer4 freezes backbone, collects trainable params from bundles +
cross_attn + bridge, AdamW(lr=3e-4, wd=0.01), grad-clip 1.0.
- step(batch_texts): reseeds store + registry, runs write() to mirror
inference-side data structures, runs encode_batch_for_training for
grad-bearing copies, sums weighted losses, backprop.
- probe_weights(): snapshot of three representative weight magnitudes.
- save(path, ...): dumps only trainable params + cfg_summary + provenance.
MemLLM4.load_trained_weights(path):
- Strict shape match on named_parameters. Prints v4-native log line
(mirrors v3.46 '[AMS_TRAINED_WEIGHTS] loaded=X skipped=Y' log format).
train_v4.py:
- Same 9-sentence corpus as v3.46's train_v346.py (§5.3 of SPRINT_CLOSEOUT).
- AdamW, 60 steps default, batch 3.
- Writes ckpt/v4_trained.pt + ckpt/v4_train_log.jsonl.
Tests (ams_v4/tests/test_v46_train.py, 10/10 pass on CPU with distilgpt2):
PASS test_encode_batch_for_training_shapes
PASS test_loss_prefix_semantic_anchor_scalar_and_finite
PASS test_loss_bundle_axis_alignment_nonneg
PASS test_loss_cross_bundle_independence_nonneg
PASS test_loss_recon_finite
PASS test_loss_write_policy_finite
PASS test_loss_prefix_anchor_gradient_flow_cross_attn
gradient reaches cross_attn.lift_time[0].weight
PASS test_loss_bundle_axis_alignment_gradient_flow
gradient reaches bundle_time._axis_raw (the canonical axis)
PASS test_trainer_three_step_cpu_smoke
3 trainer steps run, losses vary across steps
PASS test_trainer_save_and_reload_roundtrip
save -> load_trained_weights -> weights bit-identical
Full v4 regression: 69/69 tests pass (6 skeleton + 11 v4.1 + 14 v4.2 +
19 v4.3 + 8 v4.4 + 1 v4.5 smoke + 10 v4.6 training).
Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Allows the parity harness to run against a v4 Trainer4 checkpoint (ckpt/v4_trained.pt) via MemLLM4.load_trained_weights. Output report.json records which checkpoint (if any) was used under config.trained_weights. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Diagnostic: v4 trained run at Qwen 1.5B produced 0% hit-rate with degenerate
repetition ('1. 1. 1. ...') outputs. Root cause: after prefix_ln, each prefix
slot has ||x||_2 ≈ sqrt(d_LLM) = 39 for d_LLM=1536, vs Qwen token embedding
norm ~2. Prefix was ~20x louder than tokens and dominated the backbone
forward, forcing repetition regardless of what memories the prefix encoded.
Fix: add a learnable nn.Parameter 'prefix_scale' initialized at 1/sqrt(d_LLM),
applied as prefix_ln(x) * prefix_scale. Initial magnitude matches token
embeddings; training can tune up from there via the prefix_semantic_anchor
loss.
No unit test needed to change — the learnable scale is shape-preserving.
All 10 v4.6 training tests + 6 skeleton + 8 v4.4 + 1 v4.5 smoke pass.
Will retrain and re-run SUT on H200.
Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…izer
Diagnostic: after the prefix_scale fix, trained topic tree retrieval was
still collapsed: every query's top-1 retrieval returned mid=7 (Thai) or
mid=9 (sister), with cos > 0.99 across ALL (query, memory) pairs.
topic_base vectors were nearly collinear.
Two root causes in bundle_axis_alignment topic sub-term:
1. Jaccard was computed over raw token ids, which are dominated by shared
stopwords ("User:", "I", "my", "the") — so "positive pair" was
usually just "the next-door fact" rather than a meaningful content
similarity. Triplet loss was pulling everything to a global mean.
2. No explicit diversity pressure; triplet loss alone doesn't prevent the
whole batch from collapsing onto one direction.
Fix:
- _jaccard now drops token ids < 1000, cutting punctuation and the most-
common BPE merges. Coarse heuristic, but works for Qwen2.5 + GPT-2
vocabularies.
- Added a diversity regularizer: relu(off_diag_cos - 0.7).mean() penalizes
any pair of topic_bases that are too collinear.
- Triplet margin bumped 0.1 -> 0.2 to give the diversity term room to
push.
All 10 v4.6 training tests still pass. Will retrain on GPU and re-run SUT.
Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…+ topic-diversity fixes applied)
Training: 15.7 s, 60 steps x batch 3. Checkpoint ckpt/v4_trained.pt.
Psa final epoch avg: ~1.6 (healthy; pre-fix run hit ~0.2 which was
overfitting the dominant prefix; see prefix_scale fix commit).
N=10 (reports/session_viability_v4_trained/):
D_full_history hit=100% in_tok=159 gen=483ms
A_ams_prefix hit= 10% in_tok=11 gen=452ms ret=21ms
C_ams_hybrid hit= 40% in_tok=27 gen=452ms ret=27ms
N=20 (reports/session_viability_v4_trained_20facts/):
D_full_history hit=100% in_tok=301 gen=519ms
A_ams_prefix hit= 0% in_tok=11 gen=466ms ret=19ms
C_ams_hybrid hit= 30% in_tok=27 gen=415ms ret=26ms
MERGE GATE FAILED (v4-trained A,C at N=20 must exceed v3.46-trained 50/70).
The improvements from fresh-init -> trained are clear on C (N=20: 20% -> 30%,
N=10: 10% -> 40%) but A stays at 0-10%. Session is not mergeable to main.
Two diagnostic-driven fixes landed in the training cycle:
1. CrossBundleAttention.prefix_scale as a learnable nn.Parameter — without
it, the prefix L2-norm was ~39 per slot vs token embedding norm ~2, so
the prefix dominated the backbone and produced degenerate repetition.
After fix, generated text is coherent.
2. Topic axis loss: content-token Jaccard (drop ids<1000) + a diversity
regularizer (off-diag cos <= 0.7) — without these, triplet loss was
driven by stopword overlap and collapsed all topic_base vectors to one
direction. After fix, trained topic tree retrieves the correct memory
on 1/5 diagnostic queries (vs 0/5 before) and off-diagonal cos is
distributed instead of all > 0.99.
Remaining root cause (not in scope for this PR):
- Topic base space is still too crowded at d_topic=16 with a 60-step tiny
corpus. At training time the model sees 9 rotating sentences; the topic
loss can satisfy diversity over 3 at a time, but the held-out session
has 10/20 distinct memories.
- The prefix_semantic_anchor loss uses a 50/50 text split, which for
short training sentences leaves very little target signal (often 3-5
tokens); the NLL surface is shallow.
- Retrieval is still run flat (cross_attn attends over ALL entries), not
filtered by tree top-k. Trained attention can still be overwhelmed by
20 memories when only ~1 is relevant.
Follow-up PR (v4.7) should address these in the order above: scale the
training corpus, rework the prefix_semantic_anchor target (mask-the-entity
instead of 50/50 split), and add a tree-topk retrieval filter before
cross-bundle attention.
Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Realigns the codebase to the abstract AMS spec (multiple Kakeya sets × three fiber bundles × explicit time/topic/background axes × cross-bundle attention). v4.1 → v4.6 implemented with 69 unit tests passing and end-to-end trained runs on Qwen2.5-1.5B at H200 scale.
Merge gate (v4-trained A/C at N=20 strictly > v3.46-trained 50/70) failed. Trained results with honest diagnosis below. This PR stays draft; v4.7 scope is identified.
Status at the merge gate
Training converged on the five-term v4 loss (total 5→1.6 over 60 steps, 15.7s on H200). But the hit-rate gap to v3.46-trained is large on both A and C. Two honest diagnostic cycles inside this PR:
"1. 1. 1. ...", 0% everywhere). Diagnostic: ‖prefix‖₂ ≈ 39 per slot vs token embedding norm ≈ 2. Fix: learnableprefix_scaleparameter initialized at1/sqrt(d_LLM). Text became coherent; hit-rate moved from 0→10/20%.After both fixes, A at N=20 is still 0%. Remaining gap is architectural, not a tuning issue of the two I fixed — the v4.7 section below names what to work on.
What ships in this PR
ARCHITECTURE_v4.md— design spec.ARCHITECTURE_v4_IMPL.md— per-PR implementation spec for v4.1–v4.5.ARCHITECTURE_v4_TRAIN.md— v4.6 trainer + loss + data + merge-gate spec.ams_v4/full implementation (~2900 LOC):core/+bundles/base.py— geometry primitives ported from v3.46.bundles/{temporal,topic,context}.py— three concrete Bundles + three encoders, each with explicit axis input.kakeya/— multi-set kakeya with per-set bundle-axis alignment constraint.attention/{query_heads,cross_bundle}.py— three per-bundle attentions + learnableprefix_scale.projection/,bridge/— end-to-endMemLLM4.training/{batch_encode,losses,trainer}.py—Trainer4+ five loss terms.train_v4.py— training driver.session_viability_v4.py— 3-mode parity harness (D, A, C).ckpt/v4_trained.pt(viackpt/v4_train_log.jsonl) + fourreports/session_viability_v4_*directories (fresh + trained × N=10/N=20).Test results
69/69 unit tests pass on CPU locally and on vast.ai H200:
GPU SUT — Qwen2.5-1.5B-Instruct, NVIDIA H200, mt=30, 60 training steps
Fresh-init (
reports/session_viability_v4_fresh{,_20facts}/)D_full_historyA_ams_prefixC_ams_hybridTrained (
reports/session_viability_v4_trained{,_20facts}/)D_full_historyA_ams_prefixC_ams_hybridReference: v3.46-trained (from PR #29)
A_ams_prefixC_ams_hybridHonest reading
The merge gate fails. v4-trained A and C at N=20 are both below v3.46-trained. The per-turn generated text is coherent (after the
prefix_scalefix), and the direction of the response is often topic-correct (e.g. "You love classical music..." for the chopin query) — what's missing is the fluent extraction of the specific keyword.Three concrete gaps identified in the
ckpt/v4_train_stdout.logprobes and the diagnostic retrieval check:d_topic=16, with a 9-sentence training corpus (§5.3 rotating batch), the topic diversity regularizer can satisfy ceiling over 3 memories at a time, but the held-out eval session has 10/20 distinct memories — trained topic tree retrieves correctly on only 1/5 diagnostic queries.prefix_semantic_anchortarget is too shallow. The 50/50 split of short training sentences leaves often just 3–5 target tokens; the NLL surface is shallow. Need a harder target (entity mask, cloze).MemLLM4.prepare_decode_contextattends over all entries rather than filtering to top-k via the direction trees. At N=20 the trained attention is still diluted by irrelevant memories.v4.7 follow-up scope (not in this PR)
Three independently-testable changes, in order:
longmemeval_results.json) or a synthetic generator producing at least 200 entity-diverse sentences. Run 300+ steps.prefix_semantic_anchor. Replace the 50/50 split with entity masking: find the single highest-IDF content token in each sentence, mask it, supervise the LM to predict that token. This matches the session-viability query format.MemLLM4.prepare_decode_context, retrieve top-k (k=4 say) via a fused score across the three bundle trees before feeding intoCrossBundleAttention. Same change would also bring flat-scanB_ams_text-style modes into v4 if needed later.v4.7 merge gate is unchanged from v4.6: A/C at N=20 must strictly exceed 50/70.
What NOT to do in v4.7
Do not add v3.46-style decode-time patches (
content_bias_*,strict_overlap_*,keyword_tail_slot, etc.). If v4.7's three fixes don't close the gate, that points at a real architectural issue that additional decode hacks would hide, not fix — SPRINT_CLOSEOUT_v3.46.md §10.9 bright line holds.Commits in this PR
Chronological, most recent first:
434b769— v4.6 trained SUT results (honest failure diagnosis)c102dfc— v4.6 fix: content-token Jaccard + topic diversity regularizera86ea25— v4.6 fix: learnableprefix_scaleinCrossBundleAttentiona81d40d—session_viability_v4 --trained-weightsflagf905e3a— v4.6: Trainer4 + five loss terms +train_v4.pydriver9859216— v4 fresh-init GPU SUT resultsa913aad— v4.5 dtype fix (removed redundant prefix LN)5c7c729— v4.5 auto-cuda inLLMBackbone4.load448c300—session_viability_v4harness9053b28— v4.5 end-to-end + CPU smoke test1451733— v4.4BundleQueryHeads+CrossBundleAttention3f394c8— v4.3KakeyaSet+KakeyaRegistry+ alignment08910fa— v4.2 three encoders + three Bundle subclassesf4ef74c— v4.1 geometry primitives +MemStore+DirectionTreeV4f7254af—ARCHITECTURE_v4_IMPL.md9f34781— initial design skeleton +ARCHITECTURE_v4.md