v3.46 · session-layer viability: prefix channel (A/C) is the ship track; B_ams_text is a diagnostic#29
Draft
FluffyAIcode wants to merge 8 commits intoAgentMemory/v346-trained-gpu-7e97from
Conversation
Scaffolding per SPRINT_CLOSEOUT_v3.46.md \u00a710 viability framework.
Decides whether AMS is already useful as a low-cost session layer
before committing to blackbox-audit improvements (P0\u2013P4).
Five modes on a 20-turn synthetic session (10 facts + 10 targeted recall
queries, expected-keyword-in-answer as hit criterion):
D_full_history - everything in prompt (ceiling baseline, tokens O(N))
B_flat_cos - flat cosine over semantic_emb -> text inject
B_ams_text - full AMS retrieval pipeline -> text inject
A_ams_prefix - AMS prefix injection only (blackbox mechanism)
C_ams_hybrid - prefix + top-1 retrieved source_text
Each mode reports per-turn (retrieve_ms, generate_ms, input_tokens,
output_tokens, answer_hit) and aggregate (hit_rate, avg_*). Writes
reports/session_viability/{report.json, report.md}.
All modes reuse existing SUT APIs (MemLLM.write, prepare_decode_context,
generate, AMM.tree.store, DirectionTree retrieve); no Cfg changes,
no loss changes, no SUT modifications. Picks up AMS_TRAINED_WEIGHTS
env var transparently via MemLLM.load (so we can compare trained vs
fresh by toggling the env).
Run:
python3 session_viability.py --mt 40 \
--out reports/session_viability_trained
AMS_TRAINED_WEIGHTS=ckpt/v346_trained.pt python3 session_viability.py \
--mt 40 --out reports/session_viability_trained
Next: execute on vast.ai H200, fill decision table, update SPRINT \u00a710.
Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…_emb for query
Two bugs surfaced in local CPU smoke:
1. _seed_memory passed training_mode=False to MemLLM.write, so the write-gate
could silently drop facts (store size = 0). Since this is a measurement
spike, not a training claim, use training_mode=True to unconditionally
persist every seeded fact. Added a sanity WARN if stored != facts.
2. _retrieve_flat_cos pooled o['hs'][0] as if it were a [T, D] tensor, but
fwd(ids, mask)['hs'] is a list of per-layer hidden states. Use the same
query embedding path that MemLLM.write() uses:
hs_pooled = layer_pool(hs_list) # [B, T, d_LLM]
sem_q = _compute_content_semantic_emb(hs_pooled, ids, mask) # [B, d_LLM]
so query and stored semantic_emb live in the same space.
Local fresh-init CPU smoke (mt=12) now shows healthy numbers:
D_full_history: 80% hit / 159 in_tok
B_flat_cos: 50% hit / 55 in_tok
B_ams_text: 30% hit / 56 in_tok
A_ams_prefix: 40% hit / 11 in_tok
C_ams_hybrid: 70% hit / 26 in_tok
C_hybrid at 70% with 6x less input cost than D is already a non-trivial
signal for the decision framework.
Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
… 20-turn session)
Decision-framework data per SPRINT_CLOSEOUT_v3.46.md \u00a710. vast.ai
is currently SSH-rejecting (Connection reset by peer); running on the
local cloud-agent VM's CPU as the fallback so the scaffolding is
exercised end-to-end and the fresh-init decision table exists.
Trained-ckpt comparison will land as a follow-up commit once vast.ai
recovers.
Results (mt=30):
D_full_history hit=100% in_tok=159 gen=4138ms (ceiling, O(N) tokens)
B_flat_cos hit= 80% in_tok= 55 gen=4187ms <- strong baseline
B_ams_text hit= 70% in_tok= 56 gen=4030ms <- full AMS retrieval
A_ams_prefix hit= 60% in_tok= 11 gen=19722ms <- prefix only
C_ams_hybrid hit= 70% in_tok= 26 gen=21147ms <- prefix + top-1 text
Three robust signals:
1. B_flat_cos beats B_ams_text 80% vs 70% on N=10 small store, short
queries. The strict-overlap gate + rerank hurt at this scale; the
tree's recall advantage is for larger N.
2. Prefix-only A_ams_prefix at 60% is non-trivial — axis-C routing works.
Answers lack fluency (confirming blackbox axis-C finding) but keyword
presence is 6/10.
3. C_ams_hybrid at 26 tokens = 16% of D's token cost matches the
retrieval-based B modes on hit-rate with a 2x token reduction.
Prefix + short text is a real Pareto point.
Decision: AMS has independent commercial value as a session layer **today**
at v3.46-trained with an imperfect blackbox. P0-P4 climb becomes
nice-to-have not must-have. Most useful near-term improvement is
reducing A/C generate time (5x slower than text-only, due to CFG double-
forward + logit-shaping).
Added .hf_home/ to .gitignore (local HF cache dir).
Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…pdate \u00a710 New --n-facts flag (default 10). N=10 is identity-only; N=20 adds 10 distractors. Queries are the same 10 in both. N=20 run (reports/session_viability_fresh_20facts/): D_full_history hit=100% in_tok=301 (159->301, confirmed O(N) growth) B_flat_cos hit= 70% (-10 vs N=10) B_ams_text hit= 90% (+20 vs N=10) <- ranking INVERSION A_ams_prefix hit= 60% (flat) C_ams_hybrid hit= 70% (flat) Key finding: the AMS retrieval pipeline (DirectionTree + strict-overlap gate + rerank) **beats** flat cosine at N=20 by 20 points, reversing the N=10 ordering. This is a clean mechanistic advantage from the retrieval side of AMS, which the blackbox audit does NOT measure at all. Decision update in \u00a710.5: - B_ams_text at N=20 delivers 90% at 18% of D's input-token cost. - This is the immediately-shippable commercial value of the codebase. - A_ams_prefix / C_ams_hybrid do NOT yet justify themselves against B_ams_text on hit-rate or cost. - Revised recommendation: ship B_ams_text as-is; P0-P4 blackbox improvements become research track with explicit bar: 'trained A or C must beat B_ams_text at N=20 on both hit-rate and total cost'. Still pending: trained checkpoint re-run once vast.ai SSH recovers. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
PR #29 needs trained-ckpt comparison to complete the decision framework, but this agent's remote (vast.ai) is SSH-rejecting (Connection reset by peer) and the trained ckpt/v346_trained.pt (455 MB, git-ignored) exists only on that remote. SESSION_VIABILITY_HANDOFF.md is the complete pickup document for the next agent with GPU access: \u00a71 what's done (PR scaffold, fresh-init N=10/N=20 data, \u00a710 framework) \u00a72 task (2 trained reports + decision table update) \u00a73 how to get ckpt (vast.ai scp; or retrain from scratch per \u00a75.3) \u00a74 run protocol with loader sanity check \u00a75 PR update instructions with \u00a710.7/\u00a710.8 templates \u00a76 guardrails inherited from SPRINT_CLOSEOUT_v3.46.md (no Cfg/loss changes) \u00a77 success criteria (atomic commits, actual numbers, final decision) \u00a78 stuck-path fallbacks Each step is copy-paste-ready. The next agent should be able to complete PR #29 autonomously in ~30 min on an H200. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…N=20 All 4 runs on NVIDIA H200, Qwen2.5-1.5B-Instruct, mt=30, same 10 queries. Trained ckpt: v346_trained.pt (60 steps, 193.5s, post_probe tail_head_slot1_abs_mean=7.25e-4, vocab_proj_last_abs_mean=5.09e-4 — matches handoff expected values within noise). Loader verification on both trained runs: loaded=202 skipped=0 shape_errs=0 provenance=AgentMemory/v346-revertE-topk-nonexclusive-7e97. Trained N=10: D=100%, B_flat=80%, B_ams=80%, A_prefix=50%, C_hybrid=70% Trained N=20: D=100%, B_flat=70%, B_ams=80%, A_prefix=50%, C_hybrid=70% Fresh-GPU N=10: D=100%, B_flat=80%, B_ams=80%, A_prefix=50%, C_hybrid=70% Fresh-GPU N=20: D=100%, B_flat=70%, B_ams=80%, A_prefix=50%, C_hybrid=70% Trained numbers are identical to fresh-GPU on all 5 modes — 60-step training does not move hit-rate at current mt=30. Also notable: GPU fresh-init B_ams_text N=20 is 80% (not the 90% previously seen on CPU) — single- turn stochasticity in the greedy decode path, not a retrieval change. §10.7/§10.8 decision update to follow in a separate commit. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
§10.7: trained + fresh-GPU N=10/20 tables with Δ fresh-GPU columns.
§10.8: three decision questions answered from data:
1. Training lifts A_prefix past B_ams_text at N=20? NO (50% vs 80%)
2. Training lifts C_hybrid past B_ams_text at N=20? NO (70% vs 80%)
3. A/C generate-time still ~5× text modes? YES (actually ~25× on H200)
Ship B_ams_text. Move P0-P4 blackbox audit to research track.
Also noted: CPU-vs-GPU numeric drift shifts one borderline N=20
B_ams_text query ('davis'), so the §10.3 CPU 90% becomes 80% on
GPU. Not a bug in session_viability.py — bf16 vs fp32 numerics in
layer_pool + _compute_content_semantic_emb tip one top-k ordering.
Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Retracts the §10.5/§10.8 'ship B_ams_text as cheaper RAG backend' wording. That framing collapsed AMS to a product category it was never trying to be (RAG) and demoted its actual mechanism (the prefix/hidden-state injection channel tested by A_ams_prefix / C_ams_hybrid) to 'research track'. That was wrong. §10.9 now states the correct frame and re-reads §10.7 accordingly: - B_ams_text is a retrieval-side diagnostic + upper bound, NOT a product line. It exercises RAG-shaped prompt prepending, not AMS. - A_ams_prefix / C_ams_hybrid are the actual product. A=50%, C=70% at N=20 is the AMS-mechanism measurement. 60-step training did not move it. - P0-P4 blackbox-audit work returns to the ship track. - Success bar restated: A or C must match D_full_history at a large fraction of D's token cost — beating B_ams_text is a necessary intermediate milestone, not the finish line, because B_ams_text is a different product category (RAG-shaped). - §10.3's 'inversion at N=20' re-read as architectural good news for the retrieval side, not as a reason to ship B_ams_text. Raw numbers in §10.3 and §10.7 are unchanged; only the product framing is corrected. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
cursor Bot
pushed a commit
that referenced
this pull request
Apr 22, 2026
3-mode subset (D_full_history, A_ams_prefix, C_ams_hybrid) on the same 10-query synthetic session as PR #29's session_viability.py, but using MemLLM4 for A and C. Fresh-init expectation: A/C hit-rates at Qwen2.5-1.5B scale on GPU are not expected to beat v3.46 fresh-init — that requires training (v4.6). This harness produces the fresh-init baseline that v4-trained will be compared against. B modes (B_flat_cos, B_ams_text) omitted: they are RAG-shaped upper-bound diagnostics, not v4 product modes (per SPRINT_CLOSEOUT_v3.46.md §10.9). D_full_history is kept as the ceiling baseline. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Child of #28. Decision-framework spike per
SPRINT_CLOSEOUT_v3.46.md§10.Framing (read this first)
AMS is not a RAG system and is not a knowledge graph. Its core mechanism is a prefix / hidden-state injection channel: memories are encoded into continuous vectors delivered into the backbone's forward pass without going through text re-prompting. That channel is what
A_ams_prefixandC_ams_hybridexercise in this benchmark.B_flat_cosandB_ams_textare RAG-shaped modes in this benchmark. They are retained as diagnostics and upper bounds, not as product candidates.DirectionTreeis a continuous-embedding routing structure.An earlier revision of this PR concluded "ship
B_ams_textas cheaper RAG backend, move P0–P4 to research track." That was a framing error. It is retracted inSPRINT_CLOSEOUT_v3.46.md§10.9 and replaced with the decision below.Protocol
session_viability.py(~550 lines): 5-mode benchmark on a synthetic session.expected_keywordsubstring in answer.MemLLM.write,prepare_decode_context,MemLLM.generate,layer_pool,_compute_content_semantic_emb.AMS_TRAINED_WEIGHTSvia the loader added in v3.46-trained: 60-step training + audit (18/26, −3 vs fresh 21/26) #28.train_v346.py --steps 60→ckpt/v346_trained.ptin 193.5 s.post_probe = {tail_head_slot1_abs_mean: 7.25e-4, vocab_proj_last_abs_mean: 5.09e-4}(handoff expected 7.30e-4 / 5.49e-4; matches within <10 % noise).loaded=202 skipped=0 shape_errs=0 provenance=AgentMemory/v346-revertE-topk-nonexclusive-7e97.What each mode measures:
D_full_historyB_flat_cosB_ams_textDirectionTree+ gate + rerank) retrieve the right fact? Upper bound on the retrieval side.A_ams_prefixC_ams_hybridThe B modes are not product candidates. They are there to isolate whether a failure lives on the retrieval side or the prefix side.
Data
Fresh-init CPU (
reports/session_viability_fresh/,..._20facts/)N=10
D_full_historyB_flat_cosB_ams_textA_ams_prefixC_ams_hybridN=20
D_full_historyB_flat_cosB_ams_textA_ams_prefixC_ams_hybridFresh-init GPU (H200,
reports/session_viability_fresh_gpu/,..._20facts/)N=10
D_full_historyB_flat_cosB_ams_textA_ams_prefixC_ams_hybridN=20
D_full_historyB_flat_cosB_ams_textA_ams_prefixC_ams_hybridTrained ckpt (H200, 60 steps,
reports/session_viability_trained/,..._20facts/)N=10
D_full_historyB_flat_cosB_ams_textA_ams_prefixC_ams_hybridN=20
D_full_historyB_flat_cosB_ams_textA_ams_prefixC_ams_hybridWhat the data says about AMS's core mechanism
Read under the corrected frame (§10.9 in
SPRINT_CLOSEOUT_v3.46.md):A_ams_prefixat 50 % N=20, trained or fresh, is the actual state of AMS's mechanism as of v3.46. 60 steps of the §5.3 rotating 6-sentence corpus movesvocab_projweights by ~5e-4 abs mean and does not measurably change any hit-rate. That is a scale/data finding, not a mechanism failure — but the mechanism is visibly unfinished.B_ams_textat 80–90 % N=20 is an upper-bound diagnostic on the retrieval side. It says: given the rightsource_textin the prompt, Qwen2.5-1.5B can answer 8–9 / 10 queries. That tells usDirectionTree+ gate + rerank are finding the right memory; the remaining gap lives in the prefix channel, not in retrieval.B_ams_text−A_ams_prefix= 30 points at N=20. This is the gap P0–P4 has to close. It is the single number that defines "AMS works" vs "AMS is a nicer embedding RAG".Corrected decision (supersedes the earlier "ship B_ams_text" wording)
B_ams_textis a retained diagnostic and upper bound, not a product line. It does not ship as an AMS artifact — doing so would be shipping a RAG product with AMS-flavored retrieval, which is not what AMS is.A_ams_prefixorC_ams_hybridmatchesD_full_historyat a large fraction ofD's token cost. BeatingB_ams_textis a necessary intermediate milestone, not the finish line, becauseB_ams_textis a different product category.A_ams_prefix/C_ams_hybridat N=10 / N=20. This tests whether the 30-point prefix-channel deficit is a scale problem or a mechanism problem. The current 60-step / 6-sentence budget cannot distinguish the two.Notes
B_ams_textat N=20 is 80 %, not the 90 % observed on CPU in §10.3. Greedy decode is deterministic given identical tokens, but bf16-vs-fp32 numerics inlayer_pool + _compute_content_semantic_embshift one borderline query (davis) across the retrieval gate. Not a bug insession_viability.py.A_ams_prefixhit-rate is 50 % on GPU vs 60 % on CPU at N=10 — same single-turn noise class on a 10-query session.What this PR does NOT claim
A_ams_prefix = 50 %at N=20 is not a shippable number.Commits
153069c— scaffolding (5 modes, synthetic 20-turn session, JSON+MD report)dbbf850— fix: write-gate drop + correct query embedding vialayer_pool + content_sem_embed9e8e3— N=10 fresh-init CPU results6c2eec6— N=20 distractor run6692510—SESSION_VIABILITY_HANDOFF.mdfor next GPU agentfd487f6— GPU runs: trained ckpt + fresh-GPU baseline, N=10/20, 12 report filesa4fc23d— §10.7/§10.8 tables and first-pass decision (now superseded by §10.9)4e85eb3— §10.9 framing correction: AMS is not RAG, not a KG; P0–P4 back on ship track;B_ams_textis a diagnostic, not a product