Skip to content

v3.46 · session-layer viability: prefix channel (A/C) is the ship track; B_ams_text is a diagnostic#29

Draft
FluffyAIcode wants to merge 8 commits intoAgentMemory/v346-trained-gpu-7e97from
AgentMemory/v346-session-viability-7e97
Draft

v3.46 · session-layer viability: prefix channel (A/C) is the ship track; B_ams_text is a diagnostic#29
FluffyAIcode wants to merge 8 commits intoAgentMemory/v346-trained-gpu-7e97from
AgentMemory/v346-session-viability-7e97

Conversation

@FluffyAIcode
Copy link
Copy Markdown
Owner

@FluffyAIcode FluffyAIcode commented Apr 22, 2026

Child of #28. Decision-framework spike per SPRINT_CLOSEOUT_v3.46.md §10.

Status: complete. Fresh-init (CPU + GPU), trained-ckpt (GPU), framing correction, and the corrected final decision are all in this PR.


Framing (read this first)

AMS is not a RAG system and is not a knowledge graph. Its core mechanism is a prefix / hidden-state injection channel: memories are encoded into continuous vectors delivered into the backbone's forward pass without going through text re-prompting. That channel is what A_ams_prefix and C_ams_hybrid exercise in this benchmark.

  • RAG = retrieve text, prepend to prompt. B_flat_cos and B_ams_text are RAG-shaped modes in this benchmark. They are retained as diagnostics and upper bounds, not as product candidates.
  • Knowledge graph = explicit entities / relations / symbolic query. AMS has none of these; DirectionTree is a continuous-embedding routing structure.

An earlier revision of this PR concluded "ship B_ams_text as cheaper RAG backend, move P0–P4 to research track." That was a framing error. It is retracted in SPRINT_CLOSEOUT_v3.46.md §10.9 and replaced with the decision below.

Protocol

  • session_viability.py (~550 lines): 5-mode benchmark on a synthetic session.
  • 10 targeted-recall queries over a growing store of facts. Hit = expected_keyword substring in answer.
  • Backbone: Qwen2.5-1.5B-Instruct (bf16). Two store sizes: N=10 (identity-only) and N=20 (+10 distractors), same 10 queries.
  • No Cfg changes, no loss changes, no SUT changes. All existing APIs: MemLLM.write, prepare_decode_context, MemLLM.generate, layer_pool, _compute_content_semantic_emb.
  • Honors AMS_TRAINED_WEIGHTS via the loader added in v3.46-trained: 60-step training + audit (18/26, −3 vs fresh 21/26) #28.
  • Trained runs on NVIDIA H200 (vast.ai). train_v346.py --steps 60ckpt/v346_trained.pt in 193.5 s. post_probe = {tail_head_slot1_abs_mean: 7.25e-4, vocab_proj_last_abs_mean: 5.09e-4} (handoff expected 7.30e-4 / 5.49e-4; matches within <10 % noise).
  • Loader verification on both trained runs: loaded=202 skipped=0 shape_errs=0 provenance=AgentMemory/v346-revertE-topk-nonexclusive-7e97.

What each mode measures:

Mode Category What it tells us
D_full_history no retrieval, full context Ceiling. The backbone's own answer quality when it has all facts in the prompt.
B_flat_cos RAG-shaped Does a naïve cosine-over-semantic_emb retrieve the right fact?
B_ams_text RAG-shaped Does AMS's full retrieval stack (DirectionTree + gate + rerank) retrieve the right fact? Upper bound on the retrieval side.
A_ams_prefix AMS core mechanism Does AMS's prefix channel deliver the memory into the forward pass without re-prompting?
C_ams_hybrid AMS core + 1 crutch text memory Same mechanism with a single text memory as scaffolding — an intermediate diagnostic.

The B modes are not product candidates. They are there to isolate whether a failure lives on the retrieval side or the prefix side.

Data

Fresh-init CPU (reports/session_viability_fresh/, ..._20facts/)

N=10

Mode Hit-rate avg in-tokens avg retrieve ms avg generate ms
D_full_history 100% 159 0.0 4138
B_flat_cos 80% 55 144 4187
B_ams_text 70% 56 526 4030
A_ams_prefix 60% 11 453 19722
C_ams_hybrid 70% 26 471 21147

N=20

Mode Hit-rate avg in-tokens avg retrieve ms avg generate ms
D_full_history 100% 301 0.0 4590
B_flat_cos 70% 55 119 3954
B_ams_text 90% 54 544 4025
A_ams_prefix 60% 11 473 18502
C_ams_hybrid 70% 26 455 20320

Fresh-init GPU (H200, reports/session_viability_fresh_gpu/, ..._20facts/)

N=10

Mode Hit-rate avg in-tokens avg retrieve ms avg generate ms
D_full_history 100% 159 0.0 516
B_flat_cos 80% 55 31 507
B_ams_text 80% 55 416 386
A_ams_prefix 50% 11 500 14898
C_ams_hybrid 70% 27 428 15363

N=20

Mode Hit-rate avg in-tokens avg retrieve ms avg generate ms
D_full_history 100% 301 0.0 511
B_flat_cos 70% 55 32 486
B_ams_text 80% 55 411 370
A_ams_prefix 50% 11 513 15096
C_ams_hybrid 70% 27 450 15311

Trained ckpt (H200, 60 steps, reports/session_viability_trained/, ..._20facts/)

N=10

Mode Hit-rate avg in-tokens avg retrieve ms avg generate ms Δ fresh-GPU hit Δ fresh-GPU gen-ms
D_full_history 100% 159 0.0 537 0 +21
B_flat_cos 80% 55 31 526 0 +20
B_ams_text 80% 55 415 376 0 −9
A_ams_prefix 50% 11 452 13033 0 −1865
C_ams_hybrid 70% 27 466 14520 0 −843

N=20

Mode Hit-rate avg in-tokens avg retrieve ms avg generate ms Δ fresh-GPU hit Δ fresh-GPU gen-ms
D_full_history 100% 301 0.0 539 0 +28
B_flat_cos 70% 55 32 498 0 +13
B_ams_text 80% 55 417 373 0 +3
A_ams_prefix 50% 11 435 12900 0 −2196
C_ams_hybrid 70% 27 447 13853 0 −1458

What the data says about AMS's core mechanism

Read under the corrected frame (§10.9 in SPRINT_CLOSEOUT_v3.46.md):

  1. A_ams_prefix at 50 % N=20, trained or fresh, is the actual state of AMS's mechanism as of v3.46. 60 steps of the §5.3 rotating 6-sentence corpus moves vocab_proj weights by ~5e-4 abs mean and does not measurably change any hit-rate. That is a scale/data finding, not a mechanism failure — but the mechanism is visibly unfinished.
  2. B_ams_text at 80–90 % N=20 is an upper-bound diagnostic on the retrieval side. It says: given the right source_text in the prompt, Qwen2.5-1.5B can answer 8–9 / 10 queries. That tells us DirectionTree + gate + rerank are finding the right memory; the remaining gap lives in the prefix channel, not in retrieval.
  3. Prefix-channel deficit: B_ams_textA_ams_prefix = 30 points at N=20. This is the gap P0–P4 has to close. It is the single number that defines "AMS works" vs "AMS is a nicer embedding RAG".
  4. A/C generate time is ~25× text modes on H200, not the ~5× seen on CPU. CFG double-forward + logit-shaping dominates wall time on fast hardware. This is architectural, not a regression; it is part of what P0–P4 needs to address (or justify).

Corrected decision (supersedes the earlier "ship B_ams_text" wording)

  1. P0–P4 blackbox-audit work is the ship track. It is the only track that exercises AMS's actual mechanism. The earlier "move P0–P4 to research track" wording was a framing error, retracted in §10.9.
  2. B_ams_text is a retained diagnostic and upper bound, not a product line. It does not ship as an AMS artifact — doing so would be shipping a RAG product with AMS-flavored retrieval, which is not what AMS is.
  3. Success bar restated correctly: AMS is done when A_ams_prefix or C_ams_hybrid matches D_full_history at a large fraction of D's token cost. Beating B_ams_text is a necessary intermediate milestone, not the finish line, because B_ams_text is a different product category.
  4. Immediate next experiment (a follow-up PR, not this one): 10× more training on a real corpus, re-run A_ams_prefix / C_ams_hybrid at N=10 / N=20. This tests whether the 30-point prefix-channel deficit is a scale problem or a mechanism problem. The current 60-step / 6-sentence budget cannot distinguish the two.

Notes

  • CPU vs GPU cross-cut: GPU fresh-init B_ams_text at N=20 is 80 %, not the 90 % observed on CPU in §10.3. Greedy decode is deterministic given identical tokens, but bf16-vs-fp32 numerics in layer_pool + _compute_content_semantic_emb shift one borderline query (davis) across the retrieval gate. Not a bug in session_viability.py.
  • A_ams_prefix hit-rate is 50 % on GPU vs 60 % on CPU at N=10 — same single-turn noise class on a 10-query session.

What this PR does NOT claim

  • It does not claim AMS is ready to ship as a product today. A_ams_prefix = 50 % at N=20 is not a shippable number.
  • It does not claim P0–P4 will work with more training; only that the current 60-step / 6-sentence training budget cannot move the prefix channel at all, so that budget cannot be used to conclude anything about the ceiling.
  • It does not benchmark AMS against any RAG system. AMS is not RAG; the comparison is not apples-to-apples and is not done here.

Commits

  • 153069c — scaffolding (5 modes, synthetic 20-turn session, JSON+MD report)
  • dbbf850 — fix: write-gate drop + correct query embedding via layer_pool + content_sem_emb
  • ed9e8e3 — N=10 fresh-init CPU results
  • 6c2eec6 — N=20 distractor run
  • 6692510SESSION_VIABILITY_HANDOFF.md for next GPU agent
  • fd487f6GPU runs: trained ckpt + fresh-GPU baseline, N=10/20, 12 report files
  • a4fc23d — §10.7/§10.8 tables and first-pass decision (now superseded by §10.9)
  • 4e85eb3§10.9 framing correction: AMS is not RAG, not a KG; P0–P4 back on ship track; B_ams_text is a diagnostic, not a product
Open in Web Open in Cursor 

cursoragent and others added 4 commits April 22, 2026 04:45
Scaffolding per SPRINT_CLOSEOUT_v3.46.md \u00a710 viability framework.
Decides whether AMS is already useful as a low-cost session layer
before committing to blackbox-audit improvements (P0\u2013P4).

Five modes on a 20-turn synthetic session (10 facts + 10 targeted recall
queries, expected-keyword-in-answer as hit criterion):

  D_full_history  - everything in prompt (ceiling baseline, tokens O(N))
  B_flat_cos      - flat cosine over semantic_emb -> text inject
  B_ams_text      - full AMS retrieval pipeline -> text inject
  A_ams_prefix    - AMS prefix injection only (blackbox mechanism)
  C_ams_hybrid    - prefix + top-1 retrieved source_text

Each mode reports per-turn (retrieve_ms, generate_ms, input_tokens,
output_tokens, answer_hit) and aggregate (hit_rate, avg_*).  Writes
reports/session_viability/{report.json, report.md}.

All modes reuse existing SUT APIs (MemLLM.write, prepare_decode_context,
generate, AMM.tree.store, DirectionTree retrieve); no Cfg changes,
no loss changes, no SUT modifications.  Picks up AMS_TRAINED_WEIGHTS
env var transparently via MemLLM.load (so we can compare trained vs
fresh by toggling the env).

Run:
  python3 session_viability.py --mt 40 \
    --out reports/session_viability_trained
  AMS_TRAINED_WEIGHTS=ckpt/v346_trained.pt python3 session_viability.py \
    --mt 40 --out reports/session_viability_trained

Next: execute on vast.ai H200, fill decision table, update SPRINT \u00a710.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…_emb for query

Two bugs surfaced in local CPU smoke:

1. _seed_memory passed training_mode=False to MemLLM.write, so the write-gate
   could silently drop facts (store size = 0).  Since this is a measurement
   spike, not a training claim, use training_mode=True to unconditionally
   persist every seeded fact.  Added a sanity WARN if stored != facts.

2. _retrieve_flat_cos pooled o['hs'][0] as if it were a [T, D] tensor, but
   fwd(ids, mask)['hs'] is a list of per-layer hidden states.  Use the same
   query embedding path that MemLLM.write() uses:
     hs_pooled = layer_pool(hs_list)                       # [B, T, d_LLM]
     sem_q = _compute_content_semantic_emb(hs_pooled, ids, mask)   # [B, d_LLM]
   so query and stored semantic_emb live in the same space.

Local fresh-init CPU smoke (mt=12) now shows healthy numbers:
  D_full_history: 80% hit / 159 in_tok
  B_flat_cos:     50% hit /  55 in_tok
  B_ams_text:     30% hit /  56 in_tok
  A_ams_prefix:   40% hit /  11 in_tok
  C_ams_hybrid:   70% hit /  26 in_tok

C_hybrid at 70% with 6x less input cost than D is already a non-trivial
signal for the decision framework.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
… 20-turn session)

Decision-framework data per SPRINT_CLOSEOUT_v3.46.md \u00a710.  vast.ai
is currently SSH-rejecting (Connection reset by peer); running on the
local cloud-agent VM's CPU as the fallback so the scaffolding is
exercised end-to-end and the fresh-init decision table exists.
Trained-ckpt comparison will land as a follow-up commit once vast.ai
recovers.

Results (mt=30):
  D_full_history  hit=100%  in_tok=159  gen=4138ms  (ceiling, O(N) tokens)
  B_flat_cos      hit= 80%  in_tok= 55  gen=4187ms  <- strong baseline
  B_ams_text      hit= 70%  in_tok= 56  gen=4030ms  <- full AMS retrieval
  A_ams_prefix    hit= 60%  in_tok= 11  gen=19722ms <- prefix only
  C_ams_hybrid    hit= 70%  in_tok= 26  gen=21147ms <- prefix + top-1 text

Three robust signals:
  1. B_flat_cos beats B_ams_text 80% vs 70% on N=10 small store, short
     queries.  The strict-overlap gate + rerank hurt at this scale; the
     tree's recall advantage is for larger N.
  2. Prefix-only A_ams_prefix at 60% is non-trivial — axis-C routing works.
     Answers lack fluency (confirming blackbox axis-C finding) but keyword
     presence is 6/10.
  3. C_ams_hybrid at 26 tokens = 16% of D's token cost matches the
     retrieval-based B modes on hit-rate with a 2x token reduction.
     Prefix + short text is a real Pareto point.

Decision: AMS has independent commercial value as a session layer **today**
at v3.46-trained with an imperfect blackbox.  P0-P4 climb becomes
nice-to-have not must-have.  Most useful near-term improvement is
reducing A/C generate time (5x slower than text-only, due to CFG double-
forward + logit-shaping).

Added .hf_home/ to .gitignore (local HF cache dir).

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…pdate \u00a710

New --n-facts flag (default 10). N=10 is identity-only; N=20 adds 10
distractors.  Queries are the same 10 in both.

N=20 run (reports/session_viability_fresh_20facts/):
  D_full_history  hit=100%  in_tok=301  (159->301, confirmed O(N) growth)
  B_flat_cos      hit= 70%  (-10 vs N=10)
  B_ams_text      hit= 90%  (+20 vs N=10)  <- ranking INVERSION
  A_ams_prefix    hit= 60%  (flat)
  C_ams_hybrid    hit= 70%  (flat)

Key finding: the AMS retrieval pipeline (DirectionTree + strict-overlap
gate + rerank) **beats** flat cosine at N=20 by 20 points, reversing
the N=10 ordering.  This is a clean mechanistic advantage from the
retrieval side of AMS, which the blackbox audit does NOT measure at all.

Decision update in \u00a710.5:
- B_ams_text at N=20 delivers 90% at 18% of D's input-token cost.
- This is the immediately-shippable commercial value of the codebase.
- A_ams_prefix / C_ams_hybrid do NOT yet justify themselves against
  B_ams_text on hit-rate or cost.
- Revised recommendation: ship B_ams_text as-is; P0-P4 blackbox
  improvements become research track with explicit bar: 'trained A or C
  must beat B_ams_text at N=20 on both hit-rate and total cost'.

Still pending: trained checkpoint re-run once vast.ai SSH recovers.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
@cursor cursor Bot changed the title v3.46 · session-layer viability spike (5 modes × 20-turn session) v3.46 · session-layer viability: B_ams_text wins at N=20 (90% / 54 tokens) Apr 22, 2026
PR #29 needs trained-ckpt comparison to complete the decision
framework, but this agent's remote (vast.ai) is SSH-rejecting
(Connection reset by peer) and the trained ckpt/v346_trained.pt
(455 MB, git-ignored) exists only on that remote.

SESSION_VIABILITY_HANDOFF.md is the complete pickup document for
the next agent with GPU access:
  \u00a71 what's done (PR scaffold, fresh-init N=10/N=20 data, \u00a710 framework)
  \u00a72 task (2 trained reports + decision table update)
  \u00a73 how to get ckpt (vast.ai scp; or retrain from scratch per \u00a75.3)
  \u00a74 run protocol with loader sanity check
  \u00a75 PR update instructions with \u00a710.7/\u00a710.8 templates
  \u00a76 guardrails inherited from SPRINT_CLOSEOUT_v3.46.md (no Cfg/loss changes)
  \u00a77 success criteria (atomic commits, actual numbers, final decision)
  \u00a78 stuck-path fallbacks

Each step is copy-paste-ready.  The next agent should be able to
complete PR #29 autonomously in ~30 min on an H200.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
@cursor cursor Bot changed the title v3.46 · session-layer viability: B_ams_text wins at N=20 (90% / 54 tokens) v3.46 · session-layer viability: B_ams_text wins at N=20 (90% / 54 tokens) [handoff: needs GPU agent] Apr 22, 2026
cursoragent and others added 2 commits April 22, 2026 06:53
…N=20

All 4 runs on NVIDIA H200, Qwen2.5-1.5B-Instruct, mt=30, same 10 queries.
Trained ckpt: v346_trained.pt (60 steps, 193.5s, post_probe
tail_head_slot1_abs_mean=7.25e-4, vocab_proj_last_abs_mean=5.09e-4 —
matches handoff expected values within noise).

Loader verification on both trained runs: loaded=202 skipped=0 shape_errs=0
provenance=AgentMemory/v346-revertE-topk-nonexclusive-7e97.

Trained N=10: D=100%, B_flat=80%, B_ams=80%, A_prefix=50%, C_hybrid=70%
Trained N=20: D=100%, B_flat=70%, B_ams=80%, A_prefix=50%, C_hybrid=70%
Fresh-GPU N=10: D=100%, B_flat=80%, B_ams=80%, A_prefix=50%, C_hybrid=70%
Fresh-GPU N=20: D=100%, B_flat=70%, B_ams=80%, A_prefix=50%, C_hybrid=70%

Trained numbers are identical to fresh-GPU on all 5 modes — 60-step training
does not move hit-rate at current mt=30. Also notable: GPU fresh-init
B_ams_text N=20 is 80% (not the 90% previously seen on CPU) — single-
turn stochasticity in the greedy decode path, not a retrieval change.

§10.7/§10.8 decision update to follow in a separate commit.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
§10.7: trained + fresh-GPU N=10/20 tables with Δ fresh-GPU columns.
§10.8: three decision questions answered from data:

1. Training lifts A_prefix past B_ams_text at N=20? NO (50% vs 80%)
2. Training lifts C_hybrid past B_ams_text at N=20? NO (70% vs 80%)
3. A/C generate-time still ~5× text modes? YES (actually ~25× on H200)

Ship B_ams_text. Move P0-P4 blackbox audit to research track.

Also noted: CPU-vs-GPU numeric drift shifts one borderline N=20
B_ams_text query ('davis'), so the §10.3 CPU 90% becomes 80% on
GPU. Not a bug in session_viability.py — bf16 vs fp32 numerics in
layer_pool + _compute_content_semantic_emb tip one top-k ordering.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
@cursor cursor Bot changed the title v3.46 · session-layer viability: B_ams_text wins at N=20 (90% / 54 tokens) [handoff: needs GPU agent] v3.46 · session-layer viability: B_ams_text wins; training does not lift A/C past B Apr 22, 2026
Retracts the §10.5/§10.8 'ship B_ams_text as cheaper RAG backend'
wording. That framing collapsed AMS to a product category it was
never trying to be (RAG) and demoted its actual mechanism (the
prefix/hidden-state injection channel tested by A_ams_prefix /
C_ams_hybrid) to 'research track'. That was wrong.

§10.9 now states the correct frame and re-reads §10.7 accordingly:

- B_ams_text is a retrieval-side diagnostic + upper bound, NOT a
  product line. It exercises RAG-shaped prompt prepending, not AMS.
- A_ams_prefix / C_ams_hybrid are the actual product. A=50%, C=70%
  at N=20 is the AMS-mechanism measurement. 60-step training did
  not move it.
- P0-P4 blackbox-audit work returns to the ship track.
- Success bar restated: A or C must match D_full_history at a large
  fraction of D's token cost — beating B_ams_text is a necessary
  intermediate milestone, not the finish line, because B_ams_text
  is a different product category (RAG-shaped).
- §10.3's 'inversion at N=20' re-read as architectural good news
  for the retrieval side, not as a reason to ship B_ams_text.

Raw numbers in §10.3 and §10.7 are unchanged; only the product
framing is corrected.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
@cursor cursor Bot changed the title v3.46 · session-layer viability: B_ams_text wins; training does not lift A/C past B v3.46 · session-layer viability: prefix channel (A/C) is the ship track; B_ams_text is a diagnostic Apr 22, 2026
cursor Bot pushed a commit that referenced this pull request Apr 22, 2026
3-mode subset (D_full_history, A_ams_prefix, C_ams_hybrid) on the same
10-query synthetic session as PR #29's session_viability.py, but using
MemLLM4 for A and C.

Fresh-init expectation: A/C hit-rates at Qwen2.5-1.5B scale on GPU are
not expected to beat v3.46 fresh-init — that requires training (v4.6).
This harness produces the fresh-init baseline that v4-trained will be
compared against.

B modes (B_flat_cos, B_ams_text) omitted: they are RAG-shaped upper-bound
diagnostics, not v4 product modes (per SPRINT_CLOSEOUT_v3.46.md §10.9).
D_full_history is kept as the ceiling baseline.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants