v3.44-Trained black-box audit: 60-step CPU training breaks 17/26 plateau by FluffyAIcode · Pull Request #17 · FluffyAIcode/AgentMemorySystem

FluffyAIcode · 2026-04-20T15:32:44Z

Scope

SUT: scheme_b_v344.py = scheme_b_v342.py + [J-1] weight-load hook. No Cfg changes.
Runner / spec: unmodified.
Training: 60 steps of Trainer.step() on CPU (batch 3, Adam lr=1e-4). Took 398.5 s.
Audit: 26 cases with AMS_TRAINED_WEIGHTS=ckpt/v344_trained.pt. Took 1404.3 s.

Result

18 / 26 pass, first 26-case run to exceed the 17±1 eval-time plateau held across v3.37 → v3.43.

Delta vs v3.42 (untrained, 17/26)

Transition	Count	Cases
FAIL → PASS	2	4.12 `prefix_stepwise_drift_trajectory`; 4.21 `decode_repetition_feedback_probe`
PASS → FAIL	1	4.13 `retrieval_generation_alignment_audit` (training instability @ 60 steps — output drifts into Qwen's multilingual token space)
Persistent FAIL	8	4.7, 4.10, 4.15, 4.17, 4.23, 4.24, 4.25 (+ 4.13)

Mechanism: why training broke the plateau

4.12 / 4.21 depend on vocab_proj and reranker learned weights. vocab_proj.proj[-1] was zero-init in v3.42 (std = 0); after 60 steps std = 7e-4, enough to add ~+1 logit semantic boost to content tokens at step 0, breaking the "key key key" attractor.
Training loss: 307.6 → 44.2 (7×). Saturation in context_separation (→0 by step 14) indicates that loss is mis-specified (clamps all pairs; see §4.3 in feedback).

Hypothesis test: which FAILs are trainable?

Pre-training prediction matrix:

predicted trainable	actual
4.15	❌ not yet — `vocab_proj` std too low at 60 steps
4.23	❌ Qwen vocab geometry, not trainable
4.24	❌ loss function has a sign bug (pushes same-domain apart)
4.12 (unpredicted)	✅ FIXED
4.21 (unpredicted)	✅ FIXED

The "eval-time vs training-time" partitioning was directionally correct but case-specific assignments were wrong. Learned vocab_proj / reranker weights carry more degrees of freedom than any Cfg scalar, which is why training broke cases that scalar tuning could not.

Next-step projections (not executed)

Fix context_separation_loss to triplet form → expect 4.24 PASS
Train to 300+ steps → expect 4.15 PASS (probability crossing the 0.01 quantisation)
Projection: 20/26 achievable without Cfg changes

Artifacts

scheme_b_v344.py + train_v344.py
ckpt/v344_trained.pt (453 MB — not tracked, reproducible by python3 train_v344.py --steps 60)
ckpt/train_log.jsonl + ckpt/train_stdout.log
reports/v344_trained_blackbox/{report.json, report.md, runner.log, audit_feedback.md}

- scheme_b_v344.py: v3.42 clone + [J-1] AMS_TRAINED_WEIGHTS env hook - train_v344.py: CPU training driver (60 steps, 398.5s) - ckpt/train_log.jsonl + train_stdout.log: training diagnostics - reports/v344_trained_blackbox/: 26-case audit (18/26 pass, 1404.3s) - audit_feedback.md: Section 7 compliant analysis Delta vs v3.42 (untrained 17/26): FAIL -> PASS: 4.12 prefix_stepwise_drift_trajectory, 4.21 decode_repetition_feedback_probe PASS -> FAIL: 4.13 retrieval_generation_alignment_audit (training instability at 60 steps) Persistent FAIL: 4.7, 4.10, 4.15, 4.17, 4.23, 4.24, 4.25 First 26-case run to exceed the 17+/-1 eval-time plateau. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v3.44-Trained black-box audit: 60-step CPU training breaks 17/26 plateau#17

v3.44-Trained black-box audit: 60-step CPU training breaks 17/26 plateau#17
FluffyAIcode wants to merge 1 commit intomainfrom
AgentMemory/v344-trained-audit-7e97

FluffyAIcode commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

FluffyAIcode commented Apr 20, 2026

Scope

Result

Delta vs v3.42 (untrained, 17/26)

Mechanism: why training broke the plateau

Hypothesis test: which FAILs are trainable?

Next-step projections (not executed)

Artifacts

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants