Skip to content

v3.44-Trained black-box audit: 60-step CPU training breaks 17/26 plateau#17

Draft
FluffyAIcode wants to merge 1 commit intomainfrom
AgentMemory/v344-trained-audit-7e97
Draft

v3.44-Trained black-box audit: 60-step CPU training breaks 17/26 plateau#17
FluffyAIcode wants to merge 1 commit intomainfrom
AgentMemory/v344-trained-audit-7e97

Conversation

@FluffyAIcode
Copy link
Copy Markdown
Owner

Scope

  • SUT: scheme_b_v344.py = scheme_b_v342.py + [J-1] weight-load hook. No Cfg changes.
  • Runner / spec: unmodified.
  • Training: 60 steps of Trainer.step() on CPU (batch 3, Adam lr=1e-4). Took 398.5 s.
  • Audit: 26 cases with AMS_TRAINED_WEIGHTS=ckpt/v344_trained.pt. Took 1404.3 s.

Result

18 / 26 pass, first 26-case run to exceed the 17±1 eval-time plateau held across v3.37 → v3.43.

Delta vs v3.42 (untrained, 17/26)

Transition Count Cases
FAIL → PASS 2 4.12 prefix_stepwise_drift_trajectory; 4.21 decode_repetition_feedback_probe
PASS → FAIL 1 4.13 retrieval_generation_alignment_audit (training instability @ 60 steps — output drifts into Qwen's multilingual token space)
Persistent FAIL 8 4.7, 4.10, 4.15, 4.17, 4.23, 4.24, 4.25 (+ 4.13)

Mechanism: why training broke the plateau

  • 4.12 / 4.21 depend on vocab_proj and reranker learned weights. vocab_proj.proj[-1] was zero-init in v3.42 (std = 0); after 60 steps std = 7e-4, enough to add ~+1 logit semantic boost to content tokens at step 0, breaking the "key key key" attractor.
  • Training loss: 307.6 → 44.2 (7×). Saturation in context_separation (→0 by step 14) indicates that loss is mis-specified (clamps all pairs; see §4.3 in feedback).

Hypothesis test: which FAILs are trainable?

Pre-training prediction matrix:

predicted trainable actual
4.15 ❌ not yet — vocab_proj std too low at 60 steps
4.23 ❌ Qwen vocab geometry, not trainable
4.24 ❌ loss function has a sign bug (pushes same-domain apart)
4.12 (unpredicted) ✅ FIXED
4.21 (unpredicted) ✅ FIXED

The "eval-time vs training-time" partitioning was directionally correct but case-specific assignments were wrong. Learned vocab_proj / reranker weights carry more degrees of freedom than any Cfg scalar, which is why training broke cases that scalar tuning could not.

Next-step projections (not executed)

  • Fix context_separation_loss to triplet form → expect 4.24 PASS
  • Train to 300+ steps → expect 4.15 PASS (probability crossing the 0.01 quantisation)
  • Projection: 20/26 achievable without Cfg changes

Artifacts

  • scheme_b_v344.py + train_v344.py
  • ckpt/v344_trained.pt (453 MB — not tracked, reproducible by python3 train_v344.py --steps 60)
  • ckpt/train_log.jsonl + ckpt/train_stdout.log
  • reports/v344_trained_blackbox/{report.json, report.md, runner.log, audit_feedback.md}
Open in Web Open in Cursor 

- scheme_b_v344.py: v3.42 clone + [J-1] AMS_TRAINED_WEIGHTS env hook
- train_v344.py: CPU training driver (60 steps, 398.5s)
- ckpt/train_log.jsonl + train_stdout.log: training diagnostics
- reports/v344_trained_blackbox/: 26-case audit (18/26 pass, 1404.3s)
- audit_feedback.md: Section 7 compliant analysis

Delta vs v3.42 (untrained 17/26):
  FAIL -> PASS: 4.12 prefix_stepwise_drift_trajectory, 4.21 decode_repetition_feedback_probe
  PASS -> FAIL: 4.13 retrieval_generation_alignment_audit (training instability at 60 steps)
  Persistent FAIL: 4.7, 4.10, 4.15, 4.17, 4.23, 4.24, 4.25

First 26-case run to exceed the 17+/-1 eval-time plateau.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants