Runner v3.45 update: 4.23/4.24/4.25 corrected metrics + axis-coverage + determinism hook + audit on v3.44-Trained (19/26) by FluffyAIcode · Pull Request #19 · FluffyAIcode/AgentMemorySystem

FluffyAIcode · 2026-04-20T17:10:12Z

Scope

Non-SUT changes only:

v331_blackbox_eval.py — probes 4.23 / 4.24 / 4.25 replaced with SPEC PR Spec correction: compression-communication channel definition (Section 1.1) + rewrite of probes 4.22–4.25 + axis-coverage reporting #18's v3.45 metrics; axis-coverage emission added; AMS_DETERMINISTIC env hook added
V331_BLACKBOX_TEST_SPEC.md — synced in from SPEC PR Spec correction: compression-communication channel definition (Section 1.1) + rewrite of probes 4.22–4.25 + axis-coverage reporting #18
reports/v345_runner_update_blackbox/ — 26-case audit against same ckpt as v3.44-Trained

No scheme_b_vXXX.py or AgentMemorySystem.py changes.

Metric corrections (per SPEC PR #18)

probe	pre-v3.45 metric	v3.45 metric
4.23 `keyword_specific_tail_slot_probe`	`top-3(wte @ slot_1) ∩ rare_keywords >= 1` (dominated by Qwen 2.5 token ids 0/1/2 near WTE mean)	`top-20(wte_centered @ slot_1_centered) ∩ rare >= 1` AND `median(rank_of_best_rare) <= 100` out of 151936
4.24 `context_descriptor_cluster_probe`	`intra - inter >= 0.15` (JL variance 0.58 at N=3)	`loo_nn_accuracy >= 0.75` (Clopper-Pearson CI bounded)
4.25 `prefix_length_scaling_probe`	`content_starters_top12_B >= top12_A + 1` (saturates at 12/12)	`avg(starter_positive_logit_mass_B / mass_A) > 1.10` over 3 prompts

Axis-coverage emission

Every report now emits the Section 4-meta.1 table with A/B/C/D axis results:

{
  "axis_a_compression":    { "ratio": 8.97, "threshold": 10.0, "passed": false },
  "axis_b_injection_cost": { "per_step_floats": 164224, "depends_on_N": false, "passed": true },
  "axis_c_fidelity":       { "passed_over_total": "5/11", "threshold_K": 9, "passed": false },
  "axis_d_stability":      { "passed_over_total": "2/3",  "threshold_all_pass": true, "passed": false }
}

Note axis A fails at 8.97 because the formula includes the semantic_emb (d_LLM=1536) field. If semantic_emb is excluded (it is a cached hidden_mean, not part of the compressed code), ratio = 87. The runner reports the literal sum per Section 7.3 instruction to avoid narrative.

Determinism hook

v331_blackbox_eval.py reads AMS_DETERMINISTIC=1 at import time and calls torch.set_num_threads(1) + torch.use_deterministic_algorithms(True, warn_only=True) before importing the SUT.
Zero SUT change; just an environment-driven runner-side setting.

Audit result (ckpt = v3.44-Trained's `ckpt/v344_trained.pt`, unchanged)

26 cases, elapsed 1476.3 s on CPU single-threaded.
Pass: 19 / 26 (v3.44-Trained under old runner: 18/26).

Delta vs v3.44-Trained (same weights, only runner metrics changed):

case_id	prior_passed	current_passed	notes
4.25 prefix_length_scaling_probe	false	true	v3.45 metric `avg_mass_ratio=1.38` crosses threshold 1.10

No other case changed.

Per-failing-case evidence

case	metric	value	threshold
4.23	median_rank_of_best_rare	4291	≤ 100
4.23	mean_intersection_size_top20	0.0	≥ 1.0
4.24	loo_nn_accuracy	0.60 (3/5)	≥ 0.75
4.24	music_gap (diagnostic)	−0.08	—
4.24	space_gap (diagnostic)	+0.25	—
4.25	avg_mass_ratio_B_over_A	1.38	> 1.10 (PASS)
4.13	divergence_step under AMS_DETERMINISTIC=1	1 (after `"piano"`)	0 (no divergence)

Falsifiable predictions (Section 7.6)

4.24: setting Cfg(context_hybrid_hidden_weight=0.1) and retraining predicts loo_nn_accuracy ≥ 0.75 and music_gap > 0
4.23: extending training from 60 to 300 steps predicts median_rank_of_best_rare decreases monotonically in step count
4.13: save_load divergence persists under AMS_DETERMINISTIC=1, so root cause is not thread scheduling; next candidates are torch.randperm in PrefixAligner.calibrate or memory-state mutation during load

Artifacts

reports/v345_runner_update_blackbox/report.json
reports/v345_runner_update_blackbox/report.md
reports/v345_runner_update_blackbox/runner.log
reports/v345_runner_update_blackbox/audit_feedback.md (Section 7.7 compliant)

Dependencies

SPEC PR Spec correction: compression-communication channel definition (Section 1.1) + rewrite of probes 4.22–4.25 + axis-coverage reporting #18 must merge first. This PR's V331_BLACKBOX_TEST_SPEC.md content is identical to that branch; if Spec correction: compression-communication channel definition (Section 1.1) + rewrite of probes 4.22–4.25 + axis-coverage reporting #18 merges unchanged, this one is a clean fast-forward.

- scheme_b_v344.py: v3.42 clone + [J-1] AMS_TRAINED_WEIGHTS env hook - train_v344.py: CPU training driver (60 steps, 398.5s) - ckpt/train_log.jsonl + train_stdout.log: training diagnostics - reports/v344_trained_blackbox/: 26-case audit (18/26 pass, 1404.3s) - audit_feedback.md: Section 7 compliant analysis Delta vs v3.42 (untrained 17/26): FAIL -> PASS: 4.12 prefix_stepwise_drift_trajectory, 4.21 decode_repetition_feedback_probe PASS -> FAIL: 4.13 retrieval_generation_alignment_audit (training instability at 60 steps) Persistent FAIL: 4.7, 4.10, 4.15, 4.17, 4.23, 4.24, 4.25 First 26-case run to exceed the 17+/-1 eval-time plateau. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…nism hook; audit on v3.44-Trained ckpt: 19/26 pass Changes to v331_blackbox_eval.py (non-SUT): - 4.23 keyword_specific_tail_slot_probe: replace top-3 absolute-cosine with mean-centered top-20 intersection + median rank_of_best_rare <= 100 - 4.24 context_descriptor_cluster_probe: replace JL-noise-bound cosine gap with LOO NN accuracy >= 0.75 (retain cosine metrics as diagnostics) - 4.25 prefix_length_scaling_probe: replace saturation-bound top-12 count with starter-positive-logit-mass ratio mass_B/mass_A > 1.10 averaged over 3 prompts - write_reports: compute and emit Section 4-meta.1 axis-coverage table (A compression / B cost / C fidelity / D stability) - startup: if AMS_DETERMINISTIC=1, torch.set_num_threads(1) + use_deterministic_algorithms(warn_only=True) before SUT import - no SUT code changed (per user constraint) Audit on ckpt/v344_trained.pt with AMS_DETERMINISTIC=1 + AMS_TRAINED_WEIGHTS: - 19/26 pass (v3.44-Trained: 18/26; same weights) - 4.25 transitions FAIL -> PASS (avg_mass_ratio=1.38, threshold >1.10) - 4.23 still FAIL under corrected metric: median_rank_of_best_rare=4291 (threshold <=100) - 4.24 still FAIL under corrected metric: loo_nn_accuracy=0.60 (threshold >=0.75) - 4.13 save_load still FAIL under AMS_DETERMINISTIC=1: root cause not in thread scheduling - axis_a=false (8.97 vs 10.0), axis_b=true, axis_c=5/11, axis_d=2/3; channel_passes_all_axes=false Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

cursoragent and others added 2 commits April 20, 2026 15:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Runner v3.45 update: 4.23/4.24/4.25 corrected metrics + axis-coverage + determinism hook + audit on v3.44-Trained (19/26)#19

Runner v3.45 update: 4.23/4.24/4.25 corrected metrics + axis-coverage + determinism hook + audit on v3.44-Trained (19/26)#19
FluffyAIcode wants to merge 2 commits intomainfrom
AgentMemory/v345-runner-update-audit-7e97

FluffyAIcode commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

FluffyAIcode commented Apr 20, 2026

Scope

Metric corrections (per SPEC PR #18)

Axis-coverage emission

Determinism hook

Audit result (ckpt = v3.44-Trained's ckpt/v344_trained.pt, unchanged)

Per-failing-case evidence

Falsifiable predictions (Section 7.6)

Artifacts

Dependencies

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Audit result (ckpt = v3.44-Trained's `ckpt/v344_trained.pt`, unchanged)