Skip to content

Runner v3.45 update: 4.23/4.24/4.25 corrected metrics + axis-coverage + determinism hook + audit on v3.44-Trained (19/26)#19

Draft
FluffyAIcode wants to merge 2 commits intomainfrom
AgentMemory/v345-runner-update-audit-7e97
Draft

Runner v3.45 update: 4.23/4.24/4.25 corrected metrics + axis-coverage + determinism hook + audit on v3.44-Trained (19/26)#19
FluffyAIcode wants to merge 2 commits intomainfrom
AgentMemory/v345-runner-update-audit-7e97

Conversation

@FluffyAIcode
Copy link
Copy Markdown
Owner

Scope

Non-SUT changes only:

No scheme_b_vXXX.py or AgentMemorySystem.py changes.

Metric corrections (per SPEC PR #18)

probe pre-v3.45 metric v3.45 metric
4.23 keyword_specific_tail_slot_probe top-3(wte @ slot_1) ∩ rare_keywords >= 1 (dominated by Qwen 2.5 token ids 0/1/2 near WTE mean) top-20(wte_centered @ slot_1_centered) ∩ rare >= 1 AND median(rank_of_best_rare) <= 100 out of 151936
4.24 context_descriptor_cluster_probe intra - inter >= 0.15 (JL variance 0.58 at N=3) loo_nn_accuracy >= 0.75 (Clopper-Pearson CI bounded)
4.25 prefix_length_scaling_probe content_starters_top12_B >= top12_A + 1 (saturates at 12/12) avg(starter_positive_logit_mass_B / mass_A) > 1.10 over 3 prompts

Axis-coverage emission

Every report now emits the Section 4-meta.1 table with A/B/C/D axis results:

{
  "axis_a_compression":    { "ratio": 8.97, "threshold": 10.0, "passed": false },
  "axis_b_injection_cost": { "per_step_floats": 164224, "depends_on_N": false, "passed": true },
  "axis_c_fidelity":       { "passed_over_total": "5/11", "threshold_K": 9, "passed": false },
  "axis_d_stability":      { "passed_over_total": "2/3",  "threshold_all_pass": true, "passed": false }
}

Note axis A fails at 8.97 because the formula includes the semantic_emb (d_LLM=1536) field. If semantic_emb is excluded (it is a cached hidden_mean, not part of the compressed code), ratio = 87. The runner reports the literal sum per Section 7.3 instruction to avoid narrative.

Determinism hook

  • v331_blackbox_eval.py reads AMS_DETERMINISTIC=1 at import time and calls torch.set_num_threads(1) + torch.use_deterministic_algorithms(True, warn_only=True) before importing the SUT.
  • Zero SUT change; just an environment-driven runner-side setting.

Audit result (ckpt = v3.44-Trained's ckpt/v344_trained.pt, unchanged)

  • 26 cases, elapsed 1476.3 s on CPU single-threaded.
  • Pass: 19 / 26 (v3.44-Trained under old runner: 18/26).

Delta vs v3.44-Trained (same weights, only runner metrics changed):

case_id prior_passed current_passed notes
4.25 prefix_length_scaling_probe false true v3.45 metric avg_mass_ratio=1.38 crosses threshold 1.10

No other case changed.

Per-failing-case evidence

case metric value threshold
4.23 median_rank_of_best_rare 4291 ≤ 100
4.23 mean_intersection_size_top20 0.0 ≥ 1.0
4.24 loo_nn_accuracy 0.60 (3/5) ≥ 0.75
4.24 music_gap (diagnostic) −0.08
4.24 space_gap (diagnostic) +0.25
4.25 avg_mass_ratio_B_over_A 1.38 > 1.10 (PASS)
4.13 divergence_step under AMS_DETERMINISTIC=1 1 (after "piano") 0 (no divergence)

Falsifiable predictions (Section 7.6)

  • 4.24: setting Cfg(context_hybrid_hidden_weight=0.1) and retraining predicts loo_nn_accuracy ≥ 0.75 and music_gap > 0
  • 4.23: extending training from 60 to 300 steps predicts median_rank_of_best_rare decreases monotonically in step count
  • 4.13: save_load divergence persists under AMS_DETERMINISTIC=1, so root cause is not thread scheduling; next candidates are torch.randperm in PrefixAligner.calibrate or memory-state mutation during load

Artifacts

  • reports/v345_runner_update_blackbox/report.json
  • reports/v345_runner_update_blackbox/report.md
  • reports/v345_runner_update_blackbox/runner.log
  • reports/v345_runner_update_blackbox/audit_feedback.md (Section 7.7 compliant)

Dependencies

Open in Web Open in Cursor 

cursoragent and others added 2 commits April 20, 2026 15:32
- scheme_b_v344.py: v3.42 clone + [J-1] AMS_TRAINED_WEIGHTS env hook
- train_v344.py: CPU training driver (60 steps, 398.5s)
- ckpt/train_log.jsonl + train_stdout.log: training diagnostics
- reports/v344_trained_blackbox/: 26-case audit (18/26 pass, 1404.3s)
- audit_feedback.md: Section 7 compliant analysis

Delta vs v3.42 (untrained 17/26):
  FAIL -> PASS: 4.12 prefix_stepwise_drift_trajectory, 4.21 decode_repetition_feedback_probe
  PASS -> FAIL: 4.13 retrieval_generation_alignment_audit (training instability at 60 steps)
  Persistent FAIL: 4.7, 4.10, 4.15, 4.17, 4.23, 4.24, 4.25

First 26-case run to exceed the 17+/-1 eval-time plateau.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…nism hook; audit on v3.44-Trained ckpt: 19/26 pass

Changes to v331_blackbox_eval.py (non-SUT):
- 4.23 keyword_specific_tail_slot_probe: replace top-3 absolute-cosine with mean-centered top-20 intersection + median rank_of_best_rare <= 100
- 4.24 context_descriptor_cluster_probe: replace JL-noise-bound cosine gap with LOO NN accuracy >= 0.75 (retain cosine metrics as diagnostics)
- 4.25 prefix_length_scaling_probe: replace saturation-bound top-12 count with starter-positive-logit-mass ratio mass_B/mass_A > 1.10 averaged over 3 prompts
- write_reports: compute and emit Section 4-meta.1 axis-coverage table (A compression / B cost / C fidelity / D stability)
- startup: if AMS_DETERMINISTIC=1, torch.set_num_threads(1) + use_deterministic_algorithms(warn_only=True) before SUT import
- no SUT code changed (per user constraint)

Audit on ckpt/v344_trained.pt with AMS_DETERMINISTIC=1 + AMS_TRAINED_WEIGHTS:
- 19/26 pass (v3.44-Trained: 18/26; same weights)
- 4.25 transitions FAIL -> PASS (avg_mass_ratio=1.38, threshold >1.10)
- 4.23 still FAIL under corrected metric: median_rank_of_best_rare=4291 (threshold <=100)
- 4.24 still FAIL under corrected metric: loo_nn_accuracy=0.60 (threshold >=0.75)
- 4.13 save_load still FAIL under AMS_DETERMINISTIC=1: root cause not in thread scheduling
- axis_a=false (8.97 vs 10.0), axis_b=true, axis_c=5/11, axis_d=2/3; channel_passes_all_axes=false

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants