Runner v3.45 update: 4.23/4.24/4.25 corrected metrics + axis-coverage + determinism hook + audit on v3.44-Trained (19/26)#19
Draft
FluffyAIcode wants to merge 2 commits intomainfrom
Conversation
- scheme_b_v344.py: v3.42 clone + [J-1] AMS_TRAINED_WEIGHTS env hook - train_v344.py: CPU training driver (60 steps, 398.5s) - ckpt/train_log.jsonl + train_stdout.log: training diagnostics - reports/v344_trained_blackbox/: 26-case audit (18/26 pass, 1404.3s) - audit_feedback.md: Section 7 compliant analysis Delta vs v3.42 (untrained 17/26): FAIL -> PASS: 4.12 prefix_stepwise_drift_trajectory, 4.21 decode_repetition_feedback_probe PASS -> FAIL: 4.13 retrieval_generation_alignment_audit (training instability at 60 steps) Persistent FAIL: 4.7, 4.10, 4.15, 4.17, 4.23, 4.24, 4.25 First 26-case run to exceed the 17+/-1 eval-time plateau. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…nism hook; audit on v3.44-Trained ckpt: 19/26 pass Changes to v331_blackbox_eval.py (non-SUT): - 4.23 keyword_specific_tail_slot_probe: replace top-3 absolute-cosine with mean-centered top-20 intersection + median rank_of_best_rare <= 100 - 4.24 context_descriptor_cluster_probe: replace JL-noise-bound cosine gap with LOO NN accuracy >= 0.75 (retain cosine metrics as diagnostics) - 4.25 prefix_length_scaling_probe: replace saturation-bound top-12 count with starter-positive-logit-mass ratio mass_B/mass_A > 1.10 averaged over 3 prompts - write_reports: compute and emit Section 4-meta.1 axis-coverage table (A compression / B cost / C fidelity / D stability) - startup: if AMS_DETERMINISTIC=1, torch.set_num_threads(1) + use_deterministic_algorithms(warn_only=True) before SUT import - no SUT code changed (per user constraint) Audit on ckpt/v344_trained.pt with AMS_DETERMINISTIC=1 + AMS_TRAINED_WEIGHTS: - 19/26 pass (v3.44-Trained: 18/26; same weights) - 4.25 transitions FAIL -> PASS (avg_mass_ratio=1.38, threshold >1.10) - 4.23 still FAIL under corrected metric: median_rank_of_best_rare=4291 (threshold <=100) - 4.24 still FAIL under corrected metric: loo_nn_accuracy=0.60 (threshold >=0.75) - 4.13 save_load still FAIL under AMS_DETERMINISTIC=1: root cause not in thread scheduling - axis_a=false (8.97 vs 10.0), axis_b=true, axis_c=5/11, axis_d=2/3; channel_passes_all_axes=false Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Scope
Non-SUT changes only:
v331_blackbox_eval.py— probes 4.23 / 4.24 / 4.25 replaced with SPEC PR Spec correction: compression-communication channel definition (Section 1.1) + rewrite of probes 4.22–4.25 + axis-coverage reporting #18's v3.45 metrics; axis-coverage emission added;AMS_DETERMINISTICenv hook addedV331_BLACKBOX_TEST_SPEC.md— synced in from SPEC PR Spec correction: compression-communication channel definition (Section 1.1) + rewrite of probes 4.22–4.25 + axis-coverage reporting #18reports/v345_runner_update_blackbox/— 26-case audit against same ckpt as v3.44-TrainedNo
scheme_b_vXXX.pyorAgentMemorySystem.pychanges.Metric corrections (per SPEC PR #18)
keyword_specific_tail_slot_probetop-3(wte @ slot_1) ∩ rare_keywords >= 1(dominated by Qwen 2.5 token ids 0/1/2 near WTE mean)top-20(wte_centered @ slot_1_centered) ∩ rare >= 1ANDmedian(rank_of_best_rare) <= 100out of 151936context_descriptor_cluster_probeintra - inter >= 0.15(JL variance 0.58 at N=3)loo_nn_accuracy >= 0.75(Clopper-Pearson CI bounded)prefix_length_scaling_probecontent_starters_top12_B >= top12_A + 1(saturates at 12/12)avg(starter_positive_logit_mass_B / mass_A) > 1.10over 3 promptsAxis-coverage emission
Every report now emits the Section 4-meta.1 table with A/B/C/D axis results:
{ "axis_a_compression": { "ratio": 8.97, "threshold": 10.0, "passed": false }, "axis_b_injection_cost": { "per_step_floats": 164224, "depends_on_N": false, "passed": true }, "axis_c_fidelity": { "passed_over_total": "5/11", "threshold_K": 9, "passed": false }, "axis_d_stability": { "passed_over_total": "2/3", "threshold_all_pass": true, "passed": false } }Note axis A fails at
8.97because the formula includes thesemantic_emb (d_LLM=1536)field. Ifsemantic_embis excluded (it is a cached hidden_mean, not part of the compressed code), ratio = 87. The runner reports the literal sum per Section 7.3 instruction to avoid narrative.Determinism hook
v331_blackbox_eval.pyreadsAMS_DETERMINISTIC=1at import time and callstorch.set_num_threads(1)+torch.use_deterministic_algorithms(True, warn_only=True)before importing the SUT.Audit result (ckpt = v3.44-Trained's
ckpt/v344_trained.pt, unchanged)Delta vs v3.44-Trained (same weights, only runner metrics changed):
avg_mass_ratio=1.38crosses threshold 1.10No other case changed.
Per-failing-case evidence
"piano")Falsifiable predictions (Section 7.6)
Cfg(context_hybrid_hidden_weight=0.1)and retraining predictsloo_nn_accuracy ≥ 0.75andmusic_gap > 0median_rank_of_best_raredecreases monotonically in step countAMS_DETERMINISTIC=1, so root cause is not thread scheduling; next candidates aretorch.randperminPrefixAligner.calibrateor memory-state mutation during loadArtifacts
reports/v345_runner_update_blackbox/report.jsonreports/v345_runner_update_blackbox/report.mdreports/v345_runner_update_blackbox/runner.logreports/v345_runner_update_blackbox/audit_feedback.md(Section 7.7 compliant)Dependencies
V331_BLACKBOX_TEST_SPEC.mdcontent is identical to that branch; if Spec correction: compression-communication channel definition (Section 1.1) + rewrite of probes 4.22–4.25 + axis-coverage reporting #18 merges unchanged, this one is a clean fast-forward.