v3.37 black-box audit — 14/19 PASS (new best), 4.16 retrieval fix confirmed#9
Draft
FluffyAIcode wants to merge 2 commits intomainfrom
Draft
v3.37 black-box audit — 14/19 PASS (new best), 4.16 retrieval fix confirmed#9FluffyAIcode wants to merge 2 commits intomainfrom
FluffyAIcode wants to merge 2 commits intomainfrom
Conversation
v3.37 introduces two structural fixes over v3.36:
[C-5] IDF-weighted content bias: rare domain tokens get ~2x boost relative
to high-frequency cross-domain repeaters.
[C-6] Multi-signal DirectionTree.retrieve: beam search + centroid cosine +
forward maxsim (IDF-weighted) rerank, preserving the (qdir, bw)
signature so the unmodified runner sees a richer candidate list.
Retains [C-4] guidance_active gate, [C-1..3] A-*, B-* fixes.
Vendors scheme_b_v321..v330 and v331_blackbox_eval.py for the audit.
Ignore __pycache__.
Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Full run of v331_blackbox_eval.py (unmodified) against v3.37 as SUT.
Results (14/19 PASS, 5/19 FAIL):
PASS: leaf_capacity_stability, degenerate_direction_boundary,
metric_trainability, no_grad_generation,
counterfactual_memory_influence, prefix_logit_drift_audit,
repetition_segment_audit, prefix_stepwise_drift_trajectory,
retrieval_generation_alignment_audit,
retrieval_prefix_decode_correlation_audit,
prompt_diversity_without_memory, save_load_consistency,
training_cache_isolation, cheating_heuristics
FAIL: semantic_memory_grounding, semantic_memory_counterfactual_pairs,
degeneration_quality, retrieval_topk_semantic_shift,
stepwise_label_mass_alignment_audit
Version evolution (PASS count):
v3.31: 10, v3.32: 11, v3.33: 10, v3.34: 12,
v3.35: 13, v3.36: 12, v3.37: 14 (new best)
Targeted fixes confirmed:
4.16 retrieval_generation_alignment_audit FAIL -> PASS
([C-6] multi-signal tree.retrieve rerank): retrieval_miss=0
on music/space queries (vs 1-2 retrieval_miss in v3.36).
4.12 repetition_segment_audit returned to PASS
(v3.36 regressed, v3.37 restored with bad_segment_ratio=0.11).
Residual FAILs all trace to either:
(a) keyword-list / backbone vocab distribution mismatch (4.7, 4.11),
which IDF [C-5] mitigates but does not eliminate — Qwen's top-12
on generic prompts still favors stop-function tokens.
(b) upstream simplification in runner's retrieve_memory_ids path
for stepwise aligned counts (4.19 inject stage).
(c) new regression in semantic_memory_grounding (4.6) — needs
future investigation (backbone produced long Chinese tangents).
(d) degeneration_quality (4.8) threshold tight under stochastic seeds.
Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
v3.37 Black-Box Audit — 14/19 PASS (new best across all audited versions)
Switches the System Under Test to
scheme_b_v337.pyand runs the unmodifiedv331_blackbox_eval.pyagainst it under theV331_BLACKBOX_TEST_SPEC.mdpolicy: no monkeypatching, no source modification of the runner, no mocked returns, honest failures.Summary
reports/v337_blackbox/{report.json,report.md,runner.log}Version evolution (PASS count)
Fixes in v3.37
content_biastensor is now multiplied by its corpus IDF (clamped to[idf_floor, idf_bias_max_boost=3.0]). Rare domain tokens get ~2× the boost of high-frequency repeaters, pushing them into the decoder's top-k.DirectionTree.retrieve— a PyTorchforward_pre_hookon the backbone captures the most recent query ids intoamm._last_query_ids.tree.retrieve(qdir, bw)now runs:0.2·dir + 0.4·centroid + 0.4·fwdAll prior fixes retained:
[C-4]_mem_guidance_activegate,[C-1..3],[A-*],[B-*].Case-by-case results
Key wins
[C-6]. On the music query, tree.retrieve now returns[1, 0, 3, 6, 2](4 music / 1 space) instead of a mostly-space set; diagnoses:{aligned: 2, retrieval_miss: 0, bridge_unused: 1}.Residual FAILs
All residual failures are structural, not fixable by further logit-shaping:
chopin,pianist,nocturnes, …) doesn't match the tokens Qwen naturally emits for generic prompts like "A strong explanation should mention". Even with IDF boost, the relative logit gap to" the"," a"," at"cannot be closed at the top-12 cutoff without destroying other properties.injectstage where logit label-mass is zero; fixing this needs training-time alignment ofContentSemanticTailHeadto keyword tokens, not more runtime shaping.Artifacts
reports/v337_blackbox/report.json— full structured results with per-case metricsreports/v337_blackbox/report.md— human-readable audit report (1099.4 s elapsed)reports/v337_blackbox/runner.log— full runner stdout/stderr