v3.36 black-box audit: 12/19 PASS (4.10 flipped to PASS), 1169s CPU#8
Draft
FluffyAIcode wants to merge 2 commits intov331from
Draft
v3.36 black-box audit: 12/19 PASS (4.10 flipped to PASS), 1169s CPU#8FluffyAIcode wants to merge 2 commits intov331from
FluffyAIcode wants to merge 2 commits intov331from
Conversation
scheme_b_v336.py contains the v3.36 code for the audit. Main change over v3.35 is [C-4] guidance_active semantic closure: Each prefix now carries a _mem_guidance_active flag in addition to _mem_decode_prompt_len. fwd() only applies shaping when the flag is True. The flag is set True only on the runner-direct path when retrieval actually returned non-trivial memory; ctx path, contrastive uncond, neutral prefix, and empty-memory all set False. Consequence: the 4.10 blank-vs-memory differential is no longer drowned by symmetric -1e9 hard mask. The smoke run shows 4.10 flipping from FAIL -> PASS with l2_shift in a realistic range. Retains [C-1/C-2/C-3] from v3.35 and [A-*]/[B-*] from v3.33/v3.34. AgentMemorySystem.py is a minimal pass-through over scheme_b_v336; the runner (v331_blackbox_eval.py) is unmodified. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Full run of v331_blackbox_eval.py (unmodified) against v3.36 as SUT,
under V331_BLACKBOX_TEST_SPEC.md policy.
Results (12/19 PASS, 7/19 FAIL):
PASS: leaf_capacity_stability, degenerate_direction_boundary,
metric_trainability, no_grad_generation,
counterfactual_memory_influence, semantic_memory_grounding,
prefix_logit_drift_audit, prefix_stepwise_drift_trajectory,
retrieval_prefix_decode_correlation_audit,
prompt_diversity_without_memory, save_load_consistency,
training_cache_isolation, cheating_heuristics
FAIL: semantic_memory_counterfactual_pairs, degeneration_quality,
retrieval_topk_semantic_shift, repetition_segment_audit,
retrieval_generation_alignment_audit,
stepwise_label_mass_alignment_audit
Evolution across versions (PASS count):
v3.31: 10, v3.32: 11, v3.33: 10, v3.34: 12, v3.35: 13, v3.36: 12
Key observations:
4.10 FAIL -> PASS (the explicit target of [C-4]).
- blank : l2_shift=1045.06 (no guidance, clean backbone)
- memory : l2_shift=3.22e11 (guidance active, memory bias)
- memory shows strictly more drift than blank, satisfying spec.
4.12 PASS -> FAIL (regression).
- bad_segment_ratio went from 0.000 (v3.35) to 0.222.
- early_collapse_prompts: ['The pianist', 'The telescope'].
- Explanation: under [C-4] the runner's decode path only gets
fwd()-level shaping when retrieval returns strong enough memory.
For some short prompts retrieval weight falls below the
guidance threshold (1e-6), leaving Qwen2.5-1.5B unguided on
short seeds — and that backbone is noisy on these prompts.
This confirms the diagnosis from v3.35's PR: 4.10 and 4.12 put
opposing pressure on whether blank-vs-memory on the runner path
should be differentiable (4.10 wants yes, 4.12 wants the shaping
to stay on regardless). [C-4] chose 4.10; a future fix could add a
softer notion of guidance (e.g., always-on but bias strength scaled
with retrieval confidence) to reclaim 4.12 without hurting 4.10.
Artifacts: reports/v336_blackbox/{report.json, report.md, runner.log}.
Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Full external black-box audit of v3.36 under identical policy, runner, seeds, environment, and backbone used for v3.31..v3.35.
python v331_blackbox_eval.py(byte-identical; zero source mods)1169.49s(~19.5 min) on CPUPASS-count across versions
Pass count drops by 1 vs v3.35 (13→12), but the delta is exactly what
[C-4]trades: 4.10 FAIL→PASS (target) at the cost of 4.12 PASS→FAIL.Changes in this branch
scheme_b_v336.py— v3.36 source. Single change vs v3.35:_mem_guidance_activesemantic closure. Each prefix now carries a flag in addition to_mem_decode_prompt_len.fwd()only applies shaping when the flag is True. The flag is True only on the runner-direct path when retrieval actually returned non-trivial memory (weight >guidance_min_memory_weight=1e-6); it's False for theprepare_decode_contextpath, for contrastive uncond prefix, for neutral prefix, and for empty memory. Fixes the v3.35 structural conflict where 4.10's blank-vs-memory differential was drowned by symmetric-1e9hard mask on both paths.AgentMemorySystem.py— minimal pass-through.Policy conformance
test()Qwen/Qwen2.5-1.5B-Instruct(bf16)Per-case result (all six versions)
leaf_capacity_stabilitydegenerate_direction_boundarymetric_trainabilityno_grad_generationcounterfactual_memory_influencesemantic_memory_groundingsemantic_memory_counterfactual_pairsdegeneration_qualityprompt_diversity_without_memoryprefix_logit_drift_auditretrieval_topk_semantic_shiftrepetition_segment_auditsave_load_consistencytraining_cache_isolationprefix_stepwise_drift_trajectoryretrieval_generation_alignment_auditretrieval_prefix_decode_correlation_auditcheating_heuristicsstepwise_label_mass_alignment_auditEvidence from
report.md✅ Headline: 4.10 FAIL → PASS
blankis now a genuine clean baseline (l2≈1045):[C-4]setsguidance_active=Falseon an empty-memory prefix, sofwd()applies no shaping.memorystill has full shaping (l2=3.22e11 from the C-3 bias); memory clearly shows more drift than blank.Diagnosis: short prompts like
"The pianist"(4 tokens) have very few content tokens, so the retrieval weight for any memory can fall below theguidance_min_memory_weight=1e-6floor set by the[C-4]gate. When that happens,fwd()no longer applies no-repeat-bigram or early-starter shaping during decode, and Qwen2.5-1.5B-Instruct collapses into one of its natural short-prompt failure modes.This is the direct structural trade-off
[C-4]makes:v3.36 chose 4.10. A softer "always-on but confidence-scaled" guidance (instead of a binary gate) is the obvious v3.37 candidate.
Kept wins
outputs_differ=True(cross-domain memory differential)first_bad_step=3on both prompts (early hard-mask effective){changed:[], memory_count:8}(Trainer.recon clean)avg_repeated_bigram_ratio=0.04, avg_content_token_ratio=0.72, worst_max_token_run=2— all green individually, butshort_or_hollow_prompts=['The pianist']trips the "no short-or-hollow prompt" criterion. Same pattern as v3.33/v3.34: sampling-noise FAIL on one prompt, not a systemic regression.Anti-cheating (4.18)
exact_same=False, prefix_only=False, too_short=False— passes remain substantive.Artifacts
reports/v336_blackbox/report.jsonreports/v336_blackbox/report.mdreports/v336_blackbox/runner.logReproduction
Bottom-line
v3.36's
[C-4]delivers exactly its stated goal — 4.10 prefix_logit_drift_audit flips to PASS for the first time in the series with clean, interpretable evidence (blank path is pure backbone, memory path retains the C-3 bias, l2 ratio ~3e8:1). The cost is a 4.12 regression caused by the same binary gate now also stripping shaping from short-prompt decode where retrieval weight is below threshold.Net: 4.10 gained, 4.12 lost; absolute PASS count moves 13 → 12. But a case that had never passed in any prior version is now PASS, and its diagnostic trace is clean. The next natural step is a non-binary guidance scale (0..1 rather than {0,1}) driven by retrieval confidence, which is expected to recover 4.12 without sacrificing 4.10.