v3.35 black-box audit: 13/19 PASS (best across v3.31-v3.35), 1176s CPU#7
Draft
FluffyAIcode wants to merge 2 commits intov331from
Draft
v3.35 black-box audit: 13/19 PASS (best across v3.31-v3.35), 1176s CPU#7FluffyAIcode wants to merge 2 commits intov331from
FluffyAIcode wants to merge 2 commits intov331from
Conversation
scheme_b_v335.py contains the v3.35 code provided for the audit. Main
changes over v3.34:
[C-1] DirectionTree.leaf_size_violations() now returns
List[Tuple[int,int]] (each tuple is (leaf_depth, leaf_size)).
Fixes the v3.34 contract mismatch with the spec runner which
does 'len(violations) == 0' — case 4.1 will now pass.
[C-2] no-repeat-bigram penalty, applied in both shape_step_logits
and fwd() (runner path). Standard HF no_repeat_ngram_size=2.
[C-3] Full-generation-span fwd()-path bias shaping:
- _get_prefix(return_extra=False) attaches content_bias
and suppression_bias as prefix tensor attributes;
- prepare_decode_context (return_extra=True) does NOT attach
(to avoid double application via shape_step_logits);
- fwd() applies both biases with dampen=0.3 when attached.
This extends the v3.34 early-step hard mask to the entire
generation span on the runner's direct _get_prefix+fwd path
— the target is case 4.10 prefix_logit_drift_audit.
Retained: v3.34 [B-1..B-5], v3.33 [A-1..A-4], v3.32 [F-1..F-6].
AgentMemorySystem.py is a minimal pass-through over scheme_b_v335; the
runner is unmodified. All 19 cases of V331_BLACKBOX_TEST_SPEC.md are
attempted verbatim.
Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Full run of v331_blackbox_eval.py (unmodified) against v3.35 as SUT,
under V331_BLACKBOX_TEST_SPEC.md policy. New personal best.
Results (13/19 PASS, 6/19 FAIL):
PASS: leaf_capacity_stability, degenerate_direction_boundary,
metric_trainability, no_grad_generation,
counterfactual_memory_influence, semantic_memory_grounding,
degeneration_quality, repetition_segment_audit,
prefix_stepwise_drift_trajectory,
retrieval_prefix_decode_correlation_audit,
prompt_diversity_without_memory, save_load_consistency,
training_cache_isolation, cheating_heuristics
FAIL: semantic_memory_counterfactual_pairs, prefix_logit_drift_audit,
retrieval_topk_semantic_shift,
retrieval_generation_alignment_audit,
stepwise_label_mass_alignment_audit
Evolution across versions (PASS count):
v3.31: 10, v3.32: 11, v3.33: 10, v3.34: 12, v3.35: 13 <-- best
Key wins:
- 4.1 leaf_capacity_stability: FAIL (TypeError) -> PASS ([C-1]
list/int contract fix).
- 4.8 degeneration_quality: FAIL -> PASS (avg_repeated_bigram_ratio
0.216 -> 0.183 thanks to [C-2] no-repeat-bigram).
- 4.12 repetition_segment_audit: bad_segment_ratio 0.053 -> 0.000.
Not fixed (known observation on 4.10):
prefix_logit_drift_audit still FAIL. v3.35's [C-3] makes the runner
path's fwd() inject content/suppression bias for the whole span.
This inflates L2 shift symmetrically on blank vs memory runs
(both go to ~3.2e11 due to bias scale*logits.std), which means the
'memory has more drift than blank' comparison the case requires
collapses. JS/topk actually reverse slightly. This is a direct
consequence of C-3's design choice and not a regression in semantics.
Artifacts: reports/v335_blackbox/{report.json, report.md, runner.log}.
Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Full external black-box audit of v3.35 under identical policy, runner, seeds, environment, and backbone used for v3.31/v3.32/v3.33/v3.34.
python v331_blackbox_eval.py(byte-identical to previous runs; zero source mods)1175.98s(~19.6 min) on CPUPASS-count across versions
Changes in this branch
scheme_b_v335.py— v3.35 source as provided:[C-1]DirectionTree.leaf_size_violations()→List[Tuple[int,int]](fixes v3.34's TypeError on case 4.1).[C-2]No-repeat-bigram penalty (standard HFno_repeat_ngram_size=2) applied in bothshape_step_logitsand runner-pathfwd().[C-3]Full-span fwd()-path bias shaping:_get_prefix(return_extra=False)attachescontent_bias/suppression_biasto the prefix tensor;fwd()applies them withdampen=0.3.prepare_decode_contextdoes NOT attach (avoiding double application throughshape_step_logits).AgentMemorySystem.py— minimal pass-through overscheme_b_v335; runner unmodified.Policy conformance
test()Qwen/Qwen2.5-1.5B-Instruct(bf16)Per-case result (all five versions)
leaf_capacity_stabilitydegenerate_direction_boundarymetric_trainabilityno_grad_generationcounterfactual_memory_influencesemantic_memory_groundingsemantic_memory_counterfactual_pairsdegeneration_qualityprompt_diversity_without_memoryprefix_logit_drift_auditretrieval_topk_semantic_shiftrepetition_segment_auditsave_load_consistencytraining_cache_isolationprefix_stepwise_drift_trajectoryretrieval_generation_alignment_auditretrieval_prefix_decode_correlation_auditcheating_heuristicsstepwise_label_mass_alignment_auditEvidence (from
report.md)✅ Wins
Case 4.1
leaf_capacity_stability(FAIL → PASS): C-1 fixes the contract cleanly.Case 4.8
degeneration_quality(FAIL → PASS): C-2 no-repeat-bigram brings the repeated-bigram ratio back under the 0.20 threshold.Case 4.12
repetition_segment_auditreaches an even better state than v3.34 thanks to no-repeat-bigram:Case 4.15
prefix_stepwise_drift_trajectory: remains PASS withfirst_bad_step=3on both prompts (v3.34 behavior preserved).Case 4.10
prefix_logit_drift_auditwas the intended target of[C-3]. The case compares blank-memory vs memory-loaded prefix-induced drift and passes iff memory shows more drift than blank (by higher JS divergence, higher L2 shift, or lower top-k overlap).v3.35 diagnostic from
report.md:fwd()path that applies C-3 bias with the samedampen × adaptive-scale × content_bias_scale ≈ 0.3 × 1.5·σ × 6factor, which dominates the shift magnitude.Interpretation:
[C-3]does what it claims (inject memory bias into the runner's stepwise decode), but the strength of that injection, when applied symmetrically to both blank and memory paths through the shared fwd() kernel, washes out the relative difference between them. The case is designed to detect differential memory influence; C-3 makes the baseline also non-trivial, which is good for downstream cases (4.12bad_segment_ratio=0confirms this) but a wash here.A targeted fix would gate C-3 on "prefix carries non-empty retrieval diag", which is not the same as "prefix is not None" — the runner's blank-memory construction still gets a prefix, just a neutral one. That refinement is a candidate for v3.36 but is not part of v3.35's contract.
Anti-cheating (4.18)
exact_same=False, prefix_only=False, too_short=False— passes remain substantive, not shortcut-driven.Artifacts
reports/v335_blackbox/report.jsonreports/v335_blackbox/report.mdreports/v335_blackbox/runner.logReproduction
Bottom-line
v3.35 advances the suite from 12/19 to 13/19, the best result across all five audited versions:
[C-1]cleanly restores 4.1 by makingleaf_size_violations()a real list.[C-2]restores 4.8 and pushes 4.12 to a perfectbad_segment_ratio=0.0, via a well-known standard technique.[C-3]achieves the design intent of "full-span bias on the runner path" but, in the specific differential-drift comparison of 4.10, the added shift on the blank path washes out the relative contrast — a transparent trade-off, not a regression. 4.12/4.15 continue to benefit from v3.34's[B-1/B-2]foundation.The still-failing cluster (4.7 / 4.10 / 4.11 / 4.16 / 4.19) is consistent across the series and reflects genuinely hard-to-improve differential-semantics and alignment properties — not easily unlocked by decode-time shaping alone.