v3.34 black-box audit: 12/19 PASS (best across v3.31-v3.34), 1262s CPU#6
Draft
FluffyAIcode wants to merge 2 commits intov331from
Draft
v3.34 black-box audit: 12/19 PASS (best across v3.31-v3.34), 1262s CPU#6FluffyAIcode wants to merge 2 commits intov331from
FluffyAIcode wants to merge 2 commits intov331from
Conversation
scheme_b_v334.py contains the v3.34 code provided for the audit. Main
changes over v3.33:
[B-1] fwd() applies F-2 content-starter hard mask on the runner path
(eval mode + prefix with metadata + step within window).
[B-2] _get_prefix() binds prompt_length as a tensor attribute on the
returned prefix so fwd() can recover the decode step.
[B-3] Hard mask value -1e9 (not -inf) to keep runner-side CFG finite.
[B-4] training mode skips the mask to protect _recon_forward grads.
[B-5] DirectionTree.max_depth() / leaf_size_violations() as public API.
AgentMemorySystem.py is a minimal pass-through over scheme_b_v334 so
the external runner (v331_blackbox_eval.py, unmodified) sees v3.34 as
the SUT.
Expected runner contract mismatch (disclosed, not patched):
Spec case 4.1 does 'passed = len(violations) == 0 and len(consistency) == 0'.
v3.34's DirectionTree.leaf_size_violations() returns int (a count),
not a list. This mismatches the runner's len() call and 4.1 will
fail with TypeError: object of type 'int' has no len().
Per the audit policy (no runner modification, no source shims), this
is recorded as an honest FAIL in the report and flagged in the PR.
Case 4.2 only reads the value without len() and still passes.
Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Full run of v331_blackbox_eval.py (unmodified) against v3.34 as SUT,
under V331_BLACKBOX_TEST_SPEC.md policy.
Results (12/19 PASS, 7/19 FAIL):
PASS: degenerate_direction_boundary, metric_trainability,
no_grad_generation, counterfactual_memory_influence,
semantic_memory_grounding, repetition_segment_audit,
prefix_stepwise_drift_trajectory,
retrieval_prefix_decode_correlation_audit,
prompt_diversity_without_memory, save_load_consistency,
training_cache_isolation, cheating_heuristics
FAIL: leaf_capacity_stability (TypeError contract mismatch),
semantic_memory_counterfactual_pairs, degeneration_quality,
prefix_logit_drift_audit, retrieval_topk_semantic_shift,
retrieval_generation_alignment_audit,
stepwise_label_mass_alignment_audit
Evolution across versions (PASS count):
v3.31: 10, v3.32: 11, v3.33: 10, v3.34: 12 <-- new best
Key wins from [B-1/B-2] (fwd()-path F-2 hard mask):
4.12 repetition_segment_audit: bad_segment_ratio 0.375 -> 0.053,
early_collapse_prompts [] (was ['The telescope',
'Explain the topic clearly']).
4.15 prefix_stepwise_drift_trajectory: first_bad_step 0 -> 3
on both prompts. Spec passes if >= 3 or absent.
Regression / known disclosure:
4.1 leaf_capacity_stability: FAIL with
'TypeError: object of type int has no len()'.
v3.34 leaf_size_violations() returns int; the spec runner
does 'len(violations)'. Disclosed in the PR; not patched
per policy.
Artifacts: reports/v334_blackbox/{report.json, report.md, runner.log}.
Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Full external black-box audit of v3.34 under identical policy, runner, seeds, environment, and backbone used for v3.31/v3.32/v3.33.
python v331_blackbox_eval.py(byte-identical to previous runs; zero source mods)1261.9s(~21 min) on CPUPASS-count across versions
Changes in this branch
scheme_b_v334.py— v3.34 source as provided:[B-1]MemLLM.fwd()applies the F-2 content-starter hard mask on the runner path (eval mode + prefix carrying metadata + step within early-window).[B-2]_get_prefix()bindsprompt_lengthas a tensor attribute on the returned prefix sofwd()can recoverstep = ids.shape[1] - prompt_length.[B-3]Hard mask value-1e9(not-inf) to keep runner-side CFG finite.[B-4]self.training == Trueskips the mask to protect_recon_forwardgradients.[B-5]DirectionTree.max_depth()/leaf_size_violations()promoted to real public API.AgentMemorySystem.py— minimal pass-through overscheme_b_v334; runner unmodified.Policy conformance (spec §1, §5)
test()Qwen/Qwen2.5-1.5B-Instruct(bf16)Per-case result (v3.31 → v3.32 → v3.33 → v3.34)
leaf_capacity_stabilitydegenerate_direction_boundarymetric_trainabilityno_grad_generationcounterfactual_memory_influencesemantic_memory_groundingsemantic_memory_counterfactual_pairsdegeneration_qualityprompt_diversity_without_memoryprefix_logit_drift_auditretrieval_topk_semantic_shiftrepetition_segment_auditsave_load_consistencytraining_cache_isolationprefix_stepwise_drift_trajectoryretrieval_generation_alignment_auditretrieval_prefix_decode_correlation_auditcheating_heuristicsstepwise_label_mass_alignment_auditWhat [B-1] / [B-2] achieved (the headline)
These two mechanisms target exactly the runner's hand-written stepwise decode path (
_get_prefix()+fwd(ids, mask, prefix)+ manual CFG), which previous fixes (A-1/A-2) could not reach because the runner doesn't call the new public APIs.Case 4.12
repetition_segment_audit: FAIL → PASSEvidence from
report.md:vs v3.33:
Spec thresholds:
bad_segment_ratio ≤ 0.35, ≤1 early-collapse prompt. Both held with margin.Case 4.15
prefix_stepwise_drift_trajectory: FAIL → PASSvs v3.33:
first_bad_step = 0on both. Spec passes iffirst_bad_stepis absent, or≥ 3. Both rows hit exactly 3 — first 3 steps are all content-starters (the windowearly_starter_hard_mask_steps=3), fully consistent with B-1's mask design.Case 4.14
training_cache_isolationstays PASS{changed: [], memory_count: 8}— [B-4] worked: the training-mode bypass infwd()preserves the recon gradient path, soTrainer.recon()still runs without touching memory bookkeeping.Anti-cheating (4.18)
exact_same=False, prefix_only=False, too_short=False— passes are substantive, not shortcut-driven.Known disclosure (spec §5)
Case 4.1 FAIL is a contract mismatch, not an algorithmic regression.
The spec runner (
v331_blackbox_eval.pyline 583) does:v3.34's
DirectionTree.leaf_size_violations()returnsint(a count), not a list/sequence. This tripslen():Per the strict audit policy (no runner modifications, no source shims), this is recorded as an honest FAIL. Case 4.2 reads the value without
len()and continues to pass.Also note:
4.8 degeneration_qualityFAIL is sampling-noise fragile — it previously flipped between PASS (v3.32) and FAIL (v3.33) from one stochastic decode producing a short prompt. The v3.34 run producedavg_content_token_ratio=0.818(very high),avg_repeated_bigram_ratio=0.216(borderline over the 0.20 threshold). The content metric is excellent; the repeated-bigram slightly over. This is a decode-sampling artifact rather than a systemic regression.Artifacts
reports/v334_blackbox/report.jsonreports/v334_blackbox/report.mdreports/v334_blackbox/runner.logReproduction
Bottom-line
v3.34's
[B-1]/[B-2]successfully land the F-2 hard mask on the runner's own decode path (which is what the black-box audit actually exercises) — flipping 4.12 and 4.15 from FAIL to PASS, for the first time since the v3.31 baseline. Net pass count reaches 12/19, the best across all four versions. One known disclosure:leaf_size_violations()signature mismatch with the spec runner makes 4.1 raise TypeError. Fixing this would require either returning a list (signature convention change, probably the right move) or adopting an audit convention thatlen(int_count)is never used — either way, outside the scope of this faithful-reproduction audit.