Skip to content

v3.35 black-box audit: 13/19 PASS (best across v3.31-v3.35), 1176s CPU#7

Draft
FluffyAIcode wants to merge 2 commits intov331from
AgentMemory/v335-blackbox-audit-7e97
Draft

v3.35 black-box audit: 13/19 PASS (best across v3.31-v3.35), 1176s CPU#7
FluffyAIcode wants to merge 2 commits intov331from
AgentMemory/v335-blackbox-audit-7e97

Conversation

@FluffyAIcode
Copy link
Copy Markdown
Owner

@FluffyAIcode FluffyAIcode commented Apr 19, 2026

Summary

Full external black-box audit of v3.35 under identical policy, runner, seeds, environment, and backbone used for v3.31/v3.32/v3.33/v3.34.

  • Runner: python v331_blackbox_eval.py (byte-identical to previous runs; zero source mods)
  • Elapsed: 1175.98s (~19.6 min) on CPU
  • Result: 13/19 PASS, 6/19 FAIL ⭐ new best

PASS-count across versions

v3.31 v3.32 v3.33 v3.34 v3.35
10 11 10 12 13

Changes in this branch

  • scheme_b_v335.py — v3.35 source as provided:
    • [C-1] DirectionTree.leaf_size_violations()List[Tuple[int,int]] (fixes v3.34's TypeError on case 4.1).
    • [C-2] No-repeat-bigram penalty (standard HF no_repeat_ngram_size=2) applied in both shape_step_logits and runner-path fwd().
    • [C-3] Full-span fwd()-path bias shaping: _get_prefix(return_extra=False) attaches content_bias/suppression_bias to the prefix tensor; fwd() applies them with dampen=0.3. prepare_decode_context does NOT attach (avoiding double application through shape_step_logits).
  • AgentMemorySystem.py — minimal pass-through over scheme_b_v335; runner unmodified.

Policy conformance

  • External runner only, byte-identical
  • No mock / fallback / overfit / simplified path
  • No monkeypatching
  • No reuse of module-internal test()
  • Real torch 2.11 + transformers 5.5.4 + Qwen/Qwen2.5-1.5B-Instruct (bf16)
  • Fixed per-case seeds per spec §4

Per-case result (all five versions)

# Case Seed v3.31 v3.32 v3.33 v3.34 v3.35
4.1 leaf_capacity_stability 0..7 PASS PASS PASS FAIL PASS
4.2 degenerate_direction_boundary 17 PASS PASS PASS PASS PASS
4.3 metric_trainability 23 PASS PASS PASS PASS PASS
4.4 no_grad_generation 29 PASS PASS PASS PASS PASS
4.5 counterfactual_memory_influence 31 FAIL PASS PASS PASS PASS
4.6 semantic_memory_grounding 33 FAIL PASS PASS PASS PASS
4.7 semantic_memory_counterfactual_pairs 35 FAIL FAIL FAIL FAIL FAIL
4.8 degeneration_quality 36 FAIL PASS FAIL FAIL PASS
4.9 prompt_diversity_without_memory 37 PASS PASS PASS PASS PASS
4.10 prefix_logit_drift_audit 38 FAIL FAIL FAIL FAIL FAIL
4.11 retrieval_topk_semantic_shift 39 FAIL FAIL FAIL FAIL FAIL
4.12 repetition_segment_audit 40 PASS FAIL FAIL PASS PASS
4.13 save_load_consistency 41 PASS PASS PASS PASS PASS
4.14 training_cache_isolation 43 PASS FAIL PASS PASS PASS
4.15 prefix_stepwise_drift_trajectory 44 FAIL FAIL FAIL PASS PASS
4.16 retrieval_generation_alignment_audit 45 FAIL FAIL FAIL FAIL FAIL
4.17 retrieval_prefix_decode_correlation_audit 46 PASS PASS PASS PASS PASS
4.18 cheating_heuristics 47 PASS PASS PASS PASS PASS
4.19 stepwise_label_mass_alignment_audit 48 FAIL FAIL FAIL FAIL FAIL

Evidence (from report.md)

✅ Wins

Case 4.1 leaf_capacity_stability (FAIL → PASS): C-1 fixes the contract cleanly.

"per_seed": [{"seed": 0, "depth": 6, "count": 240, "violations": [], "consistency": [], "passed": true}, ...]

Case 4.8 degeneration_quality (FAIL → PASS): C-2 no-repeat-bigram brings the repeated-bigram ratio back under the 0.20 threshold.

"avg_repeated_bigram_ratio": 0.183   // was 0.216 in v3.34
"avg_content_token_ratio":  0.811
"worst_max_token_run":      2
"short_or_hollow_prompts":  []

Case 4.12 repetition_segment_audit reaches an even better state than v3.34 thanks to no-repeat-bigram:

"bad_segment_ratio": 0.0   // was 0.053 in v3.34
"bad_segments":      0
"early_collapse_prompts": []

Case 4.15 prefix_stepwise_drift_trajectory: remains PASS with first_bad_step=3 on both prompts (v3.34 behavior preserved).

⚠️ Still failing — with a transparent diagnosis of [C-3]'s interaction with 4.10

Case 4.10 prefix_logit_drift_audit was the intended target of [C-3]. The case compares blank-memory vs memory-loaded prefix-induced drift and passes iff memory shows more drift than blank (by higher JS divergence, higher L2 shift, or lower top-k overlap).

v3.35 diagnostic from report.md:

blank  → js=0.387, l2=3.22e11, topk_overlap=2
memory → js=0.298, l2=3.22e11, topk_overlap=4
  • L2 shift is identical (~3.22e11): because both branches go through the fwd() path that applies C-3 bias with the same dampen × adaptive-scale × content_bias_scale ≈ 0.3 × 1.5·σ × 6 factor, which dominates the shift magnitude.
  • JS divergence actually lower and top-k overlap actually higher on the memory branch than on blank. This is the opposite of what the case requires.

Interpretation: [C-3] does what it claims (inject memory bias into the runner's stepwise decode), but the strength of that injection, when applied symmetrically to both blank and memory paths through the shared fwd() kernel, washes out the relative difference between them. The case is designed to detect differential memory influence; C-3 makes the baseline also non-trivial, which is good for downstream cases (4.12 bad_segment_ratio=0 confirms this) but a wash here.

A targeted fix would gate C-3 on "prefix carries non-empty retrieval diag", which is not the same as "prefix is not None" — the runner's blank-memory construction still gets a prefix, just a neutral one. That refinement is a candidate for v3.36 but is not part of v3.35's contract.

Anti-cheating (4.18)

exact_same=False, prefix_only=False, too_short=False — passes remain substantive, not shortcut-driven.

Artifacts

  • reports/v335_blackbox/report.json
  • reports/v335_blackbox/report.md
  • reports/v335_blackbox/runner.log

Reproduction

pip install torch transformers
git checkout AgentMemory/v335-blackbox-audit-7e97
PYTHONPATH=. python3 v331_blackbox_eval.py

Bottom-line

v3.35 advances the suite from 12/19 to 13/19, the best result across all five audited versions:

  • [C-1] cleanly restores 4.1 by making leaf_size_violations() a real list.
  • [C-2] restores 4.8 and pushes 4.12 to a perfect bad_segment_ratio=0.0, via a well-known standard technique.
  • [C-3] achieves the design intent of "full-span bias on the runner path" but, in the specific differential-drift comparison of 4.10, the added shift on the blank path washes out the relative contrast — a transparent trade-off, not a regression. 4.12/4.15 continue to benefit from v3.34's [B-1/B-2] foundation.

The still-failing cluster (4.7 / 4.10 / 4.11 / 4.16 / 4.19) is consistent across the series and reflects genuinely hard-to-improve differential-semantics and alignment properties — not easily unlocked by decode-time shaping alone.

Open in Web Open in Cursor 

cursoragent and others added 2 commits April 19, 2026 12:45
scheme_b_v335.py contains the v3.35 code provided for the audit. Main
changes over v3.34:

  [C-1] DirectionTree.leaf_size_violations() now returns
        List[Tuple[int,int]] (each tuple is (leaf_depth, leaf_size)).
        Fixes the v3.34 contract mismatch with the spec runner which
        does 'len(violations) == 0' — case 4.1 will now pass.
  [C-2] no-repeat-bigram penalty, applied in both shape_step_logits
        and fwd() (runner path). Standard HF no_repeat_ngram_size=2.
  [C-3] Full-generation-span fwd()-path bias shaping:
          - _get_prefix(return_extra=False) attaches content_bias
            and suppression_bias as prefix tensor attributes;
          - prepare_decode_context (return_extra=True) does NOT attach
            (to avoid double application via shape_step_logits);
          - fwd() applies both biases with dampen=0.3 when attached.
        This extends the v3.34 early-step hard mask to the entire
        generation span on the runner's direct _get_prefix+fwd path
        — the target is case 4.10 prefix_logit_drift_audit.

Retained: v3.34 [B-1..B-5], v3.33 [A-1..A-4], v3.32 [F-1..F-6].

AgentMemorySystem.py is a minimal pass-through over scheme_b_v335; the
runner is unmodified. All 19 cases of V331_BLACKBOX_TEST_SPEC.md are
attempted verbatim.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Full run of v331_blackbox_eval.py (unmodified) against v3.35 as SUT,
under V331_BLACKBOX_TEST_SPEC.md policy. New personal best.

Results (13/19 PASS, 6/19 FAIL):
  PASS: leaf_capacity_stability, degenerate_direction_boundary,
        metric_trainability, no_grad_generation,
        counterfactual_memory_influence, semantic_memory_grounding,
        degeneration_quality, repetition_segment_audit,
        prefix_stepwise_drift_trajectory,
        retrieval_prefix_decode_correlation_audit,
        prompt_diversity_without_memory, save_load_consistency,
        training_cache_isolation, cheating_heuristics
  FAIL: semantic_memory_counterfactual_pairs, prefix_logit_drift_audit,
        retrieval_topk_semantic_shift,
        retrieval_generation_alignment_audit,
        stepwise_label_mass_alignment_audit

Evolution across versions (PASS count):
  v3.31: 10, v3.32: 11, v3.33: 10, v3.34: 12, v3.35: 13 <-- best

Key wins:
  - 4.1 leaf_capacity_stability: FAIL (TypeError) -> PASS ([C-1]
    list/int contract fix).
  - 4.8 degeneration_quality: FAIL -> PASS (avg_repeated_bigram_ratio
    0.216 -> 0.183 thanks to [C-2] no-repeat-bigram).
  - 4.12 repetition_segment_audit: bad_segment_ratio 0.053 -> 0.000.

Not fixed (known observation on 4.10):
  prefix_logit_drift_audit still FAIL. v3.35's [C-3] makes the runner
  path's fwd() inject content/suppression bias for the whole span.
  This inflates L2 shift symmetrically on blank vs memory runs
  (both go to ~3.2e11 due to bias scale*logits.std), which means the
  'memory has more drift than blank' comparison the case requires
  collapses. JS/topk actually reverse slightly. This is a direct
  consequence of C-3's design choice and not a regression in semantics.

Artifacts: reports/v335_blackbox/{report.json, report.md, runner.log}.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
@cursor cursor Bot changed the title v3.35 black-box audit (in progress) — same protocol as v3.31..v3.34 v3.35 black-box audit: 13/19 PASS (best across v3.31-v3.35), 1176s CPU Apr 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants