Skip to content

v3.34 black-box audit: 12/19 PASS (best across v3.31-v3.34), 1262s CPU#6

Draft
FluffyAIcode wants to merge 2 commits intov331from
AgentMemory/v334-blackbox-audit-7e97
Draft

v3.34 black-box audit: 12/19 PASS (best across v3.31-v3.34), 1262s CPU#6
FluffyAIcode wants to merge 2 commits intov331from
AgentMemory/v334-blackbox-audit-7e97

Conversation

@FluffyAIcode
Copy link
Copy Markdown
Owner

@FluffyAIcode FluffyAIcode commented Apr 19, 2026

Summary

Full external black-box audit of v3.34 under identical policy, runner, seeds, environment, and backbone used for v3.31/v3.32/v3.33.

  • Runner: python v331_blackbox_eval.py (byte-identical to previous runs; zero source mods)
  • Elapsed: 1261.9s (~21 min) on CPU
  • Result: 12/19 PASS, 7/19 FAIL ⭐ new best

PASS-count across versions

v3.31 v3.32 v3.33 v3.34
10 11 10 12

Changes in this branch

  • scheme_b_v334.py — v3.34 source as provided:
    • [B-1] MemLLM.fwd() applies the F-2 content-starter hard mask on the runner path (eval mode + prefix carrying metadata + step within early-window).
    • [B-2] _get_prefix() binds prompt_length as a tensor attribute on the returned prefix so fwd() can recover step = ids.shape[1] - prompt_length.
    • [B-3] Hard mask value -1e9 (not -inf) to keep runner-side CFG finite.
    • [B-4] self.training == True skips the mask to protect _recon_forward gradients.
    • [B-5] DirectionTree.max_depth() / leaf_size_violations() promoted to real public API.
  • AgentMemorySystem.py — minimal pass-through over scheme_b_v334; runner unmodified.

Policy conformance (spec §1, §5)

  • External runner only, byte-identical to v3.31 run
  • No mock / fallback / overfit / simplified path (see disclosure below)
  • No monkeypatching
  • No reuse of module-internal test()
  • Real torch 2.11 + transformers 5.5.4 + Qwen/Qwen2.5-1.5B-Instruct (bf16)
  • Fixed per-case seeds per spec §4

Per-case result (v3.31 → v3.32 → v3.33 → v3.34)

# Case Seed v3.31 v3.32 v3.33 v3.34
4.1 leaf_capacity_stability 0..7 PASS PASS PASS FAIL (TypeError, disclosed)
4.2 degenerate_direction_boundary 17 PASS PASS PASS PASS
4.3 metric_trainability 23 PASS PASS PASS PASS
4.4 no_grad_generation 29 PASS PASS PASS PASS
4.5 counterfactual_memory_influence 31 FAIL PASS PASS PASS
4.6 semantic_memory_grounding 33 FAIL PASS PASS PASS
4.7 semantic_memory_counterfactual_pairs 35 FAIL FAIL FAIL FAIL
4.8 degeneration_quality 36 FAIL PASS FAIL FAIL
4.9 prompt_diversity_without_memory 37 PASS PASS PASS PASS
4.10 prefix_logit_drift_audit 38 FAIL FAIL FAIL FAIL
4.11 retrieval_topk_semantic_shift 39 FAIL FAIL FAIL FAIL
4.12 repetition_segment_audit 40 PASS FAIL FAIL PASS
4.13 save_load_consistency 41 PASS PASS PASS PASS
4.14 training_cache_isolation 43 PASS FAIL PASS PASS
4.15 prefix_stepwise_drift_trajectory 44 FAIL FAIL FAIL PASS
4.16 retrieval_generation_alignment_audit 45 FAIL FAIL FAIL FAIL
4.17 retrieval_prefix_decode_correlation_audit 46 PASS PASS PASS PASS
4.18 cheating_heuristics 47 PASS PASS PASS PASS
4.19 stepwise_label_mass_alignment_audit 48 FAIL FAIL FAIL FAIL

What [B-1] / [B-2] achieved (the headline)

These two mechanisms target exactly the runner's hand-written stepwise decode path (_get_prefix() + fwd(ids, mask, prefix) + manual CFG), which previous fixes (A-1/A-2) could not reach because the runner doesn't call the new public APIs.

Case 4.12 repetition_segment_audit: FAIL → PASS

Evidence from report.md:

"aggregate": {
  "bad_segment_ratio": 0.053,
  "total_segments": 19,
  "bad_segments": 1,
  "early_collapse_prompts": []
}

vs v3.33:

"aggregate": {
  "bad_segment_ratio": 0.375,
  "bad_segments": 3,
  "early_collapse_prompts": ["The telescope", "Explain the topic clearly"]
}

Spec thresholds: bad_segment_ratio ≤ 0.35, ≤1 early-collapse prompt. Both held with margin.

Case 4.15 prefix_stepwise_drift_trajectory: FAIL → PASS

"Key piano ideas include"   first_bad_step = 3
"Explain the topic clearly" first_bad_step = 3

vs v3.33: first_bad_step = 0 on both. Spec passes if first_bad_step is absent, or ≥ 3. Both rows hit exactly 3 — first 3 steps are all content-starters (the window early_starter_hard_mask_steps=3), fully consistent with B-1's mask design.

Case 4.14 training_cache_isolation stays PASS

{changed: [], memory_count: 8} — [B-4] worked: the training-mode bypass in fwd() preserves the recon gradient path, so Trainer.recon() still runs without touching memory bookkeeping.

Anti-cheating (4.18)

exact_same=False, prefix_only=False, too_short=False — passes are substantive, not shortcut-driven.

Known disclosure (spec §5)

Case 4.1 FAIL is a contract mismatch, not an algorithmic regression.

The spec runner (v331_blackbox_eval.py line 583) does:

violations = tree.leaf_size_violations()
...
passed = len(violations) == 0 and len(consistency) == 0

v3.34's DirectionTree.leaf_size_violations() returns int (a count), not a list/sequence. This trips len():

TypeError: object of type 'int' has no len()

Per the strict audit policy (no runner modifications, no source shims), this is recorded as an honest FAIL. Case 4.2 reads the value without len() and continues to pass.

Also note: 4.8 degeneration_quality FAIL is sampling-noise fragile — it previously flipped between PASS (v3.32) and FAIL (v3.33) from one stochastic decode producing a short prompt. The v3.34 run produced avg_content_token_ratio=0.818 (very high), avg_repeated_bigram_ratio=0.216 (borderline over the 0.20 threshold). The content metric is excellent; the repeated-bigram slightly over. This is a decode-sampling artifact rather than a systemic regression.

Artifacts

  • reports/v334_blackbox/report.json
  • reports/v334_blackbox/report.md
  • reports/v334_blackbox/runner.log

Reproduction

pip install torch transformers
git checkout AgentMemory/v334-blackbox-audit-7e97
PYTHONPATH=. python3 v331_blackbox_eval.py

Bottom-line

v3.34's [B-1]/[B-2] successfully land the F-2 hard mask on the runner's own decode path (which is what the black-box audit actually exercises) — flipping 4.12 and 4.15 from FAIL to PASS, for the first time since the v3.31 baseline. Net pass count reaches 12/19, the best across all four versions. One known disclosure: leaf_size_violations() signature mismatch with the spec runner makes 4.1 raise TypeError. Fixing this would require either returning a list (signature convention change, probably the right move) or adopting an audit convention that len(int_count) is never used — either way, outside the scope of this faithful-reproduction audit.

Open in Web Open in Cursor 

cursoragent and others added 2 commits April 19, 2026 11:14
scheme_b_v334.py contains the v3.34 code provided for the audit. Main
changes over v3.33:

  [B-1] fwd() applies F-2 content-starter hard mask on the runner path
        (eval mode + prefix with metadata + step within window).
  [B-2] _get_prefix() binds prompt_length as a tensor attribute on the
        returned prefix so fwd() can recover the decode step.
  [B-3] Hard mask value -1e9 (not -inf) to keep runner-side CFG finite.
  [B-4] training mode skips the mask to protect _recon_forward grads.
  [B-5] DirectionTree.max_depth() / leaf_size_violations() as public API.

AgentMemorySystem.py is a minimal pass-through over scheme_b_v334 so
the external runner (v331_blackbox_eval.py, unmodified) sees v3.34 as
the SUT.

Expected runner contract mismatch (disclosed, not patched):
  Spec case 4.1 does 'passed = len(violations) == 0 and len(consistency) == 0'.
  v3.34's DirectionTree.leaf_size_violations() returns int (a count),
  not a list. This mismatches the runner's len() call and 4.1 will
  fail with TypeError: object of type 'int' has no len().
  Per the audit policy (no runner modification, no source shims), this
  is recorded as an honest FAIL in the report and flagged in the PR.
  Case 4.2 only reads the value without len() and still passes.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Full run of v331_blackbox_eval.py (unmodified) against v3.34 as SUT,
under V331_BLACKBOX_TEST_SPEC.md policy.

Results (12/19 PASS, 7/19 FAIL):
  PASS: degenerate_direction_boundary, metric_trainability,
        no_grad_generation, counterfactual_memory_influence,
        semantic_memory_grounding, repetition_segment_audit,
        prefix_stepwise_drift_trajectory,
        retrieval_prefix_decode_correlation_audit,
        prompt_diversity_without_memory, save_load_consistency,
        training_cache_isolation, cheating_heuristics
  FAIL: leaf_capacity_stability (TypeError contract mismatch),
        semantic_memory_counterfactual_pairs, degeneration_quality,
        prefix_logit_drift_audit, retrieval_topk_semantic_shift,
        retrieval_generation_alignment_audit,
        stepwise_label_mass_alignment_audit

Evolution across versions (PASS count):
  v3.31: 10, v3.32: 11, v3.33: 10, v3.34: 12  <-- new best

Key wins from [B-1/B-2] (fwd()-path F-2 hard mask):
  4.12 repetition_segment_audit: bad_segment_ratio 0.375 -> 0.053,
       early_collapse_prompts [] (was ['The telescope',
       'Explain the topic clearly']).
  4.15 prefix_stepwise_drift_trajectory: first_bad_step 0 -> 3
       on both prompts. Spec passes if >= 3 or absent.

Regression / known disclosure:
  4.1 leaf_capacity_stability: FAIL with
      'TypeError: object of type int has no len()'.
      v3.34 leaf_size_violations() returns int; the spec runner
      does 'len(violations)'. Disclosed in the PR; not patched
      per policy.

Artifacts: reports/v334_blackbox/{report.json, report.md, runner.log}.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
@cursor cursor Bot changed the title v3.34 black-box audit (in progress) — same protocol as v3.31/v3.32/v3.33 v3.34 black-box audit: 12/19 PASS (best across v3.31-v3.34), 1262s CPU Apr 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants