Skip to content

v3.38 black-box audit — 15/19 original PASS (new best), 1/2/4 probe PASS/NI/FAIL#11

Draft
FluffyAIcode wants to merge 2 commits intomainfrom
AgentMemory/v338-blackbox-audit-7e97
Draft

v3.38 black-box audit — 15/19 original PASS (new best), 1/2/4 probe PASS/NI/FAIL#11
FluffyAIcode wants to merge 2 commits intomainfrom
AgentMemory/v338-blackbox-audit-7e97

Conversation

@FluffyAIcode
Copy link
Copy Markdown
Owner

@FluffyAIcode FluffyAIcode commented Apr 20, 2026

v3.38 Black-Box Audit — 26 cases, 1299.7s

Full run of v331_blackbox_eval.py (external runner) against scheme_b_v338.py as SUT, under strict no mocks / no fallbacks / no overfit / no simplification policy. Original cases 4.1–4.19 untouched; new cipher-system probes 4.20–4.26 appended.

Headline

Metric Value
Original 19: PASS 15 / 19
Original 19: FAIL 4 / 19
Cipher probes (new): PASS 1 / 7
Cipher probes: not_implemented (honest, non-blocking) 2 / 7
Cipher probes: FAIL 4 / 7
Elapsed 1299.7 s, CPU, Qwen2.5-1.5B-Instruct

Version evolution (original-19 PASS count)

Version v3.31 v3.32 v3.33 v3.34 v3.35 v3.36 v3.37 v3.38
PASS / 19 10 11 10 12 13 12 14 15

Movement on original-19 from v3.37 to v3.38

  • 4.8 degeneration_quality FAIL → PASS (new: [D-4] anti-collapse). All metrics comfortably within thresholds.
  • All other 18 original cases unchanged vs. v3.37.

Cipher probes

# Probe Result Key metric
4.20 rerank_stability_probe FAIL Jaccard=1.0 on both paraphrase pairs (retrieval is stable at top-1), but the probe requires Spearman ≥ 0.5 on shared top-5, and upstream gating shrank top-5 to length 1 — Spearman is undefined on that. Diagnostic of over-filtering, not instability.
4.21 decode_repetition_feedback_probe PASS avg_max_repeat=1.67, no bigram repeats, no trigram locks. [D-4] end-to-end verified.
4.22 functional_token_suppression_probe FAIL Margin condition 0/3; avg_starter_delta=0.67 (need ≥1.5). L_functional_suppression is a training-time loss; this audit runs eval-only, so the bridge has not yet been optimized against it.
4.23 keyword_specific_tail_slot_probe FAIL mean_intersection=0 over 4 memories. Same root cause as 4.22: tail[1]'s rare-keyword KL needs training gradient, which this audit did not supply.
4.24 context_descriptor_cluster_probe NOT_IMPLEMENTED (honest) missing_api: MemEntry.context_descriptor field. v3.38 implements [D-3] as a per-query aggregate (MemLLM._compute_context_descriptor), not a per-memory stored field. Spec wording is per-memory, so reported as not_implemented per Section 5.
4.25 prefix_length_scaling_probe FAIL L_mem 8 → 16 gave 3 → 2 content starters in top-12 (expected +1). Slot-norm band ratio=1.001 (passes). Reads as capacity is not the bottleneck in the current training regime — the bridge is under-fit, not under-sized.
4.26 mixture_distribution_gate_probe NOT_IMPLEMENTED (honest) missing_api: DecodeContext.mixture_gate. v3.38 shapes logits additively (content_bias + suppression_bias + CFG). No convex mixture gate exists.

Interpretation

The structural pieces [D-3] and [D-4] delivered immediate eval-time wins (4.8 flipped, 4.21 passes, 4.10 drift audit stable). [D-1] and [D-2] are training-time losses — the wiring is in place (verified by python -c smoke checks on loss gradient flow, and by the fact that the probes compute without error), but the audit runner executes eval-only against the fresh initialization. Under the strict no-training policy, their effect is correctly invisible to the probes.

This cleanly separates two audit signals:

  1. Decode-time structural fixes ([D-3], [D-4]): verified right now.
  2. Training-time structural fixes ([D-1], [D-2]): wiring verified; convergence verification requires a training-phase audit, which this runner does not perform.

Artifacts

  • reports/v338_blackbox/report.json — full structured results (8.6k lines)
  • reports/v338_blackbox/report.md — per-case markdown (includes all 26)
  • reports/v338_blackbox/runner.log — raw stdout from the 1299.7 s run
Open in Web Open in Cursor 

cursoragent and others added 2 commits April 20, 2026 05:23
[Implementation v3.38]
Adds scheme_b_v338.py with four [D-*] structural upgrades on top of v3.37:
  [D-1] L_functional_suppression hinge loss on prefix-injected logits.
  [D-2] Rare-keyword tail slot: tail[1] KL-target is IDF-top-K strict starters.
  [D-3] Context descriptor: per-query weighted aggregate of retrieved
        memory semantic_emb vectors, projected via ContextHead into one
        prefix slot.
  [D-4] Anti-collapse: per-token content_bias history decay + degeneration
        detector (low unique_ratio triggers global bias dampen).
All [C-1..C-6] from v3.37 preserved. AgentMemorySystem.py redirects to v3.38.

[Runner v331_blackbox_eval.py]
Extends the external audit runner with seven Cipher-System Structural
Probes (4.20-4.26) as specified in V331_BLACKBOX_TEST_SPEC.md section 4.
Original 19 cases are not modified.

Gating rules implemented per spec section 4-meta:
  - hard_PASS probes (4.20, 4.21, 4.22, 4.25) block suite PASS on failure.
  - PASS_or_not_implemented probes (4.23, 4.24, 4.26) do not block on
    honest 'not_implemented', but do block on 'fail'.

Honesty notes (truthful not_implemented emissions):
  - 4.24 context_descriptor_cluster_probe: spec wording requires per-
    MemEntry stored descriptor; v3.38 exposes per-query descriptor only,
    so field 'MemEntry.context_descriptor' is absent -> not_implemented.
  - 4.26 mixture_distribution_gate_probe: v3.38 uses additive CFG, not a
    convex mixture; DecodeContext.mixture_gate absent -> not_implemented.

No mocks, no fallbacks, no overfit, no simplified replacement paths.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Full run of v331_blackbox_eval.py (no mocks, no fallbacks, no overfit,
no simplification) against v3.38 as SUT.

Original cases 4.1-4.19:  PASS=15, FAIL=4
Cipher probes 4.20-4.26:  PASS=1, NOT_IMPLEMENTED=2, FAIL=4

Version evolution (original-19 PASS count):
  v3.31: 10, v3.32: 11, v3.33: 10, v3.34: 12,
  v3.35: 13, v3.36: 12, v3.37: 14, v3.38: 15 (new best)

Key movements on original-19 from v3.37 to v3.38:
  4.8 degeneration_quality FAIL -> PASS ([D-4] anti-collapse worked).
  All other original cases unchanged vs v3.37.

Cipher probe outcomes:
  4.20 rerank_stability_probe: FAIL. Jaccard=1.0 on both pairs (top-1
    identical for both paraphrases) but Spearman=0.0 because top-5 had
    only 1 element under upstream-gate filtering; the probe requires
    Spearman>=0.5 and cannot be computed on a length-1 intersection.
    Diagnostic of retrieval over-filtering, not instability.
  4.21 decode_repetition_feedback_probe: PASS. avg_max_repeat=1.67,
    no bigram repeats, no trigram locks. [D-4] working end-to-end.
  4.22 functional_token_suppression_probe: FAIL at margin (0/3). v3.38
    added L_functional_suppression as a training-time loss, but this
    audit runs eval-only without the training loop, so the bridge has
    never been optimized against this margin. Correctly detected.
  4.23 keyword_specific_tail_slot_probe: FAIL. mean_intersection=0,
    hit_ratio=0 over 4 memories. Same root cause: tail[1]'s rare-keyword
    KL target has no gradient without training.
  4.24 context_descriptor_cluster_probe: NOT_IMPLEMENTED (honest).
    Spec wording requires per-MemEntry stored descriptor; v3.38 has
    per-query aggregate only (MemLLM._compute_context_descriptor).
  4.25 prefix_length_scaling_probe: FAIL. L_mem 8 -> 16 gave 3 -> 2
    content starters (not the expected +1 improvement). Slot-norm band
    passed (ratio=1.001). Signals that capacity is not currently the
    bottleneck; bridge undertrained is.
  4.26 mixture_distribution_gate_probe: NOT_IMPLEMENTED (honest).
    v3.38 uses additive shaping, not convex mixture.

Interpretation: the structural pieces for [D-1]/[D-2] landed cleanly
and wire through the graph, but require training epochs to actually
converge the hinge margin and rare-keyword KL. [D-3] is per-query,
so 4.24 is a spec-wording mismatch, not a regression. [D-4] delivered
immediately because it is pure decode-time logic, hence the 4.8 PASS
and the 4.21 probe PASS.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
@cursor cursor Bot changed the title v3.38 black-box audit (with cipher-system probes 4.20–4.26) v3.38 black-box audit — 15/19 original PASS (new best), 1/2/4 probe PASS/NI/FAIL Apr 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants