v3.38 black-box audit — 15/19 original PASS (new best), 1/2/4 probe PASS/NI/FAIL by FluffyAIcode · Pull Request #11 · FluffyAIcode/AgentMemorySystem

FluffyAIcode · 2026-04-20T05:23:54Z

v3.38 Black-Box Audit — 26 cases, 1299.7s

Full run of v331_blackbox_eval.py (external runner) against scheme_b_v338.py as SUT, under strict no mocks / no fallbacks / no overfit / no simplification policy. Original cases 4.1–4.19 untouched; new cipher-system probes 4.20–4.26 appended.

Headline

Metric	Value
Original 19: PASS	15 / 19
Original 19: FAIL	4 / 19
Cipher probes (new): PASS	1 / 7
Cipher probes: `not_implemented` (honest, non-blocking)	2 / 7
Cipher probes: FAIL	4 / 7
Elapsed	1299.7 s, CPU, Qwen2.5-1.5B-Instruct

Version evolution (original-19 PASS count)

Version	v3.31	v3.32	v3.33	v3.34	v3.35	v3.36	v3.37	v3.38
PASS / 19	10	11	10	12	13	12	14	15

Movement on original-19 from v3.37 to v3.38

4.8 degeneration_quality FAIL → PASS (new: [D-4] anti-collapse). All metrics comfortably within thresholds.
All other 18 original cases unchanged vs. v3.37.

Cipher probes

#	Probe	Result	Key metric
4.20	`rerank_stability_probe`	FAIL	Jaccard=1.0 on both paraphrase pairs (retrieval is stable at top-1), but the probe requires Spearman ≥ 0.5 on shared top-5, and upstream gating shrank top-5 to length 1 — Spearman is undefined on that. Diagnostic of over-filtering, not instability.
4.21	`decode_repetition_feedback_probe`	PASS	avg_max_repeat=1.67, no bigram repeats, no trigram locks. [D-4] end-to-end verified.
4.22	`functional_token_suppression_probe`	FAIL	Margin condition 0/3; avg_starter_delta=0.67 (need ≥1.5). `L_functional_suppression` is a training-time loss; this audit runs eval-only, so the bridge has not yet been optimized against it.
4.23	`keyword_specific_tail_slot_probe`	FAIL	mean_intersection=0 over 4 memories. Same root cause as 4.22: tail[1]'s rare-keyword KL needs training gradient, which this audit did not supply.
4.24	`context_descriptor_cluster_probe`	NOT_IMPLEMENTED (honest)	`missing_api: MemEntry.context_descriptor field`. v3.38 implements [D-3] as a per-query aggregate (`MemLLM._compute_context_descriptor`), not a per-memory stored field. Spec wording is per-memory, so reported as not_implemented per Section 5.
4.25	`prefix_length_scaling_probe`	FAIL	L_mem 8 → 16 gave 3 → 2 content starters in top-12 (expected +1). Slot-norm band ratio=1.001 (passes). Reads as capacity is not the bottleneck in the current training regime — the bridge is under-fit, not under-sized.
4.26	`mixture_distribution_gate_probe`	NOT_IMPLEMENTED (honest)	`missing_api: DecodeContext.mixture_gate`. v3.38 shapes logits additively (content_bias + suppression_bias + CFG). No convex mixture gate exists.

Interpretation

The structural pieces [D-3] and [D-4] delivered immediate eval-time wins (4.8 flipped, 4.21 passes, 4.10 drift audit stable). [D-1] and [D-2] are training-time losses — the wiring is in place (verified by python -c smoke checks on loss gradient flow, and by the fact that the probes compute without error), but the audit runner executes eval-only against the fresh initialization. Under the strict no-training policy, their effect is correctly invisible to the probes.

This cleanly separates two audit signals:

Decode-time structural fixes ([D-3], [D-4]): verified right now.
Training-time structural fixes ([D-1], [D-2]): wiring verified; convergence verification requires a training-phase audit, which this runner does not perform.

Artifacts

reports/v338_blackbox/report.json — full structured results (8.6k lines)
reports/v338_blackbox/report.md — per-case markdown (includes all 26)
reports/v338_blackbox/runner.log — raw stdout from the 1299.7 s run

[Implementation v3.38] Adds scheme_b_v338.py with four [D-*] structural upgrades on top of v3.37: [D-1] L_functional_suppression hinge loss on prefix-injected logits. [D-2] Rare-keyword tail slot: tail[1] KL-target is IDF-top-K strict starters. [D-3] Context descriptor: per-query weighted aggregate of retrieved memory semantic_emb vectors, projected via ContextHead into one prefix slot. [D-4] Anti-collapse: per-token content_bias history decay + degeneration detector (low unique_ratio triggers global bias dampen). All [C-1..C-6] from v3.37 preserved. AgentMemorySystem.py redirects to v3.38. [Runner v331_blackbox_eval.py] Extends the external audit runner with seven Cipher-System Structural Probes (4.20-4.26) as specified in V331_BLACKBOX_TEST_SPEC.md section 4. Original 19 cases are not modified. Gating rules implemented per spec section 4-meta: - hard_PASS probes (4.20, 4.21, 4.22, 4.25) block suite PASS on failure. - PASS_or_not_implemented probes (4.23, 4.24, 4.26) do not block on honest 'not_implemented', but do block on 'fail'. Honesty notes (truthful not_implemented emissions): - 4.24 context_descriptor_cluster_probe: spec wording requires per- MemEntry stored descriptor; v3.38 exposes per-query descriptor only, so field 'MemEntry.context_descriptor' is absent -> not_implemented. - 4.26 mixture_distribution_gate_probe: v3.38 uses additive CFG, not a convex mixture; DecodeContext.mixture_gate absent -> not_implemented. No mocks, no fallbacks, no overfit, no simplified replacement paths. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

Full run of v331_blackbox_eval.py (no mocks, no fallbacks, no overfit, no simplification) against v3.38 as SUT. Original cases 4.1-4.19: PASS=15, FAIL=4 Cipher probes 4.20-4.26: PASS=1, NOT_IMPLEMENTED=2, FAIL=4 Version evolution (original-19 PASS count): v3.31: 10, v3.32: 11, v3.33: 10, v3.34: 12, v3.35: 13, v3.36: 12, v3.37: 14, v3.38: 15 (new best) Key movements on original-19 from v3.37 to v3.38: 4.8 degeneration_quality FAIL -> PASS ([D-4] anti-collapse worked). All other original cases unchanged vs v3.37. Cipher probe outcomes: 4.20 rerank_stability_probe: FAIL. Jaccard=1.0 on both pairs (top-1 identical for both paraphrases) but Spearman=0.0 because top-5 had only 1 element under upstream-gate filtering; the probe requires Spearman>=0.5 and cannot be computed on a length-1 intersection. Diagnostic of retrieval over-filtering, not instability. 4.21 decode_repetition_feedback_probe: PASS. avg_max_repeat=1.67, no bigram repeats, no trigram locks. [D-4] working end-to-end. 4.22 functional_token_suppression_probe: FAIL at margin (0/3). v3.38 added L_functional_suppression as a training-time loss, but this audit runs eval-only without the training loop, so the bridge has never been optimized against this margin. Correctly detected. 4.23 keyword_specific_tail_slot_probe: FAIL. mean_intersection=0, hit_ratio=0 over 4 memories. Same root cause: tail[1]'s rare-keyword KL target has no gradient without training. 4.24 context_descriptor_cluster_probe: NOT_IMPLEMENTED (honest). Spec wording requires per-MemEntry stored descriptor; v3.38 has per-query aggregate only (MemLLM._compute_context_descriptor). 4.25 prefix_length_scaling_probe: FAIL. L_mem 8 -> 16 gave 3 -> 2 content starters (not the expected +1 improvement). Slot-norm band passed (ratio=1.001). Signals that capacity is not currently the bottleneck; bridge undertrained is. 4.26 mixture_distribution_gate_probe: NOT_IMPLEMENTED (honest). v3.38 uses additive shaping, not convex mixture. Interpretation: the structural pieces for [D-1]/[D-2] landed cleanly and wire through the graph, but require training epochs to actually converge the hinge margin and rare-keyword KL. [D-3] is per-query, so 4.24 is a spec-wording mismatch, not a regression. [D-4] delivered immediately because it is pure decode-time logic, hence the 4.8 PASS and the 4.21 probe PASS. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

cursoragent and others added 2 commits April 20, 2026 05:23

cursor Bot changed the title ~~v3.38 black-box audit (with cipher-system probes 4.20–4.26)~~ v3.38 black-box audit — 15/19 original PASS (new best), 1/2/4 probe PASS/NI/FAIL Apr 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v3.38 black-box audit — 15/19 original PASS (new best), 1/2/4 probe PASS/NI/FAIL#11

v3.38 black-box audit — 15/19 original PASS (new best), 1/2/4 probe PASS/NI/FAIL#11
FluffyAIcode wants to merge 2 commits intomainfrom
AgentMemory/v338-blackbox-audit-7e97

FluffyAIcode commented Apr 20, 2026 •

edited by cursor Bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

FluffyAIcode commented Apr 20, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

v3.38 Black-Box Audit — 26 cases, 1299.7s

Headline

Version evolution (original-19 PASS count)

Movement on original-19 from v3.37 to v3.38

Cipher probes

Interpretation

Artifacts

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

FluffyAIcode commented Apr 20, 2026 •

edited by cursor Bot

Loading