v3.37 black-box audit — 14/19 PASS (new best), 4.16 retrieval fix confirmed by FluffyAIcode · Pull Request #9 · FluffyAIcode/AgentMemorySystem

FluffyAIcode · 2026-04-19T15:39:05Z

v3.37 Black-Box Audit — 14/19 PASS (new best across all audited versions)

Switches the System Under Test to scheme_b_v337.py and runs the unmodified v331_blackbox_eval.py against it under the V331_BLACKBOX_TEST_SPEC.md policy: no monkeypatching, no source modification of the runner, no mocked returns, honest failures.

Summary

Metric	Value
Passed	14 / 19
Failed	5 / 19
Elapsed	1099.4 s (CPU, Qwen/Qwen2.5-1.5B-Instruct)
Artifacts	`reports/v337_blackbox/{report.json,report.md,runner.log}`

Version evolution (PASS count)

Version	v3.31	v3.32	v3.33	v3.34	v3.35	v3.36	v3.37
PASS / 19	10	11	10	12	13	12	14

Fixes in v3.37

[C-5] IDF-weighted content bias — every token's contribution to the prefix's content_bias tensor is now multiplied by its corpus IDF (clamped to [idf_floor, idf_bias_max_boost=3.0]). Rare domain tokens get ~2× the boost of high-frequency repeaters, pushing them into the decoder's top-k.
[C-6] Multi-signal DirectionTree.retrieve — a PyTorch forward_pre_hook on the backbone captures the most recent query ids into amm._last_query_ids. tree.retrieve(qdir, bw) now runs:
1. beam search recall (unchanged, signature preserved)
2. per-candidate IDF-weighted centroid cosine + forward max-sim
3. combined rerank 0.2·dir + 0.4·centroid + 0.4·fwd

All prior fixes retained: [C-4] _mem_guidance_active gate, [C-1..3], [A-*], [B-*].

Case-by-case results

#	Case	v3.36	v3.37	Notes
4.1	leaf_capacity_stability	PASS	PASS
4.2	degenerate_direction_boundary	PASS	PASS
4.3	metric_trainability	PASS	PASS
4.4	no_grad_generation	PASS	PASS
4.5	counterfactual_memory_influence	PASS	PASS
4.6	semantic_memory_grounding	PASS	FAIL	regression — Chinese tangents after content-first tokens
4.7	semantic_memory_counterfactual_pairs	FAIL	FAIL	keyword-list / Qwen vocab distribution mismatch; IDF mitigates but not enough
4.8	degeneration_quality	FAIL	FAIL	borderline under stochastic seeds
4.9	prefix_logit_drift_audit	PASS	PASS	blank l2=1045, memory l2=3.22e11 — differential preserved
4.10	retrieval_topk_semantic_shift	FAIL	FAIL	same root cause as 4.7
4.11	repetition_segment_audit	FAIL	PASS	restored (0.11 bad_segment_ratio)
4.12	prefix_stepwise_drift_trajectory	PASS	PASS
4.13	retrieval_generation_alignment_audit	FAIL	PASS	[C-6] target — retrieval_miss dropped from 1-2 to 0
4.14	retrieval_prefix_decode_correlation_audit	PASS	PASS
4.15	stepwise_label_mass_alignment_audit	FAIL	FAIL	retrieve stage improved but inject stage still dominated by function tokens
4.16	prompt_diversity_without_memory	PASS	PASS
4.17	save_load_consistency	PASS	PASS
4.18	training_cache_isolation	PASS	PASS
4.19	cheating_heuristics	PASS	PASS

Key wins

4.13 retrieval_generation_alignment_audit PASS — confirms [C-6]. On the music query, tree.retrieve now returns [1, 0, 3, 6, 2] (4 music / 1 space) instead of a mostly-space set; diagnoses: {aligned: 2, retrieval_miss: 0, bridge_unused: 1}.
4.11 repetition_segment_audit returned to PASS (v3.36 had regressed it as a trade-off for fixing 4.9).

Residual FAILs

All residual failures are structural, not fixable by further logit-shaping:

4.7 / 4.10 — keyword list (chopin, pianist, nocturnes, …) doesn't match the tokens Qwen naturally emits for generic prompts like "A strong explanation should mention". Even with IDF boost, the relative logit gap to " the", " a", " at" cannot be closed at the top-12 cutoff without destroying other properties.
4.15 — the runner's stepwise harness diagnoses most steps at the inject stage where logit label-mass is zero; fixing this needs training-time alignment of ContentSemanticTailHead to keyword tokens, not more runtime shaping.
4.6 / 4.8 — borderline and partly stochastic; likely tunable but not a structural win.

Artifacts

reports/v337_blackbox/report.json — full structured results with per-case metrics
reports/v337_blackbox/report.md — human-readable audit report (1099.4 s elapsed)
reports/v337_blackbox/runner.log — full runner stdout/stderr

v3.37 introduces two structural fixes over v3.36: [C-5] IDF-weighted content bias: rare domain tokens get ~2x boost relative to high-frequency cross-domain repeaters. [C-6] Multi-signal DirectionTree.retrieve: beam search + centroid cosine + forward maxsim (IDF-weighted) rerank, preserving the (qdir, bw) signature so the unmodified runner sees a richer candidate list. Retains [C-4] guidance_active gate, [C-1..3] A-*, B-* fixes. Vendors scheme_b_v321..v330 and v331_blackbox_eval.py for the audit. Ignore __pycache__. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

Full run of v331_blackbox_eval.py (unmodified) against v3.37 as SUT. Results (14/19 PASS, 5/19 FAIL): PASS: leaf_capacity_stability, degenerate_direction_boundary, metric_trainability, no_grad_generation, counterfactual_memory_influence, prefix_logit_drift_audit, repetition_segment_audit, prefix_stepwise_drift_trajectory, retrieval_generation_alignment_audit, retrieval_prefix_decode_correlation_audit, prompt_diversity_without_memory, save_load_consistency, training_cache_isolation, cheating_heuristics FAIL: semantic_memory_grounding, semantic_memory_counterfactual_pairs, degeneration_quality, retrieval_topk_semantic_shift, stepwise_label_mass_alignment_audit Version evolution (PASS count): v3.31: 10, v3.32: 11, v3.33: 10, v3.34: 12, v3.35: 13, v3.36: 12, v3.37: 14 (new best) Targeted fixes confirmed: 4.16 retrieval_generation_alignment_audit FAIL -> PASS ([C-6] multi-signal tree.retrieve rerank): retrieval_miss=0 on music/space queries (vs 1-2 retrieval_miss in v3.36). 4.12 repetition_segment_audit returned to PASS (v3.36 regressed, v3.37 restored with bad_segment_ratio=0.11). Residual FAILs all trace to either: (a) keyword-list / backbone vocab distribution mismatch (4.7, 4.11), which IDF [C-5] mitigates but does not eliminate — Qwen's top-12 on generic prompts still favors stop-function tokens. (b) upstream simplification in runner's retrieve_memory_ids path for stepwise aligned counts (4.19 inject stage). (c) new regression in semantic_memory_grounding (4.6) — needs future investigation (backbone produced long Chinese tangents). (d) degeneration_quality (4.8) threshold tight under stochastic seeds. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

cursoragent and others added 2 commits April 19, 2026 15:38

cursor Bot changed the title ~~v3.37 black-box audit~~ v3.37 black-box audit — 14/19 PASS (new best), 4.16 retrieval fix confirmed Apr 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v3.37 black-box audit — 14/19 PASS (new best), 4.16 retrieval fix confirmed#9

v3.37 black-box audit — 14/19 PASS (new best), 4.16 retrieval fix confirmed#9
FluffyAIcode wants to merge 2 commits intomainfrom
AgentMemory/v337-blackbox-audit-7e97

FluffyAIcode commented Apr 19, 2026 •

edited by cursor Bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

FluffyAIcode commented Apr 19, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

v3.37 Black-Box Audit — 14/19 PASS (new best across all audited versions)

Summary

Version evolution (PASS count)

Fixes in v3.37

Case-by-case results

Key wins

Residual FAILs

Artifacts

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

FluffyAIcode commented Apr 19, 2026 •

edited by cursor Bot

Loading