v3.40 black-box audit by FluffyAIcode · Pull Request #13 · FluffyAIcode/AgentMemorySystem

FluffyAIcode · 2026-04-20T08:49:51Z

1. Run parameters

Field	Value
SUT	`scheme_b_v340.py`
Runner	`v331_blackbox_eval.py` (unchanged)
Device	CPU
Backbone	`Qwen/Qwen2.5-1.5B-Instruct` (bf16)
Elapsed	1309.40 s
Runner exit code	1

2. Count summary

Metric	Count
total	26
pass	16
fail	10
not_implemented	0
error	0
blocking_fail	8

3. Delta vs. v3.39

case	prior_passed	current_passed	prior_status	current_status
rerank_stability_probe	false	true	fail	pass
retrieval_prefix_decode_correlation_audit	false	true	fail	pass
prefix_stepwise_drift_trajectory	true	false	pass	fail

4. Cross-version pass counts (original 4.1 – 4.19)

version	3.31	3.32	3.33	3.34	3.35	3.36	3.37	3.38	3.39	3.40
pass / 19	10	11	10	12	13	12	14	15	13	13

5. Failing-case evidence (measured)

case	metric	threshold	observed	gap
4.6 semantic_memory_grounding	`space_margin > 0`	`> 0`	`0.0`	−0.0 (strict)
4.7 semantic_memory_counterfactual_pairs	all 2 prompts pass	all pass	1 of 2 failed	—
4.10 retrieval_topk_semantic_shift	≥1 prompt aligns	≥ 1	0	−1
4.12 prefix_stepwise_drift_trajectory	`first_bad_step >= 3`	`>= 3`	row 0: `0`; row 1: `4`	row 0 gap `−3`
4.15 stepwise_label_mass_alignment_audit	0 inject-stage rows	0	2	+2
4.17 save_load_consistency	`output_a == output_b`	identical	divergence at token index 4	—
4.22 functional_token_suppression_probe	`delta ≥ 1.5`	≥ 1.5	`0.3333`	−1.167
4.23 keyword_specific_tail_slot_probe	`mean_intersection ≥ 1.0`	≥ 1.0	`0.0`	−1.0
4.24 context_descriptor_cluster_probe	`intra − inter ≥ 0.15` (both)	≥ 0.15	music `0.0909`, space `0.0290`	−0.0591, −0.1210
4.25 prefix_length_scaling_probe	`starters_B ≥ starters_A + 1`	`B ≥ 4` (A=3)	`2`	−2

6. Full report

reports/v340_blackbox/audit_feedback.md (Section 7 compliant)
reports/v340_blackbox/report.json
reports/v340_blackbox/report.md
reports/v340_blackbox/runner.log

7. Compliance note

This description and audit_feedback.md conform to V331_BLACKBOX_TEST_SPEC.md Section 7. Mechanism notes H1–H6 are marked non-normative and stated as falsifiable predictions tied to named code elements.

v3.40 [F-1..F-7]: [F-1] prepare_decode_context / generate default update_stats=False. Memory is immutable during inference; save -> generate -> load -> generate is a pure function of (mem_state, prompt, rng). [F-2] AMM._preserve_min_keep applied at every retrieval filter stage (strict_overlap, upstream, hard, score, coherence, bidi_gap, mean_center). Cfg.retrieval_min_keep_for_rerank=5. Cfg.mc_min_keep 1 -> 3. RetrievalDiag.min_keep_enforcements counts invocations. [F-3] MemLLM.fwd adds pure_function_mask penalty when guidance is active. Cfg.use_fwd_function_suppression, fwd_function_suppression_scale=5.0, fwd_function_suppression_decay=0.04, fwd_function_suppression_floor=0.3. Independent of shape_step_logits [E-3] so audit probes that sample fwd output directly observe the margin shift. [F-4] _compute_rare_keyword_wte_residual uses target_scale = sqrt(d_LLM) matching post-LN slot magnitude. Residual magnitude now coherent with slot_head output instead of target_std * sqrt(d_LLM) which was order-of-magnitude larger on average. [F-5] MemoryContextEncoder: Linear -> LN -> SiLU -> Linear -> LN -> SiLU -> Linear. Orthogonal init on all 3 Linears. encode() applies per-sample mean-centering before L2-normalize to remove the constant-bias drift that pulled v3.39 descriptors toward one axis. [F-6] effective_tail_slots = base + (L_mem - 8) // 2. keyword_tail_top_k 8. Slot s in [1, n_slots-1] receives the (s-1)-th rare keyword centroid as residual, so tail slots anchor to distinct content directions instead of sharing one. [F-7] fwd_path_bias_dampen 0.3 -> 0.25; wte_residual_alpha 0.6 -> 0.5. Reduces aggregate shaping strength applied at high-retrieval queries (targets the 4.14 correlation regression from v3.39). MemEntry fields and MemLLM.save_memory/load_memory preserve context_descriptor. DecodeContext.mixture_gate / memory_logit_bias present; Cfg.use_mixture_decoding remains False by default (set to True by probe 4.26). All prior [C-*]/[D-*]/[E-*] fixes preserved. No mocks, no fallbacks. Audit runner v331_blackbox_eval.py unchanged on this branch. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

Artifacts: report.json, report.md, runner.log. Feedback file follows V331_BLACKBOX_TEST_SPEC.md Section 7: run parameters, 26-row per-case table, count summary (pass=16, fail=10, ni=0, error=0, blocking=8), delta vs v3.39 (3 state changes), per-failing-case evidence for all 10 fails with measured metric, threshold, and gap, 6 falsifiable mechanism notes (H1-H6), artifact links. No celebratory / consolation / hype / emotive language. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

cursoragent and others added 2 commits April 20, 2026 08:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v3.40 black-box audit#13

v3.40 black-box audit#13
FluffyAIcode wants to merge 2 commits intomainfrom
AgentMemory/v340-blackbox-audit-7e97

FluffyAIcode commented Apr 20, 2026 •

edited by cursor Bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

FluffyAIcode commented Apr 20, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. Run parameters

2. Count summary

3. Delta vs. v3.39

4. Cross-version pass counts (original 4.1 – 4.19)

5. Failing-case evidence (measured)

6. Full report

7. Compliance note

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

FluffyAIcode commented Apr 20, 2026 •

edited by cursor Bot

Loading