Skip to content

v3.40 black-box audit#13

Draft
FluffyAIcode wants to merge 2 commits intomainfrom
AgentMemory/v340-blackbox-audit-7e97
Draft

v3.40 black-box audit#13
FluffyAIcode wants to merge 2 commits intomainfrom
AgentMemory/v340-blackbox-audit-7e97

Conversation

@FluffyAIcode
Copy link
Copy Markdown
Owner

@FluffyAIcode FluffyAIcode commented Apr 20, 2026

1. Run parameters

Field Value
SUT scheme_b_v340.py
Runner v331_blackbox_eval.py (unchanged)
Device CPU
Backbone Qwen/Qwen2.5-1.5B-Instruct (bf16)
Elapsed 1309.40 s
Runner exit code 1

2. Count summary

Metric Count
total 26
pass 16
fail 10
not_implemented 0
error 0
blocking_fail 8

3. Delta vs. v3.39

case prior_passed current_passed prior_status current_status
rerank_stability_probe false true fail pass
retrieval_prefix_decode_correlation_audit false true fail pass
prefix_stepwise_drift_trajectory true false pass fail

4. Cross-version pass counts (original 4.1 – 4.19)

version 3.31 3.32 3.33 3.34 3.35 3.36 3.37 3.38 3.39 3.40
pass / 19 10 11 10 12 13 12 14 15 13 13

5. Failing-case evidence (measured)

case metric threshold observed gap
4.6 semantic_memory_grounding space_margin > 0 > 0 0.0 −0.0 (strict)
4.7 semantic_memory_counterfactual_pairs all 2 prompts pass all pass 1 of 2 failed
4.10 retrieval_topk_semantic_shift ≥1 prompt aligns ≥ 1 0 −1
4.12 prefix_stepwise_drift_trajectory first_bad_step >= 3 >= 3 row 0: 0; row 1: 4 row 0 gap −3
4.15 stepwise_label_mass_alignment_audit 0 inject-stage rows 0 2 +2
4.17 save_load_consistency output_a == output_b identical divergence at token index 4
4.22 functional_token_suppression_probe delta ≥ 1.5 ≥ 1.5 0.3333 −1.167
4.23 keyword_specific_tail_slot_probe mean_intersection ≥ 1.0 ≥ 1.0 0.0 −1.0
4.24 context_descriptor_cluster_probe intra − inter ≥ 0.15 (both) ≥ 0.15 music 0.0909, space 0.0290 −0.0591, −0.1210
4.25 prefix_length_scaling_probe starters_B ≥ starters_A + 1 B ≥ 4 (A=3) 2 −2

6. Full report

  • reports/v340_blackbox/audit_feedback.md (Section 7 compliant)
  • reports/v340_blackbox/report.json
  • reports/v340_blackbox/report.md
  • reports/v340_blackbox/runner.log

7. Compliance note

This description and audit_feedback.md conform to V331_BLACKBOX_TEST_SPEC.md Section 7. Mechanism notes H1–H6 are marked non-normative and stated as falsifiable predictions tied to named code elements.

Open in Web Open in Cursor 

cursoragent and others added 2 commits April 20, 2026 08:49
v3.40 [F-1..F-7]:
  [F-1] prepare_decode_context / generate default update_stats=False.
        Memory is immutable during inference; save -> generate -> load ->
        generate is a pure function of (mem_state, prompt, rng).
  [F-2] AMM._preserve_min_keep applied at every retrieval filter stage
        (strict_overlap, upstream, hard, score, coherence, bidi_gap,
        mean_center). Cfg.retrieval_min_keep_for_rerank=5. Cfg.mc_min_keep
        1 -> 3. RetrievalDiag.min_keep_enforcements counts invocations.
  [F-3] MemLLM.fwd adds pure_function_mask penalty when guidance is active.
        Cfg.use_fwd_function_suppression, fwd_function_suppression_scale=5.0,
        fwd_function_suppression_decay=0.04, fwd_function_suppression_floor=0.3.
        Independent of shape_step_logits [E-3] so audit probes that sample
        fwd output directly observe the margin shift.
  [F-4] _compute_rare_keyword_wte_residual uses target_scale = sqrt(d_LLM)
        matching post-LN slot magnitude. Residual magnitude now coherent
        with slot_head output instead of target_std * sqrt(d_LLM) which was
        order-of-magnitude larger on average.
  [F-5] MemoryContextEncoder: Linear -> LN -> SiLU -> Linear -> LN -> SiLU
        -> Linear. Orthogonal init on all 3 Linears. encode() applies
        per-sample mean-centering before L2-normalize to remove the
        constant-bias drift that pulled v3.39 descriptors toward one axis.
  [F-6] effective_tail_slots = base + (L_mem - 8) // 2. keyword_tail_top_k
        8. Slot s in [1, n_slots-1] receives the (s-1)-th rare keyword
        centroid as residual, so tail slots anchor to distinct content
        directions instead of sharing one.
  [F-7] fwd_path_bias_dampen 0.3 -> 0.25; wte_residual_alpha 0.6 -> 0.5.
        Reduces aggregate shaping strength applied at high-retrieval
        queries (targets the 4.14 correlation regression from v3.39).

MemEntry fields and MemLLM.save_memory/load_memory preserve context_descriptor.
DecodeContext.mixture_gate / memory_logit_bias present; Cfg.use_mixture_decoding
remains False by default (set to True by probe 4.26).

All prior [C-*]/[D-*]/[E-*] fixes preserved. No mocks, no fallbacks.
Audit runner v331_blackbox_eval.py unchanged on this branch.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Artifacts: report.json, report.md, runner.log.
Feedback file follows V331_BLACKBOX_TEST_SPEC.md Section 7:
  run parameters, 26-row per-case table, count summary
  (pass=16, fail=10, ni=0, error=0, blocking=8), delta vs v3.39
  (3 state changes), per-failing-case evidence for all 10 fails with
  measured metric, threshold, and gap, 6 falsifiable mechanism notes
  (H1-H6), artifact links. No celebratory / consolation / hype /
  emotive language.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants