Spec: add Section 7 (Reporting Discipline) + Cipher-System Structural Probes (4.20–4.26) by FluffyAIcode · Pull Request #10 · FluffyAIcode/AgentMemorySystem

FluffyAIcode · 2026-04-20T04:53:50Z

Summary

Extends V331_BLACKBOX_TEST_SPEC.md with two additions:

Cipher-System Structural Probes (4.20 – 4.26): seven probes that encode the upgrade proposals derived from the cipher-system analysis into fixed seed, public-API, binary pass/fail tests. PASS / fail / not_implemented rules per Section 4-meta.
Section 7: Reporting Discipline (mandatory): normative rules governing every human-authored audit output (reports, PR descriptions, commit messages, analysis documents). Banned language categories are enumerated (celebratory, consolation, hype, emotive). Required report structure is fixed (run parameters, per-case table, count summary, delta vs. prior, per-failing-case evidence, optional mechanism notes, artifact links). Writing rules require numerical measurements in place of comparative adjectives. Non-compliant reports are not mergeable.

File changes

V331_BLACKBOX_TEST_SPEC.md (+667 lines total on this branch; Section 7 appended as +54 lines in the latest commit).

Enforcement scope

Section 7 applies to v3.40 and forward. Prior reports are not mandated to be rewritten.
The audit runner's JSON output is machine-generated and is not subject to Section 7.1 or 7.3.
No implementation code changes. Runner code is not modified by this PR.

v3.37 introduces two structural fixes over v3.36: [C-5] IDF-weighted content bias: rare domain tokens get ~2x boost relative to high-frequency cross-domain repeaters. [C-6] Multi-signal DirectionTree.retrieve: beam search + centroid cosine + forward maxsim (IDF-weighted) rerank, preserving the (qdir, bw) signature so the unmodified runner sees a richer candidate list. Retains [C-4] guidance_active gate, [C-1..3] A-*, B-* fixes. Vendors scheme_b_v321..v330 and v331_blackbox_eval.py for the audit. Ignore __pycache__. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

Full run of v331_blackbox_eval.py (unmodified) against v3.37 as SUT. Results (14/19 PASS, 5/19 FAIL): PASS: leaf_capacity_stability, degenerate_direction_boundary, metric_trainability, no_grad_generation, counterfactual_memory_influence, prefix_logit_drift_audit, repetition_segment_audit, prefix_stepwise_drift_trajectory, retrieval_generation_alignment_audit, retrieval_prefix_decode_correlation_audit, prompt_diversity_without_memory, save_load_consistency, training_cache_isolation, cheating_heuristics FAIL: semantic_memory_grounding, semantic_memory_counterfactual_pairs, degeneration_quality, retrieval_topk_semantic_shift, stepwise_label_mass_alignment_audit Version evolution (PASS count): v3.31: 10, v3.32: 11, v3.33: 10, v3.34: 12, v3.35: 13, v3.36: 12, v3.37: 14 (new best) Targeted fixes confirmed: 4.16 retrieval_generation_alignment_audit FAIL -> PASS ([C-6] multi-signal tree.retrieve rerank): retrieval_miss=0 on music/space queries (vs 1-2 retrieval_miss in v3.36). 4.12 repetition_segment_audit returned to PASS (v3.36 regressed, v3.37 restored with bad_segment_ratio=0.11). Residual FAILs all trace to either: (a) keyword-list / backbone vocab distribution mismatch (4.7, 4.11), which IDF [C-5] mitigates but does not eliminate — Qwen's top-12 on generic prompts still favors stop-function tokens. (b) upstream simplification in runner's retrieve_memory_ids path for stepwise aligned counts (4.19 inject stage). (c) new regression in semantic_memory_grounding (4.6) — needs future investigation (backbone produced long Chinese tangents). (d) degeneration_quality (4.8) threshold tight under stochastic seeds. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…s (4.20-4.26) Adds a forward-looking subsuite that turns the 'cipher system' structural- upgrade proposals into concrete black-box probes. Each probe carries a fixed seed, an explicit setup, purely public-API observations, and binary pass/fail criteria that honour the original Section 1 no-mock / no-fallback / no-overfit policy. Mapping from cipher attribute to probe and targeted FAIL: 4.20 rerank_stability_probe invocation strategy 4.6 4.21 decode_repetition_feedback_probe anti-collapse 4.8 4.22 functional_token_suppression_probe expressive volume 4.7 / 4.10 4.23 keyword_specific_tail_slot_probe expressive vocabulary 4.15 inject 4.24 context_descriptor_cluster_probe invocation strategy 4.6 / 4.9 4.25 prefix_length_scaling_probe expressive capacity 4.7 / 4.10 4.26 mixture_distribution_gate_probe expressive form 4.7 / 4.10 / 4.15 P2/P3 upgrades that are not yet implemented (4.23, 4.24, 4.26) are allowed to emit status = 'not_implemented' rather than fail; the policy forbids silencing such probes or satisfying them via prompt-keyed shortcuts. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

Normative rules for human-authored audit reports, PR descriptions, commit messages, and inter-version comparisons. Banned categories (celebratory, consolation, hype, emotive) are enumerated. Required report sections (run parameters, per-case table, counts, delta, per-failing-case evidence, mechanism notes, artifacts) are fixed. Writing rules require measured numbers instead of comparative adjectives. Enforcement applies from v3.40 onward; prior reports are not mandated to be rewritten. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

cursoragent and others added 4 commits April 19, 2026 15:38

cursor Bot changed the title ~~Spec: add Cipher-System Structural Probes (4.20–4.26) for v3.38+ upgrades~~ Spec: add Section 7 (Reporting Discipline) + Cipher-System Structural Probes (4.20–4.26) Apr 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spec: add Section 7 (Reporting Discipline) + Cipher-System Structural Probes (4.20–4.26)#10

Spec: add Section 7 (Reporting Discipline) + Cipher-System Structural Probes (4.20–4.26)#10
FluffyAIcode wants to merge 4 commits intomainfrom
AgentMemory/v338-cipher-probes-7e97

FluffyAIcode commented Apr 20, 2026 •

edited by cursor Bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

FluffyAIcode commented Apr 20, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

File changes

Enforcement scope

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

FluffyAIcode commented Apr 20, 2026 •

edited by cursor Bot

Loading