Spec: add Section 7 (Reporting Discipline) + Cipher-System Structural Probes (4.20–4.26)#10
Draft
FluffyAIcode wants to merge 4 commits intomainfrom
Draft
Spec: add Section 7 (Reporting Discipline) + Cipher-System Structural Probes (4.20–4.26)#10FluffyAIcode wants to merge 4 commits intomainfrom
FluffyAIcode wants to merge 4 commits intomainfrom
Conversation
v3.37 introduces two structural fixes over v3.36:
[C-5] IDF-weighted content bias: rare domain tokens get ~2x boost relative
to high-frequency cross-domain repeaters.
[C-6] Multi-signal DirectionTree.retrieve: beam search + centroid cosine +
forward maxsim (IDF-weighted) rerank, preserving the (qdir, bw)
signature so the unmodified runner sees a richer candidate list.
Retains [C-4] guidance_active gate, [C-1..3] A-*, B-* fixes.
Vendors scheme_b_v321..v330 and v331_blackbox_eval.py for the audit.
Ignore __pycache__.
Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Full run of v331_blackbox_eval.py (unmodified) against v3.37 as SUT.
Results (14/19 PASS, 5/19 FAIL):
PASS: leaf_capacity_stability, degenerate_direction_boundary,
metric_trainability, no_grad_generation,
counterfactual_memory_influence, prefix_logit_drift_audit,
repetition_segment_audit, prefix_stepwise_drift_trajectory,
retrieval_generation_alignment_audit,
retrieval_prefix_decode_correlation_audit,
prompt_diversity_without_memory, save_load_consistency,
training_cache_isolation, cheating_heuristics
FAIL: semantic_memory_grounding, semantic_memory_counterfactual_pairs,
degeneration_quality, retrieval_topk_semantic_shift,
stepwise_label_mass_alignment_audit
Version evolution (PASS count):
v3.31: 10, v3.32: 11, v3.33: 10, v3.34: 12,
v3.35: 13, v3.36: 12, v3.37: 14 (new best)
Targeted fixes confirmed:
4.16 retrieval_generation_alignment_audit FAIL -> PASS
([C-6] multi-signal tree.retrieve rerank): retrieval_miss=0
on music/space queries (vs 1-2 retrieval_miss in v3.36).
4.12 repetition_segment_audit returned to PASS
(v3.36 regressed, v3.37 restored with bad_segment_ratio=0.11).
Residual FAILs all trace to either:
(a) keyword-list / backbone vocab distribution mismatch (4.7, 4.11),
which IDF [C-5] mitigates but does not eliminate — Qwen's top-12
on generic prompts still favors stop-function tokens.
(b) upstream simplification in runner's retrieve_memory_ids path
for stepwise aligned counts (4.19 inject stage).
(c) new regression in semantic_memory_grounding (4.6) — needs
future investigation (backbone produced long Chinese tangents).
(d) degeneration_quality (4.8) threshold tight under stochastic seeds.
Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…s (4.20-4.26) Adds a forward-looking subsuite that turns the 'cipher system' structural- upgrade proposals into concrete black-box probes. Each probe carries a fixed seed, an explicit setup, purely public-API observations, and binary pass/fail criteria that honour the original Section 1 no-mock / no-fallback / no-overfit policy. Mapping from cipher attribute to probe and targeted FAIL: 4.20 rerank_stability_probe invocation strategy 4.6 4.21 decode_repetition_feedback_probe anti-collapse 4.8 4.22 functional_token_suppression_probe expressive volume 4.7 / 4.10 4.23 keyword_specific_tail_slot_probe expressive vocabulary 4.15 inject 4.24 context_descriptor_cluster_probe invocation strategy 4.6 / 4.9 4.25 prefix_length_scaling_probe expressive capacity 4.7 / 4.10 4.26 mixture_distribution_gate_probe expressive form 4.7 / 4.10 / 4.15 P2/P3 upgrades that are not yet implemented (4.23, 4.24, 4.26) are allowed to emit status = 'not_implemented' rather than fail; the policy forbids silencing such probes or satisfying them via prompt-keyed shortcuts. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Normative rules for human-authored audit reports, PR descriptions, commit messages, and inter-version comparisons. Banned categories (celebratory, consolation, hype, emotive) are enumerated. Required report sections (run parameters, per-case table, counts, delta, per-failing-case evidence, mechanism notes, artifacts) are fixed. Writing rules require measured numbers instead of comparative adjectives. Enforcement applies from v3.40 onward; prior reports are not mandated to be rewritten. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Extends
V331_BLACKBOX_TEST_SPEC.mdwith two additions:File changes
V331_BLACKBOX_TEST_SPEC.md(+667 lines total on this branch; Section 7 appended as +54 lines in the latest commit).Enforcement scope