v3.38 black-box audit — 15/19 original PASS (new best), 1/2/4 probe PASS/NI/FAIL#11
Draft
FluffyAIcode wants to merge 2 commits intomainfrom
Draft
v3.38 black-box audit — 15/19 original PASS (new best), 1/2/4 probe PASS/NI/FAIL#11FluffyAIcode wants to merge 2 commits intomainfrom
FluffyAIcode wants to merge 2 commits intomainfrom
Conversation
[Implementation v3.38]
Adds scheme_b_v338.py with four [D-*] structural upgrades on top of v3.37:
[D-1] L_functional_suppression hinge loss on prefix-injected logits.
[D-2] Rare-keyword tail slot: tail[1] KL-target is IDF-top-K strict starters.
[D-3] Context descriptor: per-query weighted aggregate of retrieved
memory semantic_emb vectors, projected via ContextHead into one
prefix slot.
[D-4] Anti-collapse: per-token content_bias history decay + degeneration
detector (low unique_ratio triggers global bias dampen).
All [C-1..C-6] from v3.37 preserved. AgentMemorySystem.py redirects to v3.38.
[Runner v331_blackbox_eval.py]
Extends the external audit runner with seven Cipher-System Structural
Probes (4.20-4.26) as specified in V331_BLACKBOX_TEST_SPEC.md section 4.
Original 19 cases are not modified.
Gating rules implemented per spec section 4-meta:
- hard_PASS probes (4.20, 4.21, 4.22, 4.25) block suite PASS on failure.
- PASS_or_not_implemented probes (4.23, 4.24, 4.26) do not block on
honest 'not_implemented', but do block on 'fail'.
Honesty notes (truthful not_implemented emissions):
- 4.24 context_descriptor_cluster_probe: spec wording requires per-
MemEntry stored descriptor; v3.38 exposes per-query descriptor only,
so field 'MemEntry.context_descriptor' is absent -> not_implemented.
- 4.26 mixture_distribution_gate_probe: v3.38 uses additive CFG, not a
convex mixture; DecodeContext.mixture_gate absent -> not_implemented.
No mocks, no fallbacks, no overfit, no simplified replacement paths.
Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Full run of v331_blackbox_eval.py (no mocks, no fallbacks, no overfit,
no simplification) against v3.38 as SUT.
Original cases 4.1-4.19: PASS=15, FAIL=4
Cipher probes 4.20-4.26: PASS=1, NOT_IMPLEMENTED=2, FAIL=4
Version evolution (original-19 PASS count):
v3.31: 10, v3.32: 11, v3.33: 10, v3.34: 12,
v3.35: 13, v3.36: 12, v3.37: 14, v3.38: 15 (new best)
Key movements on original-19 from v3.37 to v3.38:
4.8 degeneration_quality FAIL -> PASS ([D-4] anti-collapse worked).
All other original cases unchanged vs v3.37.
Cipher probe outcomes:
4.20 rerank_stability_probe: FAIL. Jaccard=1.0 on both pairs (top-1
identical for both paraphrases) but Spearman=0.0 because top-5 had
only 1 element under upstream-gate filtering; the probe requires
Spearman>=0.5 and cannot be computed on a length-1 intersection.
Diagnostic of retrieval over-filtering, not instability.
4.21 decode_repetition_feedback_probe: PASS. avg_max_repeat=1.67,
no bigram repeats, no trigram locks. [D-4] working end-to-end.
4.22 functional_token_suppression_probe: FAIL at margin (0/3). v3.38
added L_functional_suppression as a training-time loss, but this
audit runs eval-only without the training loop, so the bridge has
never been optimized against this margin. Correctly detected.
4.23 keyword_specific_tail_slot_probe: FAIL. mean_intersection=0,
hit_ratio=0 over 4 memories. Same root cause: tail[1]'s rare-keyword
KL target has no gradient without training.
4.24 context_descriptor_cluster_probe: NOT_IMPLEMENTED (honest).
Spec wording requires per-MemEntry stored descriptor; v3.38 has
per-query aggregate only (MemLLM._compute_context_descriptor).
4.25 prefix_length_scaling_probe: FAIL. L_mem 8 -> 16 gave 3 -> 2
content starters (not the expected +1 improvement). Slot-norm band
passed (ratio=1.001). Signals that capacity is not currently the
bottleneck; bridge undertrained is.
4.26 mixture_distribution_gate_probe: NOT_IMPLEMENTED (honest).
v3.38 uses additive shaping, not convex mixture.
Interpretation: the structural pieces for [D-1]/[D-2] landed cleanly
and wire through the graph, but require training epochs to actually
converge the hinge margin and rare-keyword KL. [D-3] is per-query,
so 4.24 is a spec-wording mismatch, not a regression. [D-4] delivered
immediately because it is pure decode-time logic, hence the 4.8 PASS
and the 4.21 probe PASS.
Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
v3.38 Black-Box Audit — 26 cases, 1299.7s
Full run of
v331_blackbox_eval.py(external runner) againstscheme_b_v338.pyas SUT, under strict no mocks / no fallbacks / no overfit / no simplification policy. Original cases 4.1–4.19 untouched; new cipher-system probes 4.20–4.26 appended.Headline
not_implemented(honest, non-blocking)Version evolution (original-19 PASS count)
Movement on original-19 from v3.37 to v3.38
degeneration_qualityFAIL → PASS (new: [D-4] anti-collapse). All metrics comfortably within thresholds.Cipher probes
rerank_stability_probedecode_repetition_feedback_probefunctional_token_suppression_probeL_functional_suppressionis a training-time loss; this audit runs eval-only, so the bridge has not yet been optimized against it.keyword_specific_tail_slot_probecontext_descriptor_cluster_probemissing_api: MemEntry.context_descriptor field. v3.38 implements [D-3] as a per-query aggregate (MemLLM._compute_context_descriptor), not a per-memory stored field. Spec wording is per-memory, so reported as not_implemented per Section 5.prefix_length_scaling_probemixture_distribution_gate_probemissing_api: DecodeContext.mixture_gate. v3.38 shapes logits additively (content_bias + suppression_bias + CFG). No convex mixture gate exists.Interpretation
The structural pieces [D-3] and [D-4] delivered immediate eval-time wins (4.8 flipped, 4.21 passes, 4.10 drift audit stable). [D-1] and [D-2] are training-time losses — the wiring is in place (verified by
python -csmoke checks on loss gradient flow, and by the fact that the probes compute without error), but the audit runner executes eval-only against the fresh initialization. Under the strict no-training policy, their effect is correctly invisible to the probes.This cleanly separates two audit signals:
Artifacts
reports/v338_blackbox/report.json— full structured results (8.6k lines)reports/v338_blackbox/report.md— per-case markdown (includes all 26)reports/v338_blackbox/runner.log— raw stdout from the 1299.7 s run