feat(ENG-335): multi-head ONNX + temperature scaling + API contract fixes (v0.7)#63
Conversation
…ixes
Adds opt-in support for dual-head ([batch, 2]) ONNX classifiers, post-hoc
temperature scaling for calibrated probability semantics, and the multi-head
decision rule (block iff main >= mainThreshold AND aux < auxThreshold). All
behind opt-in config — single-head consumption stays the back-compat default.
API additions:
- tier2Config.multihead?: { mainThreshold, auxThreshold }
- tier2Config.temperatureT?: number (raw sigmoid when 1.0)
- OnnxClassifier.classifyPair / classifyBatchPair (main + aux)
- Tier2Classifier.classifyChunksBatchPair / isMultihead / getMultiheadConfig
- Tier2Classifier auto-loads calibration defaults from classifier_config.json
- DefenseResult.tier2AuxScore, tier2MultiheadBlocked
- DefenseResult.tier2RawScore (debug; see Bug 3 below)
- getDefaultModelPath exported
Three latent API contract bugs uncovered during calibration are fixed here:
Bug 1 — tier2Config.highRiskThreshold overrides never propagated to the
block gate. Visible only when calibrated thresholds land between the
override and the un-propagated default (0.8). Latent since multi-head
support was added. Fix: PromptDefense constructor now syncs threshold
overrides into this.config.tier2.* alongside the Tier2Classifier copy.
Bug 2 — DENSITY_SUB_THRESHOLD was hardcoded in raw-sigmoid space. Under
temperature scaling, scores compress toward 0.5 and the literal 0.75
cutoff stops counting "high" events, causing density damping to silently
under-fire. Fix: rescale in logit space — sigmoid(log(3) / T). T=1 is a
no-op; T=2.41 yields ~0.612.
Bug 3 — tier2Score returned the raw max-chunk main, but the block gate
used tier2EffectiveScore (post-density). Operators comparing
tier2Score >= highRiskThreshold got a different answer than
result.allowed === false. Fix: tier2Score now reports the effective score
that drove the decision; the pre-density max-chunk main is surfaced as
tier2RawScore for forensics. Under multi-head aux veto, tier2Score is
undefined (no block-driving score) — operators should check
tier2MultiheadBlocked when they need the rule's verdict explicitly.
229 tests pass. Default model path still points at minilm-full-aug — the v5
multi-head model with calibrated defaults lands in the next commit.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…metadata
Replaces the legacy minilm-full-aug ONNX model with minilm-multihead-v5, a
dual-head MiniLM-L6 fine-tuned with code/docs/git aux supervision. Single-head
consumption by default — no aux behavior change unless callers opt into
tier2Config.multihead. Calibrated probability semantics by default via the
model's classifier_config.json:calibration block.
Calibration metadata (model self-describes):
temperatureT: 2.41
highRiskThreshold: 0.64 (math-equivalent to raw 0.8 at T=2.41)
ece: 0.09
fitted_on: labeled plugin events 2026-05-13
Tier2Classifier auto-loads these defaults at construction; user-provided
tier2Config still wins. Models without a calibration block (custom paths
pointing at non-v5 models) fall back to library defaults (T=1, threshold=0.8).
Migration:
- Callers using the default config now receive calibrated probabilities.
tier2Score values for the same content will shift toward 0.5 (less
saturated). Re-check any hardcoded threshold comparisons.
- Callers explicitly setting tier2Config.highRiskThreshold see no semantic
change other than Bug 1 (previous commit) finally honoring overrides.
- Callers explicitly setting onnxModelPath: ".../minilm-full-aug" break —
that directory is no longer shipped. v5 ships as the only bundled model.
Build / packaging:
- scripts/copy-models.cjs replaces an inline package.json one-liner.
MODEL_DIRS lists the bundled variants; add new models here.
- npm pack size: 18.5 MB (was projected 90+ MB with all session variants).
- dist size: 23 MB (was 100 MB with all variants).
Pruning:
- Removed legacy minilm-full-aug binary.
- Removed v3, v4c, v6, v31 dev variants — kept in classifier-eval workspace
and on Modal volume for benchmarking; not in the npm tarball.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ead thresholds Three review-feedback fixes on the v0.7 branch: 1. tier2Score under aux veto. Previously set to undefined when the multi-head rule rescued content (rationale: "no block-driving score"). That preserved the strict invariant `tier2Score >= highRiskThreshold ⇔ allowed === false` but produced incoherent operator telemetry — high `tier2RawScore` with `tier2Score: undefined` is hard to reason about on dashboards. New behavior: under aux veto, `tier2Score = 0` so the operator triple (`tier2Score`, `riskLevel`, `allowed`) tells one coherent story — zero / low / true. The model's actual main signal is preserved on `tier2RawScore`, and `tier2MultiheadBlocked: false` + `tier2AuxScore` give rule-level context for anyone debugging the decision. Combined with the riskLevel-from-tier2EffectiveScore derivation, the operator invariant `tier2Score >= highRiskThreshold ⇔ allowed === false` holds in single-head and multi-head-rule-fired modes; multi-head aux-veto is the third branch and now reads consistently as "zero contribution". 2. MultiheadConfig JSDoc. The field-level docstrings claimed `Default: 0.5` and `Default: 0.3` — misleading because both fields are required (no library default) and (0.5, 0.3) is the operating point that produced our documented AS regression. Rewrote the interface docblock to point at the FP-benchmark-validated `(0.5, 0.8)` raw / `(0.5, 0.64)` calibrated default, with a reference to evals/RESULTS.md for the threshold sweep. 3. tier2Score JSDoc on DefenseResult. Rewritten to enumerate the three modes (single-head, multi-head rule fired, multi-head aux veto) with the exact value semantics for each. Also: trimmed over-commenting in specs/tier2-multihead.spec.ts (~95 lines removed). Kept the non-obvious context (threshold-arithmetic notes, the "2/6 ticket variants" operational fact); removed the line-by-line narrative. 290 tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…null assertion Three small follow-ups to the review feedback: 1. readCalibrationDefaults now distinguishes ENOENT (silent — legacy models without classifier_config.json) from other read failures and JSON parse errors (warn). A typo in a shipped calibration block now surfaces at construction time instead of silently falling back to library defaults. 2. OnnxClassifier throws on non-positive or non-finite temperatureT. T must be in (0, ∞); zero, negative, NaN, and Infinity now produce a clear error rather than being silently coerced to 1. Calibration with invalid T is a programming error, not graceful-degradation territory. 3. Replaced `this.tier2Classifier!` non-null assertion at prompt-defense.ts with a captured local inside the existing narrowed block. Lint is now warning-free; biome check passes cleanly. +1 test for temperatureT validation. 291 tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The bundled v5 model auto-loads its fitted T from classifier_config.json, so most callers should never set this field. Trim the JSDoc on Tier2ClassifierConfig.temperatureT and remove temperature references from MultiheadConfig threshold docs so the public surface reads as a single, simple knob rather than asking consumers to reason about model internals. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…loaded thresholds apply Tier2Classifier merges hardcoded defaults < model classifier_config.json < caller `tier2Config`. The previous Bug 1 fix only synced the caller-override path; thresholds auto-loaded from a model's calibration block (the new v0.7 default) reached Tier2Classifier's internal copy but not the gate at `this.config.tier2.highRiskThreshold`. Result: bundled v5 ships `highRiskThreshold: 0.64`, Tier2Classifier sees 0.64, gate stays at the library default 0.8. A calibrated score of ~0.75 on an attack lands `riskLevel: "high"` with `allowed: true` — exactly the incoherent triple Bug 1 was supposed to eliminate. Discovered when the AgentShield score dropped from 86.7 to 80.9 on the v0.7 candidate: 36 attacks flipped from block to allow on the model-auto-load path. Fix: drop the pre-construction sync (Tier2Classifier already applies caller overrides via its 3-tier merge) and read back from `tier2Classifier.getConfig()` after construction. The readback is authoritative regardless of whether the threshold came from library defaults, model auto-load, or caller override — a single source of truth for the gate. Regression test: model-level calibration auto-load must propagate to the gate, demonstrated against the bundled v5 model with no caller config. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR upgrades Tier 2’s ONNX pipeline to support multi-head models (main+aux), adds temperature scaling for calibrated probabilities, and updates the public API/telemetry to reflect “effective” (decision-driving) scores while preserving raw scores for forensics.
Changes:
- Added multi-head inference support (
[batch, 2]) and a configurable decision rule using main/aux thresholds. - Introduced temperature scaling (
sigmoid(logit / T)) and auto-loading of model calibration defaults fromclassifier_config.json. - Adjusted PromptDefense result fields/semantics (
tier2Scoreas effective score; addedtier2RawScore,tier2AuxScore,tier2MultiheadBlocked) and updated bundled default model + build packaging.
Reviewed changes
Copilot reviewed 10 out of 15 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| src/types.ts | Extends Tier2Result to optionally report aux score for multi-head models. |
| src/core/prompt-defense.ts | Implements multi-head rule + effective/raw score reporting and threshold propagation logic in the Tier 2 gate. |
| src/classifiers/tier2-classifier.ts | Loads calibration defaults from classifier_config.json; adds multi-head config plumbing and temperature accessor. |
| src/classifiers/onnx-classifier.ts | Adds output-mode detection (single vs multi), main+aux APIs, and temperature scaling in score computation. |
| src/classifiers/models/minilm-multihead-v5/tokenizer_config.json | Bundled model asset for the new default model. |
| src/classifiers/models/minilm-multihead-v5/config.json | Bundled model asset for the new default model. |
| src/classifiers/models/minilm-multihead-v5/classifier_config.json | Bundled calibration defaults (temperatureT, thresholds) consumed at runtime. |
| src/classifiers/models/minilm-full-aug/.gitkeep | Placeholder intended to prevent silent loads of the legacy model path. |
| specs/tier2-multihead.spec.ts | Adds tests for multi-head behavior, calibration loading, and the three reported API contract bugs. |
| specs/tier2-classifier.spec.ts | Updates expectations around default thresholds now coming from model calibration. |
| specs/onnx-classifier.spec.ts | Updates bundled model path reference to the new default model directory. |
| scripts/copy-models.cjs | Replaces inline copy logic with a dedicated post-build asset mirroring script. |
| package.json | Updates copy-models script to use scripts/copy-models.cjs. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…ig, tighten invariant docs
Three review-driven fixes:
1. Multihead config against a single-head model used to silently disable
Tier 2: `classifyChunksBatchPair` returned `{ main, aux: null }` rows,
the rule's "no chunk blocks" path triggered aux-veto, and
tier2EffectiveScore collapsed to 0. Detect every-aux-null after the
batched call and set `tier2SkipReason` so the misconfig surfaces.
2. JSDoc on `tier2Score` claimed the invariant
`tier2Score >= highRiskThreshold ⇔ allowed === false` held
unconditionally. It doesn't — `blockHighRisk: false` keeps
`allowed: true` regardless, and Tier 1 detections can drive
`allowed: false` independently. Reword to state the conditions.
3. Inline comment on the return claimed the multihead veto sets
`tier2Score` to undefined; the implementation sets 0. Update the
comment to match.
Also fix the header comment in scripts/copy-models.cjs: it claimed the
script writes to dist/classifiers/models/<name> but it writes to
dist/models/<name>.
Adds a regression spec for #1 using `vi.spyOn` against
`Tier2Classifier.prototype.classifyChunksBatchPair`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
readCalibrationDefaults() does a sync readFileSync + JSON.parse on every
Tier2Classifier construction. When callers create a PromptDefense per
request, that's ~50-200µs of blocked event loop per call — ~100ms/s at
1k req/s, ~1s/s (one saturated core) at 10k req/s — on a file whose
contents are bundled at build time and never change at runtime.
Cache the result in a module-level Map keyed by modelDir, mirroring the
_sessionCache pattern already used for ONNX sessions. First call on a
modelDir reads from disk; every subsequent call returns from memory.
`null` is a valid cached value ("no calibration block for this model"),
so probe with `.has()` rather than `=== undefined`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
AgentShield re-run on Matches the v5 calibration-findings baseline; threshold-sync fix (
Run config: defaults ( Branch is ready to merge. |
willleeney
left a comment
There was a problem hiding this comment.
Review — Multi-head ONNX + Temperature Scaling (v0.7)
Solid work. The core logic and invariants are correct, the three bug fixes are real and well-targeted, and the test coverage is strong (355 new lines in tier2-multihead.spec.ts). AgentShield benchmark confirms parity at 86.7.
Main actionable items: #1 (undefined clobber risk on config merge) and #5 (overly loose threshold assertion). The rest are cleanup / readability.
Six inline comments below.
…lt assertion Two PR-review fixes (#63 threads 1 and 5): 1. Tier2Classifier's caller-config spread used to overwrite model-loaded calibration defaults when the caller passed explicitly-undefined keys. The common pattern `{ temperatureT: settings.t ?? undefined }` (building config conditionally from optional settings) would silently flow `undefined` into OnnxClassifier, bypass its positive-finite guard, and leave the classifier at T=1 without warning. Filter undefined keys out of the partial before spreading so model defaults survive. 2. The `.getConfig()` regression test for the model-auto-loaded highRiskThreshold was loosened to `> 0 && <= 1` in an earlier commit, which passes for any positive value — including the library default 0.8 that the auto-load is supposed to override. An accidentally-removed or malformed calibration block would slip through silently. Replace with `toBeCloseTo(0.64, 2)` to assert the exact shipped value. Adds a regression spec covering the clobber path: constructing with `{ temperatureT: undefined, highRiskThreshold: undefined }` must preserve v5's calibrated defaults (0.64, 2.41). 294 tests pass. Lint clean. Typecheck clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address PR #63 review nits 2-4 and 6: - Density damping was computed unconditionally then immediately overwritten under multi-head. Move it into the single-head `else if` branch so it only runs when the result actually drives a decision. - `tier2AuxScore` was assigned twice on the multi-head rule-fired path (first to the global-max-main chunk's aux, then overwritten with the rule-triggering chunk's aux). Rename the eager target to a local `auxOfMaxMain` and write `tier2AuxScore` exactly once in the branch that's keeping it. End-state semantics preserved across all three paths (single-head undefined / rule fired = mhTopBlockAux / aux veto = auxOfMaxMain). - Drop the unnecessary `getTemperature?.()` optional chain — the method is always defined on Tier2Classifier. - Add the missing trailing newline to v5's classifier_config.json. Net: post-scoring control flow now reads top-to-bottom in one pass — multi-head branch handles rule + aux-veto cases, single-head branch handles density damping + risk bucketing. 252 unit tests + 23 multi-head ONNX integration tests pass. Lint clean. Typecheck clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…me formatter CI runs `code:check` (lint + format); biome format flagged the multi-line form on the undefined-filter from 2b61b29. Local `code:lint` skipped the format check. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The pattern was `/\$\([^)]+\)|`[^`]+`/g` — `$(...)` OR any backtick-pair. The second alternative couldn't distinguish bash legacy command substitution from markdown inline code, so every technical README with `cat foo.json`, `npm install`, `~/.claude/...`, or even just `filename.txt` triggered shell_command on Tier 1 with no real attack signal. Defender dogfooded the bug — its own source files (which contain literal backtick-quoted strings as code examples) tripped the rule. Modern attackers default to `$(...)` because it nests cleanly; legacy backtick substitution is rare. Tier 2 still catches the residual backtick attacks via prompt context — Tier 1 dropping the regex just removes a noisy FP source. Regression spec asserts a markdown sample with multiple backtick-inline spans does not fire shell_command, while the existing `$(rm -rf /)` test keeps the positive-case coverage. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… filters
Audit follow-up to the shell_command backtick fix. Same bug class — regexes
that look gated by a role/qualifier but the gating is `?`-optional or
keyword-only, so the regex fires on shape-only matches. Tightenings:
- you_are_now: required role-noun terminator (was: both alternation groups
optional → matched "you are now " + anything). Role list expanded to
cover the actual attack distribution (DAN/GPT/AI/jailbroken/admin/root
/hacker/developer/superuser etc.). Fixes UI-copy FP class ("you are now
logged in / subscribed / ready").
- pretend_to_be: required attack-shaped role-noun (was: no role
constraint → FP'd on children's literature, drama exercises). Same role
list with privilege-escalation nouns added.
- show_instructions: required attack-qualifier (initial/original/system
/hidden); the inner group was optional so bare "show instructions"
matched. Fixes FAQ/help-doc FP class.
- markdown_hidden_instruction: required imperative + scope qualifier
("ignore all/the/previous/prior"); previous form matched any URL
containing "system" or "instruction", so every doc cross-reference
like `[config](https://.../system-setup)` FP'd. Uses `\W+` for
separator to handle URL-encoded `+` and `%20`.
- role_system_xml: required directive content (ignore/disregard/forget
/override/you are/new instructions/...) following the tag; bare
`<system>` mentions are common in XML schemas / ML config docs / OS
specs.
- json_injection: targets actual attack shapes — `"role": "system"`
(chat-message role hijack) or `"system": "<long string>"` (system-key
stuffing). Previous form fired on every OpenAI/Anthropic SDK example
declaring those JSON keys.
- confusable_homoglyphs: Cyrillic block now requires mixed-script
adjacency to Latin letter (the actual attack: `аdmin` with Cyrillic
'а'). Pure Russian text no longer FPs. Cherokee + Phonetic Extensions
blocks remain aggressive — those are essentially never in real
customer content.
Two pre-existing tests updated because they relied on the over-broad
patterns:
- "should return medium risk" — fixture now uses "pretend to be a hacker"
instead of "pretend to be a helpful assistant" (the latter is a benign
roleplay request that only triggered the old over-broad regex).
- "should detect markdown link with hidden instruction" — fixture URL
now uses the attack-shape `?p=ignore+all+previous+instructions` instead
of just `ignore-instructions.com`.
14 new regression specs pin both directions of each fix (FP-class fixture
does NOT fire / attack-shape fixture DOES fire). 268 unit tests pass,
23/23 multihead ONNX specs pass, biome check clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
[batch, 2]main+aux output) with a documented decision rule:block iff main >= mainThreshold AND aux < auxThreshold.score = sigmoid(logit / T)— so bundled models can ship pre-calibrated and consumers get well-behaved probabilities without retuning their own thresholds.minilm-full-augtominilm-multihead-v5. The new model ships with aclassifier_config.jsoncalibration block (temperatureT: 2.41,highRiskThreshold: 0.64) that auto-loads at construction; library hardcoded defaults < model defaults < caller config.highRiskThresholdset onTier2ClassifierConfig(caller-provided or model-auto-loaded) wasn't propagated to the gate's threshold copy, so it silently fell back to the framework default.DENSITY_SUB_THRESHOLDconstant in the density adjustment was hardcoded against raw scores; under temperature scaling it would never trigger. Now rescaled viasigmoid(log(3)/T).tier2Scorepreviously reported the raw max-chunk score even when density adjustment or aux-veto changed the actual decision. It now reports the effective score that determinedriskLevel/allowed; the raw pre-adjustment value is preserved on the newtier2RawScorefield for forensics.tier2Score >= highRiskThreshold ⇔ result.allowed === false. Under aux veto,tier2Score = 0so the public triple (tier2Score,riskLevel,allowed) tells one coherent story; the un-vetoed main is ontier2RawScore.AgentShield results
Target baseline (prior v5 single-head): Final 86.7.
Expected on this PR with calibrated v5 multi-head bundled by default:
Re-run pending against
feat/tier2-v0.7HEAD; numbers will be posted as a comment before merge.What's in the API
Tier2ClassifierConfig.multihead?: { mainThreshold, auxThreshold }— opt-in multi-head decision rule. Both fields are required (no library default) because the right operating point is model- and traffic-specific. For the bundled model, FP-benchmark validation gives{ 0.5, 0.64 }.Tier2ClassifierConfig.temperatureT?: number— advanced; override only when shipping a custom ONNX model. The bundled model auto-loads its fitted T.tier2RawScorefield onDefenseResult— forensic value when density or aux-veto rewrites the effective score.Notes
onnxModelPath(or no config at all, using the bundled model) gets strict main-only blocking with validated defaults.tier2Scoresemantics changed from "max raw chunk score" to "effective score backing the decision." Direct numeric comparisons that bypassriskLevel/allowedshould re-read againsttier2RawScore.src/classifiers/models/minilm-multihead-v5/. The legacyminilm-full-augdirectory is replaced with a.gitkeepplaceholder so the old path errors loudly rather than loading silently.Test plan
npx vitest run— 292 tests pass (22 intier2-multihead.spec.ts).npx biome check src/— clean.tsdown);npm packdry-run at 18.5 MB.0803063: Final 86.7 (matches v5 baseline). See comment for category breakdown.0.7.0onfeat:commits.@stackone/defenderto^0.7.0in the plugin repo and remove the dual-model shadow eval.🤖 Generated with Claude Code