Skip to content

feat(ENG-335): multi-head ONNX + temperature scaling + API contract fixes (v0.7)#63

Merged
hiskudin merged 14 commits into
mainfrom
feat/tier2-v0.7
May 14, 2026
Merged

feat(ENG-335): multi-head ONNX + temperature scaling + API contract fixes (v0.7)#63
hiskudin merged 14 commits into
mainfrom
feat/tier2-v0.7

Conversation

@hiskudin

@hiskudin hiskudin commented May 13, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Adds multi-head ONNX classifier support ([batch, 2] main+aux output) with a documented decision rule: block iff main >= mainThreshold AND aux < auxThreshold.
  • Adds post-hoc temperature scaling — score = sigmoid(logit / T) — so bundled models can ship pre-calibrated and consumers get well-behaved probabilities without retuning their own thresholds.
  • Swaps the bundled default from minilm-full-aug to minilm-multihead-v5. The new model ships with a classifier_config.json calibration block (temperatureT: 2.41, highRiskThreshold: 0.64) that auto-loads at construction; library hardcoded defaults < model defaults < caller config.
  • Fixes 3 latent API contract bugs surfaced by 0.7 work:
    1. highRiskThreshold set on Tier2ClassifierConfig (caller-provided or model-auto-loaded) wasn't propagated to the gate's threshold copy, so it silently fell back to the framework default.
    2. The DENSITY_SUB_THRESHOLD constant in the density adjustment was hardcoded against raw scores; under temperature scaling it would never trigger. Now rescaled via sigmoid(log(3)/T).
    3. tier2Score previously reported the raw max-chunk score even when density adjustment or aux-veto changed the actual decision. It now reports the effective score that determined riskLevel/allowed; the raw pre-adjustment value is preserved on the new tier2RawScore field for forensics.
  • Establishes the operator invariant: tier2Score >= highRiskThreshold ⇔ result.allowed === false. Under aux veto, tier2Score = 0 so the public triple (tier2Score, riskLevel, allowed) tells one coherent story; the un-vetoed main is on tier2RawScore.

AgentShield results

Target baseline (prior v5 single-head): Final 86.7.

Expected on this PR with calibrated v5 multi-head bundled by default:

Final ~86.7   (matching v5 calibration-findings baseline)
  PI · Jail · DE · TA · OR · MA · Prov

Re-run pending against feat/tier2-v0.7 HEAD; numbers will be posted as a comment before merge.

What's in the API

  • Tier2ClassifierConfig.multihead?: { mainThreshold, auxThreshold } — opt-in multi-head decision rule. Both fields are required (no library default) because the right operating point is model- and traffic-specific. For the bundled model, FP-benchmark validation gives { 0.5, 0.64 }.
  • Tier2ClassifierConfig.temperatureT?: number — advanced; override only when shipping a custom ONNX model. The bundled model auto-loads its fitted T.
  • New tier2RawScore field on DefenseResult — forensic value when density or aux-veto rewrites the effective score.

Notes

  • Default behavior is unchanged for single-head consumers: passing only onnxModelPath (or no config at all, using the bundled model) gets strict main-only blocking with validated defaults.
  • Breaking-ish: tier2Score semantics changed from "max raw chunk score" to "effective score backing the decision." Direct numeric comparisons that bypass riskLevel/allowed should re-read against tier2RawScore.
  • Bundled artefacts: v5 ONNX (~22 MB) lands at src/classifiers/models/minilm-multihead-v5/. The legacy minilm-full-aug directory is replaced with a .gitkeep placeholder so the old path errors loudly rather than loading silently.

Test plan

  • npx vitest run — 292 tests pass (22 in tier2-multihead.spec.ts).
  • npx biome check src/ — clean.
  • Build clean (tsdown); npm pack dry-run at 18.5 MB.
  • AgentShield re-run on 0803063: Final 86.7 (matches v5 baseline). See comment for category breakdown.
  • After merge: release-please bumps to 0.7.0 on feat: commits.
  • After publish: bump @stackone/defender to ^0.7.0 in the plugin repo and remove the dual-model shadow eval.

🤖 Generated with Claude Code

hiskudin and others added 6 commits May 13, 2026 14:56
…ixes

Adds opt-in support for dual-head ([batch, 2]) ONNX classifiers, post-hoc
temperature scaling for calibrated probability semantics, and the multi-head
decision rule (block iff main >= mainThreshold AND aux < auxThreshold). All
behind opt-in config — single-head consumption stays the back-compat default.

API additions:
  - tier2Config.multihead?: { mainThreshold, auxThreshold }
  - tier2Config.temperatureT?: number  (raw sigmoid when 1.0)
  - OnnxClassifier.classifyPair / classifyBatchPair  (main + aux)
  - Tier2Classifier.classifyChunksBatchPair / isMultihead / getMultiheadConfig
  - Tier2Classifier auto-loads calibration defaults from classifier_config.json
  - DefenseResult.tier2AuxScore, tier2MultiheadBlocked
  - DefenseResult.tier2RawScore (debug; see Bug 3 below)
  - getDefaultModelPath exported

Three latent API contract bugs uncovered during calibration are fixed here:

  Bug 1 — tier2Config.highRiskThreshold overrides never propagated to the
  block gate. Visible only when calibrated thresholds land between the
  override and the un-propagated default (0.8). Latent since multi-head
  support was added. Fix: PromptDefense constructor now syncs threshold
  overrides into this.config.tier2.* alongside the Tier2Classifier copy.

  Bug 2 — DENSITY_SUB_THRESHOLD was hardcoded in raw-sigmoid space. Under
  temperature scaling, scores compress toward 0.5 and the literal 0.75
  cutoff stops counting "high" events, causing density damping to silently
  under-fire. Fix: rescale in logit space — sigmoid(log(3) / T). T=1 is a
  no-op; T=2.41 yields ~0.612.

  Bug 3 — tier2Score returned the raw max-chunk main, but the block gate
  used tier2EffectiveScore (post-density). Operators comparing
  tier2Score >= highRiskThreshold got a different answer than
  result.allowed === false. Fix: tier2Score now reports the effective score
  that drove the decision; the pre-density max-chunk main is surfaced as
  tier2RawScore for forensics. Under multi-head aux veto, tier2Score is
  undefined (no block-driving score) — operators should check
  tier2MultiheadBlocked when they need the rule's verdict explicitly.

229 tests pass. Default model path still points at minilm-full-aug — the v5
multi-head model with calibrated defaults lands in the next commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…metadata

Replaces the legacy minilm-full-aug ONNX model with minilm-multihead-v5, a
dual-head MiniLM-L6 fine-tuned with code/docs/git aux supervision. Single-head
consumption by default — no aux behavior change unless callers opt into
tier2Config.multihead. Calibrated probability semantics by default via the
model's classifier_config.json:calibration block.

Calibration metadata (model self-describes):
  temperatureT: 2.41
  highRiskThreshold: 0.64  (math-equivalent to raw 0.8 at T=2.41)
  ece: 0.09
  fitted_on: labeled plugin events 2026-05-13

Tier2Classifier auto-loads these defaults at construction; user-provided
tier2Config still wins. Models without a calibration block (custom paths
pointing at non-v5 models) fall back to library defaults (T=1, threshold=0.8).

Migration:
  - Callers using the default config now receive calibrated probabilities.
    tier2Score values for the same content will shift toward 0.5 (less
    saturated). Re-check any hardcoded threshold comparisons.
  - Callers explicitly setting tier2Config.highRiskThreshold see no semantic
    change other than Bug 1 (previous commit) finally honoring overrides.
  - Callers explicitly setting onnxModelPath: ".../minilm-full-aug" break —
    that directory is no longer shipped. v5 ships as the only bundled model.

Build / packaging:
  - scripts/copy-models.cjs replaces an inline package.json one-liner.
    MODEL_DIRS lists the bundled variants; add new models here.
  - npm pack size: 18.5 MB (was projected 90+ MB with all session variants).
  - dist size: 23 MB (was 100 MB with all variants).

Pruning:
  - Removed legacy minilm-full-aug binary.
  - Removed v3, v4c, v6, v31 dev variants — kept in classifier-eval workspace
    and on Modal volume for benchmarking; not in the npm tarball.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ead thresholds

Three review-feedback fixes on the v0.7 branch:

1. tier2Score under aux veto. Previously set to undefined when the multi-head
   rule rescued content (rationale: "no block-driving score"). That preserved
   the strict invariant `tier2Score >= highRiskThreshold ⇔ allowed === false`
   but produced incoherent operator telemetry — high `tier2RawScore` with
   `tier2Score: undefined` is hard to reason about on dashboards.

   New behavior: under aux veto, `tier2Score = 0` so the operator triple
   (`tier2Score`, `riskLevel`, `allowed`) tells one coherent story — zero /
   low / true. The model's actual main signal is preserved on `tier2RawScore`,
   and `tier2MultiheadBlocked: false` + `tier2AuxScore` give rule-level
   context for anyone debugging the decision.

   Combined with the riskLevel-from-tier2EffectiveScore derivation, the
   operator invariant `tier2Score >= highRiskThreshold ⇔ allowed === false`
   holds in single-head and multi-head-rule-fired modes; multi-head aux-veto
   is the third branch and now reads consistently as "zero contribution".

2. MultiheadConfig JSDoc. The field-level docstrings claimed `Default: 0.5`
   and `Default: 0.3` — misleading because both fields are required (no
   library default) and (0.5, 0.3) is the operating point that produced our
   documented AS regression. Rewrote the interface docblock to point at the
   FP-benchmark-validated `(0.5, 0.8)` raw / `(0.5, 0.64)` calibrated default,
   with a reference to evals/RESULTS.md for the threshold sweep.

3. tier2Score JSDoc on DefenseResult. Rewritten to enumerate the three
   modes (single-head, multi-head rule fired, multi-head aux veto) with the
   exact value semantics for each.

Also: trimmed over-commenting in specs/tier2-multihead.spec.ts (~95 lines
removed). Kept the non-obvious context (threshold-arithmetic notes, the
"2/6 ticket variants" operational fact); removed the line-by-line narrative.

290 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…null assertion

Three small follow-ups to the review feedback:

1. readCalibrationDefaults now distinguishes ENOENT (silent — legacy models
   without classifier_config.json) from other read failures and JSON parse
   errors (warn). A typo in a shipped calibration block now surfaces at
   construction time instead of silently falling back to library defaults.

2. OnnxClassifier throws on non-positive or non-finite temperatureT. T must
   be in (0, ∞); zero, negative, NaN, and Infinity now produce a clear
   error rather than being silently coerced to 1. Calibration with invalid
   T is a programming error, not graceful-degradation territory.

3. Replaced `this.tier2Classifier!` non-null assertion at prompt-defense.ts
   with a captured local inside the existing narrowed block. Lint is now
   warning-free; biome check passes cleanly.

+1 test for temperatureT validation. 291 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The bundled v5 model auto-loads its fitted T from classifier_config.json,
so most callers should never set this field. Trim the JSDoc on
Tier2ClassifierConfig.temperatureT and remove temperature references from
MultiheadConfig threshold docs so the public surface reads as a single,
simple knob rather than asking consumers to reason about model internals.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 13, 2026 15:37
@hiskudin hiskudin requested a review from a team as a code owner May 13, 2026 15:37
…loaded thresholds apply

Tier2Classifier merges hardcoded defaults < model classifier_config.json
< caller `tier2Config`. The previous Bug 1 fix only synced the
caller-override path; thresholds auto-loaded from a model's calibration
block (the new v0.7 default) reached Tier2Classifier's internal copy but
not the gate at `this.config.tier2.highRiskThreshold`.

Result: bundled v5 ships `highRiskThreshold: 0.64`, Tier2Classifier sees
0.64, gate stays at the library default 0.8. A calibrated score of ~0.75
on an attack lands `riskLevel: "high"` with `allowed: true` — exactly the
incoherent triple Bug 1 was supposed to eliminate. Discovered when the
AgentShield score dropped from 86.7 to 80.9 on the v0.7 candidate: 36
attacks flipped from block to allow on the model-auto-load path.

Fix: drop the pre-construction sync (Tier2Classifier already applies
caller overrides via its 3-tier merge) and read back from
`tier2Classifier.getConfig()` after construction. The readback is
authoritative regardless of whether the threshold came from library
defaults, model auto-load, or caller override — a single source of truth
for the gate.

Regression test: model-level calibration auto-load must propagate to the
gate, demonstrated against the bundled v5 model with no caller config.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR upgrades Tier 2’s ONNX pipeline to support multi-head models (main+aux), adds temperature scaling for calibrated probabilities, and updates the public API/telemetry to reflect “effective” (decision-driving) scores while preserving raw scores for forensics.

Changes:

  • Added multi-head inference support ([batch, 2]) and a configurable decision rule using main/aux thresholds.
  • Introduced temperature scaling (sigmoid(logit / T)) and auto-loading of model calibration defaults from classifier_config.json.
  • Adjusted PromptDefense result fields/semantics (tier2Score as effective score; added tier2RawScore, tier2AuxScore, tier2MultiheadBlocked) and updated bundled default model + build packaging.

Reviewed changes

Copilot reviewed 10 out of 15 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
src/types.ts Extends Tier2Result to optionally report aux score for multi-head models.
src/core/prompt-defense.ts Implements multi-head rule + effective/raw score reporting and threshold propagation logic in the Tier 2 gate.
src/classifiers/tier2-classifier.ts Loads calibration defaults from classifier_config.json; adds multi-head config plumbing and temperature accessor.
src/classifiers/onnx-classifier.ts Adds output-mode detection (single vs multi), main+aux APIs, and temperature scaling in score computation.
src/classifiers/models/minilm-multihead-v5/tokenizer_config.json Bundled model asset for the new default model.
src/classifiers/models/minilm-multihead-v5/config.json Bundled model asset for the new default model.
src/classifiers/models/minilm-multihead-v5/classifier_config.json Bundled calibration defaults (temperatureT, thresholds) consumed at runtime.
src/classifiers/models/minilm-full-aug/.gitkeep Placeholder intended to prevent silent loads of the legacy model path.
specs/tier2-multihead.spec.ts Adds tests for multi-head behavior, calibration loading, and the three reported API contract bugs.
specs/tier2-classifier.spec.ts Updates expectations around default thresholds now coming from model calibration.
specs/onnx-classifier.spec.ts Updates bundled model path reference to the new default model directory.
scripts/copy-models.cjs Replaces inline copy logic with a dedicated post-build asset mirroring script.
package.json Updates copy-models script to use scripts/copy-models.cjs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/core/prompt-defense.ts
Comment thread src/core/prompt-defense.ts Outdated
Comment thread src/core/prompt-defense.ts
Comment thread src/core/prompt-defense.ts Outdated
Comment thread scripts/copy-models.cjs Outdated
Comment thread src/classifiers/onnx-classifier.ts
@hiskudin hiskudin changed the title feat(tier2): multi-head ONNX + temperature scaling + API contract fixes (v0.7) feat(ENG-335): multi-head ONNX + temperature scaling + API contract fixes (v0.7) May 13, 2026
hiskudin and others added 2 commits May 13, 2026 17:03
…ig, tighten invariant docs

Three review-driven fixes:

1. Multihead config against a single-head model used to silently disable
   Tier 2: `classifyChunksBatchPair` returned `{ main, aux: null }` rows,
   the rule's "no chunk blocks" path triggered aux-veto, and
   tier2EffectiveScore collapsed to 0. Detect every-aux-null after the
   batched call and set `tier2SkipReason` so the misconfig surfaces.

2. JSDoc on `tier2Score` claimed the invariant
   `tier2Score >= highRiskThreshold ⇔ allowed === false` held
   unconditionally. It doesn't — `blockHighRisk: false` keeps
   `allowed: true` regardless, and Tier 1 detections can drive
   `allowed: false` independently. Reword to state the conditions.

3. Inline comment on the return claimed the multihead veto sets
   `tier2Score` to undefined; the implementation sets 0. Update the
   comment to match.

Also fix the header comment in scripts/copy-models.cjs: it claimed the
script writes to dist/classifiers/models/<name> but it writes to
dist/models/<name>.

Adds a regression spec for #1 using `vi.spyOn` against
`Tier2Classifier.prototype.classifyChunksBatchPair`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
readCalibrationDefaults() does a sync readFileSync + JSON.parse on every
Tier2Classifier construction. When callers create a PromptDefense per
request, that's ~50-200µs of blocked event loop per call — ~100ms/s at
1k req/s, ~1s/s (one saturated core) at 10k req/s — on a file whose
contents are bundled at build time and never change at runtime.

Cache the result in a module-level Map keyed by modelDir, mirroring the
_sessionCache pattern already used for ONNX sessions. First call on a
modelDir reads from disk; every subsequent call returns from memory.
`null` is a valid cached value ("no calibration block for this model"),
so probe with `.has()` rather than `=== undefined`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@hiskudin

Copy link
Copy Markdown
Collaborator Author

AgentShield re-run on 0803063 — 86.7 ✅

Matches the v5 calibration-findings baseline; threshold-sync fix (87c742b) restored the regression caught on the pre-fix candidate.

Category Pre-fix candidate Post-fix 0803063 Δ
Final 80.9 86.7 +5.8
Composite 81.6 88.2 +6.6
Prompt Injection 85.4 91.7 +6.3
Jailbreak 73.3 82.2 +8.9
Data Exfiltration 77.0 88.5 +11.5
Tool Abuse 75.0 81.3 +6.3
Over-Refusal 95.4 92.3 −3.1
Multi-Agent 94.3 97.1 +2.8
Provenance 65.0 70.0 +5.0

Run config: defaults (cfg: {}, no env vars) → single-head v5 with auto-loaded calibration from classifier_config.json (T=2.41, highRiskThreshold=0.64). 537 test cases. Result file: agentshield-benchmark/results/2026-05-14T08-19-49-654Z.json.

Branch is ready to merge.

@willleeney willleeney left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — Multi-head ONNX + Temperature Scaling (v0.7)

Solid work. The core logic and invariants are correct, the three bug fixes are real and well-targeted, and the test coverage is strong (355 new lines in tier2-multihead.spec.ts). AgentShield benchmark confirms parity at 86.7.

Main actionable items: #1 (undefined clobber risk on config merge) and #5 (overly loose threshold assertion). The rest are cleanup / readability.

Six inline comments below.

Comment thread src/classifiers/tier2-classifier.ts Outdated
Comment thread src/core/prompt-defense.ts Outdated
Comment thread src/core/prompt-defense.ts Outdated
Comment thread src/core/prompt-defense.ts Outdated
Comment thread specs/tier2-classifier.spec.ts Outdated
Comment thread src/classifiers/models/minilm-multihead-v5/classifier_config.json
hiskudin and others added 2 commits May 14, 2026 10:26
…lt assertion

Two PR-review fixes (#63 threads 1 and 5):

1. Tier2Classifier's caller-config spread used to overwrite model-loaded
   calibration defaults when the caller passed explicitly-undefined keys.
   The common pattern `{ temperatureT: settings.t ?? undefined }` (building
   config conditionally from optional settings) would silently flow
   `undefined` into OnnxClassifier, bypass its positive-finite guard, and
   leave the classifier at T=1 without warning. Filter undefined keys out
   of the partial before spreading so model defaults survive.

2. The `.getConfig()` regression test for the model-auto-loaded
   highRiskThreshold was loosened to `> 0 && <= 1` in an earlier commit,
   which passes for any positive value — including the library default 0.8
   that the auto-load is supposed to override. An accidentally-removed or
   malformed calibration block would slip through silently. Replace with
   `toBeCloseTo(0.64, 2)` to assert the exact shipped value.

Adds a regression spec covering the clobber path: constructing with
`{ temperatureT: undefined, highRiskThreshold: undefined }` must preserve
v5's calibrated defaults (0.64, 2.41).

294 tests pass. Lint clean. Typecheck clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address PR #63 review nits 2-4 and 6:

- Density damping was computed unconditionally then immediately overwritten
  under multi-head. Move it into the single-head `else if` branch so it
  only runs when the result actually drives a decision.
- `tier2AuxScore` was assigned twice on the multi-head rule-fired path
  (first to the global-max-main chunk's aux, then overwritten with the
  rule-triggering chunk's aux). Rename the eager target to a local
  `auxOfMaxMain` and write `tier2AuxScore` exactly once in the branch
  that's keeping it. End-state semantics preserved across all three
  paths (single-head undefined / rule fired = mhTopBlockAux / aux veto
  = auxOfMaxMain).
- Drop the unnecessary `getTemperature?.()` optional chain — the method
  is always defined on Tier2Classifier.
- Add the missing trailing newline to v5's classifier_config.json.

Net: post-scoring control flow now reads top-to-bottom in one pass —
multi-head branch handles rule + aux-veto cases, single-head branch
handles density damping + risk bucketing. 252 unit tests + 23 multi-head
ONNX integration tests pass. Lint clean. Typecheck clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread src/classifiers/tier2-classifier.ts
Comment thread src/classifiers/tier2-classifier.ts
hiskudin and others added 3 commits May 14, 2026 10:37
…me formatter

CI runs `code:check` (lint + format); biome format flagged the multi-line
form on the undefined-filter from 2b61b29. Local `code:lint` skipped the
format check.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The pattern was `/\$\([^)]+\)|`[^`]+`/g` — `$(...)` OR any backtick-pair.
The second alternative couldn't distinguish bash legacy command
substitution from markdown inline code, so every technical README with
`cat foo.json`, `npm install`, `~/.claude/...`, or even just `filename.txt`
triggered shell_command on Tier 1 with no real attack signal. Defender
dogfooded the bug — its own source files (which contain literal
backtick-quoted strings as code examples) tripped the rule.

Modern attackers default to `$(...)` because it nests cleanly; legacy
backtick substitution is rare. Tier 2 still catches the residual backtick
attacks via prompt context — Tier 1 dropping the regex just removes a
noisy FP source.

Regression spec asserts a markdown sample with multiple backtick-inline
spans does not fire shell_command, while the existing `$(rm -rf /)` test
keeps the positive-case coverage.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… filters

Audit follow-up to the shell_command backtick fix. Same bug class — regexes
that look gated by a role/qualifier but the gating is `?`-optional or
keyword-only, so the regex fires on shape-only matches. Tightenings:

- you_are_now: required role-noun terminator (was: both alternation groups
  optional → matched "you are now " + anything). Role list expanded to
  cover the actual attack distribution (DAN/GPT/AI/jailbroken/admin/root
  /hacker/developer/superuser etc.). Fixes UI-copy FP class ("you are now
  logged in / subscribed / ready").

- pretend_to_be: required attack-shaped role-noun (was: no role
  constraint → FP'd on children's literature, drama exercises). Same role
  list with privilege-escalation nouns added.

- show_instructions: required attack-qualifier (initial/original/system
  /hidden); the inner group was optional so bare "show instructions"
  matched. Fixes FAQ/help-doc FP class.

- markdown_hidden_instruction: required imperative + scope qualifier
  ("ignore all/the/previous/prior"); previous form matched any URL
  containing "system" or "instruction", so every doc cross-reference
  like `[config](https://.../system-setup)` FP'd. Uses `\W+` for
  separator to handle URL-encoded `+` and `%20`.

- role_system_xml: required directive content (ignore/disregard/forget
  /override/you are/new instructions/...) following the tag; bare
  `<system>` mentions are common in XML schemas / ML config docs / OS
  specs.

- json_injection: targets actual attack shapes — `"role": "system"`
  (chat-message role hijack) or `"system": "<long string>"` (system-key
  stuffing). Previous form fired on every OpenAI/Anthropic SDK example
  declaring those JSON keys.

- confusable_homoglyphs: Cyrillic block now requires mixed-script
  adjacency to Latin letter (the actual attack: `аdmin` with Cyrillic
  'а'). Pure Russian text no longer FPs. Cherokee + Phonetic Extensions
  blocks remain aggressive — those are essentially never in real
  customer content.

Two pre-existing tests updated because they relied on the over-broad
patterns:
- "should return medium risk" — fixture now uses "pretend to be a hacker"
  instead of "pretend to be a helpful assistant" (the latter is a benign
  roleplay request that only triggered the old over-broad regex).
- "should detect markdown link with hidden instruction" — fixture URL
  now uses the attack-shape `?p=ignore+all+previous+instructions` instead
  of just `ignore-instructions.com`.

14 new regression specs pin both directions of each fix (FP-class fixture
does NOT fire / attack-shape fixture DOES fire). 268 unit tests pass,
23/23 multihead ONNX specs pass, biome check clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@willleeney willleeney left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@hiskudin hiskudin merged commit 616cc10 into main May 14, 2026
3 checks passed
@hiskudin hiskudin deleted the feat/tier2-v0.7 branch May 14, 2026 13:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants