fix(ENG-259): upgrade Tier 2 ML model to v31 by hiskudin · Pull Request #59 · StackOneHQ/defender

hiskudin · 2026-05-07T08:59:54Z

Summary

Bumps the bundled all-MiniLM-L6-v2 ONNX model from v5 (current production) to v31. Drop-in replacement — no API, peer-dep, or model-format changes. Same 22 MB int8-quantized ONNX, same tokenizer.

v31 builds on v28 (this PR's earlier candidate) by adding emoji-CI-benign training data with per-sample loss weight 0.7 to close the [bracket-tag] + ✅ + imperative residual that real Claude Code plugin sessions still hit on v28 (lefthook output scored 0.978 → 0.332). The 0.7× weighting was tuned to avoid the connector-FPR regression that the unweighted v29 variant caused.

This PR ships only the model swap. The feat/obfuscation-normalisation branch is independent and ships separately.

Training data delta vs v5

v31 training adds four sources on top of v5's mix:

dev-tooling-hardneg-curated (1395 benign rows) — tracebacks from bigcode/the-stack-github-issues, AI-vocab commit subjects from JetBrains-Research/commit-chronicle, lint output. Swarm-reviewed for label noise.
agentshield-shape-attacks (~250 generated attacks) — imperative + tool reference + path/command + harmful intent shape (the AgentShield tool-abuse regression class identified via per-case diagnostic).
system-prompt-extraction-attacks (~170 generated attacks) — direct, authority, roleplay, encoded, exfil-via-URL, tool-smuggle, encoded-output, cross-context, fake-system shapes.
emoji-ci-benign (141 benign rows, per-sample loss weight 0.7×) — CI/lint output with checkmark/cross/warning emojis across pytest, eslint, lefthook, GitHub Actions, terraform plan, npm audit shapes.

Generators draw template phrasings from public prompt-injection literature (OWASP LLM01, HackAPrompt, Greshake et al.) — never from AgentShield case texts. Audit-clean: max cosine similarity 0.75 across all 568 generated rows vs AgentShield corpus (scripts/audit_contamination.py in training repo).

Benchmark results

All numbers from the production TypeScript pipeline (npm path) at threshold 0.8.

AgentShield 1.0 (537 cases)

Category	Weight	v5	v31	Δ
Prompt Injection	20%	88.8	92.7	+3.9
Jailbreak	10%	75.6	77.8	+2.2
Data Exfiltration	15%	79.3	88.5	+9.2
Tool Abuse	15%	65.0	78.8	+13.8
Over-Refusal	15%	96.9	92.3	−4.6
Multi-Agent	10%	97.1	100.0	+2.9
Provenance & Audit	5%	65.0	80.0	+15.0
Latency Overhead	10%	100.0	100.0	0
Composite		81.3	88.3	+7.0
Score		80.9	86.9	+6.0

Five categories materially improved (PI, JB, DE, TA, MA, PA) at the cost of a -4.6 Over-Refusal regression. Net composite +7.0.

StackOne enterprise connector FPR (940 benign HRIS/ATS/MS-Teams payloads, TS pipeline)

	v5 prod	v31	Δ
Overall TS-pipeline FPR	16/940 (1.70%)	8/940 (0.85%)	−0.85pp (50% reduction)

(An earlier internal Python eval reported all variants at 0/940 — that script doesn't faithfully replicate the production pipeline and has been deprecated. The TS-pipeline numbers above reflect the user-visible reality.)

Claude Code dev-tool FPR (884-row open-source corpus)

Sources: glaiveai/glaive-function-calling-v2 + JetBrains-Research/jupyter-errors-dataset + ffuuugor/agentdojo-dump.

	v5	v31
Overall FPR	7.92% (70/884)	4.75% (42/884) — 40% reduction

Captured Claude Code FP regression suite (7 cases from real plugin sessions)

FP	v5 score	v31 score	v31 verdict
FP-4 lefthook commit summary	0.997	0.021	✅ pass
FP-7 git push confirmation	0.531	0.003	✅ pass
FP-10 Python traceback	0.842	0.003	✅ pass
FP-11 gh api JSON listing	0.989	0.009	✅ pass
FP-12 Modal CUDA banner	0.966	0.152	✅ pass
FP-13 Python tuple PII output	0.978	0.793	⚠️ borderline (warn)
FP-14 HF model card MCP response	0.999	0.989	❌ still BLOCK
SANITY real injection	1.000	1.000	✅ BLOCK (preserved)

5 of 7 production FPs resolved. SANITY injection detection preserved.

Adversarial TPR (broad smoke test)

7 attack categories — DAN, system-prompt extraction, authority impersonation, tool-abuse, markdown data-exfil, roleplay jailbreak, command injection in tool args — 7/7 BLOCK at scores 0.902–1.000.

Latency

	v5	v31
P50	7ms	7ms
P95	15ms	14ms

No regression.

Why this warrants a fix release

Same API surface, no breaking changes, no peer-dep changes. release-please should bump 0.6.3 → 0.6.4 from the fix(...) commit prefix.

Trade-off acknowledged

The only category regression vs v5 is Over-Refusal (−4.6 — three more AS benign cases now block). This is concentrated on AS cases that share surface features with attack categories (the same structural trade-off documented in the training writeup). The +13.8 Tool Abuse and +15.0 Provenance gains more than compensate at the composite level, and connector FPR is cut nearly in half.

Test plan

CI passes (existing test suite is API-level; model swap shouldn't affect any existing test)
Smoke test in a downstream consumer (Claude Code plugin) for ≥24h to catch any unexpected FP regression
Re-run TS-pipeline connector FPR after merge to confirm the 0.85% number holds in CI

Full writeup

Variant history (v5 → v32, including contamination/dead-end attempts) is documented in: stackone-agent-redteaming/guard/classifier-eval/docs/experiments/2026-05-05-claude-code-fp-mitigation.md.

🤖 Generated with Claude Code

Copilot

Copilot wasn't able to review any files in this pull request.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

… + emoji-CI) Bumps the bundled all-MiniLM-L6-v2 ONNX model from v5 (current production) to v31. Drop-in replacement — same 22 MB int8-quantized ONNX, same tokenizer. v31 = v28 (PR's previous candidate) + emoji-CI-benign training data with per-sample loss weight 0.7. v28 was already a strong improvement over v5; v31 additionally closes the [bracket-tag] + ✅ + imperative residual that real Claude Code plugin sessions still hit on v28 (lefthook output line scored 0.978 → 0.332), without the connector-FPR regression that v29 (the unweighted emoji-CI variant) caused. Training data sources beyond v5's mix: - dev-tooling-hardneg-curated (1395 swarm-reviewed benign rows): tracebacks, AI-vocab commit subjects, lint output. - agentshield-shape-attacks (~250 generated): imperative+tool+path attacks. - system-prompt-extraction-attacks (~170 generated): direct/authority/ roleplay/encoded/exfil-via-URL/tool-smuggle/encoded-output/cross-context /fake-system shapes. - emoji-ci-benign (141 generated, weighted 0.7×): CI/lint output with checkmark/cross/warning emojis across pytest, eslint, lefthook, GitHub Actions, terraform plan, npm audit shapes. Generators draw template phrasings from public prompt-injection literature (OWASP LLM01, HackAPrompt, Greshake et al.), audit-clean against AS corpus (max cosine similarity 0.75 across all 568 generated rows; verified via scripts/audit_contamination.py in the training repo). AgentShield 1.0 results (published @stackone/defender 0.6.3 + this model swapped in vs the v5 model): Composite: 88.3 vs 81.3 (+7.0) Score: 86.9 vs 80.9 (+6.0) Tool Abuse: 78.8 vs 65.0 (+13.8) Provenance: 80.0 vs 65.0 (+15.0) Data Exfiltration: 88.5 vs 79.3 (+9.2) Prompt Injection: 92.7 vs 88.8 (+3.9) Multi-Agent: 100.0 vs 97.1 (+2.9) Jailbreak: 77.8 vs 75.6 (+2.2) Over-Refusal: 92.3 vs 96.9 (-4.6) ← only regression Latency: 7/14ms vs 7/15ms (no regression) Connector FPR (940 benign HRIS/ATS/MS-Teams payloads, TS-pipeline at threshold 0.8): v5 prod: 16/940 (1.70%) v31: 8/940 (0.85%) ← 50% reduction Claude Code dev-tool FPR (884-row open-source corpus from glaive-fn-calling + jupyter-errors + agentdojo): 7.92% → 4.75% (40% reduction). 7 captured Claude Code FPs from real plugin sessions (regression suite): v5 fixed 0/7. v31 fixes 5/7 directly (lefthook, git-push, traceback, gh-api-json-listing, CUDA-banner) and brings 1 more (Python tuple PII) to borderline-pass. SANITY injection still BLOCKS at 1.000. The Over-Refusal -4.6 trade-off is the only category regression. It's concentrated on AgentShield benign cases that share surface features with attack categories — the same structural trade-off documented in the v22-v32 training writeup. The +13.8 Tool Abuse and +15.0 Provenance gains more than compensate at the composite level. Full writeup with all 32 trained variants and the methodology is at: stackone-agent-redteaming/guard/classifier-eval/docs/experiments/2026-05-05-claude-code-fp-mitigation.md

Copilot AI review requested due to automatic review settings May 7, 2026 08:59

hiskudin requested a review from a team as a code owner May 7, 2026 08:59

Copilot AI reviewed May 7, 2026

View reviewed changes

hiskudin changed the title ~~fix(tier2): upgrade Tier 2 ML model to v28~~ fix(ENG-259): upgrade Tier 2 ML model to v28 May 7, 2026

hiskudin force-pushed the feat/model-v28 branch from 304bd24 to b196c2d Compare May 7, 2026 13:04

hiskudin changed the title ~~fix(ENG-259): upgrade Tier 2 ML model to v28~~ fix(tier2): upgrade Tier 2 ML model to v31 May 7, 2026

hiskudin changed the title ~~fix(tier2): upgrade Tier 2 ML model to v31~~ fix(ENG-259): upgrade Tier 2 ML model to v31 May 7, 2026

hiskudin closed this May 13, 2026

hiskudin deleted the feat/model-v28 branch May 26, 2026 13:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ENG-259): upgrade Tier 2 ML model to v31#59

fix(ENG-259): upgrade Tier 2 ML model to v31#59
hiskudin wants to merge 1 commit into
mainfrom
feat/model-v28

hiskudin commented May 7, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hiskudin commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Training data delta vs v5

Benchmark results

AgentShield 1.0 (537 cases)

StackOne enterprise connector FPR (940 benign HRIS/ATS/MS-Teams payloads, TS pipeline)

Claude Code dev-tool FPR (884-row open-source corpus)

Captured Claude Code FP regression suite (7 cases from real plugin sessions)

Adversarial TPR (broad smoke test)

Latency

Why this warrants a fix release

Trade-off acknowledged

Test plan

Full writeup

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hiskudin commented May 7, 2026 •

edited

Loading