feat(ENG-12396): upgrade ML classifier to jbv5 (AgentShield 73.7 → 81.1) by hiskudin · Pull Request #22 · StackOneHQ/defender

hiskudin · 2026-03-17T14:43:09Z

Summary

Upgrades the quantized ONNX model from baseline (full-aug-dojo) to full-aug-dojo-jailbreak-jbv5
Fixes Google 2FA/security alert emails being incorrectly flagged as injections
Model size unchanged at 22 MB

Benchmark improvements

Metric	Before	After	Delta
AgentShield score	73.7	81.1	+7.4
OR Penalty	3.51	0.00	eliminated
Jailbreak detection	48.9%	66.7%	+17.8 pts
Prompt injection	79.5%	83.9%	+4.4 pts
DAN-variant	20%	80%	+60 pts
Data exfiltration	85.1%	83.9%	-1.2 pts
Multi-agent security	80.0%	91.4%	+11.4 pts
Email FP (Google security alerts)	❌	✅ 4/4	fixed

Training additions over previous model (jbv2 → jbv5)

Three new public jailbreak datasets (same as jbv2):

jackhhao/jailbreak-classification — 527 examples (DAN, roleplay, persona-override)
lmsys/toxic-chat (jailbreaking=1) — 113 human-verified real jailbreaks
rubend18/ChatGPT-Jailbreak-Prompts — 79 classic named templates

Plus email false-positive fix:

darkknight25/phishing_benign_email_dataset (benign class) — 100 transactional notification emails (invoices, shipping confirmations, 2FA notices, password resets)
pos_weight=0.8 — re-tightens the attack boundary after benign data addition

Total training set: ~9,157 examples.

Email FP fix

Google 2FA and security alert emails were scoring 0.99 (blocked) with jbv2 due to no benign examples in the account-security semantic neighbourhood. jbv5 scores them at 0.27–0.33 (allowed).

Other changes

Remove incorrect Git LFS section from README (model files are plain binaries)

🤖 Generated with Claude Code

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Replace baseline ONNX with full-aug-dojo-jailbreak-jbv2 variant: - AgentShield score: 73.7 → 79.8 (+3.3 pts, new best) - Jailbreak detection: 48.9% → 68.9% (+20 pts) - Prompt injection: 79.5% → 92.7% (+13.2 pts) - DAN-variant subcategory: 20% → 80% Training additions over v1: jailbreakbench (527), toxic-chat (113), chatgpt-jailbreaks (79) — 719 new attack examples, 9,057 total. Model size unchanged at 22 MB quantized ONNX. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

ONNX model files are committed as plain binaries, not LFS objects. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

cubic-dev-ai

No issues found across 4 files

Copilot

Pull request overview

Updates project documentation to reflect that model binaries are not managed via Git LFS, aligning the README with the PR’s stated “remove incorrect Git LFS section” change.

Changes:

Removed the README’s “Git LFS” subsection under Development.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Replaces jbv2 ONNX model with jbv5. Fixes Google 2FA/security alert emails being flagged as injections while improving overall benchmark score. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

glebedel

LGTM

hiskudin and others added 3 commits March 12, 2026 10:00

fix: downgrade nanoid to 3.3.11 to fix CJS compatibility

4b070fd

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

docs: remove Git LFS section from README

57bc2a0

ONNX model files are committed as plain binaries, not LFS objects. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings March 17, 2026 14:43

Merge branch 'main' into feat/upgrade-model-jbv2

8e35447

Copilot started reviewing on behalf of hiskudin March 17, 2026 14:44 View session

cubic-dev-ai Bot reviewed Mar 17, 2026

View reviewed changes

hiskudin changed the title ~~feat: upgrade ML classifier to jbv2 (AgentShield 73.7 → 79.8)~~ feat(ENG-12396) : upgrade ML classifier to jbv2 (AgentShield 73.7 → 79.8) Mar 17, 2026

hiskudin changed the title ~~feat(ENG-12396) : upgrade ML classifier to jbv2 (AgentShield 73.7 → 79.8)~~ feat(ENG-12396): upgrade ML classifier to jbv2 (AgentShield 73.7 → 79.8) Mar 17, 2026

Copilot AI reviewed Mar 17, 2026

View reviewed changes

hiskudin changed the title ~~feat(ENG-12396): upgrade ML classifier to jbv2 (AgentShield 73.7 → 79.8)~~ feat(ENG-12396): upgrade ML classifier to jbv5 (AgentShield 73.7 → 81.1) Mar 20, 2026

feat: upgrade ML classifier to jbv5 (AgentShield 79.8 → 81.1)

800034c

Replaces jbv2 ONNX model with jbv5. Fixes Google 2FA/security alert emails being flagged as injections while improving overall benchmark score. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

glebedel approved these changes Mar 23, 2026

View reviewed changes

hiskudin merged commit 51f50ce into main Mar 23, 2026
3 checks passed

stackone-devops-service-account mentioned this pull request Mar 23, 2026

chore(main): release defender 0.5.0 #24

Merged

hiskudin deleted the feat/upgrade-model-jbv2 branch March 24, 2026 14:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ENG-12396): upgrade ML classifier to jbv5 (AgentShield 73.7 → 81.1)#22

feat(ENG-12396): upgrade ML classifier to jbv5 (AgentShield 73.7 → 81.1)#22
hiskudin merged 5 commits into
mainfrom
feat/upgrade-model-jbv2

hiskudin commented Mar 17, 2026 •

edited

Loading

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

glebedel left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

hiskudin commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Benchmark improvements

Training additions over previous model (jbv2 → jbv5)

Email FP fix

Other changes

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

glebedel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hiskudin commented Mar 17, 2026 •

edited

Loading