Skip to content

feat(ENG-12396): upgrade ML classifier to jbv5 (AgentShield 73.7 → 81.1)#22

Merged
hiskudin merged 5 commits into
mainfrom
feat/upgrade-model-jbv2
Mar 23, 2026
Merged

feat(ENG-12396): upgrade ML classifier to jbv5 (AgentShield 73.7 → 81.1)#22
hiskudin merged 5 commits into
mainfrom
feat/upgrade-model-jbv2

Conversation

@hiskudin

@hiskudin hiskudin commented Mar 17, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Upgrades the quantized ONNX model from baseline (full-aug-dojo) to full-aug-dojo-jailbreak-jbv5
  • Fixes Google 2FA/security alert emails being incorrectly flagged as injections
  • Model size unchanged at 22 MB

Benchmark improvements

Metric Before After Delta
AgentShield score 73.7 81.1 +7.4
OR Penalty 3.51 0.00 eliminated
Jailbreak detection 48.9% 66.7% +17.8 pts
Prompt injection 79.5% 83.9% +4.4 pts
DAN-variant 20% 80% +60 pts
Data exfiltration 85.1% 83.9% -1.2 pts
Multi-agent security 80.0% 91.4% +11.4 pts
Email FP (Google security alerts) ✅ 4/4 fixed

Training additions over previous model (jbv2 → jbv5)

Three new public jailbreak datasets (same as jbv2):

  • jackhhao/jailbreak-classification — 527 examples (DAN, roleplay, persona-override)
  • lmsys/toxic-chat (jailbreaking=1) — 113 human-verified real jailbreaks
  • rubend18/ChatGPT-Jailbreak-Prompts — 79 classic named templates

Plus email false-positive fix:

  • darkknight25/phishing_benign_email_dataset (benign class) — 100 transactional notification emails (invoices, shipping confirmations, 2FA notices, password resets)
  • pos_weight=0.8 — re-tightens the attack boundary after benign data addition

Total training set: ~9,157 examples.

Email FP fix

Google 2FA and security alert emails were scoring 0.99 (blocked) with jbv2 due to no benign examples in the account-security semantic neighbourhood. jbv5 scores them at 0.27–0.33 (allowed).

Other changes

  • Remove incorrect Git LFS section from README (model files are plain binaries)

🤖 Generated with Claude Code

hiskudin and others added 3 commits March 12, 2026 10:00
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace baseline ONNX with full-aug-dojo-jailbreak-jbv2 variant:
- AgentShield score: 73.7 → 79.8 (+3.3 pts, new best)
- Jailbreak detection: 48.9% → 68.9% (+20 pts)
- Prompt injection: 79.5% → 92.7% (+13.2 pts)
- DAN-variant subcategory: 20% → 80%

Training additions over v1: jailbreakbench (527), toxic-chat (113),
chatgpt-jailbreaks (79) — 719 new attack examples, 9,057 total.
Model size unchanged at 22 MB quantized ONNX.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ONNX model files are committed as plain binaries, not LFS objects.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings March 17, 2026 14:43

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 4 files

@hiskudin hiskudin changed the title feat: upgrade ML classifier to jbv2 (AgentShield 73.7 → 79.8) feat(ENG-12396) : upgrade ML classifier to jbv2 (AgentShield 73.7 → 79.8) Mar 17, 2026
@hiskudin hiskudin changed the title feat(ENG-12396) : upgrade ML classifier to jbv2 (AgentShield 73.7 → 79.8) feat(ENG-12396): upgrade ML classifier to jbv2 (AgentShield 73.7 → 79.8) Mar 17, 2026

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates project documentation to reflect that model binaries are not managed via Git LFS, aligning the README with the PR’s stated “remove incorrect Git LFS section” change.

Changes:

  • Removed the README’s “Git LFS” subsection under Development.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

@hiskudin hiskudin changed the title feat(ENG-12396): upgrade ML classifier to jbv2 (AgentShield 73.7 → 79.8) feat(ENG-12396): upgrade ML classifier to jbv5 (AgentShield 73.7 → 81.1) Mar 20, 2026
Replaces jbv2 ONNX model with jbv5. Fixes Google 2FA/security alert
emails being flagged as injections while improving overall benchmark score.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

@glebedel glebedel left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@hiskudin hiskudin merged commit 51f50ce into main Mar 23, 2026
3 checks passed
@hiskudin hiskudin deleted the feat/upgrade-model-jbv2 branch March 24, 2026 14:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants