fix: upgrade ML classifier to jbv2 (AgentShield 73.7 → 79.8)#1
Merged
Conversation
Replace baseline ONNX with full-aug-dojo-jailbreak-jbv2 variant: - AgentShield score: 73.7 → 79.8 (+3.3 pts, new best) - Jailbreak detection: 48.9% → 68.9% (+20 pts) - Prompt injection: 79.5% → 92.7% (+13.2 pts) - DAN-variant subcategory: 20% → 80% Also sync README with JS package: - Add banner image and badges header (PyPI-adapted) - Update description to match JS package wording - Remove incorrect Git LFS section Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replaces jbv2 ONNX model with jbv5. Fixes Google 2FA/security alert emails being flagged as injections while improving overall benchmark score. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Upgrades the bundled Tier 2 ML classifier (quantized ONNX) to the jbv2 variant and refreshes the repository README to match the JS package’s branding and messaging.
Changes:
- Updates the README header with a centered banner + badges and adjusts the one-line project description.
- Removes the README’s Git LFS section (previously incorrect per PR description).
- (Per PR metadata) Upgrades the bundled ONNX classifier to
full-aug-dojo-jailbreak-jbv2with improved benchmark results.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Each sentence is now truncated to max_text_length before being passed to the ONNX classifier, consistent with the truncation applied in classify(). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replaces jbv5 model with jbv2 (full-aug-dojo-jailbreak-jbv2). AgentShield: 73.7 → 79.8 (composite 77.2 → 87.4, penalty 3.51 → 7.54) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Track risky_field_names on sanitizer metadata for Tier 2 field selection - Tier2Config.tier2_fields, PromptDefense(tier2_fields=...), and Node-style extract_strings with fields_for_tier2 precedence (explicit > risky names > all) - DefenseResult.tier2_skip_reason for empty input and classifier skips - Module-level ONNX session/tokenizer cache keyed by resolved path; warn on load failure - Tests for metadata, Tier 2 scoping mocks, skip reasons, shared ONNX session Made-with: Cursor
- onnx_classifier: keep session cache and logging; add _load_failed on ImportError; use src-relative model path from main - prompt_defense: single strict _extract_strings (no string leaves under non-matching keys); dedupe DefenseResult; drop sanitizer tier2 kwargs removed on main - types: single tier2_skip_reason field; keep tier2_fields comment - tests: Tier 2 scoping expectations match ENG-12518 Made-with: Cursor
BREAKING CHANGE: Drop ToolSanitizationRule, config/sanitizer tool_rules, use_default_tool_rules, and get_tool_rule/should_skip_field. Matches @stackone/defender post ENG-12594. - Tier2 classify_by_sentence uses one classify_batch call - Per cache-key threading.Lock for concurrent ONNX load + session cache Made-with: Cursor
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
enable_tier2default:False→Trueto match TypeScript SDK behaviourChanges
models/minilm-full-aug/— updated to jbv2 model (md5:18b50c8a27b669dfc9c940bd42fa7b4d)src/stackone_defender/core/prompt_defense.py—enable_tier2: bool = False→TrueREADME.md— updated(default: False)comment to(default: True), removed redundantenable_tier2=Truefrom examplesWhy enable_tier2 defaults to True
The TypeScript SDK (
@stackone/defender) has always defaultedenableTier2totrueviaoptions.enableTier2 ?? true. The Python SDK had an inconsistentFalsedefault, meaning users had to explicitly opt in to ML classification. This fix aligns the two SDKs.jbv2 Model Performance (AgentShield benchmark)
🤖 Generated with Claude Code