Skip to content

fix(ENG-259): upgrade Tier 2 ML model to v31#59

Closed
hiskudin wants to merge 1 commit into
mainfrom
feat/model-v28
Closed

fix(ENG-259): upgrade Tier 2 ML model to v31#59
hiskudin wants to merge 1 commit into
mainfrom
feat/model-v28

Conversation

@hiskudin

@hiskudin hiskudin commented May 7, 2026

Copy link
Copy Markdown
Collaborator

Summary

Bumps the bundled all-MiniLM-L6-v2 ONNX model from v5 (current production) to v31. Drop-in replacement — no API, peer-dep, or model-format changes. Same 22 MB int8-quantized ONNX, same tokenizer.

v31 builds on v28 (this PR's earlier candidate) by adding emoji-CI-benign training data with per-sample loss weight 0.7 to close the [bracket-tag] + ✅ + imperative residual that real Claude Code plugin sessions still hit on v28 (lefthook output scored 0.978 → 0.332). The 0.7× weighting was tuned to avoid the connector-FPR regression that the unweighted v29 variant caused.

This PR ships only the model swap. The feat/obfuscation-normalisation branch is independent and ships separately.

Training data delta vs v5

v31 training adds four sources on top of v5's mix:

  • dev-tooling-hardneg-curated (1395 benign rows) — tracebacks from bigcode/the-stack-github-issues, AI-vocab commit subjects from JetBrains-Research/commit-chronicle, lint output. Swarm-reviewed for label noise.
  • agentshield-shape-attacks (~250 generated attacks) — imperative + tool reference + path/command + harmful intent shape (the AgentShield tool-abuse regression class identified via per-case diagnostic).
  • system-prompt-extraction-attacks (~170 generated attacks) — direct, authority, roleplay, encoded, exfil-via-URL, tool-smuggle, encoded-output, cross-context, fake-system shapes.
  • emoji-ci-benign (141 benign rows, per-sample loss weight 0.7×) — CI/lint output with checkmark/cross/warning emojis across pytest, eslint, lefthook, GitHub Actions, terraform plan, npm audit shapes.

Generators draw template phrasings from public prompt-injection literature (OWASP LLM01, HackAPrompt, Greshake et al.) — never from AgentShield case texts. Audit-clean: max cosine similarity 0.75 across all 568 generated rows vs AgentShield corpus (scripts/audit_contamination.py in training repo).

Benchmark results

All numbers from the production TypeScript pipeline (npm path) at threshold 0.8.

AgentShield 1.0 (537 cases)

Category Weight v5 v31 Δ
Prompt Injection 20% 88.8 92.7 +3.9
Jailbreak 10% 75.6 77.8 +2.2
Data Exfiltration 15% 79.3 88.5 +9.2
Tool Abuse 15% 65.0 78.8 +13.8
Over-Refusal 15% 96.9 92.3 −4.6
Multi-Agent 10% 97.1 100.0 +2.9
Provenance & Audit 5% 65.0 80.0 +15.0
Latency Overhead 10% 100.0 100.0 0
Composite 81.3 88.3 +7.0
Score 80.9 86.9 +6.0

Five categories materially improved (PI, JB, DE, TA, MA, PA) at the cost of a -4.6 Over-Refusal regression. Net composite +7.0.

StackOne enterprise connector FPR (940 benign HRIS/ATS/MS-Teams payloads, TS pipeline)

v5 prod v31 Δ
Overall TS-pipeline FPR 16/940 (1.70%) 8/940 (0.85%) −0.85pp (50% reduction)

(An earlier internal Python eval reported all variants at 0/940 — that script doesn't faithfully replicate the production pipeline and has been deprecated. The TS-pipeline numbers above reflect the user-visible reality.)

Claude Code dev-tool FPR (884-row open-source corpus)

Sources: glaiveai/glaive-function-calling-v2 + JetBrains-Research/jupyter-errors-dataset + ffuuugor/agentdojo-dump.

v5 v31
Overall FPR 7.92% (70/884) 4.75% (42/884) — 40% reduction

Captured Claude Code FP regression suite (7 cases from real plugin sessions)

FP v5 score v31 score v31 verdict
FP-4 lefthook commit summary 0.997 0.021 ✅ pass
FP-7 git push confirmation 0.531 0.003 ✅ pass
FP-10 Python traceback 0.842 0.003 ✅ pass
FP-11 gh api JSON listing 0.989 0.009 ✅ pass
FP-12 Modal CUDA banner 0.966 0.152 ✅ pass
FP-13 Python tuple PII output 0.978 0.793 ⚠️ borderline (warn)
FP-14 HF model card MCP response 0.999 0.989 ❌ still BLOCK
SANITY real injection 1.000 1.000 ✅ BLOCK (preserved)

5 of 7 production FPs resolved. SANITY injection detection preserved.

Adversarial TPR (broad smoke test)

7 attack categories — DAN, system-prompt extraction, authority impersonation, tool-abuse, markdown data-exfil, roleplay jailbreak, command injection in tool args — 7/7 BLOCK at scores 0.902–1.000.

Latency

v5 v31
P50 7ms 7ms
P95 15ms 14ms

No regression.

Why this warrants a fix release

Same API surface, no breaking changes, no peer-dep changes. release-please should bump 0.6.3 → 0.6.4 from the fix(...) commit prefix.

Trade-off acknowledged

The only category regression vs v5 is Over-Refusal (−4.6 — three more AS benign cases now block). This is concentrated on AS cases that share surface features with attack categories (the same structural trade-off documented in the training writeup). The +13.8 Tool Abuse and +15.0 Provenance gains more than compensate at the composite level, and connector FPR is cut nearly in half.

Test plan

  • CI passes (existing test suite is API-level; model swap shouldn't affect any existing test)
  • Smoke test in a downstream consumer (Claude Code plugin) for ≥24h to catch any unexpected FP regression
  • Re-run TS-pipeline connector FPR after merge to confirm the 0.85% number holds in CI

Full writeup

Variant history (v5 → v32, including contamination/dead-end attempts) is documented in: stackone-agent-redteaming/guard/classifier-eval/docs/experiments/2026-05-05-claude-code-fp-mitigation.md.

🤖 Generated with Claude Code

Copilot AI review requested due to automatic review settings May 7, 2026 08:59
@hiskudin hiskudin requested a review from a team as a code owner May 7, 2026 08:59

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot wasn't able to review any files in this pull request.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@hiskudin hiskudin changed the title fix(tier2): upgrade Tier 2 ML model to v28 fix(ENG-259): upgrade Tier 2 ML model to v28 May 7, 2026
… + emoji-CI)

Bumps the bundled all-MiniLM-L6-v2 ONNX model from v5 (current production)
to v31. Drop-in replacement — same 22 MB int8-quantized ONNX, same tokenizer.

v31 = v28 (PR's previous candidate) + emoji-CI-benign training data with
per-sample loss weight 0.7. v28 was already a strong improvement over v5;
v31 additionally closes the [bracket-tag] + ✅ + imperative residual that
real Claude Code plugin sessions still hit on v28 (lefthook output line
scored 0.978 → 0.332), without the connector-FPR regression that v29 (the
unweighted emoji-CI variant) caused.

Training data sources beyond v5's mix:
  - dev-tooling-hardneg-curated (1395 swarm-reviewed benign rows): tracebacks,
    AI-vocab commit subjects, lint output.
  - agentshield-shape-attacks (~250 generated): imperative+tool+path attacks.
  - system-prompt-extraction-attacks (~170 generated): direct/authority/
    roleplay/encoded/exfil-via-URL/tool-smuggle/encoded-output/cross-context
    /fake-system shapes.
  - emoji-ci-benign (141 generated, weighted 0.7×): CI/lint output with
    checkmark/cross/warning emojis across pytest, eslint, lefthook,
    GitHub Actions, terraform plan, npm audit shapes.

Generators draw template phrasings from public prompt-injection literature
(OWASP LLM01, HackAPrompt, Greshake et al.), audit-clean against AS corpus
(max cosine similarity 0.75 across all 568 generated rows; verified via
scripts/audit_contamination.py in the training repo).

AgentShield 1.0 results (published @stackone/defender 0.6.3 + this model
swapped in vs the v5 model):

  Composite:        88.3 vs 81.3 (+7.0)
  Score:            86.9 vs 80.9 (+6.0)
  Tool Abuse:       78.8 vs 65.0 (+13.8)
  Provenance:       80.0 vs 65.0 (+15.0)
  Data Exfiltration: 88.5 vs 79.3 (+9.2)
  Prompt Injection: 92.7 vs 88.8 (+3.9)
  Multi-Agent:     100.0 vs 97.1 (+2.9)
  Jailbreak:        77.8 vs 75.6 (+2.2)
  Over-Refusal:     92.3 vs 96.9 (-4.6) ← only regression
  Latency:          7/14ms vs 7/15ms (no regression)

Connector FPR (940 benign HRIS/ATS/MS-Teams payloads, TS-pipeline at threshold 0.8):
  v5 prod: 16/940 (1.70%)
  v31:      8/940 (0.85%)  ← 50% reduction

Claude Code dev-tool FPR (884-row open-source corpus from glaive-fn-calling
+ jupyter-errors + agentdojo): 7.92% → 4.75% (40% reduction).

7 captured Claude Code FPs from real plugin sessions (regression suite):
  v5 fixed 0/7. v31 fixes 5/7 directly (lefthook, git-push, traceback,
  gh-api-json-listing, CUDA-banner) and brings 1 more (Python tuple PII)
  to borderline-pass. SANITY injection still BLOCKS at 1.000.

The Over-Refusal -4.6 trade-off is the only category regression. It's
concentrated on AgentShield benign cases that share surface features with
attack categories — the same structural trade-off documented in the v22-v32
training writeup. The +13.8 Tool Abuse and +15.0 Provenance gains more
than compensate at the composite level.

Full writeup with all 32 trained variants and the methodology is at:
stackone-agent-redteaming/guard/classifier-eval/docs/experiments/2026-05-05-claude-code-fp-mitigation.md
@hiskudin hiskudin changed the title fix(ENG-259): upgrade Tier 2 ML model to v28 fix(tier2): upgrade Tier 2 ML model to v31 May 7, 2026
@hiskudin hiskudin changed the title fix(tier2): upgrade Tier 2 ML model to v31 fix(ENG-259): upgrade Tier 2 ML model to v31 May 7, 2026
@hiskudin hiskudin closed this May 13, 2026
@hiskudin hiskudin deleted the feat/model-v28 branch May 26, 2026 13:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants