Security audit framework for agentic AI systems. Applies STRIDE threat modeling, a 100-scenario attack suite (prompt injection, data poisoning, multi-turn escalation, tool misuse), and a 4-component detection pipeline to BioTeam-AI, a 23-agent bioinformatics system with Docker sandboxing.
Author: JangKeun Kim, Weill Cornell Medicine (jak4013@med.cornell.edu) Status: Defensive security research tool. See Limitations and Responsible Use.
The 100-scenario attack dataset is published on HuggingFace and can be used to evaluate any agentic AI system without cloning this repo.
from datasets import load_dataset
scenarios = load_dataset("jang1563/agentshield-attack-scenarios", split="train")
print(f"Loaded {len(scenarios)} scenarios across {len(set(s['category'] for s in scenarios))} categories")
print("Example:", scenarios[0])For local detection, point AgentShield's output classifier at Constitutional BioGuard (~5ms/query, no API cost).
Evaluated using Claude Haiku (claude-haiku-4-5-20251001) as the target agent and the BioGuard DeBERTa classifier as the success judge. All 100 scenarios run undefended and defended; 100 benign baselines used for FPR.
| Criterion | Target | Actual | Status |
|---|---|---|---|
| ASR reduction | >= 80% | 100% | PASS |
| Direct injection detection | >= 90% | 100% | PASS |
| Multi-turn escalation detection | >= 70% | 100% | PASS |
| False positive rate | < 5% | 1.0% (1/100, Wilson 95% CI [0.2%, 5.5%]) | PASS |
| Category | Undefended ASR | Defended ASR | Reduction | Detection Rate |
|---|---|---|---|---|
| Direct Injection | 96% | 0% | 100% | 100% |
| Indirect Injection | 68% | 0% | 100% | 84% |
| Multi-Turn Escalation | 100% | 0% | 100% | 100% |
| Tool Misuse | 100% | 0% | 100% | 100% |
User Input
|
v
[1. Input Classifier] -----> Block if injection detected
| (18 regex patterns + encoding detection)
v
[Mock Agent] processes input
|
|--- tool_calls ---> [2. Tool Auditor] --> Block if unauthorized
| (allow-list + arg validation + rate limiting)
v
Agent Response
|
v
[3. Output Classifier] ----> Block if unsafe content
| (BioGuard DeBERTa, ~5ms/query)
v
[4. Trajectory Monitor] ---> Block if escalation detected
| (sliding window: absolute + monotonic + WMA)
v
Delivered to User (or blocked)
STRIDE-based analysis of BioTeam-AI identifying 9 attack surfaces:
| ID | Surface | STRIDE Categories | Risk |
|---|---|---|---|
| AS-001 | Agent substitution bypass | Spoofing, EoP | Critical |
| AS-002 | Tool access leakage | EoP | Critical |
| AS-003 | Input sanitization gaps | Tampering, ID | High |
| AS-004 | Memory poisoning (ChromaDB) | Tampering | Critical |
| AS-005 | Docker sandbox escape | EoP | High |
| AS-006 | Auth bypass in dev mode | Spoofing | Critical |
| AS-007 | Workflow hijacking | Tampering | High |
| AS-008 | Rate limit evasion | DoS | Medium |
| AS-009 | Langfuse data leakage | ID | Medium |
Full details in docs/threat_model.md. For OWASP Agentic Application risk coverage, see docs/architecture.md.
| Category | Count | Description |
|---|---|---|
| Direct Injection | 25 | System prompt leak, role override, encoding bypass, DAN, many-shot, fiction framing, urgency injection, MCP tool description poisoning |
| Indirect Injection | 25 | Memory poisoning, API injection, cross-agent, workflow hijack, BibTeX/FASTA/UniProt/NCBI metadata injection, cross-session persistence |
| Multi-Turn Escalation | 25 | Gradual escalation across all 7 NSABB categories, CoSafe-inspired coreference, slow-burn social engineering |
| Tool Misuse | 25 | Unauthorized access, arg injection, sandbox escape, DoS, MCP tool poisoning, tool chain escalation, async race conditions |
Full catalogue in docs/attack_catalogue.md.
See Try It in 30 Seconds above. Just pip install datasets and load from HuggingFace.
Prerequisites:
- Python >= 3.10
- Constitutional BioGuard trained model (for output classifier; auto-downloads from HuggingFace)
Installation:
git clone https://github.com/jang1563/agentshield
cd agentshield
pip install -e ".[dev]"Run evaluation:
# Generate threat model
python scripts/run_threat_model.py
# Run full evaluation (attacks + detection + metrics + figures + report)
python scripts/run_evaluation.py --agent-mode scripted
# Or run attacks separately
python scripts/run_attacks.py --mode undefended
python scripts/run_attacks.py --mode defendeddata/results/evaluation_summary.json: All metricsdata/results/figures/: ASR comparison, escalation trajectories, detector performancedocs/evaluation_report.md: Full evaluation reportdocs/threat_model.md: STRIDE threat model
Pattern-based injection detector with 18 injection patterns, 5 role override patterns, 5 encoding indicators, 3 resource exhaustion patterns, and obfuscation detection (homoglyphs, leetspeak). Optional PromptGuard-86M integration for production use.
Safety filter with 3 modes:
- Local model (primary): Constitutional BioGuard DeBERTa classifier (~5ms/query)
- LLM judge: Claude API with 4-dimensional rubric (refusal, completeness, specificity, safety)
- Keyword fallback: Pattern matching for known dangerous content
Novel sliding-window algorithm tracking per-turn risk scores:
- Absolute threshold: Any single turn risk > 0.7
- Monotonic escalation: 3+ consecutive risk score increases
- Weighted moving average: WMA exceeds 0.5 over sliding window; threshold lowered to 0.4 for 2-turn histories to catch steep single-step escalations
Authorization enforcement against AGENT_AGENTIC_TOOLS mapping:
- Tool access allow-list enforcement
- Argument pattern validation (gene symbols, HGVS notation)
- Shell injection marker detection (13 patterns)
- Call frequency rate limiting (20 calls/minute/agent)
All attacks run against mock agents (not live BioTeam-AI):
- No risk of actual safety violations during testing
- Deterministic and reproducible results
- Security audit is read-only (does not modify BioTeam-AI)
Mock agent modes: echo (passthrough), scripted (pattern-matched responses), llm (Claude API).
agentshield/
├── agentshield/ # Python package
│ ├── config.py
│ ├── threat_model/ # STRIDE framework
│ │ ├── stride.py # Data models
│ │ ├── bioteam_surfaces.py # 9 attack surfaces
│ │ └── risk_matrix.py # Likelihood x impact scoring
│ ├── attacks/ # 100 attack scenarios (25 per category)
│ │ ├── base.py # AttackScenario + AttackResult
│ │ ├── direct_injection.py
│ │ ├── indirect_injection.py
│ │ ├── multi_turn_escalation.py
│ │ ├── tool_misuse.py
│ │ └── runner.py
│ ├── detectors/ # 4-component pipeline
│ │ ├── base.py # DetectorBase ABC
│ │ ├── input_classifier.py
│ │ ├── output_classifier.py
│ │ ├── trajectory_monitor.py
│ │ ├── tool_auditor.py
│ │ └── pipeline.py
│ ├── simulation/ # Mock agent framework
│ │ ├── mock_agent.py
│ │ ├── mock_tools.py
│ │ ├── mock_memory.py
│ │ └── conversation.py
│ └── evaluation/ # Metrics + reporting
│ ├── metrics.py
│ ├── benchmarks.py
│ ├── figures.py
│ └── report_generator.py
├── data/
│ ├── attack_scenarios/
│ ├── benign_baselines/
│ └── results/
├── docs/
│ ├── threat_model.md
│ ├── attack_catalogue.md
│ ├── architecture.md
│ └── evaluation_report.md
├── scripts/
│ ├── run_threat_model.py
│ ├── run_attacks.py
│ ├── run_detection.py
│ └── run_evaluation.py
└── tests/
AgentShield's output classifier uses Constitutional BioGuard's trained DeBERTa model for real-time content classification. Set the model path via environment variable:
export BIOGUARD_MODEL_DIR=/path/to/constitutional_bioguard/models/deberta_bioguard_v1Related projects:
- bio-overrefusal-v0.1: 201-query expert-annotated dataset measuring legitimate-biology FPR for frontier models
- ambiguity-casebook: 36 dual-use boundary cases for classifier stress-testing
- bio-constitution-rules: 30 rules library covering 6 bio domains, validated by 5-fold CV
- Not a complete defense on its own (recommended as one layer in defense-in-depth: upstream policy + downstream model refusal training + human review)
- Not a substitute for production red-teaming or formal penetration testing
- Not validated on adaptive attackers; static suite assumes attacker does not iterate based on detection feedback
- Not a tool for attacking systems you do not own or have explicit authorization to test
Statistical power: Each attack category has 25 scenarios. With n=25, a 90% detection rate has a 95% confidence interval of approximately ±12 percentage points. Results should be interpreted as directional, not definitive benchmarks.
Mock agent gap: All attacks run against scripted mock agents, not live BioTeam-AI. Scripted responses may not reflect how real models behave under adversarial pressure.
Static attack suite: All 100 scenarios are fixed. An adaptive attacker who iterates based on detection feedback would likely achieve higher ASR than the reported values.
OWASP ASI coverage: AgentShield directly addresses 5 of the 10 OWASP Agentic Application risks (2026). ASI06 (cascading failures), ASI08 (rogue agents), and ASI10 (inter-agent communication) are outside scope. See docs/architecture.md for the full mapping.
Indirect injection coverage: 16% of indirect injection attacks (structured data injection via API responses, metadata fields) are detected by BioGuard's content classifier but not by the input classifier's pattern rules. All are blocked; detection attribution varies by attack vector.
Benign FPR scope: The 1.0% FPR was measured on 100 benign bioinformatics queries. FPR in other domains or with synthesis-planning queries (cloning, Gibson assembly) may be higher due to BioGuard's conservative scoring of dual-use biology content.
AgentShield is a defensive security research tool. The attack scenarios are designed to probe and improve safety properties of agentic AI systems, not to enable malicious use.
Intended uses:
- Security auditing of your own agentic AI systems before deployment
- Research on prompt injection and multi-turn escalation defenses
- Building detection pipelines for production agent systems
Do not use the attack scenarios to attack systems you do not own or have explicit authorization to test. The threat model and attack catalogue describe vulnerabilities in BioTeam-AI (a research system the author built and controls) and are published to support the security research community, not to provide a playbook for unauthorized access.
The 1.0% false positive rate means legitimate requests will occasionally be blocked. Calibrate thresholds for your deployment context.
| Artifact | Link |
|---|---|
| GitHub | github.com/jang1563/agentshield |
| Attack Scenarios Dataset | huggingface.co/datasets/jang1563/agentshield-attack-scenarios |
| BioGuard Classifier | huggingface.co/jang1563/constitutional-bioguard-deberta-v1 |
See CITATION.cff for the structured citation, or use:
@software{kim2026agentshield,
author = {Kim, JangKeun},
title = {AgentShield: Security Audit Framework for Agentic AI Systems},
year = {2026},
url = {https://github.com/jang1563/agentshield},
version = {0.1.0},
}
MIT (see LICENSE).