AgentShield

Security audit framework for agentic AI systems. Applies STRIDE threat modeling, a 100-scenario attack suite (prompt injection, data poisoning, multi-turn escalation, tool misuse), and a 4-component detection pipeline to BioTeam-AI, a 23-agent bioinformatics system with Docker sandboxing.

Author: JangKeun Kim, Weill Cornell Medicine (jak4013@med.cornell.edu) Status: Defensive security research tool. See Limitations and Responsible Use.

Try It in 30 Seconds

The 100-scenario attack dataset is published on HuggingFace and can be used to evaluate any agentic AI system without cloning this repo.

from datasets import load_dataset

scenarios = load_dataset("jang1563/agentshield-attack-scenarios", split="train")
print(f"Loaded {len(scenarios)} scenarios across {len(set(s['category'] for s in scenarios))} categories")
print("Example:", scenarios[0])

For local detection, point AgentShield's output classifier at Constitutional BioGuard (~5ms/query, no API cost).

Results

Evaluated using Claude Haiku (claude-haiku-4-5-20251001) as the target agent and the BioGuard DeBERTa classifier as the success judge. All 100 scenarios run undefended and defended; 100 benign baselines used for FPR.

Criterion	Target	Actual	Status
ASR reduction	>= 80%	100%	PASS
Direct injection detection	>= 90%	100%	PASS
Multi-turn escalation detection	>= 70%	100%	PASS
False positive rate	< 5%	1.0% (1/100, Wilson 95% CI [0.2%, 5.5%])	PASS

Per-Category Breakdown

Category	Undefended ASR	Defended ASR	Reduction	Detection Rate
Direct Injection	96%	0%	100%	100%
Indirect Injection	68%	0%	100%	84%
Multi-Turn Escalation	100%	0%	100%	100%
Tool Misuse	100%	0%	100%	100%

Architecture

User Input
    |
    v
[1. Input Classifier] -----> Block if injection detected
    |                         (18 regex patterns + encoding detection)
    v
[Mock Agent] processes input
    |
    |--- tool_calls ---> [2. Tool Auditor] --> Block if unauthorized
    |                     (allow-list + arg validation + rate limiting)
    v
Agent Response
    |
    v
[3. Output Classifier] ----> Block if unsafe content
    |                         (BioGuard DeBERTa, ~5ms/query)
    v
[4. Trajectory Monitor] ---> Block if escalation detected
    |                         (sliding window: absolute + monotonic + WMA)
    v
Delivered to User (or blocked)

Threat Model

STRIDE-based analysis of BioTeam-AI identifying 9 attack surfaces:

ID	Surface	STRIDE Categories	Risk
AS-001	Agent substitution bypass	Spoofing, EoP	Critical
AS-002	Tool access leakage	EoP	Critical
AS-003	Input sanitization gaps	Tampering, ID	High
AS-004	Memory poisoning (ChromaDB)	Tampering	Critical
AS-005	Docker sandbox escape	EoP	High
AS-006	Auth bypass in dev mode	Spoofing	Critical
AS-007	Workflow hijacking	Tampering	High
AS-008	Rate limit evasion	DoS	Medium
AS-009	Langfuse data leakage	ID	Medium

Full details in docs/threat_model.md. For OWASP Agentic Application risk coverage, see docs/architecture.md.

Attack Suite (100 Scenarios)

Category	Count	Description
Direct Injection	25	System prompt leak, role override, encoding bypass, DAN, many-shot, fiction framing, urgency injection, MCP tool description poisoning
Indirect Injection	25	Memory poisoning, API injection, cross-agent, workflow hijack, BibTeX/FASTA/UniProt/NCBI metadata injection, cross-session persistence
Multi-Turn Escalation	25	Gradual escalation across all 7 NSABB categories, CoSafe-inspired coreference, slow-burn social engineering
Tool Misuse	25	Unauthorized access, arg injection, sandbox escape, DoS, MCP tool poisoning, tool chain escalation, async race conditions

Full catalogue in docs/attack_catalogue.md.

Quick Start

Use the Attack Dataset Only

See Try It in 30 Seconds above. Just pip install datasets and load from HuggingFace.

Run the Full Evaluation Pipeline

Prerequisites:

Python >= 3.10
Constitutional BioGuard trained model (for output classifier; auto-downloads from HuggingFace)

Installation:

git clone https://github.com/jang1563/agentshield
cd agentshield
pip install -e ".[dev]"

Run evaluation:

# Generate threat model
python scripts/run_threat_model.py

# Run full evaluation (attacks + detection + metrics + figures + report)
python scripts/run_evaluation.py --agent-mode scripted

# Or run attacks separately
python scripts/run_attacks.py --mode undefended
python scripts/run_attacks.py --mode defended

Key Outputs

data/results/evaluation_summary.json: All metrics
data/results/figures/: ASR comparison, escalation trajectories, detector performance
docs/evaluation_report.md: Full evaluation report
docs/threat_model.md: STRIDE threat model

Detection Pipeline Components

1. Input Classifier

Pattern-based injection detector with 18 injection patterns, 5 role override patterns, 5 encoding indicators, 3 resource exhaustion patterns, and obfuscation detection (homoglyphs, leetspeak). Optional PromptGuard-86M integration for production use.

2. Output Classifier

Safety filter with 3 modes:

Local model (primary): Constitutional BioGuard DeBERTa classifier (~5ms/query)
LLM judge: Claude API with 4-dimensional rubric (refusal, completeness, specificity, safety)
Keyword fallback: Pattern matching for known dangerous content

3. Trajectory Monitor

Novel sliding-window algorithm tracking per-turn risk scores:

Absolute threshold: Any single turn risk > 0.7
Monotonic escalation: 3+ consecutive risk score increases
Weighted moving average: WMA exceeds 0.5 over sliding window; threshold lowered to 0.4 for 2-turn histories to catch steep single-step escalations

4. Tool Auditor

Authorization enforcement against AGENT_AGENTIC_TOOLS mapping:

Tool access allow-list enforcement
Argument pattern validation (gene symbols, HGVS notation)
Shell injection marker detection (13 patterns)
Call frequency rate limiting (20 calls/minute/agent)

Simulation Approach

All attacks run against mock agents (not live BioTeam-AI):

No risk of actual safety violations during testing
Deterministic and reproducible results
Security audit is read-only (does not modify BioTeam-AI)

Mock agent modes: echo (passthrough), scripted (pattern-matched responses), llm (Claude API).

Project Structure

agentshield/
├── agentshield/                  # Python package
│   ├── config.py
│   ├── threat_model/             # STRIDE framework
│   │   ├── stride.py             # Data models
│   │   ├── bioteam_surfaces.py   # 9 attack surfaces
│   │   └── risk_matrix.py        # Likelihood x impact scoring
│   ├── attacks/                  # 100 attack scenarios (25 per category)
│   │   ├── base.py               # AttackScenario + AttackResult
│   │   ├── direct_injection.py
│   │   ├── indirect_injection.py
│   │   ├── multi_turn_escalation.py
│   │   ├── tool_misuse.py
│   │   └── runner.py
│   ├── detectors/                # 4-component pipeline
│   │   ├── base.py               # DetectorBase ABC
│   │   ├── input_classifier.py
│   │   ├── output_classifier.py
│   │   ├── trajectory_monitor.py
│   │   ├── tool_auditor.py
│   │   └── pipeline.py
│   ├── simulation/               # Mock agent framework
│   │   ├── mock_agent.py
│   │   ├── mock_tools.py
│   │   ├── mock_memory.py
│   │   └── conversation.py
│   └── evaluation/               # Metrics + reporting
│       ├── metrics.py
│       ├── benchmarks.py
│       ├── figures.py
│       └── report_generator.py
├── data/
│   ├── attack_scenarios/
│   ├── benign_baselines/
│   └── results/
├── docs/
│   ├── threat_model.md
│   ├── attack_catalogue.md
│   ├── architecture.md
│   └── evaluation_report.md
├── scripts/
│   ├── run_threat_model.py
│   ├── run_attacks.py
│   ├── run_detection.py
│   └── run_evaluation.py
└── tests/

Cross-Project Integration

AgentShield's output classifier uses Constitutional BioGuard's trained DeBERTa model for real-time content classification. Set the model path via environment variable:

export BIOGUARD_MODEL_DIR=/path/to/constitutional_bioguard/models/deberta_bioguard_v1

Related projects:

bio-overrefusal-v0.1: 201-query expert-annotated dataset measuring legitimate-biology FPR for frontier models
ambiguity-casebook: 36 dual-use boundary cases for classifier stress-testing
bio-constitution-rules: 30 rules library covering 6 bio domains, validated by 5-fold CV

What This Is Not

Not a complete defense on its own (recommended as one layer in defense-in-depth: upstream policy + downstream model refusal training + human review)
Not a substitute for production red-teaming or formal penetration testing
Not validated on adaptive attackers; static suite assumes attacker does not iterate based on detection feedback
Not a tool for attacking systems you do not own or have explicit authorization to test

Limitations

Statistical power: Each attack category has 25 scenarios. With n=25, a 90% detection rate has a 95% confidence interval of approximately ±12 percentage points. Results should be interpreted as directional, not definitive benchmarks.

Mock agent gap: All attacks run against scripted mock agents, not live BioTeam-AI. Scripted responses may not reflect how real models behave under adversarial pressure.

Static attack suite: All 100 scenarios are fixed. An adaptive attacker who iterates based on detection feedback would likely achieve higher ASR than the reported values.

OWASP ASI coverage: AgentShield directly addresses 5 of the 10 OWASP Agentic Application risks (2026). ASI06 (cascading failures), ASI08 (rogue agents), and ASI10 (inter-agent communication) are outside scope. See docs/architecture.md for the full mapping.

Indirect injection coverage: 16% of indirect injection attacks (structured data injection via API responses, metadata fields) are detected by BioGuard's content classifier but not by the input classifier's pattern rules. All are blocked; detection attribution varies by attack vector.

Benign FPR scope: The 1.0% FPR was measured on 100 benign bioinformatics queries. FPR in other domains or with synthesis-planning queries (cloning, Gibson assembly) may be higher due to BioGuard's conservative scoring of dual-use biology content.

Responsible Use

AgentShield is a defensive security research tool. The attack scenarios are designed to probe and improve safety properties of agentic AI systems, not to enable malicious use.

Intended uses:

Security auditing of your own agentic AI systems before deployment
Research on prompt injection and multi-turn escalation defenses
Building detection pipelines for production agent systems

Do not use the attack scenarios to attack systems you do not own or have explicit authorization to test. The threat model and attack catalogue describe vulnerabilities in BioTeam-AI (a research system the author built and controls) and are published to support the security research community, not to provide a playbook for unauthorized access.

The 1.0% false positive rate means legitimate requests will occasionally be blocked. Calibrate thresholds for your deployment context.

Resources

Artifact	Link
GitHub	github.com/jang1563/agentshield
Attack Scenarios Dataset	huggingface.co/datasets/jang1563/agentshield-attack-scenarios
BioGuard Classifier	huggingface.co/jang1563/constitutional-bioguard-deberta-v1

Citation

See CITATION.cff for the structured citation, or use:

@software{kim2026agentshield,
  author = {Kim, JangKeun},
  title  = {AgentShield: Security Audit Framework for Agentic AI Systems},
  year   = {2026},
  url    = {https://github.com/jang1563/agentshield},
  version = {0.1.0},
}

License

MIT (see LICENSE).

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.github/workflows		.github/workflows
agentshield		agentshield
docs		docs
scripts		scripts
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CITATION.cff		CITATION.cff
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AgentShield

Try It in 30 Seconds

Results

Per-Category Breakdown

Architecture

Threat Model

Attack Suite (100 Scenarios)

Quick Start

Use the Attack Dataset Only

Run the Full Evaluation Pipeline

Key Outputs

Detection Pipeline Components

1. Input Classifier

2. Output Classifier

3. Trajectory Monitor

4. Tool Auditor

Simulation Approach

Project Structure

Cross-Project Integration

What This Is Not

Limitations

Responsible Use

Resources

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AgentShield

Try It in 30 Seconds

Results

Per-Category Breakdown

Architecture

Threat Model

Attack Suite (100 Scenarios)

Quick Start

Use the Attack Dataset Only

Run the Full Evaluation Pipeline

Key Outputs

Detection Pipeline Components

1. Input Classifier

2. Output Classifier

3. Trajectory Monitor

4. Tool Auditor

Simulation Approach

Project Structure

Cross-Project Integration

What This Is Not

Limitations

Responsible Use

Resources

Citation

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages