Agent Security Scanner Benchmark: Can Bandit, Semgrep, or MCP Scanners Detect Agent Vulnerabilities?

No agent security scanner achieves a Youden Index above 0.30 across OWASP Agentic AI categories. Scanner union provides zero complementarity — all three scanners combined detect exactly what Sigil alone detects (80%). Tool poisoning (ASI01) and identity/privilege attacks (ASI03) have 0% detection. AOQL ranges 23x across scanners.

Key Results

Finding	Metric	Evidence
No scanner achieves adequate discrimination	Max Youden Index: 0.30 (Sigil+bandit)	37 MCP test cases (25 vulnerable, 12 safe)
Scanner union = Sigil alone	Combined TPR = 80% = Sigil TPR	Cisco and MEDUSA detections are strict subsets
AOQL spans 23x across scanners	0.04 (MEDUSA best) to 0.92 (Cisco)	Operating Characteristic curve analysis
Strong category-level specialization	ASI01/ASI03: 0% detection; ASI05: 100%	5 OWASP Agentic AI categories
MEDUSA: starkest tradeoff	96% TPR / 100% FPR → 16% TPR / 0% FPR	Score threshold sweep
Cisco MCP Scanner: lowest detection	8% TPR (2/25) at all operating points	Only detects ASI05 code execution
Statistically significant differences	Fisher's exact p<0.001 (Bonferroni-corrected)	Sigil vs Cisco, Sigil vs MEDUSA

Best Operating Points by Scanner

Scanner	Operating Point	TPR	FPR	Youden	TPR 95% CI
Cisco MCP Scanner	OP1 (static, all)	0.08	0.00	0.08	[0.01, 0.26]
MEDUSA	OP3 (high threshold)	0.16	0.00	0.16	[0.05, 0.36]
MEDUSA	OP1 (any finding)	0.96	1.00	-0.04	[0.80, 1.00]
Sigil+bandit	OP1 (score >13)	0.80	0.50	0.30	[0.59, 0.93]
Sigil+bandit	OP2 (score >19)	0.36	0.25	0.11	[0.18, 0.57]

The Finding

We benchmarked three agent security scanners — Cisco MCP Scanner (v4.6.0), MEDUSA (v2026.4.0), and Sigil (with bandit integration) — against a ground-truth corpus of 37 MCP server test cases covering 5 OWASP Agentic AI Security categories. Traditional SAST tools (bandit, semgrep, CodeQL) were not designed for agent-specific vulnerabilities, and the MCP-specific scanners don't fill the gap.

The core result: the scanners don't complement each other. Adding Cisco and MEDUSA to Sigil adds zero detection coverage. And even the best scanner (Sigil) only achieves TPR=0.80 at FPR=0.50 — meaning half of all safe servers are flagged as vulnerable to catch 80% of real vulnerabilities.

Category-level analysis reveals why: ASI01 (tool poisoning) and ASI03 (identity/privilege) have 0% detection by Cisco and MEDUSA at discriminating thresholds. These are arguably the most dangerous agentic attack categories, and no scanner reliably detects them.

Quick Start

git clone https://github.com/rexcoleman/cycle12-agent-security-tooling.git
cd cycle12-agent-security-tooling
pip install -r requirements.txt
bash reproduce.sh                    # full reproduction

Scanners Evaluated

Scanner	Version	Type	Detection Approach
Cisco MCP Scanner	v4.6.0	MCP-specific	Pattern matching on tool descriptions
MEDUSA	v2026.4.0	MCP-specific	Static analysis + LLM-assisted scoring
Sigil + bandit	latest	General + Python	AST analysis + security linting

Methodology

Test corpus: 37 MCP server implementations (25 vulnerable across 5 OWASP categories, 12 safe controls)
Analysis: Operating Characteristic curves, Youden Index optimization, AOQL computation
Statistical tests: Fisher's exact test with Bonferroni correction for pairwise comparisons
Framework: Manufacturing QA methodology (OC curves, AOQL) adapted for security scanner evaluation

Full methodology in EXPERIMENTAL_DESIGN.md. All results in FINDINGS.md.

Figures



Operating Characteristic curves for all scanners	Category-level detection heatmap

Related Work

Blog post: Can Bandit or Semgrep Detect Agent Vulnerabilities? — Accessible summary of this research
agent-skill-scanner — PyPI-installable agent security scanner (SE-157)
agent-skill-scan-action — GitHub Action for agent security (SE-158)
agent-skill-scan-mcp — MCP server for agent security checks (SE-159)
controllability-bound — Defense difficulty decomposition framework

Citation

@software{coleman2026scanneroc,
  title = {Agent Security Scanner Operating Characteristics: A Manufacturing QA Framework for Comparative Evaluation},
  author = {Coleman, Rex},
  year = {2026},
  url = {https://github.com/rexcoleman/cycle12-agent-security-tooling},
  license = {MIT}
}

License

MIT. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
blog		blog
docs		docs
outputs		outputs
scripts		scripts
tests		tests
.gitignore		.gitignore
CITATION.cff		CITATION.cff
DECISION_LOG.md		DECISION_LOG.md
EXECUTION_PROTOCOL.md		EXECUTION_PROTOCOL.md
EXPERIMENTAL_DESIGN.md		EXPERIMENTAL_DESIGN.md
FINDINGS.md		FINDINGS.md
HYPOTHESIS_REGISTRY.md		HYPOTHESIS_REGISTRY.md
LANDSCAPE_ASSESSMENT.md		LANDSCAPE_ASSESSMENT.md
LICENSE		LICENSE
OBSERVATION_LOG.md		OBSERVATION_LOG.md
PROJECT.md		PROJECT.md
README.md		README.md
REQUIREMENTS.md		REQUIREMENTS.md
RESEARCH_QUESTION_SPEC.md		RESEARCH_QUESTION_SPEC.md
ROADMAP.md		ROADMAP.md
T3_SCORES.md		T3_SCORES.md
VERIFICATION_REPORT.md		VERIFICATION_REPORT.md
content_quality_report.json		content_quality_report.json
governance.yaml		governance.yaml
reproduce.sh		reproduce.sh
requirements.txt		requirements.txt
state.json		state.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Agent Security Scanner Benchmark: Can Bandit, Semgrep, or MCP Scanners Detect Agent Vulnerabilities?

Key Results

Best Operating Points by Scanner

The Finding

Quick Start

Scanners Evaluated

Methodology

Figures

Related Work

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Agent Security Scanner Benchmark: Can Bandit, Semgrep, or MCP Scanners Detect Agent Vulnerabilities?

Key Results

Best Operating Points by Scanner

The Finding

Quick Start

Scanners Evaluated

Methodology

Figures

Related Work

Citation

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages