An autonomous, LLM-driven penetration testing agent that doesn't just find vulnerabilities — it exploits them.
pentest-agent uses large language models to iteratively plan, execute, and analyze security assessments against target systems. It combines AI reasoning with 40 offensive security skills to conduct authorized penetration tests, from reconnaissance through exploitation.
Tested against the XBOW validation benchmark — 45 real-world CTF challenges. Each challenge runs in Docker with a random flag that the agent must find through exploitation.
| Difficulty | Solved | Rate |
|---|---|---|
| Easy | 28/45 | 62% |
| Vulnerability Type | Solved | Rate |
|---|---|---|
| xss | 1/8 | 12% |
| command_injection | 5/6 | 83% |
| sqli | 2/5 | 40% |
| information_disclosure | 3/5 | 60% |
| privilege_escalation | 4/5 | 80% |
| idor | 3/4 | 75% |
| default_credentials | 4/4 | 100% |
| ssti | 4/4 | 100% |
| lfi | 2/4 | 50% |
| business_logic | 4/4 | 100% |
99.64s avg per challenge · 134,383 total tokens · Claude Code (zero-config, no API key)
Early iterations relied on prompt engineering to guide the LLM in writing correct HTTP/login code from scratch using urllib.request. This approach proved insufficient:
- Login misinterpretation: The LLM would call
json.loads()on HTML responses from/token, get aJSONDecodeError, and wrongly conclude the login failed — when a 200 response with HTML actually means success (cookie/session-based auth). This single failure mode caused the agent to spend 10+ iterations stuck retrying credentials that had already worked. - Brittle HTTP plumbing: Every
python_execscript reimplemented cookie handling, token extraction, redirect following, and error handling. Small variations (missinghttp.cookiejar, not checkingSet-Cookie, crashing onHTTPError) caused cascading failures. - Prompt instructions ignored: Despite explicit instructions ("use
http.cookiejar", "don'tjson.loads()first"), the LLM would revert to its trained patterns and write the exact code the prompts warned against. Adding more prompt constraints didn't improve reliability.
The current approach uses a helper library pattern: instead of teaching the LLM to write correct HTTP code, we provide a pre-tested CTFClient class (ctf_helpers.py) that handles login detection, cookie/JWT auth, response parsing, and flag extraction correctly. The LLM only needs to call CTFClient.login() — not reimplement the plumbing. This is deployed to /tmp/ctf_helpers.py before every python_exec invocation.
Additional reliability measures:
- Action deduplication: Hash
(skill, params)to prevent the agent from repeating identical actions - Structured fact extraction: Parse raw stdout for credentials, flags, status codes, and endpoints before passing to the Planner, so the LLM doesn't misinterpret raw output
- Schema validation: Validate Planner JSON output has required fields before execution
- Rolling analysis window: Last 3 results fed to the Planner so it doesn't lose context from earlier steps
- Autonomous agent loop — Plans attacks, executes tools, analyzes results, chains vulnerabilities, and adapts strategy
- Multi-agent architecture — Planner/Executor/Analyzer roles with clean context windows prevent reasoning degradation
- 40 built-in skills across 10 categories: recon, web, injection, auth, exploit, fuzzing, stress testing, and more
- MCP server mode — Expose all skills to Claude Desktop, VS Code Copilot, Cursor, or any MCP-compatible client
- Exploitation-focused — Doesn't stop at detection. Proves exploitability with RCE, auth bypass, data exfiltration
- Vulnerability chaining — Automatically chains findings (SSRF → cloud creds, LFI → RCE, JWT weakness + IDOR → account takeover)
- Multi-provider LLM support — Anthropic, OpenAI, or any LiteLLM-compatible model
- Zero-config with Claude Code — Uses your existing Claude subscription, no API keys needed
- Safety-first design — Scope enforcement, rate limiting, risk levels, time windows, approval gates
- Knowledge persistence — SQLite-based learning across engagements
- Docker sandboxing — Isolate tool execution in containers
- Multi-format reporting — JSON, HTML, and Markdown reports with evidence
- XBOW benchmarking — Built-in runner for the 104-challenge XBOW CTF benchmark suite with parallel execution and detailed metrics
- Plugin architecture — Drop in a Python file and it's auto-discovered
pip install -e .
# With MCP server support (for Claude Desktop / IDE integration)
pip install -e ".[mcp]"pentest-agent init engagement.yaml
# Edit engagement.yaml with your targets and scope# Zero-config if you have Claude Code installed — no API key needed!
pentest-agent run --target https://target.example.com
# From config file
pentest-agent run --config engagement.yaml
# Explicit provider (if you prefer direct API access)
pentest-agent run \
--config engagement.yaml \
--provider anthropic \
--model claude-sonnet-4-20250514 \
--report-format all \
--max-iterations 30pentest-agent skillspentest-agent historypentest-agent can run as an MCP (Model Context Protocol) server, exposing all 40 skills as tools that any MCP-compatible AI client can call directly. This means you can use pentest-agent's skills through natural conversation in Claude Desktop, VS Code Copilot, Cursor, or any other MCP client.
- Install with MCP support:
pip install -e ".[mcp]"- Add to your Claude Desktop config (
~/Library/Application Support/Claude/claude_desktop_config.jsonon macOS):
{
"mcpServers": {
"pentest-agent": {
"command": "pentest-agent",
"args": ["mcp-serve"]
}
}
}- Restart Claude Desktop. You'll see pentest-agent's 40 skills in the tools menu.
Add to your Claude Code MCP config (~/.claude/claude_code_config.json):
{
"mcpServers": {
"pentest-agent": {
"command": "pentest-agent",
"args": ["mcp-serve"]
}
}
}{
"mcpServers": {
"pentest-agent": {
"command": "pentest-agent",
"args": ["mcp-serve"]
}
}
}Once connected, just talk to your AI assistant:
- "Scan example.com for open ports and services" → calls
nmap_port_scan - "Test this login form for SQL injection" → calls
sqli_check - "Check if the JWT token is vulnerable to alg:none" → calls
jwt_attack - "Fuzz for hidden API endpoints on the target" → calls
dir_fuzzandparam_fuzz - "Run a full SSTI test against this search parameter" → calls
ssti_check - "List all available skills" → calls the
list_skillsmeta-tool - "Show me all findings so far" → calls the
get_findingsmeta-tool
The AI client handles reasoning and orchestration — it picks the right skills, chains them together, and interprets results, while pentest-agent provides the security testing capabilities.
You can also run the MCP server directly:
# stdio mode (for client integration)
pentest-agent mcp-serve
# Or run the module directly
python -m pentest_agent.mcp_server┌─────────────────────────────────────────────────────────────┐
│ CLI (click + rich) / MCP Server (stdio) │
├─────────────────────────────────────────────────────────────┤
│ PentestAgent (Coordinator) │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐ │
│ │ PLANNER │ │ EXECUTOR │ │ ANALYZER │ │
│ │ (LLM role) │→ │(deterministic│→ │ (LLM role) │ │
│ │ Decides │ │ Validates │ │ Interprets │ │
│ │ next action │ │ & executes) │ │ results │ │
│ └──────┬───────┘ └──────────────┘ └────────┬─────────┘ │
│ │ ↑ intelligence │ │
│ └──────────┘ feedback loop ────────────┘ │
│ │
│ ┌───────────┐ ┌───────────┐ ┌────────────┐ ┌───────────┐ │
│ │ LLM │ │ Safety │ │ Skill │ │ State │ │
│ │ Provider │ │ Control │ │ Registry │ │ Manager │ │
│ └───────────┘ └───────────┘ └────────────┘ └───────────┘ │
│ ┌───────────┐ ┌───────────┐ ┌────────────┐ │
│ │ Reporter │ │ Knowledge │ │ Evidence │ │
│ │ │ │ Base │ │ Graph │ │
│ └───────────┘ └───────────┘ └────────────┘ │
├─────────────────────────────────────────────────────────────┤
│ Skills (40 Plugins) │
│ recon/ web/ inject/ auth/ exploit/ fuzz/ stress/ │
│ network/ cloud/ vuln/ │
├─────────────────────────────────────────────────────────────┤
│ Docker Sandbox (optional) │
└─────────────────────────────────────────────────────────────┘
Each agent role gets a clean context window per call — no conversation history accumulation — preventing reasoning degradation over long engagements.
1. Initialize → Load config, register skills, set scope
2. PLANNER → Analyzes state + last analysis, decides next skill(s) to run
3. EXECUTOR → Validates safety, executes skill(s) — in parallel when possible
4. ANALYZER → Interprets results, populates evidence graph, extracts attack chains
5. Feed back → Analyzer's intelligence feeds forward to Planner
6. DAG check → If task graph has independent ready tasks, batch them for parallel execution
7. Reflect → Every 8 iterations, Analyzer assesses progress & strategy
8. Repeat → Go to 2 (until objectives met or max iterations)
9. Report → Generate reports with evidence graph, attack chains, and task execution history
The Executor is deterministic (no LLM) — skill execution shouldn't depend on AI reasoning. Only the Planner and Analyzer use the LLM, each with focused system prompts.
Every finding is backed by a chain of evidence. The evidence graph prevents hallucination by requiring proof at every level:
EVIDENCE (raw facts) → HYPOTHESIS (needs testing) → VULNERABILITY (confirmed) → EXPLOIT (proven)
Node types:
- Evidence — Raw data from tool output (e.g., "nmap found port 8080 running Tomcat 9.0.50"). Always
provenconfidence. - Hypothesis — Inferred possibility that needs validation (e.g., "Tomcat 9.0.50 may be vulnerable to CVE-2021-42013"). Starts at
lowconfidence. - Vulnerability — Confirmed security issue backed by evidence (e.g., "Path traversal confirmed via
..;/bypass").highconfidence. - Exploit — Proven exploitation with demonstrated impact (e.g., "RCE achieved via JSP upload through path traversal").
provenconfidence.
Edge types:
supports— Evidence supports a hypothesisconfirms— Evidence/test confirms a vulnerabilitychains_to— One finding enables another (attack chain: SSRF → cloud creds → lateral movement)disproves— Evidence disproves a hypothesisleads_to— One node leads to discovering another
Anti-hallucination: Vulnerabilities and exploits without evidence backing are flagged as unsubstantiated in the final report. The Analyzer creates hypotheses for uncertain observations and promotes them to vulnerabilities only when confirmed by further testing.
Attack chain tracking: The graph automatically identifies multi-step attack paths (e.g., port scan → version detection → CVE match → path traversal → RCE) and reports them as exploitable chains with full evidence trails.
Tasks are organized as a Directed Acyclic Graph (DAG) where edges represent dependencies. Independent tasks execute in parallel via asyncio.gather, significantly speeding up engagements.
recon-nmap ──┐
recon-dns ──┼──→ fuzz-dirs ──┐
recon-sub ──┘ fuzz-params ──┼──→ inject-sqli ──→ exploit-chain
└──→ inject-xss
└──→ inject-ssti
In this example:
- 3 recon tasks run in parallel (no dependencies on each other)
- 2 fuzz tasks run in parallel once recon-nmap completes
- 3 injection tasks run in parallel once both fuzz tasks complete
- exploit-chain only starts after injection confirms a vulnerability
The Planner can dynamically modify the DAG during execution via graph edit operations (ADD_TASK, UPDATE_TASK, ADD_DEPENDENCY, DEPRECATE_TASK). When the Analyzer discovers new attack surface, the Planner adds new tasks with appropriate dependencies — no full plan regeneration needed.
Cycle detection prevents invalid dependency chains. Topological sort determines execution order. Tasks are batched by priority when multiple are ready.
| Skill | Risk | Description |
|---|---|---|
nmap_port_scan |
active | Port scanning and service detection |
dns_enumeration |
passive | DNS record enumeration and zone transfers |
subdomain_enum |
passive | Subdomain discovery (brute-force + subfinder) |
whois_lookup |
passive | WHOIS registration information |
| Skill | Risk | Description |
|---|---|---|
http_header_check |
passive | Security header analysis |
dir_bruteforce |
active | Directory and file discovery |
nuclei_scan |
active | Nuclei vulnerability scanner |
ssl_check |
passive | SSL/TLS certificate and config analysis |
tech_detect |
passive | Technology fingerprinting |
web_crawler |
passive | Page, form, and endpoint discovery |
waf_detect |
passive | WAF and CDN detection |
graphql_audit |
active | GraphQL introspection, schema dump, batching abuse, DoS testing |
| Skill | Risk | Description |
|---|---|---|
sqli_check |
active | SQL injection testing (sqlmap) |
xss_check |
active | Reflected XSS testing |
ssti_check |
intrusive | SSTI across 7+ template engines with RCE proof (Jinja2, Twig, FreeMarker, ERB, Mako, Smarty, EL) |
command_injection |
intrusive | OS command injection — output-based, blind time-based with verification, OOB callbacks |
nosql_injection |
intrusive | MongoDB operator injection ($ne, $gt, $regex), $where JS injection, blind time-based |
path_traversal |
intrusive | Path traversal / LFI with encoding bypasses, null bytes, PHP wrappers (php://filter, data://, expect://) |
ssrf_check |
intrusive | SSRF targeting cloud metadata (AWS/GCP/Azure IMDS), internal services, protocol smuggling (file://, gopher://) |
xxe_injection |
intrusive | XML External Entity — file read, blind OOB exfiltration, SVG upload, encoding bypasses |
| Skill | Risk | Description |
|---|---|---|
jwt_attack |
intrusive | JWT alg:none bypass, weak secret brute-force, algorithm confusion (RS256→HS256), kid injection |
auth_bypass |
active | HTTP verb tampering, path/header bypasses, 401/403 bypass, IDOR, parameter privilege escalation |
session_attack |
active | Session fixation, cookie attribute audit, token entropy analysis, logout invalidation |
| Skill | Risk | Description |
|---|---|---|
race_condition |
intrusive | Concurrent request racing for TOCTOU/double-spend bugs |
request_smuggling |
intrusive | HTTP request smuggling (CL.TE, TE.CL, TE.TE) via raw sockets |
prototype_pollution |
intrusive | JS prototype pollution via JSON body and query params with persistence check |
deserialization |
active | Detect insecure deserialization (Java, PHP, Python pickle, .NET ViewState) |
metasploit_exploit |
intrusive | Metasploit module execution |
metasploit_search |
passive | Search Metasploit for exploits |
cve_lookup |
passive | CVE search via NVD API |
| Skill | Risk | Description |
|---|---|---|
param_fuzz |
active | Hidden parameter discovery (100+ names) and value fuzzing (30+ payloads) |
dir_fuzz |
active | Directory/file brute-force with 150+ paths, extension testing, soft-404 detection |
| Skill | Risk | Description |
|---|---|---|
smb_enum |
active | SMB share and user enumeration |
snmp_enum |
active | SNMP community string and info enumeration |
service_bruteforce |
intrusive | Credential brute-force (Hydra) |
| Skill | Risk | Description |
|---|---|---|
s3_bucket_check |
passive | AWS S3 misconfiguration checks |
azure_storage_check |
passive | Azure Blob Storage checks |
gcp_bucket_check |
passive | GCP Cloud Storage checks |
| Skill | Risk | Description |
|---|---|---|
slowloris |
intrusive | Slow-connection DoS resilience testing via raw sockets |
resource_exhaustion |
intrusive | Body size, parameter count, JSON depth, header length, concurrency thresholds |
The agent follows an aggressive methodology, driven by LLM reasoning:
- Recon — Maps the full attack surface: subdomains, ports, services, technologies, endpoints
- Discovery — Fuzzes for hidden directories, parameters, APIs, and files
- Injection — Tests every discovered input for SQLi, XSS, SSTI, command injection, NoSQL injection, XXE, path traversal
- Auth attacks — Probes authentication: JWT manipulation, session flaws, auth bypass, IDOR
- Exploitation — SSRF for cloud credentials, LFI-to-RCE via PHP wrappers, prototype pollution, request smuggling, race conditions
- Chaining — Combines findings for maximum impact (e.g., SSRF → AWS IAM creds → lateral movement)
- Stress testing — Tests resilience against slowloris, resource exhaustion
The agent doesn't just report "possible SQL injection" — it proves it with extracted data, achieved RCE, or bypassed authentication.
Create a new file in any skills subdirectory:
from pentest_agent.skills.base import BaseSkill
class MyCustomSkill(BaseSkill):
@property
def name(self) -> str:
return "my_custom_skill"
@property
def category(self) -> str:
return "web"
@property
def description(self) -> str:
return "Does something useful"
@property
def parameters(self) -> dict:
return {
"target": {"type": "str", "required": True, "description": "Target URL"},
}
@property
def risk_level(self) -> str:
return "active" # passive | active | intrusive
async def execute(self, params, state):
# Your tool logic here
return {
"success": True,
"summary": "What happened",
"data": {"details": "here"},
"errors": [],
}Skills are auto-discovered — just drop the file in and it's available.
name: My Assessment
targets:
- https://target.example.com
- 10.0.1.0/24
scope:
allowed_hosts:
- 10.0.1.0/24
- target.example.com
excluded_hosts:
- 10.0.1.1
allowed_domains:
- "*.example.com"
allowed_ports: [80, 443, 8080]
max_severity: intrusive # passive | active | intrusive
time_window:
start: "2024-01-15T09:00:00Z"
end: "2024-01-15T18:00:00Z"
objectives:
- Identify and exploit web application vulnerabilities
- Attempt authentication bypass and privilege escalation
- Test for RCE via injection and deserialization
- Enumerate exposed services and cloud misconfigurations
llm:
provider: auto # auto | claude-code | anthropic | openai | litellm
model: claude-sonnet-4-20250514
max_iterations: 50
require_approval:
- command_injection
- ssti_check
- ssrf_check
- path_traversal
- nosql_injection
- xxe_injection
- jwt_attack
- race_condition
- request_smuggling
- slowloris
- resource_exhaustion
- service_bruteforce
- metasploit_exploit
report:
format: all # json | html | markdown | all
output_dir: ./reportspentest-agent enforces multiple safety layers:
| Control | Description |
|---|---|
| Scope enforcement | Only approved targets, domains, hosts, and ports |
| Exclusion lists | Hosts that must never be touched |
| Risk level caps | Maximum intrusiveness (passive → active → intrusive) |
| Rate limiting | Global RPM and per-target request caps |
| Time windows | Testing only during approved hours |
| Approval gates | Human sign-off for intrusive operations |
| Audit logging | Every action recorded with timestamps |
Intrusive skills (command injection, SSTI, SSRF, stress tests, etc.) require explicit approval by default.
docker compose up pentest-agentdocker compose build sandboxThe sandbox image includes Kali tools (nmap, sqlmap, hydra, nuclei, subfinder, etc.) and runs with security constraints (read-only filesystem, memory limits, no privilege escalation).
pentest-agent includes a built-in benchmark runner compatible with the XBOW validation benchmarks — 104 Docker-based CTF challenges that are the industry standard for evaluating AI penetration testing agents.
Each challenge spins up a vulnerable Docker application with a random flag. The agent must find and extract the flag through exploitation — not just detection.
# Clone the XBOW benchmark suite
git clone https://github.com/xbow-engineering/validation-benchmarks.git
# Run all challenges
pentest-agent benchmark ./validation-benchmarks
# Run a specific category
pentest-agent benchmark ./validation-benchmarks --category web
# Run 10 challenges with 3 in parallel
pentest-agent benchmark ./validation-benchmarks -n 10 --parallel 3
# Filter by difficulty
pentest-agent benchmark ./validation-benchmarks --difficulty easy
# Use a specific model
pentest-agent benchmark ./validation-benchmarks --model claude-sonnet-4-20250514 --provider anthropic1. Discovery → Scan benchmark directory for challenges (docker-compose.yml + metadata)
2. Setup → Build & start Docker container with random flag (FLAG{<hex>})
3. Agent run → pentest-agent runs a full engagement against the challenge
4. Flag check → Search all agent output, findings, and evidence for the flag
5. Teardown → Stop and remove containers
6. Report → Aggregate results by category, difficulty, and overall success rate
Results are saved to ./benchmark-results/benchmark-results.json with:
- Per-challenge: pass/fail, time, iterations, tokens, findings count
- By category: success rate per challenge type (web, crypto, misc, etc.)
- By difficulty: success rate per difficulty level
- Aggregates: overall success rate, total time, total tokens, averages
┌───────────────────┬────────┐
│ Metric │ Value │
├───────────────────┼────────┤
│ Total challenges │ 104 │
│ Passed │ 42 │
│ Failed │ 58 │
│ Errored │ 4 │
│ Success rate │ 40.4% │
│ Total time │ 5420s │
│ Total tokens │ 892000 │
└───────────────────┴────────┘
The runner expects this directory layout:
validation-benchmarks/
web/
sqli-login/
docker-compose.yml # Must accept FLAG env var
challenge.json # Optional: {difficulty, description, target_port}
xss-reflected/
docker-compose.yml
crypto/
weak-jwt/
docker-compose.yml
# Install with dev dependencies
pip install -e ".[dev]"
# Run tests
pytest
# Lint
ruff check .
# Type check
mypy pentest_agent- Python 3.11+
- One of the following for LLM access:
- Claude Code CLI installed and authenticated (zero-config, uses your existing subscription)
ANTHROPIC_API_KEYenvironment variable for direct Anthropic APIOPENAI_API_KEYfor OpenAI-compatible models
- Security tools installed for skills you want to use (nmap, sqlmap, nuclei, etc.)
- Docker (optional, for sandboxed execution)
Plug-and-play: If you have Claude Code installed, just
pip install -e .and run. No API keys needed.
MIT
This tool is designed for authorized security testing only. Always obtain proper written authorization before testing any systems. Unauthorized access to computer systems is illegal. The authors are not responsible for misuse of this tool.