pentest-agent

An autonomous, LLM-driven penetration testing agent that doesn't just find vulnerabilities — it exploits them.

pentest-agent uses large language models to iteratively plan, execute, and analyze security assessments against target systems. It combines AI reasoning with 40 offensive security skills to conduct authorized penetration tests, from reconnaissance through exploitation.

XBOW Benchmark: 62.2% (28/45)

Tested against the XBOW validation benchmark — 45 real-world CTF challenges. Each challenge runs in Docker with a random flag that the agent must find through exploitation.

Difficulty	Solved	Rate
Easy	28/45	62%

Vulnerability Type	Solved	Rate
xss	1/8	12%
command_injection	5/6	83%
sqli	2/5	40%
information_disclosure	3/5	60%
privilege_escalation	4/5	80%
idor	3/4	75%
default_credentials	4/4	100%
ssti	4/4	100%
lfi	2/4	50%
business_logic	4/4	100%

99.64s avg per challenge · 134,383 total tokens · Claude Code (zero-config, no API key)

Approach: Why a Helper Library Instead of Prompt Engineering

Early iterations relied on prompt engineering to guide the LLM in writing correct HTTP/login code from scratch using urllib.request. This approach proved insufficient:

Login misinterpretation: The LLM would call json.loads() on HTML responses from /token, get a JSONDecodeError, and wrongly conclude the login failed — when a 200 response with HTML actually means success (cookie/session-based auth). This single failure mode caused the agent to spend 10+ iterations stuck retrying credentials that had already worked.
Brittle HTTP plumbing: Every python_exec script reimplemented cookie handling, token extraction, redirect following, and error handling. Small variations (missing http.cookiejar, not checking Set-Cookie, crashing on HTTPError) caused cascading failures.
Prompt instructions ignored: Despite explicit instructions ("use http.cookiejar", "don't json.loads() first"), the LLM would revert to its trained patterns and write the exact code the prompts warned against. Adding more prompt constraints didn't improve reliability.

The current approach uses a helper library pattern: instead of teaching the LLM to write correct HTTP code, we provide a pre-tested CTFClient class (ctf_helpers.py) that handles login detection, cookie/JWT auth, response parsing, and flag extraction correctly. The LLM only needs to call CTFClient.login() — not reimplement the plumbing. This is deployed to /tmp/ctf_helpers.py before every python_exec invocation.

Additional reliability measures:

Action deduplication: Hash (skill, params) to prevent the agent from repeating identical actions
Structured fact extraction: Parse raw stdout for credentials, flags, status codes, and endpoints before passing to the Planner, so the LLM doesn't misinterpret raw output
Schema validation: Validate Planner JSON output has required fields before execution
Rolling analysis window: Last 3 results fed to the Planner so it doesn't lose context from earlier steps

Features

Autonomous agent loop — Plans attacks, executes tools, analyzes results, chains vulnerabilities, and adapts strategy
Multi-agent architecture — Planner/Executor/Analyzer roles with clean context windows prevent reasoning degradation
40 built-in skills across 10 categories: recon, web, injection, auth, exploit, fuzzing, stress testing, and more
MCP server mode — Expose all skills to Claude Desktop, VS Code Copilot, Cursor, or any MCP-compatible client
Exploitation-focused — Doesn't stop at detection. Proves exploitability with RCE, auth bypass, data exfiltration
Vulnerability chaining — Automatically chains findings (SSRF → cloud creds, LFI → RCE, JWT weakness + IDOR → account takeover)
Multi-provider LLM support — Anthropic, OpenAI, or any LiteLLM-compatible model
Zero-config with Claude Code — Uses your existing Claude subscription, no API keys needed
Safety-first design — Scope enforcement, rate limiting, risk levels, time windows, approval gates
Knowledge persistence — SQLite-based learning across engagements
Docker sandboxing — Isolate tool execution in containers
Multi-format reporting — JSON, HTML, and Markdown reports with evidence
XBOW benchmarking — Built-in runner for the 104-challenge XBOW CTF benchmark suite with parallel execution and detailed metrics
Plugin architecture — Drop in a Python file and it's auto-discovered

Quick Start

Install

pip install -e .

# With MCP server support (for Claude Desktop / IDE integration)
pip install -e ".[mcp]"

Create an engagement config

pentest-agent init engagement.yaml
# Edit engagement.yaml with your targets and scope

Run

# Zero-config if you have Claude Code installed — no API key needed!
pentest-agent run --target https://target.example.com

# From config file
pentest-agent run --config engagement.yaml

# Explicit provider (if you prefer direct API access)
pentest-agent run \
  --config engagement.yaml \
  --provider anthropic \
  --model claude-sonnet-4-20250514 \
  --report-format all \
  --max-iterations 30

List available skills

pentest-agent skills

View engagement history

pentest-agent history

MCP Server Mode

pentest-agent can run as an MCP (Model Context Protocol) server, exposing all 40 skills as tools that any MCP-compatible AI client can call directly. This means you can use pentest-agent's skills through natural conversation in Claude Desktop, VS Code Copilot, Cursor, or any other MCP client.

Setup with Claude Desktop

Install with MCP support:

pip install -e ".[mcp]"

Add to your Claude Desktop config (~/Library/Application Support/Claude/claude_desktop_config.json on macOS):

{
  "mcpServers": {
    "pentest-agent": {
      "command": "pentest-agent",
      "args": ["mcp-serve"]
    }
  }
}

Restart Claude Desktop. You'll see pentest-agent's 40 skills in the tools menu.

Setup with Claude Code

Add to your Claude Code MCP config (~/.claude/claude_code_config.json):

{
  "mcpServers": {
    "pentest-agent": {
      "command": "pentest-agent",
      "args": ["mcp-serve"]
    }
  }
}

Setup with VS Code (Copilot / Continue)

{
  "mcpServers": {
    "pentest-agent": {
      "command": "pentest-agent",
      "args": ["mcp-serve"]
    }
  }
}

What you can do

Once connected, just talk to your AI assistant:

"Scan example.com for open ports and services" → calls nmap_port_scan
"Test this login form for SQL injection" → calls sqli_check
"Check if the JWT token is vulnerable to alg:none" → calls jwt_attack
"Fuzz for hidden API endpoints on the target" → calls dir_fuzz and param_fuzz
"Run a full SSTI test against this search parameter" → calls ssti_check
"List all available skills" → calls the list_skills meta-tool
"Show me all findings so far" → calls the get_findings meta-tool

The AI client handles reasoning and orchestration — it picks the right skills, chains them together, and interprets results, while pentest-agent provides the security testing capabilities.

Run standalone

You can also run the MCP server directly:

# stdio mode (for client integration)
pentest-agent mcp-serve

# Or run the module directly
python -m pentest_agent.mcp_server

Architecture

┌─────────────────────────────────────────────────────────────┐
│          CLI (click + rich) / MCP Server (stdio)            │
├─────────────────────────────────────────────────────────────┤
│                   PentestAgent (Coordinator)                 │
│                                                             │
│   ┌──────────────┐  ┌──────────────┐  ┌──────────────────┐ │
│   │   PLANNER    │  │   EXECUTOR   │  │    ANALYZER      │ │
│   │  (LLM role)  │→ │(deterministic│→ │   (LLM role)     │ │
│   │  Decides     │  │  Validates   │  │   Interprets     │ │
│   │  next action │  │  & executes) │  │   results        │ │
│   └──────┬───────┘  └──────────────┘  └────────┬─────────┘ │
│          │          ↑ intelligence              │           │
│          └──────────┘ feedback loop ────────────┘           │
│                                                             │
│  ┌───────────┐ ┌───────────┐ ┌────────────┐ ┌───────────┐  │
│  │ LLM       │ │ Safety    │ │ Skill      │ │ State     │  │
│  │ Provider  │ │ Control   │ │ Registry   │ │ Manager   │  │
│  └───────────┘ └───────────┘ └────────────┘ └───────────┘  │
│  ┌───────────┐ ┌───────────┐ ┌────────────┐                 │
│  │ Reporter  │ │ Knowledge │ │ Evidence   │                 │
│  │           │ │ Base      │ │ Graph      │                 │
│  └───────────┘ └───────────┘ └────────────┘                 │
├─────────────────────────────────────────────────────────────┤
│                     Skills (40 Plugins)                      │
│  recon/ web/ inject/ auth/ exploit/ fuzz/ stress/           │
│  network/ cloud/ vuln/                                      │
├─────────────────────────────────────────────────────────────┤
│               Docker Sandbox (optional)                      │
└─────────────────────────────────────────────────────────────┘

Multi-Agent Loop (Planner → Executor → Analyzer)

Each agent role gets a clean context window per call — no conversation history accumulation — preventing reasoning degradation over long engagements.

1. Initialize → Load config, register skills, set scope
2. PLANNER    → Analyzes state + last analysis, decides next skill(s) to run
3. EXECUTOR   → Validates safety, executes skill(s) — in parallel when possible
4. ANALYZER   → Interprets results, populates evidence graph, extracts attack chains
5. Feed back  → Analyzer's intelligence feeds forward to Planner
6. DAG check  → If task graph has independent ready tasks, batch them for parallel execution
7. Reflect    → Every 8 iterations, Analyzer assesses progress & strategy
8. Repeat     → Go to 2 (until objectives met or max iterations)
9. Report     → Generate reports with evidence graph, attack chains, and task execution history

The Executor is deterministic (no LLM) — skill execution shouldn't depend on AI reasoning. Only the Planner and Analyzer use the LLM, each with focused system prompts.

Evidence Graph

Every finding is backed by a chain of evidence. The evidence graph prevents hallucination by requiring proof at every level:

EVIDENCE (raw facts)  →  HYPOTHESIS (needs testing)  →  VULNERABILITY (confirmed)  →  EXPLOIT (proven)

Node types:

Evidence — Raw data from tool output (e.g., "nmap found port 8080 running Tomcat 9.0.50"). Always proven confidence.
Hypothesis — Inferred possibility that needs validation (e.g., "Tomcat 9.0.50 may be vulnerable to CVE-2021-42013"). Starts at low confidence.
Vulnerability — Confirmed security issue backed by evidence (e.g., "Path traversal confirmed via ..;/ bypass"). high confidence.
Exploit — Proven exploitation with demonstrated impact (e.g., "RCE achieved via JSP upload through path traversal"). proven confidence.

Edge types:

supports — Evidence supports a hypothesis
confirms — Evidence/test confirms a vulnerability
chains_to — One finding enables another (attack chain: SSRF → cloud creds → lateral movement)
disproves — Evidence disproves a hypothesis
leads_to — One node leads to discovering another

Anti-hallucination: Vulnerabilities and exploits without evidence backing are flagged as unsubstantiated in the final report. The Analyzer creates hypotheses for uncertain observations and promotes them to vulnerabilities only when confirmed by further testing.

Attack chain tracking: The graph automatically identifies multi-step attack paths (e.g., port scan → version detection → CVE match → path traversal → RCE) and reports them as exploitable chains with full evidence trails.

Parallel Task Execution (DAG Scheduler)

Tasks are organized as a Directed Acyclic Graph (DAG) where edges represent dependencies. Independent tasks execute in parallel via asyncio.gather, significantly speeding up engagements.

recon-nmap ──┐
recon-dns  ──┼──→ fuzz-dirs  ──┐
recon-sub  ──┘    fuzz-params ──┼──→ inject-sqli ──→ exploit-chain
                               └──→ inject-xss
                               └──→ inject-ssti

In this example:

3 recon tasks run in parallel (no dependencies on each other)
2 fuzz tasks run in parallel once recon-nmap completes
3 injection tasks run in parallel once both fuzz tasks complete
exploit-chain only starts after injection confirms a vulnerability

The Planner can dynamically modify the DAG during execution via graph edit operations (ADD_TASK, UPDATE_TASK, ADD_DEPENDENCY, DEPRECATE_TASK). When the Analyzer discovers new attack surface, the Planner adds new tasks with appropriate dependencies — no full plan regeneration needed.

Cycle detection prevents invalid dependency chains. Topological sort determines execution order. Tasks are batched by priority when multiple are ready.

Built-in Skills (40)

Reconnaissance

Skill	Risk	Description
`nmap_port_scan`	active	Port scanning and service detection
`dns_enumeration`	passive	DNS record enumeration and zone transfers
`subdomain_enum`	passive	Subdomain discovery (brute-force + subfinder)
`whois_lookup`	passive	WHOIS registration information

Web Application

Skill	Risk	Description
`http_header_check`	passive	Security header analysis
`dir_bruteforce`	active	Directory and file discovery
`nuclei_scan`	active	Nuclei vulnerability scanner
`ssl_check`	passive	SSL/TLS certificate and config analysis
`tech_detect`	passive	Technology fingerprinting
`web_crawler`	passive	Page, form, and endpoint discovery
`waf_detect`	passive	WAF and CDN detection
`graphql_audit`	active	GraphQL introspection, schema dump, batching abuse, DoS testing

Injection

Skill	Risk	Description
`sqli_check`	active	SQL injection testing (sqlmap)
`xss_check`	active	Reflected XSS testing
`ssti_check`	intrusive	SSTI across 7+ template engines with RCE proof (Jinja2, Twig, FreeMarker, ERB, Mako, Smarty, EL)
`command_injection`	intrusive	OS command injection — output-based, blind time-based with verification, OOB callbacks
`nosql_injection`	intrusive	MongoDB operator injection ($ne, $gt, $regex), $where JS injection, blind time-based
`path_traversal`	intrusive	Path traversal / LFI with encoding bypasses, null bytes, PHP wrappers (php://filter, data://, expect://)
`ssrf_check`	intrusive	SSRF targeting cloud metadata (AWS/GCP/Azure IMDS), internal services, protocol smuggling (file://, gopher://)
`xxe_injection`	intrusive	XML External Entity — file read, blind OOB exfiltration, SVG upload, encoding bypasses

Authentication & Authorization

Skill	Risk	Description
`jwt_attack`	intrusive	JWT alg:none bypass, weak secret brute-force, algorithm confusion (RS256→HS256), kid injection
`auth_bypass`	active	HTTP verb tampering, path/header bypasses, 401/403 bypass, IDOR, parameter privilege escalation
`session_attack`	active	Session fixation, cookie attribute audit, token entropy analysis, logout invalidation

Exploitation

Skill	Risk	Description
`race_condition`	intrusive	Concurrent request racing for TOCTOU/double-spend bugs
`request_smuggling`	intrusive	HTTP request smuggling (CL.TE, TE.CL, TE.TE) via raw sockets
`prototype_pollution`	intrusive	JS prototype pollution via JSON body and query params with persistence check
`deserialization`	active	Detect insecure deserialization (Java, PHP, Python pickle, .NET ViewState)
`metasploit_exploit`	intrusive	Metasploit module execution
`metasploit_search`	passive	Search Metasploit for exploits
`cve_lookup`	passive	CVE search via NVD API

Fuzzing & Discovery

Skill	Risk	Description
`param_fuzz`	active	Hidden parameter discovery (100+ names) and value fuzzing (30+ payloads)
`dir_fuzz`	active	Directory/file brute-force with 150+ paths, extension testing, soft-404 detection

Network

Skill	Risk	Description
`smb_enum`	active	SMB share and user enumeration
`snmp_enum`	active	SNMP community string and info enumeration
`service_bruteforce`	intrusive	Credential brute-force (Hydra)

Cloud

Skill	Risk	Description
`s3_bucket_check`	passive	AWS S3 misconfiguration checks
`azure_storage_check`	passive	Azure Blob Storage checks
`gcp_bucket_check`	passive	GCP Cloud Storage checks

Stress Testing

Skill	Risk	Description
`slowloris`	intrusive	Slow-connection DoS resilience testing via raw sockets
`resource_exhaustion`	intrusive	Body size, parameter count, JSON depth, header length, concurrency thresholds

How the Agent Attacks

The agent follows an aggressive methodology, driven by LLM reasoning:

Recon — Maps the full attack surface: subdomains, ports, services, technologies, endpoints
Discovery — Fuzzes for hidden directories, parameters, APIs, and files
Injection — Tests every discovered input for SQLi, XSS, SSTI, command injection, NoSQL injection, XXE, path traversal
Auth attacks — Probes authentication: JWT manipulation, session flaws, auth bypass, IDOR
Exploitation — SSRF for cloud credentials, LFI-to-RCE via PHP wrappers, prototype pollution, request smuggling, race conditions
Chaining — Combines findings for maximum impact (e.g., SSRF → AWS IAM creds → lateral movement)
Stress testing — Tests resilience against slowloris, resource exhaustion

The agent doesn't just report "possible SQL injection" — it proves it with extracted data, achieved RCE, or bypassed authentication.

Adding Custom Skills

Create a new file in any skills subdirectory:

from pentest_agent.skills.base import BaseSkill

class MyCustomSkill(BaseSkill):
    @property
    def name(self) -> str:
        return "my_custom_skill"

    @property
    def category(self) -> str:
        return "web"

    @property
    def description(self) -> str:
        return "Does something useful"

    @property
    def parameters(self) -> dict:
        return {
            "target": {"type": "str", "required": True, "description": "Target URL"},
        }

    @property
    def risk_level(self) -> str:
        return "active"  # passive | active | intrusive

    async def execute(self, params, state):
        # Your tool logic here
        return {
            "success": True,
            "summary": "What happened",
            "data": {"details": "here"},
            "errors": [],
        }

Skills are auto-discovered — just drop the file in and it's available.

Engagement Configuration

name: My Assessment
targets:
  - https://target.example.com
  - 10.0.1.0/24

scope:
  allowed_hosts:
    - 10.0.1.0/24
    - target.example.com
  excluded_hosts:
    - 10.0.1.1
  allowed_domains:
    - "*.example.com"
  allowed_ports: [80, 443, 8080]
  max_severity: intrusive   # passive | active | intrusive
  time_window:
    start: "2024-01-15T09:00:00Z"
    end: "2024-01-15T18:00:00Z"

objectives:
  - Identify and exploit web application vulnerabilities
  - Attempt authentication bypass and privilege escalation
  - Test for RCE via injection and deserialization
  - Enumerate exposed services and cloud misconfigurations

llm:
  provider: auto          # auto | claude-code | anthropic | openai | litellm
  model: claude-sonnet-4-20250514

max_iterations: 50
require_approval:
  - command_injection
  - ssti_check
  - ssrf_check
  - path_traversal
  - nosql_injection
  - xxe_injection
  - jwt_attack
  - race_condition
  - request_smuggling
  - slowloris
  - resource_exhaustion
  - service_bruteforce
  - metasploit_exploit

report:
  format: all             # json | html | markdown | all
  output_dir: ./reports

Safety Controls

pentest-agent enforces multiple safety layers:

Control	Description
Scope enforcement	Only approved targets, domains, hosts, and ports
Exclusion lists	Hosts that must never be touched
Risk level caps	Maximum intrusiveness (passive → active → intrusive)
Rate limiting	Global RPM and per-target request caps
Time windows	Testing only during approved hours
Approval gates	Human sign-off for intrusive operations
Audit logging	Every action recorded with timestamps

Intrusive skills (command injection, SSTI, SSRF, stress tests, etc.) require explicit approval by default.

Docker

Run the agent in Docker

docker compose up pentest-agent

Build the sandbox image

docker compose build sandbox

The sandbox image includes Kali tools (nmap, sqlmap, hydra, nuclei, subfinder, etc.) and runs with security constraints (read-only filesystem, memory limits, no privilege escalation).

Benchmarking (XBOW)

pentest-agent includes a built-in benchmark runner compatible with the XBOW validation benchmarks — 104 Docker-based CTF challenges that are the industry standard for evaluating AI penetration testing agents.

Each challenge spins up a vulnerable Docker application with a random flag. The agent must find and extract the flag through exploitation — not just detection.

Running benchmarks

# Clone the XBOW benchmark suite
git clone https://github.com/xbow-engineering/validation-benchmarks.git

# Run all challenges
pentest-agent benchmark ./validation-benchmarks

# Run a specific category
pentest-agent benchmark ./validation-benchmarks --category web

# Run 10 challenges with 3 in parallel
pentest-agent benchmark ./validation-benchmarks -n 10 --parallel 3

# Filter by difficulty
pentest-agent benchmark ./validation-benchmarks --difficulty easy

# Use a specific model
pentest-agent benchmark ./validation-benchmarks --model claude-sonnet-4-20250514 --provider anthropic

How it works

1. Discovery  → Scan benchmark directory for challenges (docker-compose.yml + metadata)
2. Setup      → Build & start Docker container with random flag (FLAG{<hex>})
3. Agent run  → pentest-agent runs a full engagement against the challenge
4. Flag check → Search all agent output, findings, and evidence for the flag
5. Teardown   → Stop and remove containers
6. Report     → Aggregate results by category, difficulty, and overall success rate

Benchmark output

Results are saved to ./benchmark-results/benchmark-results.json with:

Per-challenge: pass/fail, time, iterations, tokens, findings count
By category: success rate per challenge type (web, crypto, misc, etc.)
By difficulty: success rate per difficulty level
Aggregates: overall success rate, total time, total tokens, averages

┌───────────────────┬────────┐
│ Metric            │  Value │
├───────────────────┼────────┤
│ Total challenges  │    104 │
│ Passed            │     42 │
│ Failed            │     58 │
│ Errored           │      4 │
│ Success rate      │  40.4% │
│ Total time        │ 5420s  │
│ Total tokens      │ 892000 │
└───────────────────┴────────┘

Challenge structure

The runner expects this directory layout:

validation-benchmarks/
  web/
    sqli-login/
      docker-compose.yml    # Must accept FLAG env var
      challenge.json        # Optional: {difficulty, description, target_port}
    xss-reflected/
      docker-compose.yml
  crypto/
    weak-jwt/
      docker-compose.yml

Development

# Install with dev dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Lint
ruff check .

# Type check
mypy pentest_agent

Requirements

Python 3.11+
One of the following for LLM access:
- Claude Code CLI installed and authenticated (zero-config, uses your existing subscription)
- ANTHROPIC_API_KEY environment variable for direct Anthropic API
- OPENAI_API_KEY for OpenAI-compatible models
Security tools installed for skills you want to use (nmap, sqlmap, nuclei, etc.)
Docker (optional, for sandboxed execution)

Plug-and-play: If you have Claude Code installed, just pip install -e . and run. No API keys needed.

License

MIT

Disclaimer

This tool is designed for authorized security testing only. Always obtain proper written authorization before testing any systems. Unauthorized access to computer systems is illegal. The authors are not responsible for misuse of this tool.

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
examples		examples
pentest_agent		pentest_agent
scripts		scripts
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

pentest-agent

XBOW Benchmark: 62.2% (28/45)

Approach: Why a Helper Library Instead of Prompt Engineering

Features

Quick Start

Install

Create an engagement config

Run

List available skills

View engagement history

MCP Server Mode

Setup with Claude Desktop

Setup with Claude Code

Setup with VS Code (Copilot / Continue)

What you can do

Run standalone

Architecture

Multi-Agent Loop (Planner → Executor → Analyzer)

Evidence Graph

Parallel Task Execution (DAG Scheduler)

Built-in Skills (40)

Reconnaissance

Web Application

Injection

Authentication & Authorization

Exploitation

Fuzzing & Discovery

Network

Cloud

Stress Testing

How the Agent Attacks

Adding Custom Skills

Engagement Configuration

Safety Controls

Docker

Run the agent in Docker

Build the sandbox image

Benchmarking (XBOW)

Running benchmarks

How it works

Benchmark output

Challenge structure

Development

Requirements

License

Disclaimer

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages