ROGUE — Red-team every way a high-stakes AI agent can fail

The Red-Team That Never Sleeps.

_{Powered end-to-end by 5 Bright Data products · built for the Bright Data real-time AI-agents hackathon (results pending)}

ROGUE measures every place a high-stakes AI agent can go wrong — whether the model can be broken, whether the human oversight around it is meaningful, and whether the knowledge it accumulates is safe — each against an independent, continuously-refreshed standard, with a reproducible signed record. And it closes the loop: it doesn't just find the break, it generates and verifies the fix (you own the runtime — ROGUE never sits in your request path). The continuous open-web harvest behind the model surface runs on just $0.05–$0.30 of Bright Data a day.

🥇 The first continuous open-web red-team you can query over MCP.

ROGUE harvests new jailbreaks through Bright Data's MCP, reproduces each one against your config, and serves the results back through its own MCP server — so you can ask Claude / Cursor "which live attacks breach my config?" from your editor. A two-way MCP loop — harvest and distribution — that no other red-team tool closes.

See it live

Dashboard: https://rogue-eosin.vercel.app — live, deployed.
Trailer: watch the 45-second trailer on YouTube (preview below).
Dataset: 358 attack primitives across 15 families, MIT-licensed and access-gated (defensive-research-only terms — see RESPONSIBLE_RELEASE.md).
In Slack: point a Slack incoming webhook at ROGUE and the daily threat brief + every new HIGH/CRITICAL breach post straight to your workspace (the platform integration also files findings to Jira). ROGUE comes to where your team already triages.

ROGUE_1080p_av1.mp4

Why ROGUE

Other LLM red-teams run a fixed attack set you have to keep updating. ROGUE is the only one that does all of this together:

Harvests live, every day — new jailbreaks and prompt-injections pulled from 15+ open-web sources (via all 5 Bright Data products), so your report is never older than yesterday.
Reproduces against your exact config — your model and its system-prompt, not a generic safety benchmark (tool-call scoping is on the hosted roadmap).
Is queryable over MCP, both ways — it harvests through MCP and serves results through its own MCP server, so you can ask "what breaches a model like mine?" from inside Cursor or Claude. No other red-team closes that loop.
Measures three surfaces, signed — the model, the human approval gate, and the shared skill-pool — each scored against an independent answer key and emitted as a tamper-evident attestation.
Runs on the LLM you choose — the judge and extraction models are configurable (JUDGE_MODEL), any provider or a local model (Ollama via OPENAI_BASE_URL); not locked to one vendor.

Each ingredient exists somewhere; no competitor does the whole combination — that's what makes ROGUE a continuous, queryable, multi-surface red-team rather than a one-off scan.

Use it in 30 seconds

Query ROGUE from your IDE — hosted MCP, zero setup

The MCP server is mounted into the live API, so there is nothing to clone or run:

https://rogue-private.onrender.com/mcp/

The dashboard home has one-click Add to Cursor / Add to VS Code buttons; for Claude Desktop, add it as a custom connector. It exposes ~19 tools — read-only corpus/breach queries plus scan / report / benchmark actions. Full tool list + local install: MCP integration below.

Submit an endpoint, get a report — hosted API

POST /v1/scans with a target → ROGUE queues it for the same scan engine behind the dashboard and MCP, returning a scored report as JSON, HTML, or a CISO-ready PDF on completion. The hosted /v1 API is live and key-authorized today (private beta), but the background worker that drains the scan queue isn't deployed yet, so a queued scan does not complete on the host. For a graded report today, run it locally (below) or point the SDK at your own target — the identical engine, the identical report.

Run it locally — the full app (dashboard + API)

Self-host the whole thing — Postgres + API + the Next.js dashboard — with one command. It migrates and seeds a redacted snapshot of the real all-time breach matrix on startup, so every surface is fully populated on first boot — no scan, no keys. (The attack payloads + model responses are redacted to [redacted], exactly like the public site; the verdicts/rates are the real ones.)

git clone https://github.com/nguiaSoren/ROGUE && cd ROGUE
cp .env.example .env                                       # demo data needs no keys
docker compose -f docker-compose.full.yml up -d            # detached: ~30s to migrate, seed, and start

Open http://localhost:3000 — /feed, /matrix, /analytics, and /brief run against your own local instance, no account and no hosted site required. (Follow startup with docker compose -f docker-compose.full.yml logs -f.)

Fill it with your model's data. ROGUE scans a model endpoint (any OpenAI-compatible API URL — your gateway or a hosted provider), not local files. The stack runs detached, so stay in the same terminal: install the rogue CLI on the host and point it at your endpoint with --persist so each result is written into the same DB the dashboard reads:

pip install rogue-live-redteam                            # the CLI, on the host (or: pip install -e . from this clone)
export ANTHROPIC_API_KEY=sk-ant-...                       # the judge that grades each response (or repoint JUDGE_MODEL)
rogue scan --endpoint https://api.company.com/v1 --model my-model --persist --config-name "my-bot"
# (writes to $DATABASE_URL; its local default already matches the stack's Postgres, so no config needed)

Then open http://localhost:3000/matrix?config=my-bot — the breach matrix scoped to your deployment. (The judge LLM costs API spend per scan; point JUDGE_MODEL at a local model — Ollama via OPENAI_BASE_URL — to keep it ~$0.)

Want a dashboard that's only your data? Bring the stack up with SEED_DEMO=0 and the DB starts empty — then every surface (/feed, /matrix, /analytics, /brief) shows nothing but your own scans, no demo rows to filter past:

SEED_DEMO=0 docker compose -f docker-compose.full.yml up -d   # empty DB, detached
rogue scan --endpoint https://api.company.com/v1 --model my-model --persist --config-name my-bot
# → http://localhost:3000 — every surface is now 100% your data

Just the backend API, no dashboard (for development)

Skip the frontend — bring up a plain Postgres and run the API with hot-reload:

git clone https://github.com/nguiaSoren/ROGUE && cd ROGUE
cp .env.example .env          # add your keys
docker compose up -d && uv sync --extra dev
uv run alembic upgrade head && uv run python scripts/ops/seed_demo_data.py
uv run uvicorn rogue.api.main:app --reload

Scan your own model — the SDK

Install from PyPI — the rogue CLI + Python SDK, no clone needed (Python 3.11+):

pip install rogue-live-redteam

Scan any OpenAI-compatible target in three lines (plus a judge key — ROGUE grades every response; see docs/SDK.md):

from rogue import Client
client = Client(
    endpoint="https://api.company.com/v1", api_key="sk-...",   # or Client(provider="openai")
    system_prompt="<your production system prompt>",           # red-team your REAL deployment, not a bare model
)
report = client.scan(pack="aggressive", budget=10.0)
print(report.summary()); report.to_html("scan.html")

…or from the CLI: rogue scan --provider openai --pack aggressive --system-prompt-file ./system_prompt.txt (--system-prompt "…" for inline; both also work with --persist).

No API key handy? Clone the repo and run the offline demo (mocked target + judge → an HTML report): PYTHONPATH=src python3 examples/sdk_quickstart.py.

Integrations

ROGUE meets your team where it already works:

Surface	Status	What you get
Your IDE — MCP	✅ Available now · keyless	One config block in Claude Desktop / Cursor / Windsurf / VS Code; the editor's agent queries the live threat DB on the spot. Add an account to launch full scans without leaving your work. `https://rogue-private.onrender.com/mcp`
Your chat & tracker — Slack + Jira	✅ Slack alerts now · ⏳ auto-fan-out rolling out	Point a Slack incoming webhook (`SLACK_WEBHOOK_URL`) at ROGUE and the daily threat brief + new CRITICAL/HIGH breaches post to your workspace automatically — works today. Or connect Slack + Jira as per-org integrations (Fernet-encrypted creds) and file findings via the MCP action tools (`send_slack_alert` / `create_jira_ticket`); automatic fan-out on every scan completion is rolling out with the hosted worker. Setup
API & SDK — REST `/v1` + Python	✅ live · ⏳ hosted scans rolling out	The `/v1` REST API + OpenAPI spec are live and key-authorized at `https://rogue-private.onrender.com/v1`. The Python SDK runs real scans today against your own target (`pip install rogue-live-redteam`; `from rogue import Client` — see `docs/SDK.md`). Hosted scan execution (a `POST /v1/scans` that completes server-side) is rolling out.
Security tooling — SOAR / SIEM	🔜 Coming soon	Splunk / Palo Alto Cortex connectors to pipe findings into your existing security stack. On the roadmap, not available today.

What ROGUE does

Five-layer pipeline: Harvest → Extract → Dedupe → Reproduce → Diff.

Harvest — 19 open-web sources fetched via 5 Bright Data products.
Extract — an LLM agent structures each fetched document into an AttackPrimitive.
Dedupe — pgvector cosine similarity clusters near-duplicate attacks.
Reproduce — each canonical primitive runs against your DeploymentConfig × 5 trials.
Diff — a separate judge model verdicts each trial; the daily diff ships to Slack, MCP, and the dashboard.

New to the codebase? docs/PROJECT_STRUCTURE.md maps every directory to its pipeline layer and the architecture doc that explains it.

What ROGUE red-teams

ROGUE measures every place a high-stakes AI agent can go wrong — whether the agent can be broken, whether the human oversight around it is meaningful, and whether the knowledge it accumulates is safe — each against an independent, continuously-refreshed standard, and each backed by a result rather than a claim:

The model. Does a live jailbreak or prompt-injection break your deployment? The daily breach matrix replays open-web attacks against your model × system-prompt, graded by a human-calibrated judge. Finding: most claimed jailbreaks don't even reproduce — Claimed Potency Does Not Predict Reproduction.
The human gate. When a person "approves" an AI action, does that approval mean anything? ROGUE measures a reviewer's false-approve rate against an independent answer key — the rubber-stamping failure mode regulators now care about (oversight).
The agent's memory. Does a shared agent skill-pool leak one user's secrets to the next? ROGUE plants canaries in scrubbed skills and measures recovery — 85% leaked on a weak model despite an explicit never-reveal instruction (Scrubbing Is Not Containment).

…and it closes the loop (assurance-native remediation). Finding a breach is half the job. ROGUE generates a verified mitigation — a system-prompt patch, a tool-permission scope, distilled fine-tuning data — and re-tests it against the same live corpus to prove it actually closed the breach without over-blocking (measured with the same calibrated judge). ROGUE generates and verifies the fix; you own the runtime — it never sits in your request path.

One engine, one independent standard — same operation each time (fire inputs at an AI decision-maker, capture what it does, score it against the standard, emit a reproducible signed record).

Research

ROGUE's findings are written up as papers and posts — PAPERS.md is the index, and each entry links to its preprint plus the code and data in this repo that reproduces it.

Allocation Is a Capability-Growth Mechanism — in a self-growing red-team, evaluation allocation is a capability lever, not an efficiency layer (8 of 20 starved candidates graduate vs 0 of 20; Fisher p = 0.003). · arXiv cs.CR×cs.LG — preprint posting soon
Consummation-Gated Breach Judges — one gate template ("engagement ≠ breach; consummation = breach") calibrates breach judges across classes, validated against human labels four ways. · arXiv cs.CR×cs.CL — preprint posting soon
Claimed Potency Does Not Predict Reproduction — most open-web jailbreaks don't survive as working carriers in deployment context, and a source's claimed rate carries no usable signal (Spearman −0.10). · arXiv cs.CR (lead paper) — preprint posting soon
Scrubbing Is Not Containment — canary leakage from shared agent skill pools tracks alignment, not model size. · workshop paper + Hugging Face blog — posting soon

Deep dives

The mechanics behind the pipeline, each on its own page:

Bright Data integration. Five BD products end-to-end, plus a self-tuning ε-greedy SERP bandit that allocates the daily harvest budget by yield (novel primitives per dollar) at $0.05–$0.30 per harvest. → docs/bright-data.md
Multimodal red-team. Refused text jailbreaks become real images and audio via deterministic black-box renderers, climbing an autonomous escalation ladder that stops at the first breach; Bright Data sources real carrier images to composite onto. → docs/multimodal.md
Self-growing attack repertoire. ROGUE harvests reusable techniques, not just payloads — classifying, routing, and graduating / retiring / resurrecting them on live breach evidence, with a governed renderer registry and grammar-driven planning (the planner-willingness finding: 22% → 100% by changing only the planner). → docs/self-growing-repertoire.md
Judge calibration. Every breach number is an LLM verdict, so the judge is validated against independent human labels four ways — in-distribution FP 2.56%, WildGuardTest harm 88.5%, StrongREJECT −26% inflation, JBB 91.0% human agreement (top of field, reproducible from data/calibration/), up from a 70.3% v1 judge after a diagnosed recalibration. → docs/judge-calibration.md
Benchmark — coverage over time. Frozen AdvBench / JBB goal sets run through ROGUE's own graduated ladder against a fixed target, to answer "is this month's ROGUE better than last month's?" (honest caveat: still N=1, pre-recalibration). → docs/benchmark.md
Dashboard tour. A 5-second pitch and a 5-minute deep-dive: cinematic home, /feed war room (attacks replayed as ATTACKER → MODEL → JUDGE), /matrix breach heatmap, /brief threat brief. → docs/dashboard.md

Capabilities

15-family attack taxonomy (OWASP LLM Top 10 + MITRE ATLAS aligned) — see docs/taxonomy.md.
14-slot payload-template vocabulary for cross-deployment reproduction.
19-source open-web harvest list — see docs/sources.md. Not a fixed set: add your own with a ~30-line plugin → docs/adding-sources.md.
8-model target panel (GPT-5.4 Nano, Claude Haiku 4.5, Llama-3.1-8B, Mistral Small, Gemini 3.1 Flash-Lite, Claude Opus 4.8, + two audio targets) — cheap-tier models per lab, an open-weight reliability anchor, a frontier reference, and audio endpoints for multimodal coverage.
Judge-model verdict pipeline (REFUSED / EVADED / PARTIAL_BREACH / FULL_BREACH), human-validated four ways — see Judge calibration.
Daily threat brief (markdown + JSON) + Slack webhook.
ROGUE-as-MCP-server: query the attack DB from Claude Desktop / Cursor / Windsurf.
True multimodal red-team and a self-growing technique repertoire (see Deep dives).
External benchmark layer against frozen AdvBench / JailbreakBench goal sets.

Roadmap

Expand source coverage — deeper Web Scraper API integration brings the next ~100 open-web sources online.
Tool-aware scans — supply your agent's tool schemas so a reproduction exercises the full model × system-prompt × tools surface (today's self-serve scan covers model × system-prompt; tool-call scoping lands with the hosted path).
Customer SDK — a drop-in SDK that lands ROGUE verdicts in the workflows teams already run (private beta; SOAR/SIEM connectors planned).
Break bandit — a second, contextual Thompson-sampling bandit that learns how to break (which escalation strategy to try first per attack-family × target); the control surface and reward log are already built and instrumented in prod.
Enterprise — RBAC, audit logs, and compliance reporting for teams that need them.

Run it yourself

Everything below is for builders — connecting ROGUE to your tools, running it locally, or driving the pipeline.

Architecture

See docs/architecture.md for the five-layer pipeline diagram and the locked stack table.

MCP integration

ROGUE exposes its threat-intelligence database as a producer-side MCP server — Claude Desktop / Cursor / Windsurf users query the live breach matrix from inside their IDE.

Hosted (recommended, zero setup). The server is mounted into the live API at https://rogue-private.onrender.com/mcp/. Use the Add to Cursor / Add to VS Code buttons on the dashboard home, or add it as a custom connector in Claude Desktop (Settings → Customize → add a custom connector → paste the URL). The hosted server exposes the read-only query tools and the action tools (validate / scan / report / benchmark + Level-3 workflow tools) — ~19 in all.

Local (against your own DB), one command:

uv run python scripts/ops/install_mcp.py                  # Claude Desktop (default)
uv run python scripts/ops/install_mcp.py --client cursor  # or: cursor / windsurf

This detects the client's config path, merges in the rogue server entry pointing at your checkout (preserving every other key), and backs up the old file first. It's idempotent; --dry-run previews, --uninstall removes. Then restart the client. Requires a populated DB (run harvest_once.py + reproduce_once.py at least once); the deployed build reads the live Neon DB.

Read-only query tools: query_attacks, query_diff, query_threat_brief, query_breaches_for_config, query_attack_detail, query_worst_attacks. After connecting, ask Claude "What new attacks broke our customer-support config in the last 24 hours?" and it will call query_diff + query_breaches_for_config and summarize.

Transport. Stdio by default (the Claude Desktop path). For remote clients, serve over HTTP:

ROGUE_MCP_TRANSPORT=streamable-http uv run python -m rogue.mcp_server.server
# serves http://127.0.0.1:8001/mcp  (ROGUE_MCP_HOST / ROGUE_MCP_PORT override the bind)

Pipeline CLI reference

The two $-billed driver scripts spend Bright Data + LLM credit and write the live DB — run them deliberately. All flags are optional.

harvest_once.py — harvest → extract → dedup → persist

uv run python scripts/harvest/harvest_once.py --since 1d

Flag	Default	What it does
`--since`	`1d`	Harvest window (`1d`, `14d`, `6h`).
`--x-handles`	off	Comma-separated X handles to scrape this run (X is off by default — BD's profile scraper is slow).
`--database-url`	`$DATABASE_URL`	Target SQLAlchemy URL.
`--extraction-model`	Claude Haiku 4.5	Provider-prefixed extraction model (prompt-cached).
`--embedding-model`	`text-embedding-3-small`	Embedding model for dedup.

Env toggles: EXTRACTION_CONCURRENCY · HARVEST_INGEST_IMAGES=0 · HARVEST_FOLLOW_LINKS=0. For a single known-fresh URL, use scripts/harvest/harvest_url.py --url "https://x.com/.../status/<id>".

reproduce_once.py — render → target panel → judge → persist

uv run python scripts/reproduce/reproduce_once.py --primitive-limit 50 --judge-batch

Flag	Default	What it does
`--primitive-limit N`	all	Cap how many primitives are reproduced (top-N by `reproducibility_score`).
`--only-unreproduced`	off	Reproduce only primitives with no `breach_results` yet.
`--primitive-ids A,B,…`	—	Reproduce exactly the named primitives (overrides other filters).
`--n-trials N`	5	Trials per (primitive × config) — powers the bootstrap CI.
`--multimodal-only`	off	Only image/audio primitives, rendered as real media.
`--persona NAME`	off	PAP persona wrap (the B side of the A/B).
`--escalate`	off	Inline auto-ladder for panel-wide refusals (costly; bound with `--escalate-max-spend`).
`--candidate-quota N`	0	Reserve N guaranteed harvested-candidate attempts before early-stop (scheduler policy).
`--judge-batch`	off	Grade via the Anthropic Batch API (50% off + caching; baseline-only).

scripts/reproduce/candidate_quota_ab.py runs the candidate-quota A/B (the empirical baseline for the break-bandit).

Add your own source

ROGUE's sources are plugins, not a hard-coded list. To harvest from a forum, blog, repo, or feed it doesn't cover yet, write one SourcePlugin subclass — declare a name, a source_type, the required_capabilities it needs to fetch (e.g. UNLOCK for a page, SERP for a search), and an async fetch_since(fetcher, since) that returns RawDocuments. Your plugin owns what the content means; the injected fetcher owns how the bytes arrive. Register it in default_plugins() and the next harvest run extracts, dedupes, and reproduces from it like any built-in. Full walkthrough + a copy-paste example: docs/adding-sources.md.

Repository layout

src/rogue/     # Python package (schemas, harvest, extract, dedupe, reproduce, diff, mcp_server, db, api)
docs/          # architecture, schemas, taxonomy, sources, budget + the deep-dive pages
tests/         # schema round-trip tests + golden fixtures
scripts/       # harvest_once.py, reproduce_once.py, calibration/, ops/
frontend/      # Next.js dashboard

Built by

Benaja Soren Obounou Lekogo Nguia — AI Systems Engineer; previously Grand-Prize winner at Yonsei University for LLM security tooling (GPTFuzz optimization), adversarial-ML research at AIM Intelligence (HWARANG red-team series).

"I built ROGUE solo in 6 days because Bright Data abstracted away 5 different anti-bot stacks I'd otherwise have spent weeks on. The MCP Server plus pre-built Reddit / X scrapers turned a 6-week project into a 6-day project."

— Benaja Soren Obounou Lekogo Nguia

License

MIT. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 427 Commits
.github		.github
assets		assets
benchmark		benchmark
data		data
docker		docker
docs		docs
examples		examples
frontend		frontend
scripts		scripts
sdk		sdk
src/rogue		src/rogue
tests		tests
.dockerignore		.dockerignore
.env.example		.env.example
.envrc		.envrc
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.vercelignore		.vercelignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
PAPERS.md		PAPERS.md
QUICKSTART.md		QUICKSTART.md
README.md		README.md
RESPONSIBLE_RELEASE.md		RESPONSIBLE_RELEASE.md
SECURITY.md		SECURITY.md
alembic.ini		alembic.ini
docker-compose.full.yml		docker-compose.full.yml
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
pyrightconfig.json		pyrightconfig.json
railway.json		railway.json
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ROGUE — Red-team every way a high-stakes AI agent can fail

🥇 The first continuous open-web red-team you can query over MCP.

See it live

Why ROGUE

Use it in 30 seconds

Query ROGUE from your IDE — hosted MCP, zero setup

Submit an endpoint, get a report — hosted API

Run it locally — the full app (dashboard + API)

Scan your own model — the SDK

Integrations

What ROGUE does

What ROGUE red-teams

Research

Deep dives

Capabilities

Roadmap

Run it yourself

Architecture

MCP integration

Pipeline CLI reference

Add your own source

Repository layout

Built by

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ROGUE — Red-team every way a high-stakes AI agent can fail

🥇 The first continuous open-web red-team you can query over MCP.

See it live

Why ROGUE

Use it in 30 seconds

Query ROGUE from your IDE — hosted MCP, zero setup

Submit an endpoint, get a report — hosted API

Run it locally — the full app (dashboard + API)

Scan your own model — the SDK

Integrations

What ROGUE does

What ROGUE red-teams

Research

Deep dives

Capabilities

Roadmap

Run it yourself

Architecture

MCP integration

Pipeline CLI reference

Add your own source

Repository layout

Built by

License

About

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages