Skip to content

nguiaSoren/ROGUE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

427 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ROGUE

ROGUE — Red-team every way a high-stakes AI agent can fail

The Red-Team That Never Sleeps.

Powered end-to-end by 5 Bright Data products · built for the Bright Data real-time AI-agents hackathon (results pending)

ROGUE measures every place a high-stakes AI agent can go wrong — whether the model can be broken, whether the human oversight around it is meaningful, and whether the knowledge it accumulates is safe — each against an independent, continuously-refreshed standard, with a reproducible signed record. And it closes the loop: it doesn't just find the break, it generates and verifies the fix (you own the runtime — ROGUE never sits in your request path). The continuous open-web harvest behind the model surface runs on just $0.05–$0.30 of Bright Data a day.

🥇 The first continuous open-web red-team you can query over MCP.

ROGUE harvests new jailbreaks through Bright Data's MCP, reproduces each one against your config, and serves the results back through its own MCP server — so you can ask Claude / Cursor "which live attacks breach my config?" from your editor. A two-way MCP loop — harvest and distribution — that no other red-team tool closes.

Demo Trailer Dataset Research License Python

See it live

ROGUE_1080p_av1.mp4

Why ROGUE

Other LLM red-teams run a fixed attack set you have to keep updating. ROGUE is the only one that does all of this together:

  • Harvests live, every day — new jailbreaks and prompt-injections pulled from 15+ open-web sources (via all 5 Bright Data products), so your report is never older than yesterday.
  • Reproduces against your exact config — your model and its system-prompt, not a generic safety benchmark (tool-call scoping is on the hosted roadmap).
  • Is queryable over MCP, both ways — it harvests through MCP and serves results through its own MCP server, so you can ask "what breaches a model like mine?" from inside Cursor or Claude. No other red-team closes that loop.
  • Measures three surfaces, signed — the model, the human approval gate, and the shared skill-pool — each scored against an independent answer key and emitted as a tamper-evident attestation.
  • Runs on the LLM you choose — the judge and extraction models are configurable (JUDGE_MODEL), any provider or a local model (Ollama via OPENAI_BASE_URL); not locked to one vendor.

Each ingredient exists somewhere; no competitor does the whole combination — that's what makes ROGUE a continuous, queryable, multi-surface red-team rather than a one-off scan.

Use it in 30 seconds

Query ROGUE from your IDE — hosted MCP, zero setup

The MCP server is mounted into the live API, so there is nothing to clone or run:

https://rogue-private.onrender.com/mcp/

The dashboard home has one-click Add to Cursor / Add to VS Code buttons; for Claude Desktop, add it as a custom connector. It exposes ~19 tools — read-only corpus/breach queries plus scan / report / benchmark actions. Full tool list + local install: MCP integration below.

Submit an endpoint, get a report — hosted API

POST /v1/scans with a target → ROGUE queues it for the same scan engine behind the dashboard and MCP, returning a scored report as JSON, HTML, or a CISO-ready PDF on completion. The hosted /v1 API is live and key-authorized today (private beta), but the background worker that drains the scan queue isn't deployed yet, so a queued scan does not complete on the host. For a graded report today, run it locally (below) or point the SDK at your own target — the identical engine, the identical report.

Run it locally — the full app (dashboard + API)

Self-host the whole thing — Postgres + API + the Next.js dashboard — with one command. It migrates and seeds a redacted snapshot of the real all-time breach matrix on startup, so every surface is fully populated on first boot — no scan, no keys. (The attack payloads + model responses are redacted to [redacted], exactly like the public site; the verdicts/rates are the real ones.)

git clone https://github.com/nguiaSoren/ROGUE && cd ROGUE
cp .env.example .env                                       # demo data needs no keys
docker compose -f docker-compose.full.yml up -d            # detached: ~30s to migrate, seed, and start

Open http://localhost:3000/feed, /matrix, /analytics, and /brief run against your own local instance, no account and no hosted site required. (Follow startup with docker compose -f docker-compose.full.yml logs -f.)

Fill it with your model's data. ROGUE scans a model endpoint (any OpenAI-compatible API URL — your gateway or a hosted provider), not local files. The stack runs detached, so stay in the same terminal: install the rogue CLI on the host and point it at your endpoint with --persist so each result is written into the same DB the dashboard reads:

pip install rogue-live-redteam                            # the CLI, on the host (or: pip install -e . from this clone)
export ANTHROPIC_API_KEY=sk-ant-...                       # the judge that grades each response (or repoint JUDGE_MODEL)
rogue scan --endpoint https://api.company.com/v1 --model my-model --persist --config-name "my-bot"
# (writes to $DATABASE_URL; its local default already matches the stack's Postgres, so no config needed)

Then open http://localhost:3000/matrix?config=my-bot — the breach matrix scoped to your deployment. (The judge LLM costs API spend per scan; point JUDGE_MODEL at a local model — Ollama via OPENAI_BASE_URL — to keep it ~$0.)

Want a dashboard that's only your data? Bring the stack up with SEED_DEMO=0 and the DB starts empty — then every surface (/feed, /matrix, /analytics, /brief) shows nothing but your own scans, no demo rows to filter past:

SEED_DEMO=0 docker compose -f docker-compose.full.yml up -d   # empty DB, detached
rogue scan --endpoint https://api.company.com/v1 --model my-model --persist --config-name my-bot
# → http://localhost:3000 — every surface is now 100% your data
Just the backend API, no dashboard (for development)

Skip the frontend — bring up a plain Postgres and run the API with hot-reload:

git clone https://github.com/nguiaSoren/ROGUE && cd ROGUE
cp .env.example .env          # add your keys
docker compose up -d && uv sync --extra dev
uv run alembic upgrade head && uv run python scripts/ops/seed_demo_data.py
uv run uvicorn rogue.api.main:app --reload

Scan your own model — the SDK

Install from PyPI — the rogue CLI + Python SDK, no clone needed (Python 3.11+):

pip install rogue-live-redteam

Scan any OpenAI-compatible target in three lines (plus a judge key — ROGUE grades every response; see docs/SDK.md):

from rogue import Client
client = Client(
    endpoint="https://api.company.com/v1", api_key="sk-...",   # or Client(provider="openai")
    system_prompt="<your production system prompt>",           # red-team your REAL deployment, not a bare model
)
report = client.scan(pack="aggressive", budget=10.0)
print(report.summary()); report.to_html("scan.html")

…or from the CLI: rogue scan --provider openai --pack aggressive --system-prompt-file ./system_prompt.txt (--system-prompt "…" for inline; both also work with --persist).

No API key handy? Clone the repo and run the offline demo (mocked target + judge → an HTML report): PYTHONPATH=src python3 examples/sdk_quickstart.py.

Integrations

ROGUE meets your team where it already works:

Surface Status What you get
Your IDE — MCP Available now · keyless One config block in Claude Desktop / Cursor / Windsurf / VS Code; the editor's agent queries the live threat DB on the spot. Add an account to launch full scans without leaving your work. https://rogue-private.onrender.com/mcp
Your chat & tracker — Slack + Jira ✅ Slack alerts now · ⏳ auto-fan-out rolling out Point a Slack incoming webhook (SLACK_WEBHOOK_URL) at ROGUE and the daily threat brief + new CRITICAL/HIGH breaches post to your workspace automatically — works today. Or connect Slack + Jira as per-org integrations (Fernet-encrypted creds) and file findings via the MCP action tools (send_slack_alert / create_jira_ticket); automatic fan-out on every scan completion is rolling out with the hosted worker. Setup
API & SDK — REST /v1 + Python ✅ live · ⏳ hosted scans rolling out The /v1 REST API + OpenAPI spec are live and key-authorized at https://rogue-private.onrender.com/v1. The Python SDK runs real scans today against your own target (pip install rogue-live-redteam; from rogue import Client — see docs/SDK.md). Hosted scan execution (a POST /v1/scans that completes server-side) is rolling out.
Security tooling — SOAR / SIEM 🔜 Coming soon Splunk / Palo Alto Cortex connectors to pipe findings into your existing security stack. On the roadmap, not available today.

What ROGUE does

Five-layer pipeline: Harvest → Extract → Dedupe → Reproduce → Diff.

  1. Harvest — 19 open-web sources fetched via 5 Bright Data products.
  2. Extract — an LLM agent structures each fetched document into an AttackPrimitive.
  3. Dedupe — pgvector cosine similarity clusters near-duplicate attacks.
  4. Reproduce — each canonical primitive runs against your DeploymentConfig × 5 trials.
  5. Diff — a separate judge model verdicts each trial; the daily diff ships to Slack, MCP, and the dashboard.

New to the codebase? docs/PROJECT_STRUCTURE.md maps every directory to its pipeline layer and the architecture doc that explains it.

What ROGUE red-teams

ROGUE measures every place a high-stakes AI agent can go wrong — whether the agent can be broken, whether the human oversight around it is meaningful, and whether the knowledge it accumulates is safe — each against an independent, continuously-refreshed standard, and each backed by a result rather than a claim:

  • The model. Does a live jailbreak or prompt-injection break your deployment? The daily breach matrix replays open-web attacks against your model × system-prompt, graded by a human-calibrated judge. Finding: most claimed jailbreaks don't even reproduce — Claimed Potency Does Not Predict Reproduction.
  • The human gate. When a person "approves" an AI action, does that approval mean anything? ROGUE measures a reviewer's false-approve rate against an independent answer key — the rubber-stamping failure mode regulators now care about (oversight).
  • The agent's memory. Does a shared agent skill-pool leak one user's secrets to the next? ROGUE plants canaries in scrubbed skills and measures recovery — 85% leaked on a weak model despite an explicit never-reveal instruction (Scrubbing Is Not Containment).

…and it closes the loop (assurance-native remediation). Finding a breach is half the job. ROGUE generates a verified mitigation — a system-prompt patch, a tool-permission scope, distilled fine-tuning data — and re-tests it against the same live corpus to prove it actually closed the breach without over-blocking (measured with the same calibrated judge). ROGUE generates and verifies the fix; you own the runtime — it never sits in your request path.

One engine, one independent standard — same operation each time (fire inputs at an AI decision-maker, capture what it does, score it against the standard, emit a reproducible signed record).

Research

ROGUE's findings are written up as papers and posts — PAPERS.md is the index, and each entry links to its preprint plus the code and data in this repo that reproduces it.

  • Allocation Is a Capability-Growth Mechanism — in a self-growing red-team, evaluation allocation is a capability lever, not an efficiency layer (8 of 20 starved candidates graduate vs 0 of 20; Fisher p = 0.003). · arXiv cs.CR×cs.LG — preprint posting soon
  • Consummation-Gated Breach Judges — one gate template ("engagement ≠ breach; consummation = breach") calibrates breach judges across classes, validated against human labels four ways. · arXiv cs.CR×cs.CL — preprint posting soon
  • Claimed Potency Does Not Predict Reproduction — most open-web jailbreaks don't survive as working carriers in deployment context, and a source's claimed rate carries no usable signal (Spearman −0.10). · arXiv cs.CR (lead paper) — preprint posting soon
  • Scrubbing Is Not Containment — canary leakage from shared agent skill pools tracks alignment, not model size. · workshop paper + Hugging Face blog — posting soon

Deep dives

The mechanics behind the pipeline, each on its own page:

  • Bright Data integration. Five BD products end-to-end, plus a self-tuning ε-greedy SERP bandit that allocates the daily harvest budget by yield (novel primitives per dollar) at $0.05–$0.30 per harvest. → docs/bright-data.md
  • Multimodal red-team. Refused text jailbreaks become real images and audio via deterministic black-box renderers, climbing an autonomous escalation ladder that stops at the first breach; Bright Data sources real carrier images to composite onto. → docs/multimodal.md
  • Self-growing attack repertoire. ROGUE harvests reusable techniques, not just payloads — classifying, routing, and graduating / retiring / resurrecting them on live breach evidence, with a governed renderer registry and grammar-driven planning (the planner-willingness finding: 22% → 100% by changing only the planner). → docs/self-growing-repertoire.md
  • Judge calibration. Every breach number is an LLM verdict, so the judge is validated against independent human labels four ways — in-distribution FP 2.56%, WildGuardTest harm 88.5%, StrongREJECT −26% inflation, JBB 91.0% human agreement (top of field, reproducible from data/calibration/), up from a 70.3% v1 judge after a diagnosed recalibration. → docs/judge-calibration.md
  • Benchmark — coverage over time. Frozen AdvBench / JBB goal sets run through ROGUE's own graduated ladder against a fixed target, to answer "is this month's ROGUE better than last month's?" (honest caveat: still N=1, pre-recalibration). → docs/benchmark.md
  • Dashboard tour. A 5-second pitch and a 5-minute deep-dive: cinematic home, /feed war room (attacks replayed as ATTACKER → MODEL → JUDGE), /matrix breach heatmap, /brief threat brief. → docs/dashboard.md

Capabilities

  • 15-family attack taxonomy (OWASP LLM Top 10 + MITRE ATLAS aligned) — see docs/taxonomy.md.
  • 14-slot payload-template vocabulary for cross-deployment reproduction.
  • 19-source open-web harvest list — see docs/sources.md. Not a fixed set: add your own with a ~30-line plugin → docs/adding-sources.md.
  • 8-model target panel (GPT-5.4 Nano, Claude Haiku 4.5, Llama-3.1-8B, Mistral Small, Gemini 3.1 Flash-Lite, Claude Opus 4.8, + two audio targets) — cheap-tier models per lab, an open-weight reliability anchor, a frontier reference, and audio endpoints for multimodal coverage.
  • Judge-model verdict pipeline (REFUSED / EVADED / PARTIAL_BREACH / FULL_BREACH), human-validated four ways — see Judge calibration.
  • Daily threat brief (markdown + JSON) + Slack webhook.
  • ROGUE-as-MCP-server: query the attack DB from Claude Desktop / Cursor / Windsurf.
  • True multimodal red-team and a self-growing technique repertoire (see Deep dives).
  • External benchmark layer against frozen AdvBench / JailbreakBench goal sets.

Roadmap

  • Expand source coverage — deeper Web Scraper API integration brings the next ~100 open-web sources online.
  • Tool-aware scans — supply your agent's tool schemas so a reproduction exercises the full model × system-prompt × tools surface (today's self-serve scan covers model × system-prompt; tool-call scoping lands with the hosted path).
  • Customer SDK — a drop-in SDK that lands ROGUE verdicts in the workflows teams already run (private beta; SOAR/SIEM connectors planned).
  • Break bandit — a second, contextual Thompson-sampling bandit that learns how to break (which escalation strategy to try first per attack-family × target); the control surface and reward log are already built and instrumented in prod.
  • Enterprise — RBAC, audit logs, and compliance reporting for teams that need them.

Run it yourself

Everything below is for builders — connecting ROGUE to your tools, running it locally, or driving the pipeline.

Architecture

See docs/architecture.md for the five-layer pipeline diagram and the locked stack table.

MCP integration

ROGUE exposes its threat-intelligence database as a producer-side MCP server — Claude Desktop / Cursor / Windsurf users query the live breach matrix from inside their IDE.

Hosted (recommended, zero setup). The server is mounted into the live API at https://rogue-private.onrender.com/mcp/. Use the Add to Cursor / Add to VS Code buttons on the dashboard home, or add it as a custom connector in Claude Desktop (Settings → Customize → add a custom connector → paste the URL). The hosted server exposes the read-only query tools and the action tools (validate / scan / report / benchmark + Level-3 workflow tools) — ~19 in all.

Local (against your own DB), one command:

uv run python scripts/ops/install_mcp.py                  # Claude Desktop (default)
uv run python scripts/ops/install_mcp.py --client cursor  # or: cursor / windsurf

This detects the client's config path, merges in the rogue server entry pointing at your checkout (preserving every other key), and backs up the old file first. It's idempotent; --dry-run previews, --uninstall removes. Then restart the client. Requires a populated DB (run harvest_once.py + reproduce_once.py at least once); the deployed build reads the live Neon DB.

Read-only query tools: query_attacks, query_diff, query_threat_brief, query_breaches_for_config, query_attack_detail, query_worst_attacks. After connecting, ask Claude "What new attacks broke our customer-support config in the last 24 hours?" and it will call query_diff + query_breaches_for_config and summarize.

Transport. Stdio by default (the Claude Desktop path). For remote clients, serve over HTTP:

ROGUE_MCP_TRANSPORT=streamable-http uv run python -m rogue.mcp_server.server
# serves http://127.0.0.1:8001/mcp  (ROGUE_MCP_HOST / ROGUE_MCP_PORT override the bind)

Pipeline CLI reference

The two $-billed driver scripts spend Bright Data + LLM credit and write the live DB — run them deliberately. All flags are optional.

harvest_once.py — harvest → extract → dedup → persist
uv run python scripts/harvest/harvest_once.py --since 1d
Flag Default What it does
--since 1d Harvest window (1d, 14d, 6h).
--x-handles off Comma-separated X handles to scrape this run (X is off by default — BD's profile scraper is slow).
--database-url $DATABASE_URL Target SQLAlchemy URL.
--extraction-model Claude Haiku 4.5 Provider-prefixed extraction model (prompt-cached).
--embedding-model text-embedding-3-small Embedding model for dedup.

Env toggles: EXTRACTION_CONCURRENCY · HARVEST_INGEST_IMAGES=0 · HARVEST_FOLLOW_LINKS=0. For a single known-fresh URL, use scripts/harvest/harvest_url.py --url "https://x.com/.../status/<id>".

reproduce_once.py — render → target panel → judge → persist
uv run python scripts/reproduce/reproduce_once.py --primitive-limit 50 --judge-batch
Flag Default What it does
--primitive-limit N all Cap how many primitives are reproduced (top-N by reproducibility_score).
--only-unreproduced off Reproduce only primitives with no breach_results yet.
--primitive-ids A,B,… Reproduce exactly the named primitives (overrides other filters).
--n-trials N 5 Trials per (primitive × config) — powers the bootstrap CI.
--multimodal-only off Only image/audio primitives, rendered as real media.
--persona NAME off PAP persona wrap (the B side of the A/B).
--escalate off Inline auto-ladder for panel-wide refusals (costly; bound with --escalate-max-spend).
--candidate-quota N 0 Reserve N guaranteed harvested-candidate attempts before early-stop (scheduler policy).
--judge-batch off Grade via the Anthropic Batch API (50% off + caching; baseline-only).

scripts/reproduce/candidate_quota_ab.py runs the candidate-quota A/B (the empirical baseline for the break-bandit).

Add your own source

ROGUE's sources are plugins, not a hard-coded list. To harvest from a forum, blog, repo, or feed it doesn't cover yet, write one SourcePlugin subclass — declare a name, a source_type, the required_capabilities it needs to fetch (e.g. UNLOCK for a page, SERP for a search), and an async fetch_since(fetcher, since) that returns RawDocuments. Your plugin owns what the content means; the injected fetcher owns how the bytes arrive. Register it in default_plugins() and the next harvest run extracts, dedupes, and reproduces from it like any built-in. Full walkthrough + a copy-paste example: docs/adding-sources.md.

Repository layout

src/rogue/     # Python package (schemas, harvest, extract, dedupe, reproduce, diff, mcp_server, db, api)
docs/          # architecture, schemas, taxonomy, sources, budget + the deep-dive pages
tests/         # schema round-trip tests + golden fixtures
scripts/       # harvest_once.py, reproduce_once.py, calibration/, ops/
frontend/      # Next.js dashboard

Built by

Benaja Soren Obounou Lekogo Nguia — AI Systems Engineer; previously Grand-Prize winner at Yonsei University for LLM security tooling (GPTFuzz optimization), adversarial-ML research at AIM Intelligence (HWARANG red-team series).

"I built ROGUE solo in 6 days because Bright Data abstracted away 5 different anti-bot stacks I'd otherwise have spent weeks on. The MCP Server plus pre-built Reddit / X scrapers turned a 6-week project into a 6-day project."

— Benaja Soren Obounou Lekogo Nguia

License

MIT. See LICENSE.

About

The red-team that never sleeps — autonomous open-web LLM threat intelligence. Harvests new jailbreaks via Bright Data, reproduces them against your deployment, ships a daily breach diff. Two-way MCP: harvest + queryable from your IDE. Live demo + 1-click install.

Resources

License

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors