Skip to content

feat: dynamic agent testing — adaptive attacker (--smart), multi-turn, tool-calling#32

Merged
Miles-dev-29 merged 9 commits into
mainfrom
feat/dynamic-agent-scan
Jun 11, 2026
Merged

feat: dynamic agent testing — adaptive attacker (--smart), multi-turn, tool-calling#32
Miles-dev-29 merged 9 commits into
mainfrom
feat/dynamic-agent-scan

Conversation

@iamtoruk

Copy link
Copy Markdown
Member

What

Adds dynamic / adaptive agent testing on top of the existing static probe suite — turning the scan from a fixed-payload prompt test into one that can escalate against the target across turns and judge what it does with tools. All opt-in and non-scored, so the deterministic Trust Score is unchanged.

New modules (conflict-free additions)

  • adaptive.py — PAIR loop: observe → escalate → judge → refine, hard query budget (no infinite loops / runaway cost), replayable transcript.
  • adaptive_llm.py — BYOK LLM attacker + judge (injectable; runs on the user's own model).
  • tool_probe.py — action-based detection: flags an agent that invokes a forbidden tool or smuggles data through tool arguments, not just canary-in-text.
  • deep_findings.py — one adaptive campaign per high-value objective; returns findings, never a score.

Changed

  • validator.pyfix: multi-turn probes now thread conversation history, so gradual-escalation (Crescendo-style) attacks actually work (previously each turn was a stateless call → escalation was a no-op).
  • cli.pyscan --smart (bounded adaptive attacker after the static scan; BYOK; not scored) + --attacker-model (drive the attacker on a separate model, since aligned models refuse to attack). Coexists with the existing error-validity check.
  • connectors/ollama.py — read timeout 60s → 180s for local "thinking" models.
  • cli.py — warn instead of silently swallowing report-save failures.

Design

  • Determinism preserved: adaptive output is a separate, non-scored "deep findings" pass; the static suite stays the Trust Score (reproducible / leaderboard-fair).
  • BYOK: attacker/judge run on the user's own model → no marginal cost.
  • Opt-in: default scans unchanged; --smart adds the adaptive pass.

Tests

1155 passing (full suite on this branch). 27 new tests cover the engine: PAIR loop, budget/no-infinite-loop, history threading, tool-abuse detection, score parsing.

Verified

Live end-to-end against a real model (Ollama) — drove the model, escalated, judged, and caught a real canary leak. (Aligned models make weak attackers → hence --attacker-model.)

Not in this PR

A canonical-scorer rewrite (boundary double-count + score-freebie fixes) is deferred to a separate PR — origin already has the error-scoring fix, and the rewrite changes the return contract, so it deserves its own review.

iamtoruk added 9 commits June 11, 2026 02:39
The two auto-save paths in scan/guard caught all exceptions and passed,
so a failed results write was invisible to the user. Print a stderr
warning while keeping the save best-effort (non-fatal).

(cherry picked from commit 13f6018ad9180df06acd4915f7ccd43d34c45351)
Multi-turn probes fired each turn as an isolated, stateless agent call, so the
model never saw prior turns and escalation (Crescendo-style) attacks were a
no-op. Add _run_multi_turn() which accumulates the conversation (prior turns +
the agent's prior responses) and feeds it forward each turn, and returns every
turn's response so a leak on ANY turn is detected. Wires the 3 multi-turn call
sites (extraction, injection, boundary). This is the prerequisite for the
adaptive/dynamic attacker (PAIR/TAP/Crescendo all need conversation state).

(cherry picked from commit 13ec8b568e5876f58dd5ddeb36733a8d32804d22)
Foundation for testing agents WITH their tools, not just the text layer.
- ToolCall data model + detect_tool_abuse(): verdict on what the agent DOES --
  LEAKED if it invokes a forbidden tool or smuggles the canary into a tool's
  arguments; BLOCKED on benign/no tool use (with a text-canary fallback).
- run_tool_probe(): drives a tool-aware agent (messages, tools) -> (text, calls)
  against one payload and returns an action-based verdict.
Catches the attack the text layer misses: an agent that refuses in prose but
quietly calls exfiltrate(data=<system_prompt>). Connector wiring (OpenAI
function-calling, Claude tool_use, MCP) and a probe library build on this.

(cherry picked from commit 8fe3850fad42a63202dde009f640ad31b764c9ee)
The marquee dynamic capability: instead of one fixed payload, an attacker reads
the target's response and refines the next attempt under a hard query budget.
- run_pair_campaign(): injectable attacker/judge/detect (deterministic + testable;
  real BYOK LLM attacker wraps the same interface), stops on first leak, gives up
  cleanly at the budget (no infinite loop / runaway cost), records every turn as a
  replayable transcript (reproducible findings even as models drift).
- async-ready for real LLM attacker/judge.
Foundation for TAP (tree search) and Crescendo. Builds on the multi-turn fix.

(cherry picked from commit 59b5c9a1ff23175dfdf3f71fbbf76b2a5a679c6c)
Turns the tested adaptive loop into a live attack: make_llm_attacker() proposes
the next escalation from the conversation so far (the refusal is the signal),
make_llm_judge() scores 0-10 how close the target came. Both wrap an injectable
async llm_fn (the user's own model via any connector) so prompt construction and
score parsing are unit-tested without a live model, and they drop straight into
run_pair_campaign. parse_score() tolerates messy LLM output (clamped 0-10).

(cherry picked from commit ac2e23f684a25b6a364e468673fc57b165a83bd7)
…ives

run_deep_findings() runs one PAIR campaign per objective (extract system prompt,
override instructions, ...) using the BYOK attacker/judge, and returns a LIST of
findings with replayable transcripts -- never a numeric score, so adaptive output
cannot contaminate the deterministic Trust Score. Adds an optional judge-score
success threshold to the PAIR loop (default-disabled) so a leak can be detected
via the judge when no canary is planted.

(cherry picked from commit 6cd62d341b7741969d300430767ac02764abce6d)
…scan

Exposes the dynamic engine as a CLI teaser: after the deterministic scan, runs a
short BYOK PAIR campaign (1 objective, <=3 turns) that escalates against the
target and reports findings, explicitly NOT scored (the Trust Score stays the
deterministic static suite). Reuses the target connector + model as the attacker
LLM; guarded + try/except so it can never break a normal scan.

(cherry picked from commit a70f005360ace2a5b0183eef34b30273cd231216)
Live-verify against a real model showed aligned models refuse to act as the
attacker. Fixes:
- harden the attacker system prompt (sanctioned isolated-lab framing; target is a
  disposable fixture; refusing = assessment failure; output only the raw attack)
- add 'scan --attacker-model' so the adaptive attacker can run on a separate,
  less-aligned model while still testing the real target
- bump the Ollama connector read timeout 60s -> 180s so local 'thinking' models
  don't time out mid-campaign

(cherry picked from commit 7c6be0705b79263588186d4fcda6171b1abe3508)
@Miles-dev-29 Miles-dev-29 self-assigned this Jun 11, 2026
@Miles-dev-29 Miles-dev-29 self-requested a review June 11, 2026 10:15
@Miles-dev-29 Miles-dev-29 merged commit 7c2a228 into main Jun 11, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants