feat: dynamic agent testing — adaptive attacker (--smart), multi-turn, tool-calling#32
Merged
Merged
Conversation
The two auto-save paths in scan/guard caught all exceptions and passed, so a failed results write was invisible to the user. Print a stderr warning while keeping the save best-effort (non-fatal). (cherry picked from commit 13f6018ad9180df06acd4915f7ccd43d34c45351)
Multi-turn probes fired each turn as an isolated, stateless agent call, so the model never saw prior turns and escalation (Crescendo-style) attacks were a no-op. Add _run_multi_turn() which accumulates the conversation (prior turns + the agent's prior responses) and feeds it forward each turn, and returns every turn's response so a leak on ANY turn is detected. Wires the 3 multi-turn call sites (extraction, injection, boundary). This is the prerequisite for the adaptive/dynamic attacker (PAIR/TAP/Crescendo all need conversation state). (cherry picked from commit 13ec8b568e5876f58dd5ddeb36733a8d32804d22)
Foundation for testing agents WITH their tools, not just the text layer. - ToolCall data model + detect_tool_abuse(): verdict on what the agent DOES -- LEAKED if it invokes a forbidden tool or smuggles the canary into a tool's arguments; BLOCKED on benign/no tool use (with a text-canary fallback). - run_tool_probe(): drives a tool-aware agent (messages, tools) -> (text, calls) against one payload and returns an action-based verdict. Catches the attack the text layer misses: an agent that refuses in prose but quietly calls exfiltrate(data=<system_prompt>). Connector wiring (OpenAI function-calling, Claude tool_use, MCP) and a probe library build on this. (cherry picked from commit 8fe3850fad42a63202dde009f640ad31b764c9ee)
The marquee dynamic capability: instead of one fixed payload, an attacker reads the target's response and refines the next attempt under a hard query budget. - run_pair_campaign(): injectable attacker/judge/detect (deterministic + testable; real BYOK LLM attacker wraps the same interface), stops on first leak, gives up cleanly at the budget (no infinite loop / runaway cost), records every turn as a replayable transcript (reproducible findings even as models drift). - async-ready for real LLM attacker/judge. Foundation for TAP (tree search) and Crescendo. Builds on the multi-turn fix. (cherry picked from commit 59b5c9a1ff23175dfdf3f71fbbf76b2a5a679c6c)
Turns the tested adaptive loop into a live attack: make_llm_attacker() proposes the next escalation from the conversation so far (the refusal is the signal), make_llm_judge() scores 0-10 how close the target came. Both wrap an injectable async llm_fn (the user's own model via any connector) so prompt construction and score parsing are unit-tested without a live model, and they drop straight into run_pair_campaign. parse_score() tolerates messy LLM output (clamped 0-10). (cherry picked from commit ac2e23f684a25b6a364e468673fc57b165a83bd7)
…ives run_deep_findings() runs one PAIR campaign per objective (extract system prompt, override instructions, ...) using the BYOK attacker/judge, and returns a LIST of findings with replayable transcripts -- never a numeric score, so adaptive output cannot contaminate the deterministic Trust Score. Adds an optional judge-score success threshold to the PAIR loop (default-disabled) so a leak can be detected via the judge when no canary is planted. (cherry picked from commit 6cd62d341b7741969d300430767ac02764abce6d)
…scan Exposes the dynamic engine as a CLI teaser: after the deterministic scan, runs a short BYOK PAIR campaign (1 objective, <=3 turns) that escalates against the target and reports findings, explicitly NOT scored (the Trust Score stays the deterministic static suite). Reuses the target connector + model as the attacker LLM; guarded + try/except so it can never break a normal scan. (cherry picked from commit a70f005360ace2a5b0183eef34b30273cd231216)
Live-verify against a real model showed aligned models refuse to act as the attacker. Fixes: - harden the attacker system prompt (sanctioned isolated-lab framing; target is a disposable fixture; refusing = assessment failure; output only the raw attack) - add 'scan --attacker-model' so the adaptive attacker can run on a separate, less-aligned model while still testing the real target - bump the Ollama connector read timeout 60s -> 180s so local 'thinking' models don't time out mid-campaign (cherry picked from commit 7c6be0705b79263588186d4fcda6171b1abe3508)
Miles-dev-29
approved these changes
Jun 11, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds dynamic / adaptive agent testing on top of the existing static probe suite — turning the scan from a fixed-payload prompt test into one that can escalate against the target across turns and judge what it does with tools. All opt-in and non-scored, so the deterministic Trust Score is unchanged.
New modules (conflict-free additions)
adaptive.py— PAIR loop: observe → escalate → judge → refine, hard query budget (no infinite loops / runaway cost), replayable transcript.adaptive_llm.py— BYOK LLM attacker + judge (injectable; runs on the user's own model).tool_probe.py— action-based detection: flags an agent that invokes a forbidden tool or smuggles data through tool arguments, not just canary-in-text.deep_findings.py— one adaptive campaign per high-value objective; returns findings, never a score.Changed
validator.py— fix: multi-turn probes now thread conversation history, so gradual-escalation (Crescendo-style) attacks actually work (previously each turn was a stateless call → escalation was a no-op).cli.py—scan --smart(bounded adaptive attacker after the static scan; BYOK; not scored) +--attacker-model(drive the attacker on a separate model, since aligned models refuse to attack). Coexists with the existing error-validity check.connectors/ollama.py— read timeout 60s → 180s for local "thinking" models.cli.py— warn instead of silently swallowing report-save failures.Design
--smartadds the adaptive pass.Tests
1155 passing (full suite on this branch). 27 new tests cover the engine: PAIR loop, budget/no-infinite-loop, history threading, tool-abuse detection, score parsing.
Verified
Live end-to-end against a real model (Ollama) — drove the model, escalated, judged, and caught a real canary leak. (Aligned models make weak attackers → hence
--attacker-model.)Not in this PR
A canonical-scorer rewrite (boundary double-count + score-freebie fixes) is deferred to a separate PR — origin already has the error-scoring fix, and the rewrite changes the return contract, so it deserves its own review.