feat: dynamic agent testing — adaptive attacker (--smart), multi-turn, tool-calling by iamtoruk · Pull Request #32 · getagentseal/agentseal

iamtoruk · 2026-06-11T09:42:11Z

What

Adds dynamic / adaptive agent testing on top of the existing static probe suite — turning the scan from a fixed-payload prompt test into one that can escalate against the target across turns and judge what it does with tools. All opt-in and non-scored, so the deterministic Trust Score is unchanged.

New modules (conflict-free additions)

adaptive.py — PAIR loop: observe → escalate → judge → refine, hard query budget (no infinite loops / runaway cost), replayable transcript.
adaptive_llm.py — BYOK LLM attacker + judge (injectable; runs on the user's own model).
tool_probe.py — action-based detection: flags an agent that invokes a forbidden tool or smuggles data through tool arguments, not just canary-in-text.
deep_findings.py — one adaptive campaign per high-value objective; returns findings, never a score.

Changed

validator.py — fix: multi-turn probes now thread conversation history, so gradual-escalation (Crescendo-style) attacks actually work (previously each turn was a stateless call → escalation was a no-op).
cli.py — scan --smart (bounded adaptive attacker after the static scan; BYOK; not scored) + --attacker-model (drive the attacker on a separate model, since aligned models refuse to attack). Coexists with the existing error-validity check.
connectors/ollama.py — read timeout 60s → 180s for local "thinking" models.
cli.py — warn instead of silently swallowing report-save failures.

Design

Determinism preserved: adaptive output is a separate, non-scored "deep findings" pass; the static suite stays the Trust Score (reproducible / leaderboard-fair).
BYOK: attacker/judge run on the user's own model → no marginal cost.
Opt-in: default scans unchanged; --smart adds the adaptive pass.

Tests

1155 passing (full suite on this branch). 27 new tests cover the engine: PAIR loop, budget/no-infinite-loop, history threading, tool-abuse detection, score parsing.

Verified

Live end-to-end against a real model (Ollama) — drove the model, escalated, judged, and caught a real canary leak. (Aligned models make weak attackers → hence --attacker-model.)

Not in this PR

A canonical-scorer rewrite (boundary double-count + score-freebie fixes) is deferred to a separate PR — origin already has the error-scoring fix, and the rewrite changes the return contract, so it deserves its own review.

The two auto-save paths in scan/guard caught all exceptions and passed, so a failed results write was invisible to the user. Print a stderr warning while keeping the save best-effort (non-fatal). (cherry picked from commit 13f6018ad9180df06acd4915f7ccd43d34c45351)

Multi-turn probes fired each turn as an isolated, stateless agent call, so the model never saw prior turns and escalation (Crescendo-style) attacks were a no-op. Add _run_multi_turn() which accumulates the conversation (prior turns + the agent's prior responses) and feeds it forward each turn, and returns every turn's response so a leak on ANY turn is detected. Wires the 3 multi-turn call sites (extraction, injection, boundary). This is the prerequisite for the adaptive/dynamic attacker (PAIR/TAP/Crescendo all need conversation state). (cherry picked from commit 13ec8b568e5876f58dd5ddeb36733a8d32804d22)

Foundation for testing agents WITH their tools, not just the text layer. - ToolCall data model + detect_tool_abuse(): verdict on what the agent DOES -- LEAKED if it invokes a forbidden tool or smuggles the canary into a tool's arguments; BLOCKED on benign/no tool use (with a text-canary fallback). - run_tool_probe(): drives a tool-aware agent (messages, tools) -> (text, calls) against one payload and returns an action-based verdict. Catches the attack the text layer misses: an agent that refuses in prose but quietly calls exfiltrate(data=<system_prompt>). Connector wiring (OpenAI function-calling, Claude tool_use, MCP) and a probe library build on this. (cherry picked from commit 8fe3850fad42a63202dde009f640ad31b764c9ee)

The marquee dynamic capability: instead of one fixed payload, an attacker reads the target's response and refines the next attempt under a hard query budget. - run_pair_campaign(): injectable attacker/judge/detect (deterministic + testable; real BYOK LLM attacker wraps the same interface), stops on first leak, gives up cleanly at the budget (no infinite loop / runaway cost), records every turn as a replayable transcript (reproducible findings even as models drift). - async-ready for real LLM attacker/judge. Foundation for TAP (tree search) and Crescendo. Builds on the multi-turn fix. (cherry picked from commit 59b5c9a1ff23175dfdf3f71fbbf76b2a5a679c6c)

Turns the tested adaptive loop into a live attack: make_llm_attacker() proposes the next escalation from the conversation so far (the refusal is the signal), make_llm_judge() scores 0-10 how close the target came. Both wrap an injectable async llm_fn (the user's own model via any connector) so prompt construction and score parsing are unit-tested without a live model, and they drop straight into run_pair_campaign. parse_score() tolerates messy LLM output (clamped 0-10). (cherry picked from commit ac2e23f684a25b6a364e468673fc57b165a83bd7)

…ives run_deep_findings() runs one PAIR campaign per objective (extract system prompt, override instructions, ...) using the BYOK attacker/judge, and returns a LIST of findings with replayable transcripts -- never a numeric score, so adaptive output cannot contaminate the deterministic Trust Score. Adds an optional judge-score success threshold to the PAIR loop (default-disabled) so a leak can be detected via the judge when no canary is planted. (cherry picked from commit 6cd62d341b7741969d300430767ac02764abce6d)

…scan Exposes the dynamic engine as a CLI teaser: after the deterministic scan, runs a short BYOK PAIR campaign (1 objective, <=3 turns) that escalates against the target and reports findings, explicitly NOT scored (the Trust Score stays the deterministic static suite). Reuses the target connector + model as the attacker LLM; guarded + try/except so it can never break a normal scan. (cherry picked from commit a70f005360ace2a5b0183eef34b30273cd231216)

Live-verify against a real model showed aligned models refuse to act as the attacker. Fixes: - harden the attacker system prompt (sanctioned isolated-lab framing; target is a disposable fixture; refusing = assessment failure; output only the raw attack) - add 'scan --attacker-model' so the adaptive attacker can run on a separate, less-aligned model while still testing the real target - bump the Ollama connector read timeout 60s -> 180s so local 'thinking' models don't time out mid-campaign (cherry picked from commit 7c6be0705b79263588186d4fcda6171b1abe3508)

iamtoruk added 9 commits June 11, 2026 02:39

chore: release 0.10.0 — version bump + CHANGELOG

5e58fc1

Miles-dev-29 self-assigned this Jun 11, 2026

Miles-dev-29 self-requested a review June 11, 2026 10:15

Miles-dev-29 approved these changes Jun 11, 2026

View reviewed changes

Miles-dev-29 merged commit 7c2a228 into main Jun 11, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: dynamic agent testing — adaptive attacker (--smart), multi-turn, tool-calling#32

feat: dynamic agent testing — adaptive attacker (--smart), multi-turn, tool-calling#32
Miles-dev-29 merged 9 commits into
mainfrom
feat/dynamic-agent-scan

iamtoruk commented Jun 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

iamtoruk commented Jun 11, 2026

What

New modules (conflict-free additions)

Changed

Design

Tests

Verified

Not in this PR

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants