Skip to content

[Feature]: LLM-as-guardrail SecurityAnalyzer with self-exploration safety-experience injection (ToolShield) #14040

@xli04

Description

@xli04

Is there an existing feature request for this?

  • I have searched existing issues and feature requests, and this is not a duplicate.

Summary

  • Opt-in guardrail-LLM SecurityAnalyzer + ToolShield safety experiences (from Unsafer in Many Turns, Li et al. 2026; work by Northeastern, UIUC, UC Berkeley, and Virtue AI).
  • Separates actor from judge; catches executor LLM won't self-flag reliably for risk actions.
  • Multi-turn attack sequences ASR (from the same paper): 75–88% → 7–10% (Claude 4.5 / Qwen3.5-Plus / Gemini 3 Flash).
  • Minimal cost overhead on 50 benign Terminal-Bench tasks;.
  • Default analyzer unchanged.

Problem or Use Case

What problem or use case are you trying to solve?

The current default, LLMSecurityAnalyzer, asks the same LLM that generates a tool call to self-annotate security_risk in the same response. A model intended to execute a risky tool call is unlikely to simultaneously flag that call as risky, the same context that causes the bad action also biases the self-assessment. This also has some known reliability issues on weaker models that silently omit or hallucinate the security_risk field (ALL-3921), and the Invariant rule-based analyzer can't capture dynamic multi-turn attack structure (and has had stability problems, #5264). The design issue that introduced the current analyzer (ALL-3014) itself noted that rule-based approaches aren't general enough.

We measured this on a 100-task subset of MT-AgentRisk (arXiv:2602.13379, Feb 2026), a benchmark that transforms single-turn harmful tasks from established works such as OpenAgentSafety and Safearena into multi-turn sequences. The distributed harmful goals bypass the current single turn safety mechanism and make executor LLMs can't reliably identify the risk. Attack Success Rate ranges from 75% to 88% across three frontier models. That's the number we'd like to bring down.

Background: MT-AgentRisk and ToolShield

MT-AgentRisk (arXiv:2602.13379) is a multi-turn agent safety benchmark. It takes single-turn harmful tasks from established benchmarks such as OpenAgentSafety and Safearena (365 tasks across Filesystem-MCP, Browser/Playwright-MCP, PostgreSQL-MCP, Notion-MCP, and Terminal), then transforms each one into a sequence of steps that together achieve the harmful goal. A single-turn rm -rf / is rejected; broken into three innocuous-seeming commands across turns, it executes.

ToolShield (same paper) is a training-free, tool-agnostic defense. When the agent first encounters a tool, it autonomously generates test cases, executes them in the sandbox, and observes what actually happens downstream. From those observations it distills structured safety experiences. At deployment, the relevant experiences are injected into the guardrail LLM's context to guide the judgement.

Proposed Solution

Proposed Solution

We'd like to contribute an opt-in alternative that separates the actor and the judge: a distinct guardrail LLM reviews each proposed action rather than relying on the actor to self-report. Second, and more importantly, ToolShield distills safety experiences by red-teaming tools in the sandbox once per tool, the resulting guidelines are then injected into the guardrail's context at decision time. It's built on the existing SecurityAnalyzer interface, so no changes to core, tools, or the ConfirmRisky enforcement path.

Two PRs: one SDK change and one small frontend change.

PR 1 — ToolShieldLLMSecurityAnalyzer (SDK)

  • New SecurityAnalyzerBase subclass; for each ActionEvent, issues
    a separate completion to a guardrail LLM and returns
    LOW / MEDIUM / HIGH / UNKNOWN.

  • Configurable fields:

    • llm — any SDK LLM
    • history_window — prior actions of context (default 20)
    • safety_experiences — optional string injected into the system
      prompt
  • Failure handling (opinionated, logged distinctly):

    • LLM infrastructure error → UNKNOWN (lets ConfirmRisky apply
      its configured fallback)
    • Unparseable guardrail output → HIGH (conservative)
  • Pairs with ConfirmRisky unchanged; no changes to Agent or tool
    executors.

  • Per-tool safety experiences come from the existing toolshield
    PyPI package via an optional extra
    (pip install openhands-sdk[toolshield]) — no new SDK submodule.

  • Three helpers populate safety_experiences:

    • default_safety_experiences() — terminal + filesystem seed
    • load_safety_experiences(tool_names) — explicit list
    • auto_detect_safety_experiences() — probes localhost MCP ports

    PR 2 — Frontend dropdown entry (OpenHands app repo)

    • Adds "ToolShield Analyzer" to the analyzer dropdown in
      verification-settings.tsx (~10 lines + i18n keys). No new sub-form —
      the guardrail reads its LLM config from the existing SDK LLM
      profile system, and experiences come from the toolshield optional
      extra.
    • Default remains LLMSecurityAnalyzer.

Evaluation

100-task harmful subset of MT-AgentRisk; 50 benign tasks from Terminal-Bench (35 easy + 15 medium). Three models tested. For each row we use the same model as both the executor and the guardrail.

Attack Success Rate on harmful tasks (lower is better)

Model LLMSecurityAnalyzer (current default) + LLM Guardrail + LLM Guardrail + ToolShield
Claude Sonnet 4.5 88% 16% 7%
Qwen3.5-Plus 75% 14% 10%
Gemini 3 Flash 81% 18% 10%

The current default security analyzer performs the worst because distributed multi-turn harm bypasses the single-turn safety reasoning that the actor LLM relies on. Adding a separate guardrail improves performance by introducing a second judgment. ToolShield builds on this further. The distilled safety experiences provide guardrail with more actionable, tool-specific grounded knowledge to guide its behavior.

End-to-end cost on 50 benign Terminal-Bench tasks (USD): terminal + filesystem experiences injected:

Model Baseline + LLM Guardrail + LLM Guardrail + ToolShield Completion Rate Across 3 Settings
Claude Sonnet 4.5 $30.06 $31.24 $32.29 80%
Qwen3.5-Plus $7.99 $8.19 $8.49 78%
Gemini 3 Flash $4.30 $4.54 $4.60 72%

And more importantly, no benign action was blocked under the LLM Guardrail + ToolShield setting.

LLM Guardrail plus ToolShield only adds minimal overhead compare with Baseline. Once generated, experiences are transferable across models: the same artifacts can serve different guardrail LLMs without regeneration. For providers that support prompt caching, the guardrail's static context (system prompt + experiences) can also be cached server-side, further reducing per-call cost.

Alternatives Considered

  • Keep the status quo (LLMSecurityAnalyzer): cheapest, but as shown above, same-LLM self-annotation is structurally weak against multi-turn attacks (75–88% ASR). The ALL-3921 reliability bug on weaker models further limits this option.
  • Extend the existing LLMSecurityAnalyzer with a second self-call: we considered asking the actor LLM to re-evaluate its own action as a second pass. It's cheaper than a separate model, but the underlying bias (same context → same blind spot) persists — a distinct judge model is what actually breaks the pattern.
  • Rule-based InvariantAnalyzer: can't capture dynamic multi-turn attack structure, and has had stability problems (#5264). The original design discussion (ALL-3014) already acknowledges that rule-based approaches aren't general enough for agent safety.
  • Fine-tuning a dedicated guardrail model: would give the best per-call accuracy, but requires training pipeline + dataset curation, per-model retraining, and ongoing maintenance. We want a training-free path that works with any off-the-shelf LLM so operators can swap in whatever they already trust.

Priority / Severity

Medium - Would improve experience

Estimated Scope

Medium - New feature with moderate complexity

Feature Area

Agent / AI behavior

Technical Implementation Ideas (Optional)

Do you have thoughts on the technical implementation?

  • Lives in openhands-sdk at openhands/sdk/security/ alongside LLMSecurityAnalyzer. No changes to Agent, tool executors, or the ConfirmRisky enforcement path.
  • Reuses the existing LLM abstraction for the guardrail, so any provider LiteLLM supports works.
  • ToolShield experiences are JSON-serializable and cached per-tool-name in the workspace/settings area.
  • Graceful degradation: LLM infrastructure error → UNKNOWN (ConfirmRisky then prompts the user, matching what the current default does for missing annotations); unparseable guardrail output → HIGH (conservative, logged).
  • By default, the guardrail is seeded with the filesystem and terminal experiences we generated. Users can also generate their own experiences with different models or tools, and can manually adjust the injected experiences as needed. The system also supports automatic injection for all active MCP tools by scanning localhost to identify them.
  • Happy to gate this behind a feature flag for the first release.

Additional Context

Future work and extensibility

For this initial PR we include experiences for the tools evaluated in the paper. The API is designed to let users contribute their own experiences; we'll document the format and seed a few examples. Ongoing maintenance of the experience library is a direction we're interested in, but we'd like to discuss governance with maintainers before committing.

Additional context

  • Paper: Unsafer in Many Turns: Benchmarking and Defending Multi-Turn Safety Risks in Tool-Using Agents, arXiv:2602.13379
  • Related: ALL-3014 (LLM risk analyzer design), ALL-3190 (selector UI), ALL-3921 (self-annotation reliability bug), #5264 (Invariant stability), ALL-2325 (unified security policy)

Metadata

Metadata

Assignees

No one assigned

    Labels

    StaleInactive for 40 daysenhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions