[Feature]: LLM-as-guardrail SecurityAnalyzer with self-exploration safety-experience injection (ToolShield)

### Is there an existing feature request for this?

- [X] I have searched existing issues and feature requests, and this is not a duplicate.

### Summary

- Opt-in guardrail-LLM `SecurityAnalyzer` + ToolShield safety experiences (from [*Unsafer in Many Turns*, Li et al. 2026](https://arxiv.org/abs/2602.13379); work by Northeastern, UIUC, UC Berkeley, and Virtue AI).
- Separates actor from judge; catches executor LLM won't self-flag reliably for risk actions.
- Multi-turn attack sequences ASR (from the [same paper](https://arxiv.org/abs/2602.13379)): 75–88% → **7–10%** (Claude 4.5 / Qwen3.5-Plus / Gemini 3 Flash).
- Minimal cost overhead on 50 benign Terminal-Bench tasks;.
- Default analyzer unchanged.


### Problem or Use Case

## What problem or use case are you trying to solve?

The current default, `LLMSecurityAnalyzer`, asks the same LLM that generates a tool call to self-annotate `security_risk` in the same response. A model intended to execute a risky tool call is unlikely to simultaneously flag that call as risky, the same context that causes the bad action also biases the self-assessment. This also has some known reliability issues on weaker models that silently omit or hallucinate the `security_risk` field ([ALL-3921](https://linear.app/all-hands-ai/issue/ALL-3921/bug-cli-v1-crashes-with-llm-provided-a-security-risk-but-no-security)), and the Invariant rule-based analyzer can't capture dynamic multi-turn attack structure (and has had stability problems, [#5264](<https://github.com/OpenHands/OpenHands/issues/5264>)). The design issue that introduced the current analyzer ([ALL-3014](https://linear.app/all-hands-ai/issue/ALL-3014/agent-improve-security-analyzer-with-llm-provided-risk)) itself noted that rule-based approaches aren't general enough.

We measured this on a 100-task subset of MT-AgentRisk (arXiv:2602.13379, Feb 2026), a benchmark that transforms single-turn harmful tasks from established works such as OpenAgentSafety and Safearena into multi-turn sequences. The distributed harmful goals bypass the current single turn safety mechanism and make executor LLMs can't reliably identify the risk. **Attack Success Rate ranges from 75% to 88% across three frontier models.** That's the number we'd like to bring down.

## Background: MT-AgentRisk and ToolShield

**MT-AgentRisk** (arXiv:2602.13379) is a multi-turn agent safety benchmark. It takes single-turn harmful tasks from established benchmarks such as OpenAgentSafety and Safearena (365 tasks across Filesystem-MCP, Browser/Playwright-MCP, PostgreSQL-MCP, Notion-MCP, and Terminal), then transforms each one into a sequence of steps that together achieve the harmful goal. A single-turn `rm -rf /` is rejected; broken into three innocuous-seeming commands across turns, it executes.

**ToolShield** (same paper) is a training-free, tool-agnostic defense. When the agent first encounters a tool, it autonomously generates test cases, executes them in the sandbox, and observes what actually happens downstream. From those observations it distills structured safety experiences. At deployment, the relevant experiences are injected into the guardrail LLM's context to guide the judgement.

### Proposed Solution

# Proposed Solution

We'd like to contribute an **opt-in** alternative that separates the actor and the judge: a distinct guardrail LLM reviews each proposed action rather than relying on the actor to self-report. Second, and more importantly, **ToolShield** distills safety experiences by red-teaming tools in the sandbox once per tool, the resulting guidelines are then injected into the guardrail's context at decision time. It's built on the existing `SecurityAnalyzer` interface, so no changes to core, tools, or the `ConfirmRisky` enforcement path.

Two PRs: one SDK change and one small frontend change.

**PR 1 —** `ToolShieldLLMSecurityAnalyzer` (SDK)

* New `SecurityAnalyzerBase` subclass; for each `ActionEvent`, issues a separate completion to a guardrail LLM and returns `LOW / MEDIUM / HIGH / UNKNOWN`.
* Configurable fields:
 * `llm` — any SDK `LLM`
 * `history_window` — prior actions of context (default 20)
 * `safety_experiences` — optional string injected into the system prompt
* Failure handling (opinionated, logged distinctly):
 * LLM infrastructure error → `UNKNOWN` (lets `ConfirmRisky` apply its configured fallback)
 * Unparseable guardrail output → `HIGH` (conservative)
* Pairs with `ConfirmRisky` unchanged; no changes to `Agent` or tool executors.
* Per-tool safety experiences come from the existing `toolshield` PyPI package via an optional extra (`pip install openhands-sdk[toolshield]`) — no new SDK submodule.
* Three helpers populate `safety_experiences`:
 * `default_safety_experiences()` — terminal + filesystem seed
 * `load_safety_experiences(tool_names)` — explicit list
 * `auto_detect_safety_experiences()` — probes localhost MCP ports

 **PR 2 — Frontend dropdown entry** (OpenHands app repo)
 * Adds `"ToolShield Analyzer"` to the analyzer dropdown in `verification-settings.tsx` (\~10 lines + i18n keys). No new sub-form — the guardrail reads its LLM config from the existing SDK `LLM` profile system, and experiences come from the `toolshield` optional extra.
 * Default remains `LLMSecurityAnalyzer`.

## Evaluation

100-task harmful subset of MT-AgentRisk; 50 benign tasks from Terminal-Bench (35 easy + 15 medium). Three models tested. For each row we use the same model as both the executor and the guardrail.

**Attack Success Rate on harmful tasks (lower is better)**


| Model | `LLMSecurityAnalyzer` (current default) | \+ LLM Guardrail | \+ LLM Guardrail + ToolShield |
| -- | -- | -- | -- |
| Claude Sonnet 4.5 | 88% | 16% | 7% |
| Qwen3.5-Plus | 75% | 14% | 10% |
| Gemini 3 Flash | 81% | 18% | 10% |

The current default security analyzer performs the worst because distributed multi-turn harm bypasses the single-turn safety reasoning that the actor LLM relies on. Adding a separate guardrail improves performance by introducing a second judgment. ToolShield builds on this further. The distilled safety experiences provide guardrail with more actionable, tool-specific grounded knowledge to guide its behavior.

**End-to-end cost on 50 benign Terminal-Bench tasks (USD)**: terminal + filesystem experiences injected:


| Model | Baseline | \+ LLM Guardrail | \+ LLM Guardrail + ToolShield | Completion Rate Across 3 Settings |
| -- | -- | -- | -- | -- |
| Claude Sonnet 4.5 | $30.06 | $31.24 | $32.29 | 80% |
| Qwen3.5-Plus | $7.99 | $8.19 | $8.49 | 78% |
| Gemini 3 Flash | $4.30 | $4.54 | $4.60 | 72% |

And more importantly, **no benign action was blocked under the LLM Guardrail + ToolShield setting**.

LLM Guardrail plus ToolShield only adds minimal overhead compare with Baseline. Once generated, experiences are transferable across models: the same artifacts can serve different guardrail LLMs without regeneration. For providers that support prompt caching, the guardrail's static context (system prompt + experiences) can also be cached server-side, further reducing per-call cost.

### Alternatives Considered

* **Keep the status quo (**`LLMSecurityAnalyzer`): cheapest, but as shown above, same-LLM self-annotation is structurally weak against multi-turn attacks (75–88% ASR). The [ALL-3921](https://linear.app/all-hands-ai/issue/ALL-3921/bug-cli-v1-crashes-with-llm-provided-a-security-risk-but-no-security) reliability bug on weaker models further limits this option.
* **Extend the existing** `LLMSecurityAnalyzer` with a second self-call: we considered asking the actor LLM to re-evaluate its own action as a second pass. It's cheaper than a separate model, but the underlying bias (same context → same blind spot) persists — a distinct judge model is what actually breaks the pattern.
* **Rule-based** `InvariantAnalyzer`: can't capture dynamic multi-turn attack structure, and has had stability problems ([#5264](<https://github.com/OpenHands/OpenHands/issues/5264>)). The original design discussion ([ALL-3014](https://linear.app/all-hands-ai/issue/ALL-3014/agent-improve-security-analyzer-with-llm-provided-risk)) already acknowledges that rule-based approaches aren't general enough for agent safety.
* **Fine-tuning a dedicated guardrail model**: would give the best per-call accuracy, but requires training pipeline + dataset curation, per-model retraining, and ongoing maintenance. We want a training-free path that works with any off-the-shelf LLM so operators can swap in whatever they already trust.

### Priority / Severity

Medium - Would improve experience

### Estimated Scope

Medium - New feature with moderate complexity

### Feature Area

Agent / AI behavior

### Technical Implementation Ideas (Optional)

## Do you have thoughts on the technical implementation?

* Lives in `openhands-sdk` at `openhands/sdk/security/` alongside `LLMSecurityAnalyzer`. No changes to `Agent`, tool executors, or the `ConfirmRisky` enforcement path.
* Reuses the existing `LLM` abstraction for the guardrail, so any provider LiteLLM supports works.
* ToolShield experiences are JSON-serializable and cached per-tool-name in the workspace/settings area.
* Graceful degradation: LLM infrastructure error → `UNKNOWN` (`ConfirmRisky` then prompts the user, matching what the current default does for missing annotations); unparseable guardrail output → `HIGH` (conservative, logged).
* By default, the guardrail is seeded with the filesystem and terminal experiences we generated. Users can also generate their own experiences with different models or tools, and can manually adjust the injected experiences as needed. The system also supports automatic injection for all active MCP tools by scanning localhost to identify them.
* Happy to gate this behind a feature flag for the first release.

### Additional Context

## Future work and extensibility

For this initial PR we include experiences for the tools evaluated in the paper. The API is designed to let users contribute their own experiences; we'll document the format and seed a few examples. Ongoing maintenance of the experience library is a direction we're interested in, but we'd like to discuss governance with maintainers before committing.

## Additional context

* Paper: *Unsafer in Many Turns: Benchmarking and Defending Multi-Turn Safety Risks in Tool-Using Agents*, arXiv:2602.13379
* Related: [ALL-3014](https://linear.app/all-hands-ai/issue/ALL-3014/agent-improve-security-analyzer-with-llm-provided-risk) (LLM risk analyzer design), [ALL-3190](https://linear.app/all-hands-ai/issue/ALL-3190/frontend-implement-llm-risk-analyzer-ui) (selector UI), [ALL-3921](https://linear.app/all-hands-ai/issue/ALL-3921/bug-cli-v1-crashes-with-llm-provided-a-security-risk-but-no-security) (self-annotation reliability bug), [#5264](<https://github.com/OpenHands/OpenHands/issues/5264>) (Invariant stability), [ALL-2325](https://linear.app/all-hands-ai/issue/ALL-2325/security-enhancement-secure-handling-of-environment-variables-and) (unified security policy)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: LLM-as-guardrail SecurityAnalyzer with self-exploration safety-experience injection (ToolShield) #14040

Is there an existing feature request for this?

Summary

Problem or Use Case

What problem or use case are you trying to solve?

Background: MT-AgentRisk and ToolShield

Proposed Solution

Proposed Solution

Evaluation

Alternatives Considered

Priority / Severity

Estimated Scope

Feature Area

Technical Implementation Ideas (Optional)

Do you have thoughts on the technical implementation?

Additional Context

Future work and extensibility

Additional context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Model	`LLMSecurityAnalyzer` (current default)	+ LLM Guardrail	+ LLM Guardrail + ToolShield
Claude Sonnet 4.5	88%	16%	7%
Qwen3.5-Plus	75%	14%	10%
Gemini 3 Flash	81%	18%	10%

Model	Baseline	+ LLM Guardrail	+ LLM Guardrail + ToolShield	Completion Rate Across 3 Settings
Claude Sonnet 4.5	$30.06	$31.24	$32.29	80%
Qwen3.5-Plus	$7.99	$8.19	$8.49	78%
Gemini 3 Flash	$4.30	$4.54	$4.60	72%

[Feature]: LLM-as-guardrail SecurityAnalyzer with self-exploration safety-experience injection (ToolShield) #14040

Description

Is there an existing feature request for this?

Summary

Problem or Use Case

What problem or use case are you trying to solve?

Background: MT-AgentRisk and ToolShield

Proposed Solution

Proposed Solution

Evaluation

Alternatives Considered

Priority / Severity

Estimated Scope

Feature Area

Technical Implementation Ideas (Optional)

Do you have thoughts on the technical implementation?

Additional Context

Future work and extensibility

Additional context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions