From 790b0844d740632a7d922b1dbdc10931f482ee6c Mon Sep 17 00:00:00 2001 From: bishnubista Date: Sat, 2 May 2026 10:10:39 -0700 Subject: [PATCH 1/2] feat(SAFE-M-1): additive expansion to template parity + curated Mitigates MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit SAFE-M-1 already had substantial content (115 lines, ASCII architecture diagram, CaMeL-style code example, the canonical CaMeL paper reference) but was missing 2 template sections (Deployment Considerations, Current Status) and its Mitigates list named only 3 of 15 actual citing techniques. This PR completes the template parity additively without rewriting existing content. Section additions: - Technical Implementation > Prerequisites (new sub-section, placed before Implementation Steps): trusted query source, capability model, control-flow parser, sandboxed data handler, dispatch mediator. - Deployment Considerations (new H2): resources (neutral wording — no unsourced magnitude claims), performance per the CaMeL paper (77% secure task completion under attack on AgentDojo vs 84% undefended baseline — a 7-percentage-point security/utility trade), monitoring (capability-escalation attempts, parser rejection, control-plane invocation from data plane, attempts to pass capability material through the model context). Operational note on trust-boundary manifest as a high-value config. - Current Status (2026) (new H2): source-backed only — CaMeL paper anchor with the repository's own research-artifact caveat (not a turn-key production system); Saltzer & Schroeder 1975 capability- security foundation. No unsourced adoption or "primary reference" claims. Curated Mitigates list: replaced 3-entry list with 11 directly-mapped citing techniques with technique-specific rationale per entry. Excluded 4 mislabels: T1202 (asks for token binding); T1704 (asks for mTLS + signed server responses); T1911 (asks for schema enforcement / parameter validation); T1915 (asks for blockchain transaction allowlists). Mislabels tracked separately as safe-m-1-mislabel-cluster follow-up. Redirect targets are not speculated in this PR — the follow-up will read each technique's actual mitigation-section context. Related Mitigations: replaced M-3 (AI-Powered Content Analysis — adjacent only) with M-69 (Out-of-Band Authorization — privileged tool gating consumes M-1's separation guarantees) and M-22 (Semantic Output Validation — runs in M-1's data plane). M-2 (Cryptographic Integrity) preserved. References: kept CaMeL paper; added Saltzer & Schroeder 1975 (capability-security foundational); NIST SP 800-160 Vol 2 Appendix F (Design Principles for Cyber Resilient Systems); OWASP LLM01 prompt- injection guidance; MCP Specification. Implementation Examples: added Example 2 — a host-bound capability grant with binding to (tool, args_digest, subject, session_id, expiry), fail-closed verification, atomic consume-once replay protection via a durable store contract (Redis SET NX EX, Postgres INSERT ON CONFLICT DO NOTHING, etc.), explicit canonicalization contract for args_digest, and the operational rule that no token material is routed back through the LLM context. Pedagogically labeled as one implementation pattern, not the canonical CaMeL implementation. Preserves the v1.0 Frederick Kautz Version History row per the corpus convention. Signed-off-by: bishnubista --- mitigations/SAFE-M-1/README.md | 212 ++++++++++++++++++++++++++++----- 1 file changed, 180 insertions(+), 32 deletions(-) diff --git a/mitigations/SAFE-M-1/README.md b/mitigations/SAFE-M-1/README.md index 243c9495..7dbf7c51 100644 --- a/mitigations/SAFE-M-1/README.md +++ b/mitigations/SAFE-M-1/README.md @@ -10,19 +10,32 @@ ## Description Control/Data Flow Separation is an architectural defense that creates a protective system layer around LLMs by explicitly separating control flow (from trusted queries) from data flow (including untrusted tool descriptions). This approach ensures that malicious instructions embedded in data cannot influence program execution. -The most notable implementation is CaMeL (Control and Memory Language), developed by researchers from Google and other institutions, which demonstrates 77% task completion with provable security guarantees. +A notable implementation is CaMeL (Control and Memory Language), developed by researchers from Google and other institutions, which on the AgentDojo benchmark achieves **77% secure task completion under attack vs 84% for the undefended baseline** — a 7-percentage-point security/utility tradeoff with provable security guarantees over the threat model evaluated. ## Mitigates -- [SAFE-T1001](../../techniques/SAFE-T1001/README.md): Tool Poisoning Attack (TPA) -- [SAFE-T1102](../../techniques/SAFE-T1102/README.md): Prompt Injection (Multiple Vectors) -- [SAFE-T1401](../../techniques/SAFE-T1401/README.md): Line Jumping + +The mitigation directly addresses the following techniques (curated against the actual citation graph; 4 mislabeled citers excluded — see Out of scope): + +- [SAFE-T1001](../../techniques/SAFE-T1001/README.md): Tool Poisoning Attack (TPA) — separation prevents poisoned tool descriptions from being interpreted as control instructions; the malicious payload is confined to the data plane. +- [SAFE-T1008](../../techniques/SAFE-T1008/README.md): Stored Prompt Injection (Persistent Tool/Repo Storage) — persistent untrusted content cannot influence the control plane that selects subsequent operations. +- [SAFE-T1102](../../techniques/SAFE-T1102/README.md): Prompt Injection (Multiple Vectors) — direct, indirect, and reflected injection attempts land in the data plane; the control plane is sourced only from trusted queries. +- [SAFE-T1204](../../techniques/SAFE-T1204/README.md): Persistent Tool Redefinition — separation ensures that tool descriptions modified post-load cannot redefine control flow at invocation time. +- [SAFE-T1304](../../techniques/SAFE-T1304/README.md): High-Privilege Tool Misuse via Indirect Prompt Injection — CaMeL-style capability gating in the dispatch layer prevents data-plane content from minting privileged calls. +- [SAFE-T1309](../../techniques/SAFE-T1309/README.md): Privileged Tool Invocation via Prompt Manipulation — the control plane that selects which privileged tool to call is isolated from the prompt content delivered to the model. +- [SAFE-T1401](../../techniques/SAFE-T1401/README.md): Line Jumping — context-window items cannot escalate from data items into control directives because the parser only extracts control flow from trusted sources. +- [SAFE-T1502](../../techniques/SAFE-T1502/README.md): Persistence via Repository Implants — repository content lives in the data plane; persisted implants cannot mint control-plane state. +- [SAFE-T1705](../../techniques/SAFE-T1705/README.md): Cross-Server Tool Shadowing — separation enforces that tool-name resolution and dispatch happens in the control plane against an explicit trusted registry. +- [SAFE-T1801](../../techniques/SAFE-T1801/README.md): Tool/Resource Exfiltration via Indirect Prompt Injection — exfiltration attempts that ride on data-plane content cannot mint dispatch decisions for new exfil tool calls. +- [SAFE-T2105](../../techniques/SAFE-T2105/README.md): Disinformation Output — even when the LLM produces compromised output, downstream consumer actions are gated by control-plane decisions, not by the model output itself. + +Four citers reference SAFE-M-1 with labels that don't match its actual concept (Control/Data Flow Separation) — they ask for different mitigations and are excluded from this list pending the `safe-m-1-mislabel-cluster` follow-up. See Out of scope for detail. ## Technical Implementation ### Core Principles -1. **Explicit Control Flow Extraction**: Parse and extract control flow from trusted sources only -2. **Data Isolation**: Treat all external inputs (tool descriptions, API responses, etc.) as pure data -3. **Capability-Based Security**: Implement fine-grained permissions for data access and flow +1. **Explicit Control Flow Extraction**: Parse and extract control flow from trusted sources only. +2. **Data Isolation**: Treat all external inputs (tool descriptions, API responses, retrieved memory, file content) as pure data with no executable semantics. +3. **Capability-Based Security**: Implement fine-grained, explicitly-issued capabilities for any privileged data access or flow operation. Capabilities are minted by control-plane authorities the model cannot impersonate; they bind to a specific tool, argument digest, subject, session, and expiry. ### Architecture Components ``` @@ -44,31 +57,38 @@ The most notable implementation is CaMeL (Control and Memory Language), develope └──────────┘ ``` +### Prerequisites +- A trusted query source / control-plane definition that explicitly declares the trusted set (e.g., system prompt, allowlisted user roles). +- A capability model that enumerates fine-grained permissions per data class. +- A parser/extractor that can reliably distinguish control-flow-relevant tokens from data-flow tokens in trusted inputs. +- A sandboxed data handler (process, container, or in-process capability boundary) where untrusted data lives. +- A dispatch layer that mediates between control engine and data handler with no leak of data into control flow. + ### Implementation Steps 1. **Design Phase**: - - Define trust boundaries - - Identify all data sources - - Design capability model + - Define trust boundaries — enumerate every input source and label it `trusted` (control-plane allowed) or `untrusted` (data-plane only). + - Identify all data sources — tool descriptions, API responses, retrieved memory, file content, web fetches. + - Design the capability model — per-data-class permissions, issuance authorities, binding fields, expiry policy. 2. **Development Phase**: - - Implement control flow parser - - Create data sandboxing layer - - Build capability enforcement + - Implement the control flow parser against the trusted-source schema. + - Create the data sandboxing layer (process / container / capability-boundary). + - Build capability issuance and verification with replay protection. 3. **Deployment Phase**: - - Configure security policies - - Set up monitoring - - Train operations team + - Configure the trust-boundary manifest and version it as code. + - Set up monitoring for trust-boundary violations and capability-forgery attempts. + - Train the operations team on the dispatch model and incident-response procedures. ## Benefits -- **Provable Security**: Mathematical guarantees against certain attack classes -- **Defense in Depth**: Works even if underlying LLM is vulnerable -- **No Model Retraining**: Can be applied to existing LLMs +- **Provable Security**: Mathematical guarantees against the prompt-injection class of attacks under the modeled threat surface — see the CaMeL paper for the specific guarantee statements. +- **Defense in Depth**: Works even if the underlying LLM is vulnerable to injection at the model level — the protection is architectural, not behavioral. +- **No Model Retraining**: Can be applied to existing LLMs without modifying model weights or fine-tuning data. ## Limitations -- **Performance Impact**: ~7% reduction in task completion rate -- **Complexity**: Requires significant architectural changes -- **Not Universal**: Some tasks may be incompatible with strict separation +- **Performance Impact**: 7-percentage-point trade between secure task completion (77%) and undefended baseline (84%) on the AgentDojo benchmark per the CaMeL paper. The exact runtime overhead per tool call depends on capability-check granularity and parser implementation — published latency benchmarks are limited. +- **Complexity**: Requires significant architectural changes — a dispatch layer, capability issuance and verification, and a trust-boundary manifest. Not a configuration-only mitigation. +- **Not Universal**: Some agent task patterns (e.g., open-ended exploration over untrusted content) may be incompatible with strict separation; operators may need to opt-out specific workflows or design data-driven extension paths that respect the control/data split. ## Implementation Examples @@ -92,25 +112,153 @@ def execute_tool_protected(tool_id, parameters): ) ``` +### Example 2: Host-bound capability grant for a privileged tool call + +This example demonstrates one way to satisfy Core Principle 3 (Capability-Based Security) — a host-issued, replay-safe capability grant that the dispatch layer verifies before allowing a privileged tool call. CaMeL supports a capability concept; this code is *one* implementation pattern, not the canonical CaMeL implementation. + +```python +# Host issues a capability grant; the model's data plane never holds or +# routes the granted token. Verification fails closed. + +import hmac, hashlib, json, secrets, time +from typing import Mapping + +# Capability granting authority — runs in the control plane, NOT model-callable. +class HostCapabilityIssuer: + def __init__(self, signing_key: bytes): + self._key = signing_key + + def issue(self, *, tool: str, args_digest: str, subject: str, + session_id: str, ttl_seconds: int = 30) -> dict: + """Issue a capability grant bound to a specific tool call.""" + claims = { + "ver": 1, + "tool": tool, + "args_digest": args_digest, # SHA-256 of the canonicalized arguments + "subject": subject, # service account / human approver + "session_id": session_id, + "exp": int(time.time()) + ttl_seconds, + "jti": secrets.token_urlsafe(16), # unique per grant; replay-protection key + } + payload = json.dumps(claims, sort_keys=True).encode() + sig = hmac.new(self._key, payload, hashlib.sha256).hexdigest() + # Return claims + signature; opaque to the data plane. + return {"payload": payload.decode(), "signature": sig} + +# Verifier — runs in the dispatch layer, between data plane and tool execution. +class HostCapabilityVerifier: + def __init__(self, signing_key: bytes, jti_store): + """jti_store must implement consume_once_or_raise(jti, exp) atomically. + See note below — the in-memory `set` shown in tests is NOT + concurrency-safe and must not be used in production.""" + self._key = signing_key + self._jti_store = jti_store + + def verify(self, grant: Mapping, *, expected_tool: str, + expected_args: dict, expected_subject: str, + expected_session: str) -> None: + """Verify a capability grant. Raises on any mismatch — fail closed.""" + # 1. Verify signature first; any field read before this is attacker-controllable. + expected_sig = hmac.new( + self._key, grant["payload"].encode(), hashlib.sha256 + ).hexdigest() + if not hmac.compare_digest(expected_sig, grant["signature"]): + raise PermissionError("capability signature invalid") + + # 2. Parse claims from the verified payload only. + claims = json.loads(grant["payload"]) + if claims.get("ver") != 1: + raise PermissionError("unsupported capability version") + + # 3. Expiry check before consume. + if claims["exp"] < int(time.time()): + raise PermissionError("capability expired") + + # 4. Bind to the actual call. + if claims["tool"] != expected_tool: + raise PermissionError("tool mismatch") + if claims["session_id"] != expected_session: + raise PermissionError("session mismatch") + if claims["subject"] != expected_subject: + raise PermissionError("subject mismatch") + digest = hashlib.sha256( + json.dumps(expected_args, sort_keys=True).encode() + ).hexdigest() + if not hmac.compare_digest(claims["args_digest"], digest): + raise PermissionError("arguments do not match signed grant") + + # 5. Atomic consume-once. Implementation MUST be atomic (e.g. Redis + # `SET NX EX`, a database unique-constraint INSERT, or a + # transactional check-and-add). A non-atomic check-then-add + # check on a Python `set` is NOT replay-safe under concurrency + # — two concurrent verifies of the same grant can both pass. + self._jti_store.consume_once_or_raise(claims["jti"], claims["exp"]) +``` + +**Replay-store contract.** `jti_store.consume_once_or_raise(jti, exp)` MUST atomically (a) reject if `jti` is already recorded as consumed, (b) record `jti` as consumed with TTL = `exp`, and (c) raise `PermissionError("capability replay detected")` on rejection. Reference implementations: Redis `SET 1 NX EX ` returning `OK` on success or `nil` on duplicate; PostgreSQL `INSERT INTO consumed_jtis (jti, exp) VALUES (?, ?) ON CONFLICT DO NOTHING RETURNING jti` returning a row only on first insert. **Do not substitute a Python `set` outside unit tests** — the check-then-add pattern that sets allow is not atomic under concurrent verification and will let a captured grant be replayed. + +**Argument-canonicalization contract.** `args_digest` is computed by both the issuer and the verifier from the *same* canonicalization routine (in this example, `json.dumps(args, sort_keys=True)`). Any divergence — different key ordering, different whitespace handling, different float-precision rules, different Unicode normalization — silently breaks binding. Issuer and verifier must share the canonicalization implementation (same library, same version) and add a contract test that round-trips representative argument shapes. + +**Critical design constraints** that distinguish this from a vulnerable bearer-token pattern: + +- The grant is **issued by a host authority the model cannot impersonate** — `HostCapabilityIssuer` runs in the control plane and is not exposed to the data plane (not callable by the LLM, not present in tool-call output paths). +- The grant **binds to (tool, args_digest, subject, session_id, expiry)** — a model that captures a grant for one call cannot replay it for a different tool, different arguments, a different session, or after expiry. The unique `jti` per grant is the replay-protection primitive; the verifier must consume it via an atomic store (see Replay-store contract above). +- Verification **fails closed** on every mismatch and validates the signature *before* reading any other claim. +- **Replay protection** via `jti` consume-once. +- **No token material is routed back through the LLM context** — the grant flows through dispatch infrastructure; the model never sees, returns, or echoes the grant. Operational policy: alarm via SAFE-M-12 *Audit Logging* if any LLM output ever contains a string matching the grant signature shape. + ## Testing and Validation 1. **Security Testing**: - - Attempt prompt injection with known payloads - - Verify data cannot influence control flow - - Test capability enforcement + - Replay known prompt-injection corpus (e.g., the AgentDojo evaluation set used by the CaMeL paper) against a protected configuration; verify the data plane cannot mint control-plane decisions. + - Test capability-grant binding by altering each bound field individually (tool, args, subject, session) and verifying fail-closed rejection. + - Test replay detection by replaying a previously-consumed grant and verifying rejection. + - Test trust-boundary enforcement by feeding crafted "trusted-source" content with embedded control directives; verify the parser rejects malformed trusted input. 2. **Functional Testing**: - - Ensure legitimate operations still work - - Measure performance impact - - Validate error handling + - Ensure legitimate operations still work — run the full agent test suite under the protected configuration; measure task-completion delta vs unprotected. + - Measure dispatch overhead per tool call. + - Validate error handling: capability-rejection paths, parser rejection paths, dispatch-layer fault paths. + +## Deployment Considerations + +### Resource Requirements +- Control-plane parser and data-sandbox introduce additional memory/process overhead; the exact profile depends on parser implementation, sandbox type (in-process capability boundary vs separate process vs container), and capability-check granularity. Measure in staging before production rollout. +- Capability issuance and verification require a durable replay-protection store (e.g., Redis or a database) for the consumed `jti` set; size this for peak grant velocity × TTL window. + +### Performance Impact +- Per the CaMeL paper (arXiv 2503.18813 v2, June 24, 2025), the protected configuration achieves **77% secure task completion under attack on the AgentDojo benchmark vs 84% for the undefended baseline** — a 7-percentage-point trade between utility and security guarantees, not a 7% relative reduction. +- Latency overhead per tool call depends on capability-check granularity and parser depth; published per-call latency benchmarks for CaMeL-style separation are limited as of 2026-04. Measure against your specific workload. + +### Monitoring and Alerting +- Alarm on attempted capability escalation that's denied — repeated denials for the same actor + tool indicate either misconfiguration or attack reconnaissance. +- Alarm on parser rejection of trusted-source input — potential trust-boundary violation upstream. +- Alarm on data-plane attempts to invoke control-plane primitives — direct evidence of an injection attempt that the architecture caught. +- Alarm on capability-grant forgery attempts (signature verification failures). +- **Hard-floor alert**: detection of capability material (signature shape, jti, payload fragment) appearing in any LLM output or tool-call output. The architecture's correctness depends on capability material never traversing the model context; a violation is a direct indicator of either implementation bug or active exfiltration attempt. + +### Operational Note +The trust-boundary manifest and capability-issuance signing keys are themselves high-value config. Version the manifest as code (review-gated changes), audit every modification via SAFE-M-12 *Audit Logging*, and gate manifest updates with SAFE-M-69 *Out-of-Band Authorization for Privileged Tool Invocations*. + +## Current Status (2026) + +Reference implementation [CaMeL (Defeating Prompt Injections by Design, Google Research / arXiv 2503.18813)](https://arxiv.org/abs/2503.18813) demonstrates 77% secure task completion under attack on the AgentDojo benchmark vs an 84% undefended baseline. The CaMeL repository describes the public release as a research artifact, notes that it may contain bugs, and explicitly states it might not be fully secure — production deployments should treat it as a starting point for hardening rather than a turn-key system. + +General capability-based security frameworks are well-established ([Saltzer & Schroeder, 1975](https://web.mit.edu/Saltzer/www/publications/protection/)); the application of these principles to LLM-driven agentic systems is an active research area where CaMeL is one notable implementation as of 2026-04. ## References -- [Defeating Prompt Injections by Design - Google Research (2025)](https://arxiv.org/abs/2503.18813) +- [Defeating Prompt Injections by Design — Google Research, arXiv 2503.18813 (2025)](https://arxiv.org/abs/2503.18813) +- [Saltzer & Schroeder — The Protection of Information in Computer Systems (1975)](https://web.mit.edu/Saltzer/www/publications/protection/) — foundational treatment of capability-based security and the principles of least privilege, fail-safe defaults, and complete mediation that this mitigation builds on. +- [NIST SP 800-160 Vol 2 — Developing Cyber-Resilient Systems, Appendix F (Design Principles for Cyber Resilient Systems)](https://csrc.nist.gov/publications/detail/sp/800-160/vol-2-rev-1/final) — Segregation, Non-Persistence, and related design principles that map to control/data flow separation. +- [OWASP Top 10 for Large Language Model Applications — LLM01: Prompt Injection](https://owasp.org/www-project-top-10-for-large-language-model-applications/) — the threat class this mitigation primarily addresses. +- [Model Context Protocol Specification](https://modelcontextprotocol.io/specification) — the protocol layer at which the dispatch boundary is implemented in MCP deployments. ## Related Mitigations -- [SAFE-M-2](../SAFE-M-2/README.md): Cryptographic Integrity -- [SAFE-M-3](../SAFE-M-3/README.md): AI-Powered Content Analysis +- [SAFE-M-2](../SAFE-M-2/README.md): Cryptographic Integrity for Tool Descriptions — protects the trusted-source content that the control-plane parser reads, ensuring the parser is operating on authentic input. +- [SAFE-M-22](../SAFE-M-22/README.md): Semantic Output Validation — runs in M-1's data plane on LLM outputs; M-22 validates content, M-1 ensures that data-plane content cannot mint control-plane decisions even when validation fails. +- [SAFE-M-69](../SAFE-M-69/README.md): Out-of-Band Authorization for Privileged Tool Invocations — privileged-tool gating that consumes M-1's separation guarantees. M-1 ensures the control plane decides *which* privileged tool to invoke; M-69 ensures *whether* that invocation is authorized via an out-of-band channel. ## Version History | Version | Date | Changes | Author | |---------|------|---------|--------| -| 1.0 | 2025-01-03 | Initial documentation | Frederick Kautz | \ No newline at end of file +| 1.0 | 2025-01-03 | Initial documentation | Frederick Kautz | +| 2.0 | 2026-05-02 | Additive expansion to template parity per corpus mitigation quality audit; authored Deployment Considerations and Current Status (2026) sections, added Prerequisites sub-section under Technical Implementation, expanded Mitigates list to 11 directly-mapped citing techniques (curated from 15 raw citers; 4 mislabels excluded with reason notes — tracked as safe-m-1-mislabel-cluster follow-up), added Implementation Example 2 (host-bound capability grant with replay protection and explicit no-token-through-LLM constraint), expanded References (Saltzer & Schroeder 1975, NIST SP 800-160 Vol 2 Appendix F, OWASP LLM01, MCP Specification), updated Related Mitigations (replaced M-3 with M-69 and M-22; preserved M-2). Corrected CaMeL performance framing to "7 percentage-point drop (77% vs 84% on AgentDojo)" with explicit research-artifact caveat | bishnu bista | From 88b41052492a79979a3027dc1143c8051bd2f7a3 Mon Sep 17 00:00:00 2001 From: bishnubista Date: Mon, 4 May 2026 17:34:43 -0700 Subject: [PATCH 2/2] fix(SAFE-M-1): add Out of scope section + Integration Testing template item Cross-PR review of #201-205 surfaced two real defects in PR #203: 1. The v2.0 Mitigates section had two pointers to "see Out of scope" (intro line and end-of-Mitigates note) but no ## Out of scope section existed in the file. Adds the section between Limitations and Implementation Examples (matching the convention used by SAFE-M-5 and SAFE-M-14), enumerating the four mislabeled citers per the safe-m-1-mislabel-cluster audit follow-up: - T1202 cites M-1 as "Token Binding" -> M-31 / M-37 redirect - T1704 cites M-1 as "Strong Authentication" -> transport-auth canonical - T1911 cites M-1 as "Input Validation" -> new "Parameter Validation" canonical (same corpus-side gap as SAFE-M-5's partial-fit cluster) - T1915 cites M-1 as "Input Validation" -> chain-specific canonical 2. mitigations/TEMPLATE.md lists "3. Integration Testing" as the third numbered sub-item under ## Testing and Validation, alongside Security Testing and Functional Testing. The v2.0 file had only items 1 and 2. Adds item 3 with three substantive integration tests: - End-to-end trust-boundary propagation through parser, dispatch layer, and capability issuance authority. - Audit-substrate integration with SAFE-M-12: every capability lifecycle event emits a structured audit event with correlation_id joining issuance to consumption; replay a capability-forgery sequence and confirm the audit chain reconstructs the timeline. - Approval-gate integration with SAFE-M-69: capability issuance for high-risk tools requires out-of-band approval before signing. Signed-off-by: bishnubista --- mitigations/SAFE-M-1/README.md | 15 +++++++++++++++ 1 file changed, 15 insertions(+) diff --git a/mitigations/SAFE-M-1/README.md b/mitigations/SAFE-M-1/README.md index 7dbf7c51..69afe2ef 100644 --- a/mitigations/SAFE-M-1/README.md +++ b/mitigations/SAFE-M-1/README.md @@ -90,6 +90,15 @@ Four citers reference SAFE-M-1 with labels that don't match its actual concept ( - **Complexity**: Requires significant architectural changes — a dispatch layer, capability issuance and verification, and a trust-boundary manifest. Not a configuration-only mitigation. - **Not Universal**: Some agent task patterns (e.g., open-ended exploration over untrusted content) may be incompatible with strict separation; operators may need to opt-out specific workflows or design data-driven extension paths that respect the control/data split. +## Out of scope + +Four citers reference SAFE-M-1 with labels that do not correspond to M-1's actual control concept (Control/Data Flow Separation). Each technique's primary defensive ask is heterogeneous and points to a different canonical mitigation; redirect targets are chosen per-case after reading each technique's mitigation-section context. They are excluded from the curated `## Mitigates` list above and tracked as a follow-up cleanup: + +- `techniques/SAFE-T1202/README.md` cites M-1 as **"Architectural Defense - Token Binding"** — wants token-binding / Proof-of-Possession controls. Different control concept (token-binding is about binding a credential to a transport-layer key, not about separating control flow from data flow). Redirect candidate: **SAFE-M-31 *Proof of Possession Tokens*** or **SAFE-M-37 *Token Rotation and Invalidation*** depending on the technique's specific section context. +- `techniques/SAFE-T1704/README.md` cites M-1 as **"Strong Authentication"** — wants mutual TLS, signed server responses, and similar transport-authentication controls. Different control concept (transport authentication, not architectural flow separation). Redirect candidate: a transport-authentication canonical mitigation; verify against the canonical set before assigning. +- `techniques/SAFE-T1911/README.md` cites M-1 as **"Input Validation"** — wants schema enforcement and parameter validation on tool arguments. Different control concept (input validation operates per-call on arguments; M-1 is architectural and operates over the whole control plane). Redirect candidate: **likely a new "Parameter Validation" canonical mitigation** — the corpus does not currently have a parameter-validation mitigation (same corpus-side gap surfaced by SAFE-M-5's partial-fit cluster). +- `techniques/SAFE-T1915/README.md` cites M-1 as **"Input Validation"** — wants blockchain-transaction allowlists / bridge-route controls. Different control concept (chain-specific resource policy, not control/data separation). Redirect candidate: a chain-specific canonical mitigation; verify against the canonical set before assigning. + ## Implementation Examples ### Example 1: Tool Call Protection @@ -219,6 +228,11 @@ class HostCapabilityVerifier: - Measure dispatch overhead per tool call. - Validate error handling: capability-rejection paths, parser rejection paths, dispatch-layer fault paths. +3. **Integration Testing**: + - End-to-end trust-boundary propagation: change a trust-boundary manifest entry and verify the parser, dispatch layer, and capability issuance authority all observe the new boundary within the deployed configuration's update window. + - Audit-substrate integration with [SAFE-M-12](../SAFE-M-12/README.md): every capability lifecycle event (issued / verified / consumed / rejected) emits a structured audit event with `correlation_id` joining issuance to consumption; replay an attempted capability-forgery sequence and confirm the audit chain reconstructs the attack timeline. + - Approval-gate integration with [SAFE-M-69](../SAFE-M-69/README.md): for capability grants whose `tool` falls within an operator-defined high-risk set, capability issuance requires an out-of-band approval before the grant is signed; verify the dispatch layer refuses the call when the approval ticket is absent or expired. + ## Deployment Considerations ### Resource Requirements @@ -262,3 +276,4 @@ General capability-based security frameworks are well-established ([Saltzer & Sc |---------|------|---------|--------| | 1.0 | 2025-01-03 | Initial documentation | Frederick Kautz | | 2.0 | 2026-05-02 | Additive expansion to template parity per corpus mitigation quality audit; authored Deployment Considerations and Current Status (2026) sections, added Prerequisites sub-section under Technical Implementation, expanded Mitigates list to 11 directly-mapped citing techniques (curated from 15 raw citers; 4 mislabels excluded with reason notes — tracked as safe-m-1-mislabel-cluster follow-up), added Implementation Example 2 (host-bound capability grant with replay protection and explicit no-token-through-LLM constraint), expanded References (Saltzer & Schroeder 1975, NIST SP 800-160 Vol 2 Appendix F, OWASP LLM01, MCP Specification), updated Related Mitigations (replaced M-3 with M-69 and M-22; preserved M-2). Corrected CaMeL performance framing to "7 percentage-point drop (77% vs 84% on AgentDojo)" with explicit research-artifact caveat | bishnu bista | +| 2.1 | 2026-05-04 | Added `## Out of scope` section to back the existing "see Out of scope" reference in the Mitigates intro and at the end of Mitigates (v2.0 had two dangling pointers and no section). The section enumerates the four mislabeled citers (T1202 token-binding, T1704 strong-authentication, T1911 input-validation, T1915 input-validation) with per-citer rationale and per-citer redirect-candidate guidance — heterogeneous targets (M-31 / M-37 for T1202; transport-authentication canonical for T1704; new "Parameter Validation" canonical or similar for T1911; chain-specific canonical for T1915), to be confirmed per-case during the `safe-m-1-mislabel-cluster` follow-up. T1911's redirect is the same corpus-side gap surfaced by SAFE-M-5's partial-fit cluster (parameter validation). Also added the missing **Integration Testing** numbered sub-item under `## Testing and Validation` (item 3) — `mitigations/TEMPLATE.md` lists Integration Testing as the third numbered sub-item under that section, alongside Security Testing (item 1) and Functional Testing (item 2); v2.0 had only items 1 and 2. New Integration Testing items cover end-to-end trust-boundary propagation, M-12 audit-substrate integration with capability lifecycle events, and M-69 approval-gate integration for high-risk capability grants. | bishnu bista |