Companion to the prompt-injection scanning issue. Even if a skill's content passes injection screening, the skill itself defines the agent's behavior surface when invoked, and current code grants ambient capabilities to whatever's in the active skills directory. A trusted-peer-turned-malicious could publish a skill that does exactly what its frontmatter says, where the description happens to be "exfiltrate .env to attacker.example.com" or "send wallet balance to known-bad address."
Threat model
A trusted publisher pushes a skill whose stated purpose is hostile but technically truthful. Without per-skill capability declarations enforced at runtime, ambient trust on the skill becomes ambient trust on every tool it can reach: shell, network, wallet, filesystem outside the skill's workspace.
Proposed defense
- SKILL.md frontmatter declares required capabilities, e.g.
capabilities: [shell, wallet.send, network.fetch]
- Default profile for ingested-from-mesh skills is least-privilege: read-only filesystem, no network, no wallet, no shell
- High-risk capabilities (wallet writes, shell, file writes outside the skill's workspace, arbitrary outbound network) require explicit operator approval per-skill, prompted at first invocation rather than at ingestion
- Enforcement at the runtime level, not just the prompt level. The skill literally cannot call those tools, vs. trusting the LLM not to use them after being told not to in the system prompt
Open questions
- How this interacts with the existing
allowed-tools frontmatter convention. Probably consolidate into one capability schema rather than maintain both.
- Per-skill vs. per-publisher capability profile. A skill from a verified publisher might inherit elevated defaults; an
untrusted publisher's skill stays locked down regardless of frontmatter claims.
- Sandbox boundary: worker thread, separate process, or capability-token model. Worker thread is cheapest and good enough for "no shell, no wallet, restricted fs." Separate process is needed for hard isolation against memory-corruption skills (lower priority).
Why now
Same trigger as the injection-scanning issue. Belt-and-braces: scan the content, then constrain what the content can do even if the scan misses something.
Companion to the prompt-injection scanning issue. Even if a skill's content passes injection screening, the skill itself defines the agent's behavior surface when invoked, and current code grants ambient capabilities to whatever's in the active skills directory. A trusted-peer-turned-malicious could publish a skill that does exactly what its frontmatter says, where the description happens to be "exfiltrate
.envto attacker.example.com" or "send wallet balance to known-bad address."Threat model
A trusted publisher pushes a skill whose stated purpose is hostile but technically truthful. Without per-skill capability declarations enforced at runtime, ambient trust on the skill becomes ambient trust on every tool it can reach: shell, network, wallet, filesystem outside the skill's workspace.
Proposed defense
capabilities: [shell, wallet.send, network.fetch]Open questions
allowed-toolsfrontmatter convention. Probably consolidate into one capability schema rather than maintain both.untrustedpublisher's skill stays locked down regardless of frontmatter claims.Why now
Same trigger as the injection-scanning issue. Belt-and-braces: scan the content, then constrain what the content can do even if the scan misses something.