Security: prompt-injection scanning on inbound P2P skill content before LLM exposure

The current P2P skill ingestion pipeline at `src/agents/skills/ingest.ts` defends against unauthorized injection (Ed25519 signature on the bytes, SHA-256 content hash, SKILL.md schema validation, quarantine-by-default, per-peer rate limiting). What it doesn't defend against: a signed-but-malicious skill from a compromised trusted peer, where the SKILL.md bytes themselves contain adversarial instructions targeting the LLM that later reads the skill.

## Threat model

An attacker with control of a previously-trusted pubkey publishes a skill whose content embeds prompt-injection text: "ignore prior instructions," planted role markers, encoded exfil instructions, etc. Under `auto` policy with that pubkey on the trust list, current code accepts it directly into the skills directory (`ingest.ts:139-176`). Next time the agent loads the skills snapshot for inference, the payload reaches the LLM with no further gating.

Signing and hashing don't help here. The transport layer can't see what the content layer says.

## Proposed defense

Pre-LLM injection scan, run before `bumpSkillsSnapshotVersion` makes the skill visible to the agent:

- Rule-based heuristics for known patterns (jailbreak strings, unusual role markers, base64/hex-encoded payloads inside descriptions, suspicious YAML frontmatter)
- Optional cheap classifier pass for borderline cases
- Suspicious content gets force-quarantined regardless of pubkey trust level; operator must explicitly approve out of `skills-incoming/`
- Existing `validateSkillContent` at `ingest.ts:303` is the natural place to layer this in as a tiered validator

## Out of scope here

Capability sandboxing on what skills can do once active is a separate problem, filed as its own issue. This one is purely about catching adversarial **content** before the LLM sees it.

## Why now

Surfaced from public scrutiny on the architecture (a r/LocalLLaMA reader called this exact gap). The transport-layer defenses are real but the content-layer defense is the right next thing to ship.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Security: prompt-injection scanning on inbound P2P skill content before LLM exposure #20

Threat model

Proposed defense

Out of scope here

Why now

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Security: prompt-injection scanning on inbound P2P skill content before LLM exposure #20

Description

Threat model

Proposed defense

Out of scope here

Why now

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions