The current P2P skill ingestion pipeline at src/agents/skills/ingest.ts defends against unauthorized injection (Ed25519 signature on the bytes, SHA-256 content hash, SKILL.md schema validation, quarantine-by-default, per-peer rate limiting). What it doesn't defend against: a signed-but-malicious skill from a compromised trusted peer, where the SKILL.md bytes themselves contain adversarial instructions targeting the LLM that later reads the skill.
Threat model
An attacker with control of a previously-trusted pubkey publishes a skill whose content embeds prompt-injection text: "ignore prior instructions," planted role markers, encoded exfil instructions, etc. Under auto policy with that pubkey on the trust list, current code accepts it directly into the skills directory (ingest.ts:139-176). Next time the agent loads the skills snapshot for inference, the payload reaches the LLM with no further gating.
Signing and hashing don't help here. The transport layer can't see what the content layer says.
Proposed defense
Pre-LLM injection scan, run before bumpSkillsSnapshotVersion makes the skill visible to the agent:
- Rule-based heuristics for known patterns (jailbreak strings, unusual role markers, base64/hex-encoded payloads inside descriptions, suspicious YAML frontmatter)
- Optional cheap classifier pass for borderline cases
- Suspicious content gets force-quarantined regardless of pubkey trust level; operator must explicitly approve out of
skills-incoming/
- Existing
validateSkillContent at ingest.ts:303 is the natural place to layer this in as a tiered validator
Out of scope here
Capability sandboxing on what skills can do once active is a separate problem, filed as its own issue. This one is purely about catching adversarial content before the LLM sees it.
Why now
Surfaced from public scrutiny on the architecture (a r/LocalLLaMA reader called this exact gap). The transport-layer defenses are real but the content-layer defense is the right next thing to ship.
The current P2P skill ingestion pipeline at
src/agents/skills/ingest.tsdefends against unauthorized injection (Ed25519 signature on the bytes, SHA-256 content hash, SKILL.md schema validation, quarantine-by-default, per-peer rate limiting). What it doesn't defend against: a signed-but-malicious skill from a compromised trusted peer, where the SKILL.md bytes themselves contain adversarial instructions targeting the LLM that later reads the skill.Threat model
An attacker with control of a previously-trusted pubkey publishes a skill whose content embeds prompt-injection text: "ignore prior instructions," planted role markers, encoded exfil instructions, etc. Under
autopolicy with that pubkey on the trust list, current code accepts it directly into the skills directory (ingest.ts:139-176). Next time the agent loads the skills snapshot for inference, the payload reaches the LLM with no further gating.Signing and hashing don't help here. The transport layer can't see what the content layer says.
Proposed defense
Pre-LLM injection scan, run before
bumpSkillsSnapshotVersionmakes the skill visible to the agent:skills-incoming/validateSkillContentatingest.ts:303is the natural place to layer this in as a tiered validatorOut of scope here
Capability sandboxing on what skills can do once active is a separate problem, filed as its own issue. This one is purely about catching adversarial content before the LLM sees it.
Why now
Surfaced from public scrutiny on the architecture (a r/LocalLLaMA reader called this exact gap). The transport-layer defenses are real but the content-layer defense is the right next thing to ship.