From 039d288b8bb208e36d17f28968a1d9ad77da5fa5 Mon Sep 17 00:00:00 2001
From: Algis Dumbris <a.dumbris@gmail.com>
Date: Sun, 28 Jun 2026 10:03:10 +0300
Subject: [PATCH 1/2] docs(security): document the deterministic tool-scanner
 detect engine (Spec 076 T022)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Adds docs/features/tool-scanner.md covering the offline detect engine behind
the built-in tpa-descriptions scanner:

- the six checks (unicode.hidden / shadowing.cross_server / payload.decoded —
  hard tier; directive.imperative / capability.mismatch / secret.embedded —
  soft tier)
- the two-tier model (hard auto-quarantines; soft severity = distinct soft-check
  count 1->low/2->medium/3+->high; consensus adds to confidence/risk score)
- the eval gate (scan-eval --gate --min-recall 0.90 --max-fp 0.05, exit 6 on
  breach) and its blocking CI wiring in .github/workflows/eval.yml
- the offline / no-egress guarantee (no I/O, deterministic, recover-isolated)
- normalization rules (raw-text hidden-Unicode + secrets, normalized phrases)

Also expands the tpa-descriptions row in security-scanner-plugins.md to point
at the new page, links it from Related reading, registers it in the docs
sidebar, and checks off T013-T019 + T022 in the Spec 076 tasks checklist.

Docs-only change (exempt from TDD per CLAUDE.md). No code touched.

Related: Spec 076 (specs/076-deterministic-tool-scanner)
---
 docs/features/security-scanner-plugins.md     |   3 +-
 docs/features/tool-scanner.md                 | 259 ++++++++++++++++++
 specs/076-deterministic-tool-scanner/tasks.md |  16 +-
 website/sidebars.js                           |   1 +
 4 files changed, 270 insertions(+), 9 deletions(-)
 create mode 100644 docs/features/tool-scanner.md

diff --git a/docs/features/security-scanner-plugins.md b/docs/features/security-scanner-plugins.md
index 379781dfc..e38618009 100644
--- a/docs/features/security-scanner-plugins.md
+++ b/docs/features/security-scanner-plugins.md
@@ -118,7 +118,7 @@ MCPProxy ships with a bundled registry of 8 scanners. The bundled list lives in
 | `nova-proximity` | MCPProxy (NOVA-inspired rules) | source | — | Keyword-based, fully offline. Very fast. |
 | `ramparts` | Javelin | source | — | Rust-based YARA scanner. Runs fully offline: v0.8.x scans a live MCP endpoint, so MCPProxy replays the captured tool definitions to it over stdio (the upstream is never re-executed). *(`amd64`-only image; runs under emulation on arm64 — see [Scanner Images](/features/scanner-images).)* |
 | `semgrep-mcp` | Semgrep | source | — | Static analysis with MCP-specific rules. Uses the upstream `returntocorp/semgrep:latest` image. |
-| `tpa-descriptions` | MCPProxy | source | — | **Built-in, Docker-less, always on.** In-process analysis of tool descriptions/schemas for Tool-Poisoning-Attack indicators (hidden instructions, prompt-injection phrasing, data-exfiltration hints) and embedded secrets. Also runs the deterministic offline detection engine (Spec 076): hidden-Unicode smuggling (zero-width/bidi/tag-block/PUA), cross-server tool shadowing, and base64/hex payloads that decode to shell/exfil commands — each finding carries a `confidence` score and the contributing check `signals`. Runs for any connected server — including remote `http`/`sse` servers with no source or Docker. |
+| `tpa-descriptions` | MCPProxy | source | — | **Built-in, Docker-less, always on.** In-process analysis of tool descriptions/schemas via the deterministic offline [detect engine (Spec 076)](/features/tool-scanner): six checks across two tiers — **hard** (hidden-Unicode smuggling, cross-server shadowing, decode-to-shell payloads) auto-quarantine; **soft** (prompt-injection directives, capability-mismatch, embedded secrets) raise a review item. Each finding carries a `confidence` score and the contributing check `signals`. Fully offline (no network/filesystem/Docker), deterministic, and runs for any connected server — including remote `http`/`sse` servers with no source or Docker. See [Tool Scanner](/features/tool-scanner) for the full rule reference and the CI eval gate. |
 | `trivy-mcp` | Aqua Security | source, container_image | — | Filesystem + CVE scan. Uses the upstream `ghcr.io/aquasecurity/trivy:latest` image. |
 
 See [Scanner Images](/features/scanner-images) for the image sources and why vendor images are preferred over custom wrappers.
@@ -343,6 +343,7 @@ The Security page at `/security` in the Web UI mirrors the CLI and provides:
 
 ## Related reading
 
+- [Tool Scanner (Spec 076)](/features/tool-scanner) — the built-in offline detect engine behind `tpa-descriptions`: the six checks, two-tier model, and CI eval gate
 - [Security Commands](/cli/security-commands) — exhaustive CLI reference
 - [Scanner Images](/features/scanner-images) — where each Docker image comes from
 - [Security Quarantine](/features/security-quarantine) — the underlying quarantine mechanism that scanners gate
diff --git a/docs/features/tool-scanner.md b/docs/features/tool-scanner.md
new file mode 100644
index 000000000..499ce6006
--- /dev/null
+++ b/docs/features/tool-scanner.md
@@ -0,0 +1,259 @@
+---
+id: tool-scanner
+title: Deterministic Tool Scanner (Spec 076)
+sidebar_label: Tool Scanner (detect engine)
+description: The offline, deterministic in-process detection engine that scans MCP tool definitions for hidden-Unicode smuggling, cross-server shadowing, decoded shell payloads, prompt-injection directives, capability mismatch, and embedded secrets.
+keywords: [security, tool-poisoning, prompt-injection, unicode-smuggling, shadowing, detection, offline, deterministic, quarantine, mcp]
+---
+
+# Deterministic Tool Scanner (Spec 076)
+
+The **detect engine** (`internal/security/detect/`) is the deterministic, fully-offline
+in-process detector that analyzes every upstream tool's definition — name,
+description, input schema, and output schema — for tool-poisoning and
+prompt-injection attacks. It is what powers the built-in, Docker-less
+[`tpa-descriptions` scanner](/features/security-scanner-plugins#scanner-registry),
+so it runs for **every connected server**, including remote `http`/`sse`
+servers that have no source code or Docker container to scan.
+
+> This page documents the detection rules themselves. For the scanner plugin
+> framework that hosts them (SARIF orchestration, the Docker-based scanners, the
+> approval workflow), see [Security Scanner Plugins](/features/security-scanner-plugins).
+> For the per-tool hash-based approval that quarantine decisions feed into, see
+> [Tool Quarantine (Spec 032)](/features/tool-quarantine).
+
+## Offline / no-egress guarantee
+
+The detect engine performs **no I/O of any kind**. It imports no networking
+(`net`, `net/http`), no process execution (`os/exec`), no filesystem access
+(`os`), and no HTTP or Docker client. Detection runs purely over the in-memory
+tool definitions the caller supplies. This is not a convention — it is enforced
+by a standing import-guard test (`internal/security/detect/imports_test.go`)
+that fails the build if any forbidden import is added (FR-001).
+
+Three properties hold by construction:
+
+- **Offline** — no network, filesystem, Docker, external API, or LLM is ever
+  consulted. Safe to run in air-gapped deployments.
+- **Deterministic** — identical input yields byte-identical output, including
+  the ordering of findings and signals. No maps are iterated for output
+  ordering; no clocks or randomness are consulted.
+- **Total** — every check runs under `recover()`. A check that panics or errors
+  is isolated, counted as degraded coverage, and never aborts the scan. A
+  degraded scan still returns the findings from every other check (the same way
+  the external scanner pipeline surfaces `scanners_failed`).
+
+## The two-tier model
+
+Each check emits zero or more **signals**, and every signal carries a **tier**:
+
+| Tier | What it means | Effect on the tool |
+|------|---------------|--------------------|
+| **Hard** | A structural attack that essentially never appears in a legitimate tool definition (near-zero false positive). | **Auto-quarantines** the affected tool/server. |
+| **Soft** | A phrased or heuristic indicator that *can* appear in benign tooling (e.g. a security tool that legitimately mentions attack strings). | **Raises the tool for human review only** — never auto-quarantines on its own. |
+
+The per-tool aggregation combines all of a tool's signals into a single
+finding (`internal/security/detect/aggregate.go`):
+
+- **Any hard signal → dangerous.** The tool is quarantined regardless of what
+  else fired (FR-004).
+- **Soft-only severity is driven by the count of _distinct_ checks that fired**
+  (FR-005): `1 → low`, `2 → medium`, `3+ → high`. A single soft signal is a
+  low-severity review item; three independent soft checks agreeing on the same
+  tool is high severity.
+- **Independent signals add to confidence and risk score** rather than being
+  deduplicated away (FR-006). When multiple independent checks agree on a tool,
+  that agreement is visible in the finding's `confidence` and raises the
+  aggregated risk score, instead of collapsing to one entry keyed on
+  `(rule_id + location)`.
+- **Every finding exposes its `confidence` value and the list of contributing
+  check IDs** (`signals`), so an operator can see *why* a tool was flagged and
+  how strongly (FR-010). These surface in the CLI report (`Confidence:` /
+  `Signals:` lines) and in the REST scan report JSON.
+
+### Normalization (FR-007)
+
+Phrase-matching checks (directive, capability, embedded-secret position logic)
+run over a **normalized** form of the text: Unicode-normalized (NFKC),
+zero-width / format-rune stripped, lowercased, whitespace-collapsed, and lightly
+stemmed. Normalization defeats trivial wording variants — `don't disclose` and
+`do not tell the user` collapse to the same matchable form (SC-004).
+
+Crucially, the **hidden-Unicode check runs on the RAW text _before_
+normalization** — normalization strips exactly the invisible characters that
+check exists to detect, so running it on normalized text would hide the attack.
+The embedded-secret check likewise scans **raw** text, because secrets are
+case-sensitive and exact (lowercasing would fold the very bytes the matchers
+key on, e.g. `AKIA…` prefixes).
+
+## The six checks
+
+Three **hard** structural checks and three **soft** heuristic checks.
+
+### Hard tier
+
+#### `unicode.hidden` — hidden-Unicode smuggling
+
+Flags invisible / format-control runes smuggled into a tool's **raw**
+description or schema text: zero-width joiners/spaces, bidirectional controls,
+Unicode TAG-block characters, and Private-Use-Area code points. These never
+appear in a legitimate human-readable tool description, so a hit is near-zero
+false-positive.
+
+**Escalation:** a description carrying **≥3 distinct hidden classes**, or
+TAG-block characters that **decode to a printable ASCII message**, is rated
+near-certain (critical); a single class is still hard but high.
+
+#### `shadowing.cross_server` — cross-server tool impersonation
+
+Flags two cross-server attack shapes, using the read-only registry snapshot of
+all servers' tools:
+
+1. **Name collision** — a *distinctive* tool name exposed by two different
+   servers (one impersonating the other so an agent calls the wrong one).
+2. **Cross-server reference** — a tool whose description names a *distinctive*
+   tool that lives on a different server (steering the agent's tool selection).
+
+To hold near-zero FP, both shapes require the name to be **distinctive**:
+generic verbs (`search`, `get`, `list`) collide across servers all the time and
+are never flagged. A tool referencing its **own** name is also ignored.
+
+#### `payload.decoded` — decode-then-confirm shell payload
+
+Decodes base64/hex blobs embedded in a description or schema and flags **only
+when the decoded bytes are a shell/exfiltration command** — `curl … | sh`,
+`wget … | sh`, `chmod`, `rm -rf`, a pipe-to-shell, or a raw `IP:port`
+reverse-shell target (FR-008). Benign encoded data (an icon, a JSON config)
+decodes to non-matching/non-printable bytes and is never flagged. The
+**evidence presents the decoded content**, so an operator sees exactly what was
+hidden — not the encoded string.
+
+### Soft tier
+
+#### `directive.imperative` — prompt-injection directives
+
+Flags prompt-injection directives smuggled into a description: hidden-instruction
+tags (`<IMPORTANT>…`), secrecy imperatives ("do not tell the user"), instruction
+overrides ("ignore previous instructions"), and tool-preamble injections
+("before using this tool, first …"). Runs over **normalized** text.
+
+Each hit is **position-classified** (FR-009): a phrase that is quoted or
+illustrated — *"detects prompts such as 'ignore previous instructions'"* — is
+example-position and discounted below the emit threshold, so legitimate security
+tooling that merely *describes* these phrases is not flagged. The same phrase in
+imperative position ("before using this tool, read ~/.ssh/id_rsa") retains full
+confidence. This is the core false-positive control for legitimate security
+documentation.
+
+#### `capability.mismatch` — declared-vs-implied capability gap
+
+Flags a gap between what a tool *declares* it does and what it *implies* it
+touches:
+
+- **Declared-vs-implied** — a tool whose declared purpose is pure computation or
+  string manipulation (name/lead sentence like `add`, `to_uppercase`) that
+  nevertheless references a sensitive resource it has no business touching
+  (`~/.ssh`, `/etc/passwd`, an external URL, a shell). A calculator reading
+  `id_rsa` is a classic exfiltration tell.
+- **Unexplained data-sink param** — a free-form input named like an
+  exfiltration channel (`sidenote`, `scratchpad`) that the description never
+  explains — the model is steered to stuff stolen data into it.
+
+The declared category is taken from the tool **name and its leading sentence**,
+not the full description, so an attacker's benign cover sentence still anchors
+the declaration while the smuggled access in the rest of the text is treated as
+implied. Tools that legitimately declare file/network/system access are
+therefore **not** flagged for touching those resources.
+
+#### `secret.embedded` — hardcoded live credential
+
+Flags a live credential hardcoded into a description or schema — an AWS key, a
+private key, a database password, a Luhn-valid card, etc. It wraps the shared
+`internal/security/patterns/` matchers (the same set used by
+[sensitive-data detection](/features/sensitive-data-detection)) and carries each
+match's **per-match confidence**: a validated card / live cloud key is high; a
+documented placeholder (`AKIA…EXAMPLE`) collapses to near-zero and is dropped.
+Scans **raw** text (secrets are case-sensitive). Being soft, a hit raises a
+review item rather than auto-quarantining — an embedded secret may be a careless
+example as easily as a planted one.
+
+### At a glance
+
+| Check ID | Tier | Catches |
+|----------|------|---------|
+| `unicode.hidden` | hard | Zero-width / bidi / TAG-block / PUA character smuggling (raw text) |
+| `shadowing.cross_server` | hard | Distinctive tool name collision or cross-server reference |
+| `payload.decoded` | hard | base64/hex blob that decodes to a shell/exfil command |
+| `directive.imperative` | soft | Injection directives, secrecy imperatives, instruction overrides (normalized, position-discounted) |
+| `capability.mismatch` | soft | Compute/string tool touching `~/.ssh` etc.; unexplained data-sink param |
+| `secret.embedded` | soft | Hardcoded live credential (confidence-scored, placeholders dropped) |
+
+## The eval gate (CI-enforced reliability)
+
+Reliability is enforced as a number the build checks, so the detector cannot
+silently regress (the original keyword detector drifted to ~10% recall
+unnoticed). A labeled corpus runs as a **blocking CI gate**:
+
+```bash
+go run ./cmd/scan-eval \
+  --corpus specs/065-evaluation-foundation/datasets/detect_corpus_v1.json \
+  --gate --min-recall 0.90 --max-fp 0.05
+```
+
+- **Recall ≥ 0.90** on malicious entries and **false-positive rate ≤ 0.05** on
+  the **hard-negative** set (benign tools that deliberately resemble attacks).
+  Clean-benign entries are reported for transparency but do **not** dilute the
+  gated FP rate — only the hard-negative FP rate feeds the gate decision
+  (SC-002).
+- On a breach the command prints a `GATE FAILED: …` reason and exits with code
+  **6** (distinct from config/write errors so CI can tell a real regression
+  from a tooling fault). On success it prints `GATE PASSED: …` and exits `0`.
+- It always prints a per-category recall/precision/FP/F1 JSON scorecard to
+  stdout for the CI log.
+
+**CI wiring:** the gate runs as a blocking step in the `security-d2` job of
+[`.github/workflows/eval.yml`](https://github.com/smart-mcp-proxy/mcpproxy-go/blob/main/.github/workflows/eval.yml).
+The job is pure Go + Python with no live upstreams, so it is fast and
+hermetic (FR-013, SC-006).
+
+### Corpus and category gating
+
+The labeled corpus lives at
+`specs/065-evaluation-foundation/datasets/detect_corpus_v1.json` (separate from
+the immutable `security_corpus_v1.json`; it carries the server/tool/schema/peers
+context the detect engine needs). Each entry is labeled `malicious` or
+`benign`, tagged with a category (e.g. `unicode_smuggling`, `decoded_payload`,
+`shadowing`, `capability_mismatch`), and hard-negatives record which attack
+class they `resemble` so a false positive is attributed to that category.
+
+A category is only **enforced** by the gate when its corresponding check is
+registered in the gate's check list (`gateChecks()` in `cmd/scan-eval/gate.go`).
+This is a forward-compatibility mechanism: a category whose check is not yet in
+the gate list is **measured and reported but never fails the build
+prematurely**. When a new check is wired into the gate list, the gate begins
+enforcing its category.
+
+## How it plugs in (unchanged entry points)
+
+The detect engine is invoked from `internal/security/scanner/inprocess.go`,
+which projects the connected servers' parsed tool definitions into a
+`RegistryView` and renders each `detect.Finding` 1:1 into the existing
+`ScanFinding` type (additively carrying `Confidence` and `Signals`). Because the
+finding shape is preserved, all existing entry points keep working unchanged
+(FR-015):
+
+- CLI `mcpproxy security scan <server>`
+- REST `POST /api/v1/servers/{name}/scan`
+- the `quarantine_security` MCP tool
+
+It reuses — rather than rebuilds — the Spec-032 quarantine hashing, the
+quarantine state machine, the aggregated-report types, and the
+`internal/security/patterns/` secret matchers (FR-012).
+
+## Related reading
+
+- [Security Scanner Plugins](/features/security-scanner-plugins) — the plugin framework hosting the `tpa-descriptions` scanner
+- [Security Quarantine](/features/security-quarantine) — the quarantine mechanism hard-tier findings drive
+- [Tool Quarantine (Spec 032)](/features/tool-quarantine) — per-tool hash-based approval
+- [Sensitive-Data Detection](/features/sensitive-data-detection) — the shared secret matchers the embedded-secret check reuses
+- Spec: `specs/076-deterministic-tool-scanner/spec.md` · engine contract: `internal/security/detect/doc.go`
diff --git a/specs/076-deterministic-tool-scanner/tasks.md b/specs/076-deterministic-tool-scanner/tasks.md
index 3332be446..dca534a96 100644
--- a/specs/076-deterministic-tool-scanner/tasks.md
+++ b/specs/076-deterministic-tool-scanner/tasks.md
@@ -56,10 +56,10 @@ Single Go module. New package `internal/security/detect/` (engine + `checks/`);
 
 **Independent test**: Hard-negative corpus entries stay unflagged-as-dangerous; matching malicious entries are caught.
 
-- [ ] T013 [P] [US2] Write `internal/security/detect/checks/directive_imperative_test.go` (MUST-flag `<IMPORTANT>`/"before using this tool"/"do not tell the user"/"ignore previous instructions" and variants over NORMALIZED text; MUST-NOT-flag example-position usage) per FR-009; then implement `directive_imperative.go` using regex families + the position classifier.
-- [ ] T014 [P] [US2] Write `internal/security/detect/checks/capability_mismatch_test.go` (MUST-flag a math/string tool that reads `~/.ssh` or has an unexplained data-sink param like "sidenote"; MUST-NOT-flag a file tool that legitimately reads files); then implement `capability_mismatch.go` (declared-vs-implied + unused-param heuristic).
-- [ ] T015 [P] [US2] Add a per-match confidence to `internal/security/patterns/` matchers (validated card/Luhn → high; entropy-only → low) without changing existing call sites' behavior; update the patterns tests.
-- [ ] T016 [US2] Write `internal/security/detect/checks/embedded_secret_test.go`; then implement `embedded_secret.go` wrapping `patterns/` with confidence, register all three soft checks in the engine.
+- [x] T013 [P] [US2] Write `internal/security/detect/checks/directive_imperative_test.go` (MUST-flag `<IMPORTANT>`/"before using this tool"/"do not tell the user"/"ignore previous instructions" and variants over NORMALIZED text; MUST-NOT-flag example-position usage) per FR-009; then implement `directive_imperative.go` using regex families + the position classifier.
+- [x] T014 [P] [US2] Write `internal/security/detect/checks/capability_mismatch_test.go` (MUST-flag a math/string tool that reads `~/.ssh` or has an unexplained data-sink param like "sidenote"; MUST-NOT-flag a file tool that legitimately reads files); then implement `capability_mismatch.go` (declared-vs-implied + unused-param heuristic).
+- [x] T015 [P] [US2] Add a per-match confidence to `internal/security/patterns/` matchers (validated card/Luhn → high; entropy-only → low) without changing existing call sites' behavior; update the patterns tests.
+- [x] T016 [US2] Write `internal/security/detect/checks/embedded_secret_test.go`; then implement `embedded_secret.go` wrapping `patterns/` with confidence, register all three soft checks in the engine.
 
 **Checkpoint**: US1 + US2 — full six-check detector with FP discrimination.
 
@@ -71,9 +71,9 @@ Single Go module. New package `internal/security/detect/` (engine + `checks/`);
 
 **Independent test**: `scan-eval --gate` exits non-zero when recall < 0.90 or hard-negative FP > 5%.
 
-- [ ] T017 [P] [US3] Expand the labeled corpus in `specs/065-evaluation-foundation/datasets/` with new categories (unicode_smuggling, decoded_payload, capability_mismatch, shadowing) and additional hard-negatives; author original equivalents where external licensing is unclear (FR-014). Update the dataset README + counts.
-- [ ] T018 [US3] Add `--gate --min-recall --max-fp` mode to `cmd/scan-eval/` that runs the new `detect.Engine` over the corpus, prints per-category recall/precision/FP/F1 JSON, and exits non-zero on breach; write `cmd/scan-eval` test for the gate exit logic.
-- [ ] T019 [US3] Wire the gate into the existing CI test workflow (`.github/workflows/…`) as a blocking step `scan-eval --gate --min-recall 0.90 --max-fp 0.05` (FR-013, SC-006).
+- [x] T017 [P] [US3] Expand the labeled corpus in `specs/065-evaluation-foundation/datasets/` with new categories (unicode_smuggling, decoded_payload, capability_mismatch, shadowing) and additional hard-negatives; author original equivalents where external licensing is unclear (FR-014). Update the dataset README + counts.
+- [x] T018 [US3] Add `--gate --min-recall --max-fp` mode to `cmd/scan-eval/` that runs the new `detect.Engine` over the corpus, prints per-category recall/precision/FP/F1 JSON, and exits non-zero on breach; write `cmd/scan-eval` test for the gate exit logic.
+- [x] T019 [US3] Wire the gate into the existing CI test workflow (`.github/workflows/…`) as a blocking step `scan-eval --gate --min-recall 0.90 --max-fp 0.05` (FR-013, SC-006).
 
 **Checkpoint**: reliability is enforced; recall ≥ 0.90 / FP ≤ 5% proven by the gate.
 
@@ -94,7 +94,7 @@ Single Go module. New package `internal/security/detect/` (engine + `checks/`);
 
 ## Phase 7: Polish & Cross-Cutting Concerns
 
-- [ ] T022 [P] Document the six checks, the two-tier model, and the eval gate in `docs/features/` (extend security-quarantine.md / sensitive-data-detection.md or add tool-scanner.md); note offline/no-egress guarantee.
+- [x] T022 [P] Document the six checks, the two-tier model, and the eval gate in `docs/features/` (extend security-quarantine.md / sensitive-data-detection.md or add tool-scanner.md); note offline/no-egress guarantee.
 - [ ] T023 [P] Run `gofmt`/`goimports` and `golangci-lint run --config .github/.golangci.yml ./internal/security/... ./cmd/scan-eval/...`; fix findings.
 - [ ] T024 Full verification: `go test -race ./internal/security/... ./cmd/scan-eval/...`, `./scripts/test-api-e2e.sh`, and the corpus gate; confirm SC-001…SC-007 and update the spec checklist.
 
diff --git a/website/sidebars.js b/website/sidebars.js
index 4c421451b..15342d1ad 100644
--- a/website/sidebars.js
+++ b/website/sidebars.js
@@ -71,6 +71,7 @@ const sidebars = {
         'features/oauth-authentication',
         'features/code-execution',
         'features/security-quarantine',
+        'features/tool-scanner',
         'features/search-discovery',
         'features/version-updates',
       ],

From a59b4f1b6738aacd988f4ae1744bcd70e94d36db Mon Sep 17 00:00:00 2001
From: Algis Dumbris <a.dumbris@gmail.com>
Date: Sun, 28 Jun 2026 11:17:13 +0300
Subject: [PATCH 2/2] docs(security): clarify legacy TPA rules coexist with the
 detect engine
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

CodexReviewer review of #780: the docs overstated that tpa-descriptions is
purely the new two-tier detect engine. The live scanner
(internal/security/scanner/inprocess.go) still appends the legacy TPA keyword
rules (tpa_hidden_instructions / prompt_injection_in_description /
data_exfiltration_in_description) after the detect-engine findings, and those
are ThreatLevelDangerous — they block security approve and drive the summary
to dangerous (confirmed by e2e_tpa_smoke_test.go).

Documents the current coexistence accurately:
- tool-scanner.md: scope note on the two-tier table + a new "Coexistence with
  the legacy TPA rules" subsection + a plug-in-section pointer; the
  "soft never auto-quarantines" rule is the detect-engine's, not the legacy
  rules'.
- security-scanner-plugins.md: tpa-descriptions row notes the still-active
  dangerous legacy rules.

Folding the legacy rules into the detect engine remains a separate
implementation change (out of scope for this docs PR).

Related: Spec 076 (specs/076-deterministic-tool-scanner)

Co-Authored-By: Paperclip <noreply@paperclip.ing>
---
 docs/features/security-scanner-plugins.md |  2 +-
 docs/features/tool-scanner.md             | 47 ++++++++++++++++++++++-
 2 files changed, 47 insertions(+), 2 deletions(-)

diff --git a/docs/features/security-scanner-plugins.md b/docs/features/security-scanner-plugins.md
index e38618009..f071ce0d6 100644
--- a/docs/features/security-scanner-plugins.md
+++ b/docs/features/security-scanner-plugins.md
@@ -118,7 +118,7 @@ MCPProxy ships with a bundled registry of 8 scanners. The bundled list lives in
 | `nova-proximity` | MCPProxy (NOVA-inspired rules) | source | — | Keyword-based, fully offline. Very fast. |
 | `ramparts` | Javelin | source | — | Rust-based YARA scanner. Runs fully offline: v0.8.x scans a live MCP endpoint, so MCPProxy replays the captured tool definitions to it over stdio (the upstream is never re-executed). *(`amd64`-only image; runs under emulation on arm64 — see [Scanner Images](/features/scanner-images).)* |
 | `semgrep-mcp` | Semgrep | source | — | Static analysis with MCP-specific rules. Uses the upstream `returntocorp/semgrep:latest` image. |
-| `tpa-descriptions` | MCPProxy | source | — | **Built-in, Docker-less, always on.** In-process analysis of tool descriptions/schemas via the deterministic offline [detect engine (Spec 076)](/features/tool-scanner): six checks across two tiers — **hard** (hidden-Unicode smuggling, cross-server shadowing, decode-to-shell payloads) auto-quarantine; **soft** (prompt-injection directives, capability-mismatch, embedded secrets) raise a review item. Each finding carries a `confidence` score and the contributing check `signals`. Fully offline (no network/filesystem/Docker), deterministic, and runs for any connected server — including remote `http`/`sse` servers with no source or Docker. See [Tool Scanner](/features/tool-scanner) for the full rule reference and the CI eval gate. |
+| `tpa-descriptions` | MCPProxy | source | — | **Built-in, Docker-less, always on.** In-process analysis of tool descriptions/schemas via the deterministic offline [detect engine (Spec 076)](/features/tool-scanner): six checks across two tiers — **hard** (hidden-Unicode smuggling, cross-server shadowing, decode-to-shell payloads) auto-quarantine; **soft** (prompt-injection directives, capability-mismatch, embedded secrets) raise a review item. Each finding carries a `confidence` score and the contributing check `signals`. **It currently also runs a set of still-active legacy TPA keyword rules** (`tpa_hidden_instructions`, `prompt_injection_in_description`, `data_exfiltration_in_description`) that produce their own **dangerous, approval-blocking** findings — so the detect engine's "soft never auto-quarantines" rule applies to its own signals, not to those legacy rules (which can still block on the same phrases). Fully offline (no network/filesystem/Docker), deterministic, and runs for any connected server — including remote `http`/`sse` servers with no source or Docker. See [Tool Scanner](/features/tool-scanner) for the full rule reference, the legacy-rule coexistence, and the CI eval gate. |
 | `trivy-mcp` | Aqua Security | source, container_image | — | Filesystem + CVE scan. Uses the upstream `ghcr.io/aquasecurity/trivy:latest` image. |
 
 See [Scanner Images](/features/scanner-images) for the image sources and why vendor images are preferred over custom wrappers.
diff --git a/docs/features/tool-scanner.md b/docs/features/tool-scanner.md
index 499ce6006..081e51800 100644
--- a/docs/features/tool-scanner.md
+++ b/docs/features/tool-scanner.md
@@ -45,7 +45,17 @@ Three properties hold by construction:
 
 ## The two-tier model
 
-Each check emits zero or more **signals**, and every signal carries a **tier**:
+> **Scope of "soft never auto-quarantines":** the two-tier semantics below
+> describe the **detect-engine signals** specifically. The live `tpa-descriptions`
+> scanner currently runs the detect engine *alongside* a set of still-active
+> legacy TPA keyword rules that produce their own dangerous, approval-blocking
+> findings — see [Coexistence with the legacy TPA rules](#coexistence-with-the-legacy-tpa-rules)
+> below. So a phrase like "ignore previous instructions" can still yield a
+> blocking finding today even though the detect engine classifies it as a soft
+> signal.
+
+Each detect-engine check emits zero or more **signals**, and every signal
+carries a **tier**:
 
 | Tier | What it means | Effect on the tool |
 |------|---------------|--------------------|
@@ -71,6 +81,35 @@ finding (`internal/security/detect/aggregate.go`):
   how strongly (FR-010). These surface in the CLI report (`Confidence:` /
   `Signals:` lines) and in the REST scan report JSON.
 
+### Coexistence with the legacy TPA rules
+
+The two-tier model above governs the **detect engine**. The current
+`tpa-descriptions` scanner does not run the detect engine *exclusively* — it
+runs it **alongside a legacy set of TPA keyword rules** that predate Spec 076
+(`internal/security/scanner/inprocess.go`). The detect-engine findings are
+emitted first, then the legacy rules are appended:
+
+- **`tpa_hidden_instructions`** (critical) — phrases like "ignore previous
+  instructions", "do not tell the user", `<IMPORTANT>`.
+- **`prompt_injection_in_description`** (high) — "system prompt", "you must
+  always", "always call this tool first", "jailbreak", etc.
+- **`data_exfiltration_in_description`** (high) — `~/.ssh`, `id_rsa`,
+  `/etc/passwd`, ".env file", "send the credentials", etc.
+
+All three legacy rules are **`dangerous`-level**, so — unlike the detect
+engine's *soft* `directive.imperative` / `capability.mismatch` checks, which
+only raise a review item — a legacy-rule match **blocks `security approve`** and
+drives the scan summary to `dangerous`. There is therefore some deliberate
+overlap: a description containing "ignore previous instructions" is a *soft*
+detect-engine `directive.imperative` signal **and** a *dangerous* legacy
+`tpa_hidden_instructions` finding at the same time, and today the dangerous
+legacy finding is what gates approval.
+
+This coexistence is intentional for the migration — it keeps the MVP from
+regressing any pre-076 keyword coverage. Folding the legacy rules into the
+detect engine (so the two-tier model applies uniformly) is a **separate
+implementation change tracked outside this docs page**, not yet shipped.
+
 ### Normalization (FR-007)
 
 Phrase-matching checks (directive, capability, embedded-secret position logic)
@@ -250,6 +289,12 @@ It reuses — rather than rebuilds — the Spec-032 quarantine hashing, the
 quarantine state machine, the aggregated-report types, and the
 `internal/security/patterns/` secret matchers (FR-012).
 
+`inprocess.go` does **not** delegate to the detect engine exclusively today: it
+also appends the legacy dangerous TPA keyword rules to the same findings list
+(see [Coexistence with the legacy TPA rules](#coexistence-with-the-legacy-tpa-rules)).
+The detect engine's two-tier semantics therefore describe its own signals, not
+the legacy rules' findings.
+
 ## Related reading
 
 - [Security Scanner Plugins](/features/security-scanner-plugins) — the plugin framework hosting the `tpa-descriptions` scanner