Skip to content

feat(providers): add Phala confidential TDX CVM SSH-lease provider#367

Open
anagnorisis2peripeteia wants to merge 12 commits into
openclaw:mainfrom
anagnorisis2peripeteia:feat/phala-provider
Open

feat(providers): add Phala confidential TDX CVM SSH-lease provider#367
anagnorisis2peripeteia wants to merge 12 commits into
openclaw:mainfrom
anagnorisis2peripeteia:feat/phala-provider

Conversation

@anagnorisis2peripeteia

@anagnorisis2peripeteia anagnorisis2peripeteia commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds phala as a Linux-only ProviderKindSSHLease provider that leases a Phala Cloud /
dstack confidential Intel TDX CVM, and — before trusting it — verifies a genuine
hardware TDX attestation
that binds the box to the exact code crabbox deployed. Every
other instance provider leases a plain VM; this one leases a hardware enclave and proves
it
. Adapted structurally from internal/providers/namespaceinstance, then corrected
against the real phala/dstack contract discovered by running it end-to-end on live TDX
hardware.

Attestation is the gate (not deferred)

After the CVM is reachable, Acquire fetches the dstack guest-agent attestation over SSH
(/var/run/tappd.sockTappd.Infoapp_cert + tcb_info) and verifies, against the
app id of the CVM it just deployed:

  1. RTMR replay — recompute RTMR0..3 as the SHA-384 fold of the tcb_info event log and
    require they equal the hardware registers. The measurement is genuine and unforgeable; a
    single tampered event breaks it.
  2. Quote ↔ measurement — extract the Intel TDX quote embedded in app_cert (X.509 ext
    OID 1.3.6.1.4.1.62397.1.1) and require its TD-report MRTD/RTMRs equal tcb_info.
  3. DCAP signatureverify.TdxQuote (go-tdx-guest) chains the quote to the Intel
    SGX/TDX Root CA
    : genuine Intel silicon, not an emulator.
  4. Identity binding — the RTMR3 event log's app-id must equal the deployed CVM's app
    id, so the attested enclave is our deployment.

On failure the just-created CVM is destroyed (never leaked) and the lease is refused. On
success the verified app-id/compose-hash/rtmr3 are surfaced in lease labels
(attested=true). Gate defaults on (--phala-skip-attestation to opt out); untrusted
repo config may only tighten it. Verification is pure-Go (go-tdx-guest); the DCAP step
reaches Intel PCS at runtime (documented host dependency).

Live proof — real Intel TDX hardware (funded account)

End-to-end crabbox run --provider phala (no --keep → auto-release):

provisioning provider=phala lease=cbx_bbfab33501f7 slug=attest-… instance_type=tdx.small
waiting for …:22 phala cvm ssh ssh-auth… (TLS gateway)
WARN: Using embedded Intel certificate for TDX attestation root of trust   (DCAP verify)
attested phala_cvm=ea6874fd… app_id=ea6874fd… compose_hash=fe3df84a… rtmr3=72da6a69…
provisioned … state=ready            ← only AFTER attestation passed
sync complete in 14.8s ; ATTESTED_WORKLOAD_RAN ; exit=0
releasing … ✓ CVM crabbox-cbx-bbfab33501f7 deleted successfully   (final CVM count = 0)

The same path is proven on a larger class too — a tdx.medium run attested and ran
end-to-end (instance_type=tdx.medium … attested … TDX_MEDIUM_ATTESTED_OK … exit 0 … deleted). The measurement is bound dynamically: the verified app_id/compose_hash are
this deployment's, not a constant, and rtmr3 differs across runs because it folds the
per-CVM instance id — the gate matched each correctly.

The committed unit tests run against a real attestation captured off live TDX silicon
(testdata/real_attestation_info.json, real_tdx_quote.bin): RTMR replay match +
mutation-breaks, quote↔tcb_info MRTD/RTMR equality, app-id binding (+ wrong-id rejection),
and an opt-in network DCAP test (CRABBOX_TDX_DCAP_NETWORK_TEST=1) that chains the real
quote to Intel.

Test quality (mutation tested): a gremlins mutation run over the provider package
killed 267/267 runnable mutants — 100% efficacy, zero survivors (the 33 not-covered
lines are the SSH-gated guest-agent fetch / acquire orchestration, exercised live above).

Real-contract corrections (found by running it live, not by inspection)

The first live runs failed in five distinct ways the adapted-from-namespace code never
anticipated; each is fixed and pinned by a test using the real CLI shapes:

  • deploy output prints a leading Provisioning CVM … line before the JSON; ids span
    snake_case (app_id, deploy/get) and camelCase (appId, list).
  • cvms list omits labels and puts the name under cvmName — ownership is anchored on
    the local lease claim by cloud-id, with cvms get (which has the name) used to
    corroborate the name/lease before any destructive op.
  • dev-os guest is an immutable appliance (read-only squashfs root, no package manager,
    no egress) that already ships rsync/tar/python3 but not git → bootstrap requires only the
    rsync-sync essentials; work root defaults to the writable /var/volatile/crabbox (the old
    /work sat on the read-only root and failed at sync).
  • SSH is via the dstack TLS gateway (<appId>-22.<gateway-domain>:443, tunneled with
    openssl s_client), and the gateway host is cached in the claim so connections skip a
    per-connection cvms get.

Security hardening

  • TLS server auth (MITM fix): the gateway tunnel now passes -verify_return_error +
    -verify_hostname (SSH host-key checking is necessarily off for a fresh per-lease CVM, so
    TLS is the only server authentication; previously it enforced nothing).
  • Destructive-op safety: both ReleaseLease and Cleanup require a matching local claim
    and cvms get name corroboration, so a foreign crabbox-*-named CVM (or an app-id
    reused after teardown) is never deleted; a transient/false "not found" no longer drops the
    claim+key and orphans a live billing CVM.
  • No API key is ever passed as a flag (stored phala auth only).

Review

Ran a multi-dimension AI adversarial review (ownership-safety, dstack-integration,
go-quality, security, test-quality, integration-consistency), each finding independently
verify-checked. 13 confirmed findings, all addressed — the two P1s (Cleanup name
corroboration; the TLS MITM above), the orphan-on-false-not-found P2s, the docs/marker P3s,
and the missing-coverage gaps (ReleaseLease positive-destroy is now mutation-checked; Acquire
rollback-destroy tested).

Known limitation / honest gaps

  • status --wait: the generic 4-second SSH readiness probe is too tight for a cold
    TLS-gateway SSH connect, so crabbox status --wait can time out even though the box is up
    (the lease→run→release path is unaffected and proven above). The right fix is a provider
    Status method that reports readiness from CVM state rather than a live probe — a small,
    bounded follow-up I'd rather land separately than rush here.
  • Adds go-tdx-guest (+ protobuf/logger transitive deps) and a runtime Intel-PCS dependency
    for the DCAP step. Justified for a confidential provider; the full network DCAP test is
    opt-in to keep CI deterministic.
  • Direct-only (no coordinator/broker); Linux-only. tdx.small and tdx.medium exercised
    live (attested); the larger tdx.large/tdx.xlarge class mappings are not yet shown live.

AI assistance

Built and reviewed with AI assistance (Claude). The verification recipe (RTMR replay, quote
extraction, DCAP) was derived by pulling and verifying a real quote off live hardware before
writing the Go, and the live proofs above were captured on a funded Phala account.

Add phala as a Linux-only ProviderKindSSHLease instance provider, adapted
from the namespaceinstance template. It leases a Phala Cloud / dstack
confidential Intel TDX CVM via the phala CLI (deploy --dev-os --ssh-pubkey
--wait), SSHes in over crabbox's normal sync/run path, and releases with
cvms delete. First confidential/attested execution substrate in crabbox.

Scope: lease/run/release only. TDX attestation (cvms attestation) is
deferred and intentionally not advertised -- crabbox has no confidential
Feature constant yet.

- internal/providers/phala: provider/flags/backend (+ tests)
- internal/cli/phala_proxy.go: SSH gateway proxy shim (__phala-proxy)
- internal/cli/config.go: PhalaConfig + file/env wiring
- docs/providers/phala.md + regenerated provider matrix
- scripts/live-phala-smoke.sh: guarded live smoke

Ownership is anchored on the local claim (Phala exposes no server-side
labels), so List/Cleanup/ReleaseLease never touch a name-prefixed CVM that
lacks a matching local crabbox claim.
@clawsweeper

clawsweeper Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Codex review: needs maintainer review before merge. Reviewed June 15, 2026, 8:51 AM ET / 12:51 UTC.

Summary
This PR adds a Linux-only Phala Cloud/dstack Intel TDX CVM SSH-lease provider with config, flags, docs, tests, live-smoke scripts, TLS-gateway SSH, cleanup, and default-on TDX attestation.

Reproducibility: not applicable. this is a new provider PR rather than a current-main bug report. The relevant verification is contributor live output plus source and fixture review, not a failing reproduction path.

Review metrics: 3 noteworthy metrics.

  • Diff surface: 26 files changed; 6,087 additions, 1 deletion. This is a broad provider addition touching runtime code, docs, tests, scripts, and dependency metadata.
  • New runtime dependency: 1 direct module added: github.com/google/go-tdx-guest v0.3.1. The attestation gate depends on new cryptographic verification code and its transitive supply chain.
  • Live proof scope: 2 instance classes claimed: tdx.small and tdx.medium. The provided live proof covers two classes while larger class mappings remain maintainer-acceptance scope.

Merge readiness
Overall: 🐚 platinum hermit
Proof: 🦞 diamond lobster
Patch quality: 🐚 platinum hermit
Result: ready for maintainer review.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Rank-up moves:

  • none.

Risk before merge

  • [P1] This creates a new confidential-compute trust boundary: users will rely on TDX quote parsing, DCAP verification, RTMR replay, app-id binding, and TLS gateway authentication before trusting a leased box.
  • [P1] The provider owns live billing CVM lifecycle and destructive cleanup through local claims plus phala cvms get/delete; maintainers should accept the ownership model before merge.
  • [P1] The default attestation path adds a runtime Intel PCS network dependency and a new go-tdx-guest dependency, which unit tests and green CI do not fully settle.
  • [P1] The PR body still documents a cold TLS-gateway status --wait readiness caveat and larger tdx.large/tdx.xlarge mappings without live proof.

Maintainer options:

  1. Accept the confidential provider boundary after review (recommended)
    Maintainers can merge after provider/security review accepts the default-on DCAP attestation path, TLS gateway server auth, Intel PCS runtime dependency, and cleanup model.
  2. Require broader live class proof
    Maintainers can ask for tdx.large and tdx.xlarge live proof, or remove those class mappings until they are demonstrated on real Phala hardware.
  3. Track readiness as a follow-up
    Maintainers can accept the proven lease-run-release path now and track provider-specific status readiness separately if that caveat is acceptable for first merge.

Next step before merge

  • No automated repair is indicated; the remaining action is human maintainer review of the new provider scope, trust boundary, dependencies, and final merge readiness.

Security
Cleared: No concrete patch-introduced security or supply-chain defect was found, but the new confidential-compute trust boundary remains a maintainer merge-risk decision.

Review details

Best possible solution:

Merge only after maintainers explicitly accept the confidential-compute trust model and availability caveats, while keeping provider-specific lifecycle and attestation logic inside the Phala adapter.

Do we have a high-confidence way to reproduce the issue?

Not applicable; this is a new provider PR rather than a current-main bug report. The relevant verification is contributor live output plus source and fixture review, not a failing reproduction path.

Is this the best way to solve the issue?

Yes, the implementation shape fits the repository boundary by keeping Phala-specific lifecycle, ownership, gateway, and attestation behavior in the provider adapter with minimal generic hooks. The remaining question is maintainer acceptance of the trust and availability model, not a narrow mechanical repair.

AGENTS.md: found and applied where relevant.

Codex review notes: model internal, reasoning high; reviewed against 57440dad819d.

Label changes

Label justifications:

  • P2: This is a normal-priority provider feature with meaningful review surface but no emergency regression signal.
  • merge-risk: 🚨 security-boundary: Merging makes Crabbox responsible for a new confidential-compute trust boundary involving attestation, DCAP, and TLS gateway authentication.
  • merge-risk: 🚨 availability: Merging adds a live billing CVM lifecycle path with cleanup responsibilities and documented readiness/class-coverage caveats.
  • rating: 🐚 platinum hermit: Overall readiness is 🐚 platinum hermit; proof is 🦞 diamond lobster and patch quality is 🐚 platinum hermit.
  • feature: ✨ showcase: ClawSweeper spotlight: unusually compelling feature idea for maintainer attention. A default-attested confidential Intel TDX CVM provider is an unusually substantive provider capability for hardware-backed isolated execution.
  • status: 👀 ready for maintainer look: ClawSweeper has no concrete contributor-facing blocker left for this PR. Sufficient (live_output): The PR body and follow-up comments include after-change live lease, attestation, run, and release output on funded Phala Intel TDX hardware, including tdx.small and tdx.medium.
  • proof: sufficient: Contributor real behavior proof is sufficient. The PR body and follow-up comments include after-change live lease, attestation, run, and release output on funded Phala Intel TDX hardware, including tdx.small and tdx.medium.
Evidence reviewed

What I checked:

  • Repository policy read and applied: AGENTS.md was read fully; its provider-neutral boundary is relevant because this PR keeps Phala-specific reconciliation, gateway, attestation, lifecycle, and cleanup behavior behind a provider adapter with only small generic CLI/config hooks. (AGENTS.md:13, 57440dad819d)
  • Current main does not already implement Phala: A current-main search found no phala, dstack, TDX, or go-tdx-guest provider implementation, so the central provider addition is not obsolete on main. (57440dad819d)
  • PR diff surface: The PR diff against its base changes 26 files with 6,087 additions and 1 deletion across provider code, CLI config/proxy hooks, docs, tests, dependency metadata, and live-smoke scripts. (1a06ea0b7185)
  • Provider registration is adapter-scoped: The new provider registers as a Linux SSH-lease provider with SSH, crabbox-sync, cleanup, and coordinator disabled, matching the repository's provider-adapter model. (internal/providers/phala/provider.go:20, 1a06ea0b7185)
  • Attestation gates acquire: Acquire waits for SSH, fetches dstack attestation, verifies it, records attestation labels only after success, and rolls back the created CVM on fetch or verification failure. (internal/providers/phala/backend.go:380, 1a06ea0b7185)
  • Attestation verification checks measurement and identity: The attestation verifier replays RTMRs, checks quote measurements, performs optional DCAP verification, and binds the RTMR3 app-id event to the created CVM app id. (internal/providers/phala/attestation.go:255, 1a06ea0b7185)

Likely related people:

  • steipete: Current-main blame points to shared config/claim surfaces, and GitHub commit metadata shows recent Namespace Compute provider work adjacent to this PR's stated structural starting point. (role: recent core/provider infrastructure contributor; confidence: medium; commits: f6b4a9765285, 5627fce63188; files: internal/cli/config.go, internal/cli/claim.go, internal/providers/namespaceinstance)
  • coygeek: Merged DigitalOcean and other direct lease provider work touches the same registration, config, claim, docs, and live-smoke patterns used by this provider. (role: adjacent direct SSH provider contributor; confidence: medium; commits: aff04c9f19a3, d6be66ec65cf, baa8562f139b; files: internal/providers/digitalocean, internal/cli/config.go, internal/cli/claim.go)
  • anagnorisis2peripeteia: Beyond this PR, current-main history shows prior merged provider work on Hyper-V and Tart lifecycle/cleanup surfaces, so they have relevant domain context for this provider branch. (role: adjacent provider contributor; confidence: medium; commits: f09c059f011f, b3c0c4c3cb6f; files: internal/providers/hyperv, internal/providers/tart)
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

@clawsweeper clawsweeper Bot added rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. P2 Normal priority bug or improvement with limited blast radius. merge-risk: 🚨 availability 🚨 Merging this PR could cause crashes, hangs, restart loops, stalls, or process outages. labels Jun 15, 2026
… review)

- deploy always supplies --compose (bundled default compose; the phala
  CLI requires a compose file in non-interactive mode)
- SSH proxy uses the dstack TLS gateway: derive <appId>-22.<gateway-domain>
  from the cvms get gateway object and tunnel via openssl s_client (was
  raw TCP on flat host/port fields); openssl is now a host dependency
- tests for compose-always-present + gateway-host derivation + openssl tunnel
@anagnorisis2peripeteia

Copy link
Copy Markdown
Contributor Author

Thanks for the precise review — both P1s are addressed in f641dc5e, using the Phala CLI contract you cited:

P1 — compose required for non-interactive deploy (backend.go): the provider now always supplies --compose. When none is configured it writes a bundled minimal default (default-compose.yml — a long-lived debian:stable-slim box) into the per-lease dir; an explicit --phala-compose still overrides. TestCreateAlwaysSuppliesComposeFlag asserts --compose is present on both the default and configured paths.

P1 — TLS gateway SSH contract (phala_proxy.go): rewritten off raw TCP / flat fields. It now parses the cvms get --json gateway object (gateway_domain / nested base_domain) plus the app id, derives <appId>-22.<gateway-domain>, and tunnels via openssl s_client -connect <host>:443 -servername <host> (openssl documented as a host dependency). Tests cover the host derivation and the openssl tunnel argv. The earlier TODO(phala) is removed now that the contract is known.

Local gates green: go build ./..., go vet ./internal/providers/phala/... ./internal/cli/..., and go test ./internal/providers/phala/... ./internal/cli/... -run Phala.

The remaining status: needs proof item is the funded live smoke. The account's $20 trial credit is gated behind a first card-bind (Phala's "$20 with your first card bind or $1+ top-up"), which the contributor needs to complete; once funded I'll attach a redacted lease→run→release transcript and lift the draft status.

@clawsweeper re-review

@clawsweeper

clawsweeper Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

@clawsweeper clawsweeper Bot added the feature: ✨ showcase ClawSweeper spotlight: unusually compelling feature idea for maintainer attention. label Jun 15, 2026
The live lease->run->release run against real Phala TDX hardware revealed
the actual `phala` CLI output contract, which the provider mis-parsed:

- deploy --json prints a leading "Provisioning CVM <name>..." progress
  line before the JSON object; jsonObjectPrefix only trimmed trailing
  noise, so deploy failed with "produced no JSON output". Now scans to
  the first top-level { and tolerates leading AND trailing noise.
- cvms list items use camelCase keys (appId/vmUuid/instanceId); cvms get
  uses snake_case (app_id/vm_uuid). instance now decodes BOTH spellings
  via a custom UnmarshalJSON; cloudID() prefers app_id (the confirmed
  --cvm-id handle for get/delete).
- listInstances skips items with no usable name/handle instead of failing.

Tests parse the exact observed deploy stdout, camelCase list payload, and
snake_case cvms-get/gateway payload.
@clawsweeper clawsweeper Bot added the merge-risk: 🚨 security-boundary 🚨 Merging this PR could weaken sandboxing, authorization, credentials, or sensitive data. label Jun 15, 2026
Test added 7 commits June 15, 2026 09:38
…uirement)

Live SSH into a real --dev-os TDX CVM showed the dstack guest is an
immutable confidential-compute appliance: read-only squashfs root, no
package manager (apt/dnf/yum/apk all absent), no network egress. It already
ships rsync, tar and python3 -- everything crabbox's rsync-based sync and
exec needs -- but NOT git. The earlier bootstrap required git, so it could
never succeed on the supported guest and failed live with
"Phala CVM tool bootstrap failed: exit status 1" after SSH auth.

crabbox does not need git on the box (the file manifest is computed locally
and the working tree is rsync'd, not git-cloned, over the SSH gateway
tunnel). So the required tool set is now rsync+tar+python3; git is installed
only opportunistically when a package manager exists (non-dev-os images) and
is never required. The lease ReadyCheck drops git for the same reason.

Tests pin the dev-os contract: rsync+tar+python3 required, git opportunistic.
…ad-only)

Second live finding: after bootstrap succeeded the run reached the sync
step and failed at "write sync manifests: exit status 1". The dstack
--dev-os guest roots its filesystem on a read-only squashfs (confirmed via
mount: /dev/mapper/rootfs on / type squashfs ro), so the /work/crabbox work
root could not be mkdir'd. /var/volatile is a writable tmpfs present on
every dstack guest.

Default Phala work root is now /var/volatile/crabbox (BaseConfig and the
applyDefaults fallback). ValidateConfig additionally rejects the bare
/var/volatile mount as too broad, matching the existing /tmp, /var, /work
guards. Docs and the deploy-args test updated; tests pin the writable
default and that it survives its own validator. Users needing encrypted
at-rest persistence can point --phala-work-root at
/var/volatile/dstack/persistent/...

(The earlier git-requirement and JSON-shape fixes already landed; this is
the remaining writable-path gap before the live smoke completes.)
Live release/resolve revealed the ownership model relied on server-side
labels that Phala never provides: cvms list returns no crabbox labels and
omits even the CVM name (only cvms get/deploy/delete echo the name). The
run failed at resolve with "ownership labels do not match lease".

Rework, consistent with Phala's reality that the local lease claim (which
records the CVM cloud id at acquire time) is the ownership authority:

- owned(): dual-proof -- a local claim maps to the CVM cloud id (post-acquire
  authority, the reliable path for List/Resolve where cvms list has no name),
  OR the CVM name carries the crabbox-<lease> prefix (pre-claim recovery of
  our own just-created CVM, before the claim is written).
- phalaLabels(): synthesize lease/slug/owner from the claim keyed on the CVM
  cloud id, falling back to the crabbox- name prefix for objects that do carry
  a name (cvms get).
- Resolve claim branch: trust the claim (owned re-resolves it by cloud id)
  instead of cross-checking absent server labels.
- Destructive ops (validateDestroyTarget) source the CVM -- and its name --
  from cvms get (which carries it on real hardware) rather than cvms list,
  then corroborate: require a local claim AND a crabbox-<lease> name matching
  the lease. This preserves the four release-safety properties (reject
  foreign-named, mismatched-lease, unclaimed, and skip-when-gone) against the
  app-id-reuse threat, now working on real Phala instead of only on mocked
  list payloads with names.

Adds getInstance (cvms get) and tests pinning the real get/list shapes.
The __phala-proxy ProxyCommand resolved the dstack gateway host by calling
`phala cvms get` on EVERY SSH connection. crabbox status --wait uses a 4s
SSH readiness probe (probeSSHReady) that cannot complete cvms-get + openssl
TLS + SSH in 4s, so it timed out ("timed out waiting for <lease> to become
ready") even though the leased CVM was up -- the live run path tolerates it
via a longer wait, but every connection paid an extra API round-trip.

The gateway host (<app_id>-22.<gateway_domain>) is stable for a CVM, so it is
now resolved ONCE at acquire (resolveGatewayHost via cvms get) and cached in
the lease claim under gateway_host. proxyCommand bakes --gateway-host into the
__phala-proxy invocation; the proxy tunnels straight to it and skips cvms get.
lease() reads the cached host from the claim-backed labels. When the cache is
absent (older claims, resolution failure) the proxy falls back to cvms get, so
behavior degrades gracefully.

This makes status --wait succeed, cuts an API call + latency off every SSH
connection, and reduces Phala API load. Tests cover the cached-host fast path
(no cvms get), the fallback, real cvms-get gateway parsing, and the
acquire->claim->proxy round-trip.
The SSH gateway tunnel runs `openssl s_client` with only -verify_quiet,
which merely restricts verify OUTPUT -- a certificate-chain failure did NOT
abort the connection. Combined with SSH host-key checking being disabled
(the per-lease CVM host key is unknown), the TLS handshake was the only
server authentication AND it enforced nothing: a network MITM between the
crabbox host and <appId>-22.<gateway-domain>:443 could terminate TLS with any
cert, read the SSH session, and inject commands -- defeating the confidential
-compute trust boundary the provider exists to uphold.

Add -verify_return_error (abort on any chain-verification failure) and
-verify_hostname <host> (pin the leaf cert to the gateway host so a valid-
chain cert for a different name cannot be substituted). The dstack gateway
presents a publicly-trusted cert for the gateway host (verified live against
the ambient trust store), so genuine connections still succeed. Proxy test
now asserts both flags are present.
Parse the real cvms-list name key and fix ownership/lifecycle gaps surfaced
by an independent adversarial review and live TDX testing:

- cvms list items carry the name under `cvmName` (camelCase), not name/appName
  -- instance.UnmarshalJSON now reads it, restoring ambiguous-create recovery,
  owned()'s name branch, and name surfacing; all list-shaped test fixtures
  corrected to the real keys (appId/cvmName/status).
- Cleanup now routes every live delete through validateDestroyTarget (cvms get
  name corroboration), matching ReleaseLease, so a stale claim onto a reused
  app_id can no longer delete a foreign CVM (P1).
- destroy()/getInstance() no longer treat any 'not found' substring (incl over
  err.Error()) as success; an anchored missingCVMResponse helper distinguishes
  a definitively-absent CVM from a transient lookup failure, and ReleaseLease
  retains the claim+key on an ambiguous failure instead of orphaning a live
  billing CVM (P2).
- status --wait no longer re-runs the full prepareSSH bootstrap (Resolve gates
  prepareSSH on !StatusOnly), so it relies on the lightweight readiness probe.
- slug now round-trips through the claim (acquire injects it; Resolve/Touch
  re-claims preserve it) so resolve/status/stop by slug work and list shows it.
- configuration.md workRoot corrected to /var/volatile/crabbox; gateway app-id
  resolver aligned with the proxy; ambiguous-create markers anchored.

Adds tests pinning the real cvms-list cvmName shape, the Cleanup foreign-reuse
refusal, transient-not-found claim retention, ReleaseLease positive destroy
(mutation-checked), Acquire rollback-destroy, slug round-trip, status-only
no-bootstrap, and the gateway-domain preference table.
A confidential-compute provider must prove the box is a genuine attested
enclave running the authorized code -- not just an SSH-reachable VM. After
the CVM is reachable, Acquire now fetches the dstack guest-agent attestation
over SSH (/var/run/tappd.sock Tappd.Info -> app_cert + tcb_info) and verifies,
against the app id of the CVM it just deployed:

- RTMR replay: recompute RTMR0..3 as the SHA-384 fold of the tcb_info event
  log and require they equal the hardware registers (the measurement is
  genuine and unforgeable -- a tampered event breaks it).
- quote consistency: extract the Intel TDX quote embedded in the app_cert
  (X.509 ext OID 1.3.6.1.4.1.62397.1.1) and require its TD-report MRTD/RTMRs
  equal tcb_info.
- DCAP signature: verify.TdxQuote (go-tdx-guest) chains the quote to the Intel
  SGX/TDX Root CA -- genuine Intel silicon, not an emulator.
- identity binding: the RTMR3 event log's app-id must equal the deployed CVM's
  app id, so the attested enclave is OUR deployment.

On failure the just-created CVM is destroyed (never leaked) and the lease is
refused; on success the verified app-id/compose-hash/rtmr3 are surfaced in the
lease labels (attested=true). Gate defaults ON (--phala-skip-attestation to
opt out); untrusted repo config may only tighten it, never disable.

Tests run against a REAL attestation captured off live TDX hardware
(testdata/): RTMR replay match + mutation-breaks, quote<->tcb_info MRTD/RTMR
equality, app-id binding (+wrong-id rejection), and an opt-in network DCAP
test (CRABBOX_TDX_DCAP_NETWORK_TEST=1) that chains the real quote to Intel.
@clawsweeper clawsweeper Bot added proof: sufficient Contributor real behavior proof is sufficient. status: ⏳ waiting on author ClawSweeper has contributor-facing work open and is waiting for author action. and removed status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. labels Jun 15, 2026
@anagnorisis2peripeteia

Copy link
Copy Markdown
Contributor Author

Major update since the last review — the provider is now live-proven on real Intel TDX hardware and gained an attestation gate:

  • Attestation (new, the point of a confidential provider): Acquire now fetches the dstack guest-agent quote over SSH and verifies RTMR replay + quote↔tcb_info consistency + the Intel DCAP signature chain (go-tdx-guest → Intel SGX/TDX Root CA) + app-id binding to the deployed CVM, before trusting the box; it destroys + refuses a non-attesting CVM. Default-on. Tested against a real quote captured off live silicon (testdata/).
  • Live proof: full lease → attest → run → release exercised end-to-end on a funded account (transcript in the PR body); the earlier environment_blocked smoke is now real.
  • 5 real-contract fixes found by running it live (deploy prefix, cvms list name under cvmName + claim-anchored ownership, immutable dev-os guest/work-root, TLS-gateway SSH + cached host).
  • Security: TLS server-auth MITM fix (-verify_return_error -verify_hostname); Cleanup now name-corroborates like ReleaseLease; no orphaning a live CVM on a transient not-found.
  • A 6-dimension AI adversarial review found 13 findings, all addressed (incl. the 2 P1s above).

Known gap, stated honestly: crabbox status --wait's generic 4s SSH readiness probe is too tight for a cold TLS-gateway connect — a bounded follow-up (needs a core change), the lease→run→release path is unaffected.

@clawsweeper re-review

@clawsweeper

clawsweeper Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

…wSweeper)

Address the ClawSweeper re-review of the attestation commit:

- [P1] Acquire's success path called ClaimLeaseTargetForRepoConfig
  unconditionally, which no-ops when req.Repo.Root is empty. Phala ownership is
  anchored on the local claim, so a non-repo acquire (warmup, or run outside a
  repo) returned a live, billing CVM that List/stop/Cleanup could never find or
  destroy. Extracted claimAcquiredLease() which falls back to
  ClaimLeaseTargetForConfig when there is no repo root; test pins that a
  non-repo (and whitespace) repo root still writes a resolvable claim.
- [P3] docs/providers/phala.md still described attestation as deferred/not
  verified; rewrote the section to document the default-on Acquire gate
  (RTMR replay + quote consistency + DCAP-to-Intel-root + app-id binding) and
  the --phala-skip-attestation opt-out, added attest to the config example,
  corrected the provider-metadata caveat, and regenerated the provider matrix.
@anagnorisis2peripeteia

Copy link
Copy Markdown
Contributor Author

Both findings from the last review are addressed in eb47bdcd:

  • [P1] Persist non-repo acquire claims — Acquire's success path no longer calls the repo-only claim writer unconditionally. Extracted claimAcquiredLease() which falls back to ClaimLeaseTargetForConfig when req.Repo.Root is empty, so a non-repo acquire (warmup, or run outside a repo) still writes the local ownership claim that List/stop/Cleanup depend on. TestClaimAcquiredLeasePersistsNonRepoClaim pins that an empty (and whitespace) repo root writes a resolvable claim with the right cloud-id and slug.
  • [P3] Attestation docsdocs/providers/phala.md now documents the default-on Acquire gate (RTMR replay + quote↔tcb_info consistency + DCAP-to-Intel-root + app-id binding) and the --phala-skip-attestation opt-out; added attest to the config example, corrected the provider-metadata.json caveat, and regenerated the provider matrix (check-provider-matrix passes).

@clawsweeper re-review

@clawsweeper

clawsweeper Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

@clawsweeper clawsweeper Bot added rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. and removed rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: ⏳ waiting on author ClawSweeper has contributor-facing work open and is waiting for author action. labels Jun 15, 2026
@clawsweeper clawsweeper Bot added the status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. label Jun 15, 2026
…verage

Mutation testing (gremlins, 100% efficacy / 0 survivors over the package)
flagged the dstack Info parsing inside fetchAttestation as uncovered because
it was welded to the SSH fetch. Split the deterministic parse (trim, skip a
leading shell banner to the first '{', JSON-decode, reject empty) into
parseDstackInfo; fetchAttestation keeps only the SSH call (the genuine seam
gap). TestParseDstackInfo covers clean/bannered/whitespace inputs and the
empty + non-JSON rejections.
@anagnorisis2peripeteia

Copy link
Copy Markdown
Contributor Author

Strengthened since the last pass — addressing the remaining maintainer-acceptance items:

  • Larger-class live proof: tdx.medium now attested + run + released end-to-end (instance_type=tdx.medium … attested … TDX_MEDIUM_ATTESTED_OK … exit 0 … deleted), so the live matrix is no longer tdx.small-only (added to the PR body).
  • Mutation tested: a gremlins run over the provider package kills 267/267 runnable mutants — 100% efficacy, zero survivors. The only uncovered lines are the SSH-gated guest-agent fetch / acquire orchestration (exercised live, not unit-reachable without an SSH seam); parseDstackInfo was split out of fetchAttestation (commit 1a06ea0b) to make the attestation-response parsing deterministically covered.

Known follow-ups, stated honestly: status --wait cold-TLS-gateway readiness (needs a small core cli change, kept out to stay focused) and the tdx.large/xlarge class mappings (not yet shown live).

@clawsweeper re-review

@clawsweeper

clawsweeper Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

@anagnorisis2peripeteia anagnorisis2peripeteia marked this pull request as ready for review June 15, 2026 12:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature: ✨ showcase ClawSweeper spotlight: unusually compelling feature idea for maintainer attention. merge-risk: 🚨 availability 🚨 Merging this PR could cause crashes, hangs, restart loops, stalls, or process outages. merge-risk: 🚨 security-boundary 🚨 Merging this PR could weaken sandboxing, authorization, credentials, or sensitive data. P2 Normal priority bug or improvement with limited blast radius. proof: sufficient Contributor real behavior proof is sufficient. rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant