feat(providers): add Phala confidential TDX CVM SSH-lease provider#367
feat(providers): add Phala confidential TDX CVM SSH-lease provider#367anagnorisis2peripeteia wants to merge 12 commits into
Conversation
Add phala as a Linux-only ProviderKindSSHLease instance provider, adapted from the namespaceinstance template. It leases a Phala Cloud / dstack confidential Intel TDX CVM via the phala CLI (deploy --dev-os --ssh-pubkey --wait), SSHes in over crabbox's normal sync/run path, and releases with cvms delete. First confidential/attested execution substrate in crabbox. Scope: lease/run/release only. TDX attestation (cvms attestation) is deferred and intentionally not advertised -- crabbox has no confidential Feature constant yet. - internal/providers/phala: provider/flags/backend (+ tests) - internal/cli/phala_proxy.go: SSH gateway proxy shim (__phala-proxy) - internal/cli/config.go: PhalaConfig + file/env wiring - docs/providers/phala.md + regenerated provider matrix - scripts/live-phala-smoke.sh: guarded live smoke Ownership is anchored on the local claim (Phala exposes no server-side labels), so List/Cleanup/ReleaseLease never touch a name-prefixed CVM that lacks a matching local crabbox claim.
|
Codex review: needs maintainer review before merge. Reviewed June 15, 2026, 8:51 AM ET / 12:51 UTC. Summary Reproducibility: not applicable. this is a new provider PR rather than a current-main bug report. The relevant verification is contributor live output plus source and fixture review, not a failing reproduction path. Review metrics: 3 noteworthy metrics.
Merge readiness Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch. Rank-up moves:
Risk before merge
Maintainer options:
Next step before merge
Security Review detailsBest possible solution: Merge only after maintainers explicitly accept the confidential-compute trust model and availability caveats, while keeping provider-specific lifecycle and attestation logic inside the Phala adapter. Do we have a high-confidence way to reproduce the issue? Not applicable; this is a new provider PR rather than a current-main bug report. The relevant verification is contributor live output plus source and fixture review, not a failing reproduction path. Is this the best way to solve the issue? Yes, the implementation shape fits the repository boundary by keeping Phala-specific lifecycle, ownership, gateway, and attestation behavior in the provider adapter with minimal generic hooks. The remaining question is maintainer acceptance of the trust and availability model, not a narrow mechanical repair. AGENTS.md: found and applied where relevant. Codex review notes: model internal, reasoning high; reviewed against 57440dad819d. Label changesLabel justifications:
Evidence reviewedWhat I checked:
Likely related people:
What the crustacean ranks mean
Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics. How this review workflow works
|
… review) - deploy always supplies --compose (bundled default compose; the phala CLI requires a compose file in non-interactive mode) - SSH proxy uses the dstack TLS gateway: derive <appId>-22.<gateway-domain> from the cvms get gateway object and tunnel via openssl s_client (was raw TCP on flat host/port fields); openssl is now a host dependency - tests for compose-always-present + gateway-host derivation + openssl tunnel
|
Thanks for the precise review — both P1s are addressed in P1 — compose required for non-interactive deploy ( P1 — TLS gateway SSH contract ( Local gates green: The remaining @clawsweeper re-review |
|
🦞🧹 I asked ClawSweeper to review this item again. Re-review progress:
|
The live lease->run->release run against real Phala TDX hardware revealed
the actual `phala` CLI output contract, which the provider mis-parsed:
- deploy --json prints a leading "Provisioning CVM <name>..." progress
line before the JSON object; jsonObjectPrefix only trimmed trailing
noise, so deploy failed with "produced no JSON output". Now scans to
the first top-level { and tolerates leading AND trailing noise.
- cvms list items use camelCase keys (appId/vmUuid/instanceId); cvms get
uses snake_case (app_id/vm_uuid). instance now decodes BOTH spellings
via a custom UnmarshalJSON; cloudID() prefers app_id (the confirmed
--cvm-id handle for get/delete).
- listInstances skips items with no usable name/handle instead of failing.
Tests parse the exact observed deploy stdout, camelCase list payload, and
snake_case cvms-get/gateway payload.
…uirement) Live SSH into a real --dev-os TDX CVM showed the dstack guest is an immutable confidential-compute appliance: read-only squashfs root, no package manager (apt/dnf/yum/apk all absent), no network egress. It already ships rsync, tar and python3 -- everything crabbox's rsync-based sync and exec needs -- but NOT git. The earlier bootstrap required git, so it could never succeed on the supported guest and failed live with "Phala CVM tool bootstrap failed: exit status 1" after SSH auth. crabbox does not need git on the box (the file manifest is computed locally and the working tree is rsync'd, not git-cloned, over the SSH gateway tunnel). So the required tool set is now rsync+tar+python3; git is installed only opportunistically when a package manager exists (non-dev-os images) and is never required. The lease ReadyCheck drops git for the same reason. Tests pin the dev-os contract: rsync+tar+python3 required, git opportunistic.
…ad-only) Second live finding: after bootstrap succeeded the run reached the sync step and failed at "write sync manifests: exit status 1". The dstack --dev-os guest roots its filesystem on a read-only squashfs (confirmed via mount: /dev/mapper/rootfs on / type squashfs ro), so the /work/crabbox work root could not be mkdir'd. /var/volatile is a writable tmpfs present on every dstack guest. Default Phala work root is now /var/volatile/crabbox (BaseConfig and the applyDefaults fallback). ValidateConfig additionally rejects the bare /var/volatile mount as too broad, matching the existing /tmp, /var, /work guards. Docs and the deploy-args test updated; tests pin the writable default and that it survives its own validator. Users needing encrypted at-rest persistence can point --phala-work-root at /var/volatile/dstack/persistent/... (The earlier git-requirement and JSON-shape fixes already landed; this is the remaining writable-path gap before the live smoke completes.)
Live release/resolve revealed the ownership model relied on server-side labels that Phala never provides: cvms list returns no crabbox labels and omits even the CVM name (only cvms get/deploy/delete echo the name). The run failed at resolve with "ownership labels do not match lease". Rework, consistent with Phala's reality that the local lease claim (which records the CVM cloud id at acquire time) is the ownership authority: - owned(): dual-proof -- a local claim maps to the CVM cloud id (post-acquire authority, the reliable path for List/Resolve where cvms list has no name), OR the CVM name carries the crabbox-<lease> prefix (pre-claim recovery of our own just-created CVM, before the claim is written). - phalaLabels(): synthesize lease/slug/owner from the claim keyed on the CVM cloud id, falling back to the crabbox- name prefix for objects that do carry a name (cvms get). - Resolve claim branch: trust the claim (owned re-resolves it by cloud id) instead of cross-checking absent server labels. - Destructive ops (validateDestroyTarget) source the CVM -- and its name -- from cvms get (which carries it on real hardware) rather than cvms list, then corroborate: require a local claim AND a crabbox-<lease> name matching the lease. This preserves the four release-safety properties (reject foreign-named, mismatched-lease, unclaimed, and skip-when-gone) against the app-id-reuse threat, now working on real Phala instead of only on mocked list payloads with names. Adds getInstance (cvms get) and tests pinning the real get/list shapes.
The __phala-proxy ProxyCommand resolved the dstack gateway host by calling
`phala cvms get` on EVERY SSH connection. crabbox status --wait uses a 4s
SSH readiness probe (probeSSHReady) that cannot complete cvms-get + openssl
TLS + SSH in 4s, so it timed out ("timed out waiting for <lease> to become
ready") even though the leased CVM was up -- the live run path tolerates it
via a longer wait, but every connection paid an extra API round-trip.
The gateway host (<app_id>-22.<gateway_domain>) is stable for a CVM, so it is
now resolved ONCE at acquire (resolveGatewayHost via cvms get) and cached in
the lease claim under gateway_host. proxyCommand bakes --gateway-host into the
__phala-proxy invocation; the proxy tunnels straight to it and skips cvms get.
lease() reads the cached host from the claim-backed labels. When the cache is
absent (older claims, resolution failure) the proxy falls back to cvms get, so
behavior degrades gracefully.
This makes status --wait succeed, cuts an API call + latency off every SSH
connection, and reduces Phala API load. Tests cover the cached-host fast path
(no cvms get), the fallback, real cvms-get gateway parsing, and the
acquire->claim->proxy round-trip.
The SSH gateway tunnel runs `openssl s_client` with only -verify_quiet, which merely restricts verify OUTPUT -- a certificate-chain failure did NOT abort the connection. Combined with SSH host-key checking being disabled (the per-lease CVM host key is unknown), the TLS handshake was the only server authentication AND it enforced nothing: a network MITM between the crabbox host and <appId>-22.<gateway-domain>:443 could terminate TLS with any cert, read the SSH session, and inject commands -- defeating the confidential -compute trust boundary the provider exists to uphold. Add -verify_return_error (abort on any chain-verification failure) and -verify_hostname <host> (pin the leaf cert to the gateway host so a valid- chain cert for a different name cannot be substituted). The dstack gateway presents a publicly-trusted cert for the gateway host (verified live against the ambient trust store), so genuine connections still succeed. Proxy test now asserts both flags are present.
Parse the real cvms-list name key and fix ownership/lifecycle gaps surfaced by an independent adversarial review and live TDX testing: - cvms list items carry the name under `cvmName` (camelCase), not name/appName -- instance.UnmarshalJSON now reads it, restoring ambiguous-create recovery, owned()'s name branch, and name surfacing; all list-shaped test fixtures corrected to the real keys (appId/cvmName/status). - Cleanup now routes every live delete through validateDestroyTarget (cvms get name corroboration), matching ReleaseLease, so a stale claim onto a reused app_id can no longer delete a foreign CVM (P1). - destroy()/getInstance() no longer treat any 'not found' substring (incl over err.Error()) as success; an anchored missingCVMResponse helper distinguishes a definitively-absent CVM from a transient lookup failure, and ReleaseLease retains the claim+key on an ambiguous failure instead of orphaning a live billing CVM (P2). - status --wait no longer re-runs the full prepareSSH bootstrap (Resolve gates prepareSSH on !StatusOnly), so it relies on the lightweight readiness probe. - slug now round-trips through the claim (acquire injects it; Resolve/Touch re-claims preserve it) so resolve/status/stop by slug work and list shows it. - configuration.md workRoot corrected to /var/volatile/crabbox; gateway app-id resolver aligned with the proxy; ambiguous-create markers anchored. Adds tests pinning the real cvms-list cvmName shape, the Cleanup foreign-reuse refusal, transient-not-found claim retention, ReleaseLease positive destroy (mutation-checked), Acquire rollback-destroy, slug round-trip, status-only no-bootstrap, and the gateway-domain preference table.
A confidential-compute provider must prove the box is a genuine attested enclave running the authorized code -- not just an SSH-reachable VM. After the CVM is reachable, Acquire now fetches the dstack guest-agent attestation over SSH (/var/run/tappd.sock Tappd.Info -> app_cert + tcb_info) and verifies, against the app id of the CVM it just deployed: - RTMR replay: recompute RTMR0..3 as the SHA-384 fold of the tcb_info event log and require they equal the hardware registers (the measurement is genuine and unforgeable -- a tampered event breaks it). - quote consistency: extract the Intel TDX quote embedded in the app_cert (X.509 ext OID 1.3.6.1.4.1.62397.1.1) and require its TD-report MRTD/RTMRs equal tcb_info. - DCAP signature: verify.TdxQuote (go-tdx-guest) chains the quote to the Intel SGX/TDX Root CA -- genuine Intel silicon, not an emulator. - identity binding: the RTMR3 event log's app-id must equal the deployed CVM's app id, so the attested enclave is OUR deployment. On failure the just-created CVM is destroyed (never leaked) and the lease is refused; on success the verified app-id/compose-hash/rtmr3 are surfaced in the lease labels (attested=true). Gate defaults ON (--phala-skip-attestation to opt out); untrusted repo config may only tighten it, never disable. Tests run against a REAL attestation captured off live TDX hardware (testdata/): RTMR replay match + mutation-breaks, quote<->tcb_info MRTD/RTMR equality, app-id binding (+wrong-id rejection), and an opt-in network DCAP test (CRABBOX_TDX_DCAP_NETWORK_TEST=1) that chains the real quote to Intel.
|
Major update since the last review — the provider is now live-proven on real Intel TDX hardware and gained an attestation gate:
Known gap, stated honestly: @clawsweeper re-review |
|
🦞🧹 I asked ClawSweeper to review this item again. Re-review progress:
|
…wSweeper) Address the ClawSweeper re-review of the attestation commit: - [P1] Acquire's success path called ClaimLeaseTargetForRepoConfig unconditionally, which no-ops when req.Repo.Root is empty. Phala ownership is anchored on the local claim, so a non-repo acquire (warmup, or run outside a repo) returned a live, billing CVM that List/stop/Cleanup could never find or destroy. Extracted claimAcquiredLease() which falls back to ClaimLeaseTargetForConfig when there is no repo root; test pins that a non-repo (and whitespace) repo root still writes a resolvable claim. - [P3] docs/providers/phala.md still described attestation as deferred/not verified; rewrote the section to document the default-on Acquire gate (RTMR replay + quote consistency + DCAP-to-Intel-root + app-id binding) and the --phala-skip-attestation opt-out, added attest to the config example, corrected the provider-metadata caveat, and regenerated the provider matrix.
|
Both findings from the last review are addressed in
@clawsweeper re-review |
|
🦞🧹 I asked ClawSweeper to review this item again. Re-review progress:
|
…verage
Mutation testing (gremlins, 100% efficacy / 0 survivors over the package)
flagged the dstack Info parsing inside fetchAttestation as uncovered because
it was welded to the SSH fetch. Split the deterministic parse (trim, skip a
leading shell banner to the first '{', JSON-decode, reject empty) into
parseDstackInfo; fetchAttestation keeps only the SSH call (the genuine seam
gap). TestParseDstackInfo covers clean/bannered/whitespace inputs and the
empty + non-JSON rejections.
|
Strengthened since the last pass — addressing the remaining maintainer-acceptance items:
Known follow-ups, stated honestly: @clawsweeper re-review |
|
🦞🧹 I asked ClawSweeper to review this item again. Re-review progress:
|
Summary
Adds
phalaas a Linux-onlyProviderKindSSHLeaseprovider that leases a Phala Cloud /dstack confidential Intel TDX CVM, and — before trusting it — verifies a genuine
hardware TDX attestation that binds the box to the exact code crabbox deployed. Every
other instance provider leases a plain VM; this one leases a hardware enclave and proves
it. Adapted structurally from
internal/providers/namespaceinstance, then correctedagainst the real
phala/dstack contract discovered by running it end-to-end on live TDXhardware.
Attestation is the gate (not deferred)
After the CVM is reachable,
Acquirefetches the dstack guest-agent attestation over SSH(
/var/run/tappd.sock→Tappd.Info→app_cert+tcb_info) and verifies, against theapp id of the CVM it just deployed:
tcb_infoevent log andrequire they equal the hardware registers. The measurement is genuine and unforgeable; a
single tampered event breaks it.
app_cert(X.509 extOID
1.3.6.1.4.1.62397.1.1) and require its TD-report MRTD/RTMRs equaltcb_info.verify.TdxQuote(go-tdx-guest) chains the quote to the IntelSGX/TDX Root CA: genuine Intel silicon, not an emulator.
app-idmust equal the deployed CVM's appid, so the attested enclave is our deployment.
On failure the just-created CVM is destroyed (never leaked) and the lease is refused. On
success the verified
app-id/compose-hash/rtmr3are surfaced in lease labels(
attested=true). Gate defaults on (--phala-skip-attestationto opt out); untrustedrepo config may only tighten it. Verification is pure-Go (
go-tdx-guest); the DCAP stepreaches Intel PCS at runtime (documented host dependency).
Live proof — real Intel TDX hardware (funded account)
End-to-end
crabbox run --provider phala(no--keep→ auto-release):The same path is proven on a larger class too — a
tdx.mediumrun attested and ranend-to-end (
instance_type=tdx.medium … attested … TDX_MEDIUM_ATTESTED_OK … exit 0 … deleted). The measurement is bound dynamically: the verifiedapp_id/compose_hasharethis deployment's, not a constant, and
rtmr3differs across runs because it folds theper-CVM instance id — the gate matched each correctly.
The committed unit tests run against a real attestation captured off live TDX silicon
(
testdata/real_attestation_info.json,real_tdx_quote.bin): RTMR replay match +mutation-breaks, quote↔tcb_info MRTD/RTMR equality, app-id binding (+ wrong-id rejection),
and an opt-in network DCAP test (
CRABBOX_TDX_DCAP_NETWORK_TEST=1) that chains the realquote to Intel.
Test quality (mutation tested): a
gremlinsmutation run over the provider packagekilled 267/267 runnable mutants — 100% efficacy, zero survivors (the 33 not-covered
lines are the SSH-gated guest-agent fetch / acquire orchestration, exercised live above).
Real-contract corrections (found by running it live, not by inspection)
The first live runs failed in five distinct ways the adapted-from-namespace code never
anticipated; each is fixed and pinned by a test using the real CLI shapes:
Provisioning CVM …line before the JSON; ids spansnake_case (
app_id, deploy/get) and camelCase (appId, list).cvms listomits labels and puts the name undercvmName— ownership is anchored onthe local lease claim by cloud-id, with
cvms get(which has the name) used tocorroborate the name/lease before any destructive op.
no egress) that already ships rsync/tar/python3 but not git → bootstrap requires only the
rsync-sync essentials; work root defaults to the writable
/var/volatile/crabbox(the old/worksat on the read-only root and failed at sync).<appId>-22.<gateway-domain>:443, tunneled withopenssl s_client), and the gateway host is cached in the claim so connections skip aper-connection
cvms get.Security hardening
-verify_return_error+-verify_hostname(SSH host-key checking is necessarily off for a fresh per-lease CVM, soTLS is the only server authentication; previously it enforced nothing).
ReleaseLeaseandCleanuprequire a matching local claimand
cvms getname corroboration, so a foreigncrabbox-*-named CVM (or an app-idreused after teardown) is never deleted; a transient/false "not found" no longer drops the
claim+key and orphans a live billing CVM.
phalaauth only).Review
Ran a multi-dimension AI adversarial review (ownership-safety, dstack-integration,
go-quality, security, test-quality, integration-consistency), each finding independently
verify-checked. 13 confirmed findings, all addressed — the two P1s (Cleanup name
corroboration; the TLS MITM above), the orphan-on-false-not-found P2s, the docs/marker P3s,
and the missing-coverage gaps (ReleaseLease positive-destroy is now mutation-checked; Acquire
rollback-destroy tested).
Known limitation / honest gaps
status --wait: the generic 4-second SSH readiness probe is too tight for a coldTLS-gateway SSH connect, so
crabbox status --waitcan time out even though the box is up(the lease→run→release path is unaffected and proven above). The right fix is a provider
Statusmethod that reports readiness from CVM state rather than a live probe — a small,bounded follow-up I'd rather land separately than rush here.
go-tdx-guest(+ protobuf/logger transitive deps) and a runtime Intel-PCS dependencyfor the DCAP step. Justified for a confidential provider; the full network DCAP test is
opt-in to keep CI deterministic.
tdx.smallandtdx.mediumexercisedlive (attested); the larger
tdx.large/tdx.xlargeclass mappings are not yet shown live.AI assistance
Built and reviewed with AI assistance (Claude). The verification recipe (RTMR replay, quote
extraction, DCAP) was derived by pulling and verifying a real quote off live hardware before
writing the Go, and the live proofs above were captured on a funded Phala account.