Skip to content

[ACP on Cloud] Validate + roll out native session/load resume (staging -> prod) #1250

@simonrosenberg

Description

@simonrosenberg

Priority: P1. Part of #988. Pairs with the runtime-image pin bump (#1249) and the backend integration (#1248).

The native-resume stack (software-agent-sdk#3562 + OpenHands#14709 + OpenHands#14671) is e2e-validated against a local SaaS-equivalent rig: real object-store clients (MinIO/S3 and fake-gcs / GoogleCloudFileStore) for both the session blob and the event store, docker rm -f hard recycle, credential-leak gate, and verified safe degradation to bootstrap when the runtime lacks the blob routes — all green for Claude + Codex.

Still unproven in real SaaS (the gaps this issue closes): GCS specifically, Postgres + the enterprise migration 119 (runs in enterprise-server, not the OSS app_server), and the K8s pod/PVC lifecycle (idle-STOP, 14-day TTL, reclaimPolicy=Delete).

Prerequisite (gating)

Before any of the tasks below: merge software-agent-sdk#3562 → build an agent-server image with the blob routes → bump AGENT_SERVER_IMAGE (openhands/app_server/sandbox/sandbox_spec_service.py, currently 1.26.0-python) to that tag on the #14709 commit (#1249) → set OPENHANDS_SHA in deploy/.github/workflows/deploy.yaml.

⚠️ The trap: if you deploy #14709 without the pin bump, the runtime stays on 1.26.0-python (no blob routes) → the restore 404s → resume silently degrades to bootstrap. Staging then goes green on bootstrap resume and you never actually test native — a false pass. The pin bump (#1249) is the gating prerequisite, not optional.

Tasks

  • Deploy OpenHands#14709 (+ #14671) to staging via the manual Deploy → staging workflow (or feature for a lower-stakes preview), with the runtime pin ([ACP on Cloud] Deploy: bump AGENT_SERVER_IMAGE to activate native session/load resume #1249) pointed at an SDK#3562 build.
  • Apply migrations (app_server 012 / enterprise 119) on staging Postgres.
  • Per provider (Claude, Codex): create → turn-1 codeword → real recycle (idle-STOP / force pod delete, not pause) → resume → assert same acp_session_id, no <<RESUMED CONVERSATION>> marker, codeword recalled.
  • Verify the blob lands in the staging GCS sessions bucket and contains only sessions//projects/ (never auth.json/.credentials.json/history.jsonl).
  • Confirm a deploy WITHOUT the blob-route image degrades cleanly to bootstrap.
  • Promote to production (flip the prod pin in [ACP on Cloud] Deploy: bump AGENT_SERVER_IMAGE to activate native session/load resume #1249).

Refs

Deploy mechanism + the SaaS-faithful local-rig write-up: the validation-harness comments on this epic. The local rig is parametrized: ACP_SIM_FILE_STORE=s3|google_cloud, SHARED_EVENT_STORAGE_PROVIDER=s3|gcp, ACP_BLOB_STORE=s3|gcs, ACP_EXPECT=native|bootstrap, AGENT_SERVER_IMAGE_TAG=<image>.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions