Skip to content

feat(server,k8s): implement pause/resume with rootfs snapshot support#668

Open
fengcone wants to merge 1 commit intoalibaba:mainfrom
fengcone:feature/public-k8s-pause-resume
Open

feat(server,k8s): implement pause/resume with rootfs snapshot support#668
fengcone wants to merge 1 commit intoalibaba:mainfrom
fengcone:feature/public-k8s-pause-resume

Conversation

@fengcone
Copy link
Copy Markdown
Collaborator

@fengcone fengcone commented Apr 8, 2026

  • Add PausePolicy schema and API endpoints (/pause, /resume)
  • Add image_pull_policy configuration for sandbox containers
  • Support pausePolicy in pool-based BatchSandbox creation
  • Add SandboxSnapshot CRD and controller for snapshot lifecycle
  • Add image-committer Job for container rootfs snapshot/commit
  • Update AGENTS.md and Makefile for new components

Implements OSEP-0008 pause/resume functionality using SandboxSnapshot
CRD and nerdctl-based image committer for Kubernetes runtime.

Summary

  • What is changing and why?

Testing

  • Not run (explain why)
  • Unit tests
  • Integration tests
  • e2e / manual verification

Breaking Changes

  • None
  • Yes (describe impact and migration path)

Checklist

  • Linked Issue or clearly described motivation
  • Added/updated docs (if needed)
  • Added/updated tests (if needed)
  • Security impact considered
  • Backward compatibility considered

  - Add PausePolicy schema and API endpoints (/pause, /resume)
  - Add image_pull_policy configuration for sandbox containers
  - Support pausePolicy in pool-based BatchSandbox creation
  - Add SandboxSnapshot CRD and controller for snapshot lifecycle
  - Add image-committer Job for container rootfs snapshot/commit
  - Update AGENTS.md and Makefile for new components

  Implements OSEP-0008 pause/resume functionality using SandboxSnapshot
  CRD and nerdctl-based image committer for Kubernetes runtime.
@fengcone
Copy link
Copy Markdown
Collaborator Author

fengcone commented Apr 8, 2026

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2f6a1f854c

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

# 2. Delete SandboxSnapshot
try:
self.snapshot_provider.delete_snapshot(sandbox_id, self.namespace)
deleted = True
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Do not treat snapshot 404 as a successful sandbox deletion

delete_sandbox sets deleted = True whenever delete_snapshot returns, but SandboxSnapshotProvider.delete_snapshot intentionally swallows 404s and returns None. That means missing snapshots are counted as successful deletions, so non-existent sandboxes (or workloads that failed to delete) can still return success instead of surfacing 404/500, leaving resources behind while the API reports deletion succeeded.

Useful? React with 👍 / 👎.

// SnapshotPushSecret is the Secret name for pushing snapshots.
// +optional
// +kubebuilder:validation:Optional
SnapshotPushSecret string `json:"snapshotPushSecret,omitempty"`
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Align PausePolicy secret JSON keys with the public API

The PausePolicy struct uses snapshotPushSecret/resumeImagePullSecret JSON keys, but the server request schema and generated CRD use snapshotPushSecretName/resumeImagePullSecretName. This mismatch causes secret fields to be dropped when decoding BatchSandbox into typed Go structs, so pause/resume flows lose registry credentials (commit jobs cannot push to private registries and resumed sandboxes can miss image pull secrets).

Useful? React with 👍 / 👎.

Comment on lines +1082 to +1086
if snapshot and snapshot.get("status", {}).get("phase") == "Failed":
return (
"Failed",
"SNAPSHOT_FAILED",
snapshot.get("status", {}).get("message", "Snapshot failed"),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve running state when a stale snapshot is failed

_derive_sandbox_state returns Failed as soon as it sees a snapshot in Failed phase, before checking whether a BatchSandbox still exists and is running. After a pause attempt fails, the workload is intentionally kept alive, so this ordering misreports a usable running sandbox as failed in get_sandbox/list_sandboxes, which can drive incorrect client behavior.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant