feat(server,k8s): implement pause/resume with rootfs snapshot support#668
feat(server,k8s): implement pause/resume with rootfs snapshot support#668fengcone wants to merge 1 commit intoalibaba:mainfrom
Conversation
- Add PausePolicy schema and API endpoints (/pause, /resume) - Add image_pull_policy configuration for sandbox containers - Support pausePolicy in pool-based BatchSandbox creation - Add SandboxSnapshot CRD and controller for snapshot lifecycle - Add image-committer Job for container rootfs snapshot/commit - Update AGENTS.md and Makefile for new components Implements OSEP-0008 pause/resume functionality using SandboxSnapshot CRD and nerdctl-based image committer for Kubernetes runtime.
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2f6a1f854c
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| # 2. Delete SandboxSnapshot | ||
| try: | ||
| self.snapshot_provider.delete_snapshot(sandbox_id, self.namespace) | ||
| deleted = True |
There was a problem hiding this comment.
Do not treat snapshot 404 as a successful sandbox deletion
delete_sandbox sets deleted = True whenever delete_snapshot returns, but SandboxSnapshotProvider.delete_snapshot intentionally swallows 404s and returns None. That means missing snapshots are counted as successful deletions, so non-existent sandboxes (or workloads that failed to delete) can still return success instead of surfacing 404/500, leaving resources behind while the API reports deletion succeeded.
Useful? React with 👍 / 👎.
| // SnapshotPushSecret is the Secret name for pushing snapshots. | ||
| // +optional | ||
| // +kubebuilder:validation:Optional | ||
| SnapshotPushSecret string `json:"snapshotPushSecret,omitempty"` |
There was a problem hiding this comment.
Align PausePolicy secret JSON keys with the public API
The PausePolicy struct uses snapshotPushSecret/resumeImagePullSecret JSON keys, but the server request schema and generated CRD use snapshotPushSecretName/resumeImagePullSecretName. This mismatch causes secret fields to be dropped when decoding BatchSandbox into typed Go structs, so pause/resume flows lose registry credentials (commit jobs cannot push to private registries and resumed sandboxes can miss image pull secrets).
Useful? React with 👍 / 👎.
| if snapshot and snapshot.get("status", {}).get("phase") == "Failed": | ||
| return ( | ||
| "Failed", | ||
| "SNAPSHOT_FAILED", | ||
| snapshot.get("status", {}).get("message", "Snapshot failed"), |
There was a problem hiding this comment.
Preserve running state when a stale snapshot is failed
_derive_sandbox_state returns Failed as soon as it sees a snapshot in Failed phase, before checking whether a BatchSandbox still exists and is running. After a pause attempt fails, the workload is intentionally kept alive, so this ordering misreports a usable running sandbox as failed in get_sandbox/list_sandboxes, which can drive incorrect client behavior.
Useful? React with 👍 / 👎.
Implements OSEP-0008 pause/resume functionality using SandboxSnapshot
CRD and nerdctl-based image committer for Kubernetes runtime.
Summary
Testing
Breaking Changes
Checklist