fix: Pool 模式下 Pod 被删除后 BatchSandbox 不会重新分配新 Pod#1
Closed
longsuizhi wants to merge 89 commits into
Closed
Conversation
Set User-Agent: OpenSandbox-Go-SDK/1.0.1 on all outgoing requests (doRequestOnce, doStreamRequest, GetCommandLogs, UploadFiles, DownloadFile). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Drain HTTP response bodies after read to preserve connection reuse - Change Sandbox.Close / SandboxManager.Close to return error (io.Closer) - Add tests for Sandbox lifecycle, health, and Close methods - Replace manual url.QueryEscape with url.Values for consistency - Remove deprecated crypto/dsa import and DSA public-key branch - Remove unused encoding/json import workaround in retry_test.go - Clean up unused go.mod dependencies Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* feat(k8s): add dockur windows pool exmaple * feat(server): fix windows reboot
…undle fix(execd): use merged CA bundle for `REQUESTS_CA_BUNDLE` and `SSL_CERT_FILE`
…-caaaa fix(execd): use merged CA bundle for REQUESTS_CA_BUNDLE and SSL_CERT_FILE
chore: bump execd to v1.0.16
chore(README): osb-cli apikey ops for quick start
…t-version-0.2.0 chore(chart): bump opensandbox-controller chart version to 0.2.0
…ocket-path feat(k8s): add containerd-socket-path to controller
…ef is set (opensandbox-group#883) * fix(server): make resourceLimits/image/entrypoint optional when poolRef is set When creating a sandbox from a pre-configured Pool (via extensions.poolRef), the image, entrypoint, and resourceLimits are all defined in the Pool CRD template. Requiring callers to provide dummy values for these fields is unnecessary and error-prone. Changes: - Make resource_limits Optional with None default in CreateSandboxRequest - Skip image/snapshotId/entrypoint validation when poolRef is present - Add explicit resourceLimits required check for non-pool requests - Guard against None resource_limits in Docker provider code paths * fix(server): address review feedback for pool mode optional fields - Skip image/entrypoint resolution in K8s service layer when poolRef is set - Reject poolRef on Docker provider (unsupported) - Reject snapshotId when poolRef is set (conflicting fields) - Update specs/sandbox-lifecycle.yml: remove required constraint on resourceLimits, document pool mode behavior - All 1038 tests pass * fix: guard _ensure_image_auth_support against None image, align spec docs - Fix AttributeError when image is None in pool mode (P1) - Clarify in spec that snapshotId is rejected (not optional) with poolRef * test: add pool mode validation tests for poolRef, snapshotId rejection, Docker guard, and image auth - Schema: poolRef-only happy path, poolRef+snapshotId rejection, resourceLimits still required without poolRef, blank poolRef ignored - Docker: rejects poolRef with SANDBOX::UNSUPPORTED_POOL_REF - K8s: pool mode skips image/entrypoint validation, image auth guard handles None image without AttributeError All 1046 tests pass (8 new). * fix: normalize blank snapshotId to None in pool mode When poolRef is set and snapshotId is whitespace-only (e.g. ' '), the validator now clears it to None before returning. This prevents downstream code from treating a truthy whitespace string as a real snapshot ID (e.g. writing an invalid Kubernetes label). Adds test_pool_mode_normalizes_blank_snapshot_id to cover this edge case. --------- Co-authored-by: longsuizhi <longsuizhi@xiaomi.com>
Mirror the Python SDK behavior where every header returned by the lifecycle GetEndpoint call (auth tokens, routing hints, sticky-session keys, etc.) is forwarded as-is on every subsequent execd/egress request. The previous code extracted only the X-EXECD-ACCESS-TOKEN / OPENSANDBOX-EGRESS-AUTH header and dropped the rest, which broke routing when the server added new headers. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…sandbox-group#895) * fix(execd): ensure uploaded files are visible before responding On weakly-coherent filesystems (virtio-fs, 9pfs), a freshly-written file can be invisible to subsequent processes for a brief window after the upload handler returns 200, causing intermittent "file not found" errors when callers immediately invoke /command after /files/upload. - fsync the parent directory after closing the new file so the new dirent is durable and observable. Best-effort: ignore ENOTSUP. - Wrap ChmodFile in a one-shot retry with a short sleep. ChmodFile always invokes chown under the hood, so a freshly-created dirent that has not yet propagated would otherwise surface as ENOENT and turn a recoverable visibility delay into a 500 response. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(execd): split UploadFile to reduce cognitive complexity Extract form parsing, metadata parsing, target resolution, file write, and permission application into helpers so UploadFile passes the gocognit threshold (>30). Behavior unchanged; errors flow via *uploadError and funnel through a single RespondError call site. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Add ServerConfig fields to make uvicorn process count, concurrency limits, socket backlog, and event-loop/HTTP parser implementation configurable. Defaults preserve current behavior (workers=1) while enabling operators to scale a single pod across multiple Python processes when apiserver capacity allows. - pyproject.toml: switch to uvicorn[standard] for uvloop/httptools/watchfiles - config.py: ServerConfig.workers, limit_concurrency, backlog, loop, http - cli.py: thread new fields into uvicorn.run; force workers=1 under --reload - main.py: pass loop/http to dev __main__ entry - examples + configuration.md: document tunables and apiserver tradeoff Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…pool Sandbox/snapshot/pool route handlers were async def but called synchronous service methods that issue blocking Kubernetes/Docker API requests (50-200 ms each). Each in-flight call stalled the entire event loop, serializing every concurrent request. Convert blocking-only handlers to sync def so FastAPI offloads them to the anyio threadpool, letting concurrent requests run in parallel. create_sandbox stays async (its service is async with cooperative polling). - api/lifecycle.py: 12 handlers async -> sync; drop manual to_thread in create_snapshot now that the route itself runs in the threadpool; drop unused asyncio import - api/pool.py: 5 pool handlers async -> sync - tests/test_routes_list_sandboxes.py: regression locks in threadpool parallelism (8 x 200 ms calls finish in ~250 ms, not ~1.6 s) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
list_custom_objects always issued a direct apiserver call, even though the informer is already watching the same namespace and serves get_custom_object from cache. Under multi-worker deployments the list QPS scales with workers x replicas and pressures the apiserver unnecessarily. Prefer the informer cache when synced and the label selector falls within the supported in-memory grammar (empty, bare key existence, key=value, comma-joined AND). Anything else falls back to the existing direct API path, preserving today's behavior. - services/k8s/label_selector.py: minimal parser/matcher for the subset of selectors callers in this repo actually emit - services/k8s/informer.py: WorkloadInformer.list() snapshot helper - services/k8s/client.py: list_custom_objects consults the cache first - tests/k8s: cover label_selector grammar + cache-hit/miss/fallback behavior on the client Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The previous round moved blocking list/get/delete handlers onto sync def routes so FastAPI offloads them to anyio's default threadpool. Two follow-up bottlenecks remain: 1. anyio's default threadpool is 40 tokens; bursts of concurrent sandbox CRUD requests start queueing once that ceiling is hit. 2. lifecycle.create_sandbox is async and the Kubernetes service body still issues sync K8s calls (_ensure_pvc_volumes, workload_provider create/get/delete) directly on the event loop. Each 50-200 ms round-trip stalls every other in-flight request, and the rate limiter's time.sleep makes it worse when read/write QPS is set. Add a configurable thread_pool_size (default 200) applied at lifespan startup, and wrap the blocking K8s calls inside the create path with asyncio.to_thread so the event loop stays responsive. - config.py: ServerConfig.thread_pool_size - main.py: lifespan sets anyio current_default_thread_limiter total_tokens - services/k8s/kubernetes_service.py: to_thread wraps the four sync K8s calls in create_sandbox / _wait_for_sandbox_ready - configuration.md, tests/test_config.py: doc and field tests Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…turns post-patch state (opensandbox-group#899) * fix(server): k8s patch_sandbox_metadata correctly deletes keys and returns post-patch state Two bugs in KubernetesSandboxService.patch_sandbox_metadata caused the nightly k8s mini E2E test_02_metadata_filter_and_logic to fail: 1. JSON merge patch (RFC 7396) on metadata.labels merges keys recursively — keys absent from the patch body are kept. The previous code computed the desired final labels dict (with deleted keys removed) and sent it, so deleted keys were never actually removed on the API server. 2. After the PATCH, the code re-fetched the workload via _get_workload_or_404, which goes through K8sClient.get_custom_object that prefers the informer cache. The informer is eventually consistent, so the read could land before the watch event arrived and return the pre-patch labels. Fix both by: - Building the merge-patch body with explicit None for deleted keys. - Using the API server's PATCH response (returned from patch_labels) directly, instead of re-reading via the cache. WorkloadProvider.patch_labels now accepts Dict[str, Optional[str]] and returns the patched workload dict. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * chore(ci): touch trigger comment to run k8s mini E2E on this PR The k8s-nightly-build workflow only runs on PRs that touch one of its trigger paths. This PR fixes server-side k8s logic but does not modify those paths, so add a date stamp to the existing trigger comment to include this PR in the matrix. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
…ver-config-links docs: fix server example config links
…dbox-group#898) * fix(egress): unblock SSE/chunked streaming through mitmproxy The mitmdump --set stream_large_bodies=1m option (added to prevent OOM on large downloads) buffers any response under 1 MiB before forwarding, which stalls LLM SSE / chunked streams of small chunks until the stream ends or the threshold is hit, producing perceptible stutter downstream. Introduce a bundled system addon that is always loaded by the egress launcher and forces flow.response.stream = True for responses with content-type: text/event-stream or transfer-encoding: chunked. Large- body OOM protection from stream_large_bodies stays intact for everything else. The previous example script add_header.py is renamed to system.py and repurposed as the always-on system addon (wire-transparent: no headers added or altered). User-supplied addons via OPENSANDBOX_EGRESS_MITMPROXY_SCRIPT are still loaded after the system addon and may observe or override its hooks. Refs: https://project.aone.alibaba-inc.com/v2/project/2135082/req/82131871 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(egress): case-insensitive header match and correct system addon docs Normalize content-type and transfer-encoding to lowercase before substring matching in the system addon, so legal mixed-case values like Transfer-Encoding: Chunked or Content-Type: Text/Event-Stream still trigger streaming instead of getting buffered by stream_large_bodies=1m. Also fix the transparent-mode doc, which referenced a non-existent OPENSANDBOX_EGRESS_MITMPROXY_SYSTEM_SCRIPT env var as a way to disable or swap the system addon. The launcher always appends the bundled system addon unconditionally; document that users override behavior via OPENSANDBOX_EGRESS_MITMPROXY_SCRIPT (loaded after the system addon). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
FilesystemAdapter read operations (readFile/readByteArray/readStream) caught every failure and logged it at ERROR with a full stack trace before rethrowing. A missing file (server returns HTTP 404 with code FILE_NOT_FOUND) is an expected, business-level outcome rather than a fault, so this floods callers' error logs and monitoring with noise for a normal control-flow case (e.g. polling for a not-yet- created stdout file). Distinguish "file not found" from genuine failures and log it at DEBUG instead of ERROR. The exception is still propagated unchanged. - Add SandboxError.FILE_NOT_FOUND constant. - Add Throwable.isFileNotFound() extension (statusCode 404 / code FILE_NOT_FOUND) so callers can also branch on it cleanly. - Route read-failure logging through logReadFailure() which downgrades not-found to DEBUG. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Collapse the expression-body function to a single line as required by the project's spotlessCheck (root ./gradlew spotlessCheck). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address review feedback: isFileNotFound() previously treated any HTTP 404 as not-found. A 404 whose body cannot be parsed is mapped to UNEXPECTED_RESPONSE and may signal a real endpoint/routing regression; downgrading those to DEBUG would hide genuine failures. Restrict detection to the explicit SandboxError.FILE_NOT_FOUND code (which the execd server returns for missing files) and add a regression test covering a bare 404 + UNEXPECTED_RESPONSE. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ot-found-log-noise fix(sdk): avoid ERROR-level logs for expected file-not-found on read
Add timestamp, hostname, and user information to license verification output for CI debugging purposes.
…heck-poc fix(ci): add diagnostic output to license verification script
…-image-committer feat(ci): publish image-committer image via release workflow
…ty-ping-response fix(js-sdk): Accept empty ping responses
Decouples the kubernetes/ e2e suite from the assumption that it always runs against a freshly-built Kind cluster, so it can also be pointed at an externally-provided Kubernetes cluster (e.g. minikube, a shared dev cluster, or a CI-provisioned one) via KUBECONFIG. - New test/utils/cluster_mode.go centralises mode + env-driven defaults (E2E_MODE, E2E_POD_SECURITY_ENFORCE, registry / namespace / credential knobs for the pause/resume sub-suite). Default mode stays "kind", so existing `make test-e2e-main` behaviour is unchanged. - LoadImageToKindClusterWithName becomes a no-op when E2E_MODE!=kind, since images are expected to live in a registry the target cluster can pull from. - BeforeSuite skips docker-build / kind-load entirely outside Kind mode. - The pod-security label step is now controlled by E2E_POD_SECURITY_ENFORCE; setting it to an empty value skips the label, which is useful on clusters that enforce their own admission policy. Default value preserves the current "restricted" behaviour. - pause_resume_test.go: registry namespace / address / credentials are now overridable via env; deploy/undeploy switched to `make install/deploy/undeploy/uninstall` for consistency with e2e_test.go; registry-deployment.yaml is rendered as a template so the source image can be pointed at a mirror. - registry-deployment.yaml: parameterised image and switched the Service to ClusterIP (the suite only consumes it via cluster DNS, so the previous NodePort 30500 was unnecessary and can collide on shared clusters). - Added Ginkgo Labels (Core/Manager/Pool/Batch/Task/PauseResume) so consumers can use -ginkgo.label-filter to subset the suite.
…-e2e-test-refactor test(k8s): make e2e suite portable across cluster modes
…itter-v0.1.0 chore: bump image-committer to v0.1.0
Sync helm values with image-committer v0.1.0 release. Use sandbox-registry.cn-zhangjiakou.cr.aliyuncs.com mirror, note DockerHub location opensandbox/image-committer:v0.1.0 in comments. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…update_20260521 fix(sdk): use java duration for command request timeouts
…mage-committer-v0.1.0 chore(helm): bump image-committer to v0.1.0
…ndbox-group#943) The bootstrap script waited at most 30s for /opt/opensandbox/mitmproxy-ca-cert.pem before skipping system CA trust setup. When the egress sidecar is recovering from a transient failure (e.g. mitmproxy OOM-killed and being restarted with backoff), 30s is not enough and the sandbox starts without TLS interception support, silently breaking HTTPS for system libraries. Extend the wait to 300s and log the actual wait duration on success so the boot timeline is visible in execd logs.
…group#924) * fix(execd): kill entire process group on command cancel When a foreground command was cancelled (client disconnect, timeout, or DELETE /command), only the bash group leader received SIGKILL — child processes spawned via `&` or pipelines kept running as orphans because exec.CommandContext's internal kill targets a single pid, and killPid sent signals to the leader only. Fix runCommand's ctx.Done() branch to send SIGKILL to -pid (the whole group, since the leader is launched with Setpgid: true), mirroring runBackgroundCommand. Rewrite killPid to signal -pid for SIGTERM/SIGKILL and to use kill(-pid, 0) for liveness probing, so Interrupt() also terminates descendants. Adds regression tests covering both cancel and Interrupt paths. Fixes opensandbox-group#922 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(execd): guard Interrupt against stale PID after command exit killPid now signals the whole process group (-pid). Combined with the fact that commandClientMap retains finished sessions, a late or retried Interrupt could otherwise terminate every process in an unrelated process group whose PGID has reused the recorded PID. - markCommandFinished clears kernel.pid alongside kernel.running so the stale PID is no longer accessible. - commandSnapshot now reads under c.mu.RLock for a consistent view of running/pid relative to markCommandFinished's write under c.mu.Lock. - Interrupt() (unix and windows) snapshots the kernel and refuses to signal when the command has already finished. Adds a regression test ensuring Interrupt on a completed session returns an error and that pid is cleared. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(execd): don't surface slow group teardown as Interrupt failure kill(2) on a process group only guarantees delivery to at least one member, and kill(-pid, 0) keeps reporting the group as observable while any unreaped zombie lingers. The previous post-SIGKILL probe ran for only 150ms and then returned a hard error, so Interrupt could surface a 500 even though the kill signal had already been delivered. Likewise on macOS, SIGKILL on a group that has been reduced to zombies returns EPERM, which the previous code reported as a kill failure even though SIGTERM had already taken effect. - After a successful SIGKILL, log a warning when the probe loop still observes the group instead of returning an error. - When SIGTERM was delivered but the SIGKILL syscall fails (commonly EPERM on a zombie-only group), log and return nil — the kill is in flight and the kernel will reap the group once Wait() runs. Adds a regression test that runs killPid against a Setpgid group with no concurrent reaper, exercising the zombie-lingering path. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(execd): only group-kill on real cancellation, not after success Execute() defers cancel() for every foreground command, including successful ones, so the signal-forwarding goroutine's ctx.Done() branch also fired on the normal-success path. With the new group-wide SIGKILL on -cmd.Process.Pid, that post-completion signal could hit a recycled pid/pgid and kill an unrelated process group inside the sandbox. Gate the goroutine on the existing `done` channel (closed after cmd.Wait() returns or on start failure): exit cleanly when the command has finished, so only genuine cancellations — timeout, client abort, Interrupt — trigger the group kill. A double-check inside the ctx.Done() branch handles the race where ctx is cancelled at the same instant cmd.Wait() returns. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
chore: bump execd to v1.0.18
…y target (opensandbox-group#864) * feat(egress): add DELETE /policy endpoint for removing egress rules by target Add DELETE handler that accepts a JSON array of target strings, removes matching rules case-insensitively, and commits the updated policy. Targets not found are silently ignored (idempotent). Spec and README docs updated. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(sdks): expose egress DELETE /policy across all sandbox SDKs Wires the new DELETE /policy endpoint through Go, JavaScript, Python (async + sync), Kotlin, and C# sandbox SDKs so users can remove egress rules by target through the supported facades. Regenerates the JS and Python OpenAPI clients (TypeScript and Kotlin generators now emit the delete operation; Python generator produces a new delete_policy module), then adds matching handwritten adapter/service/sandbox methods and unit tests. Extends the C# HttpClientWrapper with a DeleteAsync(path, body, ct) overload since DELETE-with-body was not previously supported. Adds an async Python e2e test (test_01ac_network_policy_delete) that provisions a sandbox with two allow rules, deletes one (plus a nonexistent target to verify idempotency), and confirms the policy mutation, defaultAction preservation, and resulting traffic behavior. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(egress): DELETE response shape + status codes - Drop Policy field from no-match path (consistent with POST/PATCH success responses and the spec's DELETE example). - Return 500 instead of 400 for marshal/parse failures on self-synthesized policy JSON (internal inconsistency, not client error); upgrade matching log level to Error. - Revert unrelated trailing-whitespace cleanup in smoke-nft.sh copyright header. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
fix(openclaw): Support OPEN_SANDBOX_API_KEY&Repair entrypoint
Bumps [idna](https://github.com/kjd/idna) from 3.11 to 3.15. - [Release notes](https://github.com/kjd/idna/releases) - [Changelog](https://github.com/kjd/idna/blob/master/HISTORY.md) - [Commits](kjd/idna@v3.11...v3.15) --- updated-dependencies: - dependency-name: idna dependency-version: '3.15' dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com>
…li/idna-3.15 chore(deps): bump idna from 3.11 to 3.15 in /cli
当 Pool 中已分配给 BatchSandbox 的 Pod 被外部删除时,alloc-status 注解中仍保留 已删除 Pod 的名称,导致 supplement 计算为 0,无法触发重新分配。 本次修复在 getSandboxRequest 中增加了存活检测:将已删除的 Pod 从有效分配中 排除并加入 ToRelease 队列,使 supplement > 0 从而触发 Pool 重新分配新 Pod。
|
Changed directories: .github、README.md、cli、components、examples. 📋 Recommended labels (based on changed files):
Other available labels:
💡 Tip: Use cc @longsuizhi |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
问题描述
在 Pool 模式下,当已分配给 BatchSandbox 的 Pod 被外部删除(手动 delete、节点驱逐、OOM Kill 等)后,BatchSandbox 不会从 Pool 中重新获取新 Pod,导致 sandbox 永久不可用。
根因
allocator.go的getSandboxRequest通过alloc-status注解获取已分配 Pod 列表计算:Pod 被删除后注解未清理,
len(allocated)不变,supplement = 0,永远不会触发重新分配。修复方案
在
getSandboxRequest中增加存活检测:getSandboxRequestalloc-status中的 Pod 与实际存活 Pod,过滤出已删除的 PodToRelease队列清理分配记录测试
poolallocation_types.go编译错误(缺少 DeepCopyObject),与本次修改无关