Skip to content

feat(server): implement multi-tenancy provider with file and HTTP backends#979

Open
Pangjiping wants to merge 9 commits into
opensandbox-group:mainfrom
Pangjiping:feature/multi-tenancy-provider
Open

feat(server): implement multi-tenancy provider with file and HTTP backends#979
Pangjiping wants to merge 9 commits into
opensandbox-group:mainfrom
Pangjiping:feature/multi-tenancy-provider

Conversation

@Pangjiping

@Pangjiping Pangjiping commented Jun 5, 2026

Copy link
Copy Markdown
Collaborator

Summary

Wire TenantProvider interface through full request chain:

  • TenantProvider Protocol + FileTenantProvider + HTTPTenantProvider
  • Auth middleware multi-tenant/single-tenant mode branching
  • ContextVar-based tenant propagation (tenants/context.py)
  • Dynamic namespace resolution in KubernetesSandboxService
  • [tenants] config section in server.toml (provider=file|http)
  • Startup guards: docker+tenants fatal, api_key+tenants mutual exclusion
  • HTTP provider: per-key TTL cache, sync fetch, max_stale, 401 eviction
  • 27 e2e tests covering both providers end-to-end
  • Update OSEP-0012 with formal interface spec and HTTP contract
  • closes [Feature] 拆分 chatrs 让一个集群能承载多个命名空间沙箱 #972 沙箱能否支持多租户 #497

Testing

  • Not run (explain why)
  • Unit tests
  • Integration tests
  • e2e / manual verification

Breaking Changes

  • None
  • Yes (describe impact and migration path)

Checklist

  • Linked Issue or clearly described motivation
  • Added/updated docs (if needed)
  • Added/updated tests (if needed)
  • Security impact considered
  • Backward compatibility considered

@Pangjiping Pangjiping added documentation Improvements or additions to documentation component/server feature New feature or request labels Jun 5, 2026

@Pangjiping Pangjiping left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid architecture — TenantProvider protocol, ContextVar propagation, namespace-scoped isolation all look correct. One critical issue (sync HTTP blocking the event loop) should be fixed before merge; the rest are improvements.

Comment thread server/opensandbox_server/tenants/http_provider.py
Comment thread server/opensandbox_server/tenants/http_provider.py
Comment thread server/opensandbox_server/tenants/__init__.py Outdated
Comment thread server/tests/test_multi_tenancy.py Outdated
Comment thread server/opensandbox_server/tenants/models.py Outdated
@Pangjiping Pangjiping marked this pull request as ready for review June 6, 2026 07:24

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d02b0be6c3

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread server/opensandbox_server/middleware/auth.py
Comment thread kubernetes/charts/opensandbox-server/templates/server.yaml
Comment thread server/opensandbox_server/services/k8s/kubernetes_service.py Outdated
Comment thread server/opensandbox_server/tenants/http_provider.py Outdated
Comment thread server/opensandbox_server/main.py
Comment thread server/opensandbox_server/tenants/file_provider.py

@Pangjiping Pangjiping left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the thorough review. All feedback addressed:

Critical (sync HTTP): Already using asyncio.to_thread() in auth middleware to offload lookup() from the event loop.

Thundering herd: Already implemented per-key singleflight with threading.Event in _fetch_and_cache.

HTTPS enforcement: Warning logged at startup for non-HTTPS endpoints.

Naming / tests / frozen dataclass: resolve_tenants_path renamed, validate_tenant_config extracted and tested, api_keys changed to Tuple[str, ...].

Additional fixes in follow-up commit:

  • Helm chart: removed subPath from tenants ConfigMap mount → directory mount enables hot-reload
  • HTTP provider: leader errors propagated to singleflight waiters (no more 401 for provider outages)
  • K8s service: cross-namespace sandbox lookup fallback for background renew workers and proxy path
  • Spec: namespace validation deferred to sandbox creation time

@Pangjiping Pangjiping force-pushed the feature/multi-tenancy-provider branch from e49fab8 to 835d150 Compare June 10, 2026 16:01

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 835d150352

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread server/opensandbox_server/tenants/http_provider.py
Comment thread server/opensandbox_server/api/lifecycle.py
Comment thread server/opensandbox_server/main.py

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 65ed62764b

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread server/opensandbox_server/services/snapshot_service.py Outdated
Comment thread kubernetes/charts/opensandbox-server/templates/server.yaml
Comment thread server/opensandbox_server/tenants/http_provider.py
Comment thread server/opensandbox_server/services/k8s/snapshot_runtime.py
Comment thread server/opensandbox_server/services/k8s/kubernetes_service.py
Pangjiping and others added 8 commits June 11, 2026 17:37
…kends

Wire TenantProvider interface through full request chain:
- TenantProvider Protocol + FileTenantProvider + HTTPTenantProvider
- Auth middleware multi-tenant/single-tenant mode branching
- ContextVar-based tenant propagation (tenants/context.py)
- Dynamic namespace resolution in KubernetesSandboxService
- [tenants] config section in server.toml (provider=file|http)
- Startup guards: docker+tenants fatal, api_key+tenants mutual exclusion
- HTTP provider: per-key TTL cache, sync fetch, max_stale, 401 eviction
- 27 e2e tests covering both providers end-to-end
- Update OSEP-0012 with formal interface spec and HTTP contract

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- New script scripts/python-k8s-e2e-multi-tenant.sh with file/http variants
- Add multi-tenant-file and multi-tenant-http matrix entries to nightly workflow
- Helm chart: support tenantsToml value for mounting tenants.toml ConfigMap
- HTTP variant: mock Python provider on host, accessible via Kind bridge IP
- Both variants verify auth (valid key→200, invalid→401, missing→401)
  then run full SDK kubernetes-mini test suite with tenant key

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use Kind control-plane container's gateway IP as primary detection,
with fallbacks. Add pre-deploy reachability check for the mock endpoint.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix sync HTTP blocking event loop: use asyncio.to_thread() in auth middleware
- Add per-key singleflight pattern to prevent thundering herd on TTL expiry
- Log warning at startup when HTTP tenant endpoint is not HTTPS
- Rename _resolve_tenants_path to resolve_tenants_path (public API)
- Use tuple for TenantEntry.api_keys for true frozen immutability
- Extract validate_tenant_config() so startup guards are testable
- Add [tenants] section to configuration.md, README, and example configs

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Helm chart: remove subPath from tenants ConfigMap mount, enabling
  hot-reload via directory mount with kubelet-driven updates
- HTTP provider: propagate leader fetch errors to singleflight waiters
  so provider outages return 503 instead of 401 INVALID_API_KEY
- K8s service: add cross-namespace sandbox lookup fallback for
  background workers (renew, proxy) that lack tenant context
- Spec: defer namespace validation from startup to sandbox creation
- Wire tenant_provider into sandbox_service after import

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two tenant-isolation fixes:
1. Store namespace on SnapshotRecord, filter listings and verify ownership
   on get/delete/restore so cross-tenant snapshot access is blocked.
2. Pass tenant-resolved namespace into SnapshotRuntime create_snapshot
   and delete_snapshot so SandboxSnapshot CRs are created in the
   correct tenant namespace instead of the static config default.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@Pangjiping Pangjiping force-pushed the feature/multi-tenancy-provider branch from 65ed627 to ca2802f Compare June 11, 2026 09:42

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ca2802ffe8

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

record = SnapshotRecord(
id=str(uuid4()),
source_sandbox_id=sandbox_id,
namespace=namespace,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Pass snapshot namespaces during recovery

Because new snapshot records now store the tenant namespace, restart recovery also needs to use that stored namespace. The recovery path still calls inspect_snapshot(record.id, ...) and delete_snapshot(record.id, ...) without namespace=record.namespace, so a server restart while a tenant snapshot is CREATING/DELETING makes the Kubernetes runtime inspect/delete the CR in its configured default namespace instead of the tenant namespace, causing ready snapshots to be marked failed or deleting snapshots to leave their CR behind.

Useful? React with 👍 / 👎.

Comment on lines 118 to 119
egress_mode: str = EGRESS_MODE_DNS,
credential_proxy_enabled: bool = False,
) -> Dict[str, Any]:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Restore BatchSandbox credential proxy parameter

In Kubernetes deployments using the default batchsandbox workload provider, sandbox creation now fails before reaching Kubernetes because KubernetesSandboxService.create_sandbox() still always passes the credential_proxy_enabled keyword to create_workload, but this signature no longer accepts it. That TypeError is converted into a 500 for every BatchSandbox create request; restoring this parameter and forwarding it to the egress/main-container setup also preserves credential-proxy behavior.

Useful? React with 👍 / 👎.

- Use sandbox_service.namespace as fallback for single-tenant snapshot
  creation instead of hardcoded "default"
- Track snapshot_id -> namespace in KubernetesSnapshotRuntime so
  inspect/wait/delete use the creation namespace, not static config
- Use .get() instead of .pop() for inflight_error so all waiters
  see the leader's error, not just the first one
- Convert K8s namespace-not-found errors to 400 instead of 500
  in sandbox creation path

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@chatgpt-codex-connector

Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

component/server documentation Improvements or additions to documentation feature New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] 拆分 chatrs 让一个集群能承载多个命名空间沙箱

4 participants