feat: Add staging infrastructure for 2 OHE deployment environments - path and subdomain by saurya · Pull Request #580 · OpenHands/OpenHands-Cloud

saurya · 2026-04-27T18:37:47Z

Summary

This PR adds infrastructure configuration for deploying OpenHands Enterprise to platform-team-sandbox GCP project with 2 independent test environments to validate the different routing strategies.

Environments

Environment	Cluster Setup	Routing Strategy	Use Case
single-cluster-path	Single GKE cluster	Path-based (`/app`, `/runtime`)	Simple deployment, all workloads on one cluster
single-cluster-subdomain	Single GKE cluster	Subdomain-based (`app.`, `runtime.`)	Single cluster with service isolation via subdomains

Infrastructure Components

Terraform (`infrastructure/terraform/`)

Modules:

gke-cluster - GKE cluster provisioning with configurable node pools
networking - VPC, subnets, and firewall rules
iam - Service accounts and IAM bindings

Environments:

Each environment has its own directory with main.tf, variables.tf, outputs.tf, and terraform.tfvars.example
Single-cluster envs create 1 cluster for both app and runtimes

Helm Values (`infrastructure/helm/`)

cert-manager:

Let's Encrypt ClusterIssuer for automatic TLS certificates
HTTP-01 challenge solver via Traefik ingress

external-dns:

Automatic DNS record management in Google Cloud DNS
Path-routing variant: Single A record per domain
Subdomain-routing variant: Wildcard DNS entries (*.domain.com)

traefik:

Ingress controller configuration
Path-routing variant: Includes middlewares for URL path stripping/rewriting
Subdomain-routing variant: Standard host-based routing

Directory Structure

infrastructure/
├── terraform/
│   ├── modules/
│   │   ├── gke-cluster/          # GKE cluster module
│   │   ├── networking/           # VPC and networking module
│   │   └── iam/                  # IAM and service accounts module
│   └── environments/
│       ├── single-cluster-path/       # Single cluster + path routing
│       ├── single-cluster-subdomain/  # Single cluster + subdomain routing
│       ├── multi-cluster-path/        # Multi cluster + path routing
│       └── multi-cluster-subdomain/   # Multi cluster + subdomain routing
└── helm/
    ├── cert-manager/             # TLS certificate automation
    ├── external-dns/             # DNS record automation
    └── traefik/                  # Ingress controller + middlewares

Deployment Steps

Provision Infrastructure:

cd infrastructure/terraform/environments/single-cluster-path
cp terraform.tfvars.example terraform.tfvars
# Edit terraform.tfvars with your values
terraform init && terraform plan && terraform apply

Install Helm Charts:

# cert-manager
helm install cert-manager jetstack/cert-manager -f infrastructure/helm/cert-manager/values.yaml
kubectl apply -f infrastructure/helm/cert-manager/cluster-issuer.yaml

# external-dns (choose routing variant)
helm install external-dns bitnami/external-dns -f infrastructure/helm/external-dns/values-path-routing.yaml

# traefik (choose routing variant)
helm install traefik traefik/traefik -f infrastructure/helm/traefik/values-path-routing.yaml
kubectl apply -f infrastructure/helm/traefik/middlewares-path-routing.yaml

Validation Status

Terraform modules validated (terraform validate)
Both environment configurations validated
Helm chart values validated (helm lint)
Kubernetes manifests validated (kubectl --dry-run)
Terraform plan (requires GCP credentials)
Terraform apply (pending approval)

Next Steps

After merge and approval:

Run terraform apply for each environment
Install Helm charts on provisioned clusters
Deploy OpenHands applications

This PR was edited by Saurya after initial creation by an AI agent (OpenHands) on behalf of Saurya.

all-hands-bot

🔴 Needs improvement - Critical security and validation gaps block merge readiness.

[CRITICAL ISSUES]

Missing .gitignore patterns: Add Terraform patterns (*.tfstate, *.tfvars, .terraform/, etc.) to .gitignore to prevent committing sensitive files with GCP credentials and project IDs.
No validation evidence: PR lists validation steps as "Next Steps" but shows no terraform validate/plan output or cost estimates. Infrastructure creating billable resources needs validation proof before merge.

[IMPROVEMENT OPPORTUNITIES]

Code duplication: 4 environment directories contain nearly identical Terraform code - consider consolidating with shared modules.
Variable substitution: ${ACME_EMAIL} and ${GCP_PROJECT_ID} usage needs documented substitution workflow.

[RISK ASSESSMENT]
🔴 HIGH RISK - Infrastructure change introducing new GKE clusters without validation evidence. Multiple risk factors: (1) New architectural patterns (4 routing strategies), (2) Infrastructure dependencies (GKE, Cloud DNS, Load Balancers), (3) Security sensitivity (missing .gitignore for secrets), (4) Cost impact (multiple clusters). Recommendation: Do not auto-merge. Require (a) .gitignore update, (b) terraform validate/plan output, (c) cost estimate, (d) human review of infrastructure design before deployment.

all-hands-bot · 2026-04-27T18:42:04Z

+    # Use staging server for testing
+    server: https://acme-staging-v02.api.letsencrypt.org/directory
+    # Email for certificate expiration notifications
+    email: ${ACME_EMAIL}


🟠 Important: The ${ACME_EMAIL} variable substitution pattern is used but the README doesn't explain how to apply it. Add explicit instructions:

# Before applying, substitute variables export ACME_EMAIL="devops@example.com" export GCP_PROJECT_ID="staging-092324" envsubst < cluster-issuer.yaml | kubectl apply -f -

Or consider using Helm values instead of shell variable substitution for better integration with the deployment workflow.

all-hands-bot · 2026-04-27T18:42:04Z

+module "vpc" {
+  source = "../../modules/vpc-network"
+
+  project_id   = var.project_id
+  region       = var.region
+  network_name = "${var.environment_name}-vpc"
+  subnet_name  = "${var.environment_name}-subnet"
+  subnet_cidr  = var.subnet_cidr
+  pods_cidr    = var.pods_cidr
+  services_cidr = var.services_cidr


🟡 Suggestion: All 4 environment directories contain nearly identical Terraform code with only variable values differing. This creates maintenance burden - changes must be replicated across 4 files. Consider consolidating:

Option 1: Single environment module + different tfvars:

environments/base/main.tf (parameterized) + single-cluster-path.tfvars + single-cluster-subdomain.tfvars + multi-cluster-path.tfvars + multi-cluster-subdomain.tfvars

Option 2: Use Terragrunt to manage shared config.

Reduced duplication = fewer bugs and easier maintenance.

all-hands-bot · 2026-04-27T18:42:04Z

+
+# Check cert-manager logs
+kubectl logs -n cert-manager -l app=cert-manager
+


🟠 Important - Evidence Gap: The PR description mentions "Next Steps" including terraform validate and terraform plan, indicating these haven't been run yet. Before merging infrastructure code that will create billable GCP resources:

Required validation:

Output from terraform validate for at least one environment

Output from terraform plan showing what resources will be created

Cost estimate (from terraform plan or GCP pricing calculator)

Test deployment to a sandbox project (optional but recommended)

Add an Evidence section to the PR description with these outputs. This protects against deploying broken or unexpectedly expensive infrastructure.

saurya · 2026-04-27T20:16:30Z

PR Review Feedback Addressed ✅

Thanks for the review! The following issues have been addressed:

Critical Issues Fixed

Missing .gitignore patterns - Added comprehensive Terraform patterns:
- .terraform/ directories
- *.tfstate and *.tfstate.* files
- *.tfvars (excluding *.tfvars.example)
- *.tfplan, crash logs, and override files
Variable substitution workflow - Documented in two places:
- Added "Configuration Variables" section to infrastructure/README.md with a table of variables and envsubst usage example
- Added clear instructions at the top of cluster-issuer.yaml

Notes on Improvement Opportunities

Code duplication: The 4 environments intentionally have similar structure to allow independent iteration during staging validation. Once patterns are validated, consolidation can be considered.
Validation evidence: Terraform validate passes locally. Full terraform plan output requires GCP credentials which are not available in this context. The plan/apply will be done as a follow-up deployment step.

TTTTTTTTTTTTTby an AI assistant (OpenHands) on behalf of Saurya.

saurya · 2026-05-05T19:19:00Z

📋 Summary of Chart Changes

This PR includes changes to charts/ that add new features and bug fixes. These are backwards-compatible and don't change default behavior.

🐛 Bug Fixes (recommended to merge)

File	Change
`charts/crd-check/templates/_hook.tpl`	Fix nil pointer: `crdCheck.enabled` → `crdCheck && crdCheck.enabled`
`charts/runtime-api/templates/deployment.yaml`	Fix nil pointer check for `caBundle`
`charts/openhands/templates/_init-containers.yaml`	Fix postgres image URL (`bitnamilegacy` → `public.ecr.aws/bitnami`), conditional keycloak database creation
`charts/openhands/templates/deployment.yaml`	Make litellm-helm init container optional (guard with `.enabled`)
`charts/openhands/templates/litellm-config-script.yaml`	Make ConfigMap conditional on `litellm-helm.enabled`

✨ New Feature: Path-Based Routing

Adds support for path-based routing as an alternative to subdomain routing. Useful for deployments where wildcard DNS/certs aren't available.

File	Change
`charts/openhands/templates/_routing.yaml`	NEW - Routing helper functions
`charts/runtime-api/templates/_routing.yaml`	NEW - Same for runtime-api
`charts/openhands/templates/traefik-middleware.yaml`	NEW - Strip prefix middleware for path routing
`charts/runtime-api/templates/traefik-middleware.yaml`	NEW - Same for runtime-api
`charts/openhands/values.yaml`	Add `routingMode`, `serviceRoutingMode`, `pathPrefix`, `branchSanitized`
`charts/runtime-api/values.yaml`	Add `routingMode`, `pathPrefix`, `branchSanitized`
`charts/openhands/templates/ingress-*.yaml`	Refactored to use routing helpers
`charts/runtime-api/templates/ingress.yaml`	Refactored to use routing helpers

Default behavior unchanged: routingMode: subdomain (existing behavior)

✨ New Feature: Shared Keycloak Support

Allows using an external/shared Keycloak instance across multiple deployments.

File	Change
`charts/openhands/values.yaml`	Add `keycloak.authHost` option
`charts/openhands/templates/_env.yaml`	Support `authHost` override + add `AUTH_URL` env var

✨ New Feature: Warm Runtime Scheduling Options

Adds support for node selectors, tolerations, and runtime class on warm runtime pods.

File	Change
`charts/runtime-api/templates/warm-runtimes-configmap.yaml`	Add `node_selector`, `tolerations`, `runtime_class` fields
`charts/runtime-api/values.yaml`	Document these options (commented out by default)

Discussion needed: Should these chart changes be:

✅ Merged as-is (backwards compatible, adds flexibility)
🔀 Split into separate PRs (bug fixes vs features)
↩️ Reverted and handled differently

This comment was created by an AI assistant (OpenHands) on behalf of @SauryaVelagapudi.

Design for feature branch deployment to staging with: - Pre-provisioned wildcard certs (avoids Let's Encrypt rate limits) - Base domain pool (dev1-dev5.staging.all-hands.dev) - Shared Keycloak for SAML authentication - GitHub Actions workflow for slot-based deployment - Incremental deployment (only changed charts) - external-dns for automatic DNS management Estimated effort: 5-7 days for initial setup Ongoing maintenance: Mostly automated (cert renewal, TTL cleanup) Related: infra PR #1064 (base domains) Co-authored-by: openhands <openhands@all-hands.dev>

This adds infrastructure configuration for deploying OpenHands Enterprise to staging GCP project (staging-092324) with 4 independent environments: 1. single-cluster-path: Single GKE cluster with path-based routing 2. single-cluster-subdomain: Single GKE cluster with subdomain-based routing 3. multi-cluster-path: Separate app/runtime clusters with path-based routing 4. multi-cluster-subdomain: Separate app/runtime clusters with subdomain routing Infrastructure includes: - Terraform modules for VPC networking and GKE clusters - Terraform environment configurations for all 4 setups - Helm values for cert-manager (TLS certificates via Let's Encrypt) - Helm values for external-dns (automatic DNS management) - Helm values for traefik (ingress controller with routing variants) Key features: - Configurable domains per environment - Path-based routing uses middlewares for URL rewriting - Subdomain routing uses wildcard DNS entries - All environments isolated from existing staging.all-hands.dev Co-authored-by: openhands <openhands@all-hands.dev>

Comprehensive testing guide covering: - Phase 1: Static validation (Terraform + Helm) - Phase 2: Terraform plan review - Phase 3-6: Incremental deployment and testing - Phase 7: End-to-end verification checklist - Phase 8: Multi-environment rollout - Troubleshooting section for common issues Co-authored-by: openhands <openhands@all-hands.dev>

- Changed 'expose: true' to 'expose: { default: true }' format - Moved 'tls' config under 'http' section per new schema - Updated HTTP-to-HTTPS redirect format for web port Co-authored-by: openhands <openhands@all-hands.dev>

…e to prevent sensitive files from being committed - Document variable substitution workflow in infrastructure/README.md - Add clear instructions for variable substitution in cluster-issuer.yaml Co-authored-by: openhands <openhands@all-hands.dev>

Simplify staging infrastructure to only two environments: - single-cluster-path - single-cluster-subdomain Multi-cluster configurations are not needed for current staging validation. Co-authored-by: openhands <openhands@all-hands.dev>

Explains how developers can deploy their own OpenHands branch to shared staging infrastructure using Helm release isolation. Covers: - Quick start deployment steps - Values override patterns (minimal dev, full stack) - Runtime API deployment - Shared resources (PostgreSQL, Redis, LiteLLM) - Troubleshooting common issues - Best practices for resource management Co-authored-by: openhands <openhands@all-hands.dev>

This commit adds the complete infrastructure for OpenHands Enterprise staging environments in GCP, supporting both path-based and subdomain-based routing patterns with automatic TLS certificate provisioning. - Terraform module for DNS zone: ohe-staging.platform-team.all-hands.dev - Wildcard A record pointing to Traefik LoadBalancer - NS delegation from parent zone - Developer documentation in README.md - ClusterIssuer for Let's Encrypt with DNS-01 challenge - Wildcard certificate covering *.ohe-staging.platform-team.all-hands.dev - Traefik TLSStore for default certificate - Single-cluster path-based routing example values - Fixed Autopilot mode configuration conflicts - Made private_cluster_config dynamic to avoid null issues 1. **Subdomain-based**: https://<branch>.ohe-staging.platform-team.all-hands.dev 2. **Path-based**: https://ohe-staging.platform-team.all-hands.dev/<path> ```bash helm install my-feature ./charts/openhands \helm install my-feature ./charts/openhands \helm ins =tr =tr re.ohe-staging.platform-team.all-hands.dev \ --set ingress.class=traefik --set ingress.class=traefik --set ingress.class=gi --set ingress.class=traefik hands <openhands@all-hands.dev>

- Add single-cluster-subdomain values with prefixWithBranch enabled - Subdomain routing: {branch}.ohe-staging.platform-team.all-hands.dev - Configure TLS with cert-manager DNS-01 challenge - Add comprehensive README with deployment instructions - Document both path-based and subdomain-based routing strategies - Include CI/CD example for automated branch deployments Co-authored-by: openhands <openhands@all-hands.dev>

…thBranch) Added documentation for the hidden branch-based routing feature: - branchSanitized: sanitized branch name for subdomain prefix - ingress.prefixWithBranch: enables branch-prefixed hostnames When both are set, ingresses use: {branchSanitized}.{ingress.host} Example: feature-x.app.example.com Requires wildcard TLS cert and DNS record. Co-authored-by: openhands <openhands@all-hands.dev>

This terraform module deploys a shared Keycloak instance at auth.ohe-staging.platform-team.all-hands.dev that can be used by all branch deployments in the staging environment. Features: - Shared Keycloak with external PostgreSQL - Wildcard redirect URIs for all *.ohe-staging.* branches - Automated realm setup via Kubernetes Job - Client credentials stored in secrets Co-authored-by: openhands <openhands@all-hands.dev>

Changes: - Update _init-containers.yaml to use configurable postgres image - Only create keycloak database when keycloak subchart is enabled - Update single-cluster-path values to use shared Keycloak at auth.ohe-staging.platform-team.all-hands.dev - Configure TLS and proper staging domain for saurya-prototype branch Co-authored-by: openhands <openhands@all-hands.dev>

- Add _routing.yaml helpers for both openhands and runtime-api charts - Add traefik-middleware.yaml for path prefix stripping - Update ingress templates to use routing helpers - Fix annotation handling with $annotations variable pattern - Support branchSanitized at both root and ingress level for backward compatibility - Add routingMode (subdomain/path) and pathPrefix configuration to values.yaml Path mode: app.example.com/my-branch/ Subdomain mode: my-branch.app.example.com Co-authored-by: openhands <openhands@all-hands.dev>

…subdomain support - Add _routing.yaml with routing helper functions: - openhands.ingressHost: computes base host (branch-level) - openhands.serviceIngressHost: computes host for specific services - openhands.serviceRoutedPath: handles path computation per routing mode - openhands.serviceTlsSecretName: TLS secret name per service - Update ingress templates to use new service routing helpers: - ingress-automation.yaml: supports both path and subdomain modes - ingress-integrations.yaml: supports both path and subdomain modes - ingress-mcp.yaml: supports both path and subdomain modes - Add serviceRoutingMode config option (path|subdomain, default: path) - path mode: services use paths like /api/automation - subdomain mode: services get subdomains like automation.branch.host - Update values.yaml with comprehensive routing documentation Tested: Both path and subdomain modes verified with helm template Co-authored-by: openhands <openhands@all-hands.dev>

- Add overview section explaining routingMode and serviceRoutingMode - Document Config A (subdomain + path) and Config B (subdomain + subdomain) - Update values override examples with routing configuration - Add service subdomain deployment pattern example - Update access URLs section for all routing modes Co-authored-by: openhands <openhands@all-hands.dev>

- Switch from Bitnami to codecentric/keycloakx chart (Bitnami images unavailable) - Configure Keycloak with official quay.io/keycloak/keycloak:26.1.0 image - Set up realm 'staging' with client 'openhands-staging' - Configure wildcard redirect URIs for *.ohe-staging.platform-team.all-hands.dev - Add provider configuration for local kubeconfig - Fix realm setup job service URL (keycloak-http:80/auth) Keycloak accessible at: https://auth.ohe-staging.platform-team.all-hands.dev/auth Co-authored-by: openhands <openhands@all-hands.dev>

…ress cases - Add AUTH_URL environment variable for frontend login redirect - Fix keycloak.authHost conditional to work in all ingress configurations (prefixWithBranch=true, prefixWithBranch=false, ingress.enabled=false) - AUTH_URL includes https:// prefix (or http:// for local keycloak) - AUTH_WEB_HOST remains without protocol prefix for backend API calls This fixes the saurya-prototype login issue where AUTH_URL was not set, preventing the frontend from knowing where to redirect for authentication. Co-authored-by: openhands <openhands@all-hands.dev>

Keycloak is served at root path, not /auth. The /auth suffix was causing OAuth callback failures after successful authentication. Co-authored-by: openhands <openhands@all-hands.dev>

- Changed realm from 'staging' to 'allhands' to match actual Keycloak config - Changed clientId from 'openhands-staging' to 'allhands' to match frontend - Added enterprise_sso SAML identity provider configuration - Added hardcoded-attribute-idp-mapper to set identity_provider=enterprise_sso:saml - Fixed frontendUrl to remove /auth suffix - Simplified redirectUris to use wildcard pattern This fixes the Enterprise SSO login failure where identity_provider was returning None because no mapper was configured for the SAML IdP. Co-authored-by: openhands <openhands@all-hands.dev>

…setup This commit documents and fixes several issues discovered during staging deployment with shared Keycloak authentication: ## Key Fixes: 1. **API Key Secret Mismatch** (CRITICAL) - sandbox-api-key and default-api-key MUST have matching values - Added PREREQUISITES section with secret creation commands - Without this, sandbox startup fails with API key mismatch errors 2. **Sandbox API Hostname** - Fixed: http://openhands-runtime-api:5000 (was: http://runtime-api-runtime-api:5000) - Must match the helm release name pattern: {release-name}-runtime-api 3. **Shared Keycloak Configuration** - Set keycloak.enabled: false to use external shared instance - Added keycloak.authHost to prevent branch prefix on auth hostname - Required for multi-branch deployments sharing a Keycloak instance 4. **Enterprise SSO** - Added enterpriseSSO.enabled: true for SAML-based login - Works with shared Keycloak configured with Google Workspace SAML 5. **LiteLLM Configuration** - Enabled litellm-helm subchart with proper DB connection - Set internal URL: http://litellm-helm:4000 6. **Runtime URL Pattern** - Added - Added on for branch-specific overrides - Must be set at deploy time with --set for each branch Co-authored-by: openhands <openhands@all-hands.dev>

…alues - Rename infrastructure/ to site-infrastructure/ to avoid confusion with charts/infra/ directory - Simplify single-cluster-path and single-cluster-subdomain values files - Update image-loader values for staging environment - Merge latest changes from main Co-authored-by: openhands <openhands@all-hands.dev>

…ure/ - Revert charts/image-loader toleration back to default (value: not-running) - Add site-infrastructure/helm/environments/*/values-image-loader.yaml with toleration override (value: true) for GKE sysbox nodes - Simplify values-openhands.yaml files by removing redundant defaults - Add sysbox k8s manifests and documentation This keeps generic chart defaults clean while allowing site-specific customization in the appropriate location. Co-authored-by: openhands <openhands@all-hands.dev>

Root cause: Dynamically-created runtime pods were missing node scheduling configuration needed to run on sysbox nodes, causing 120-second timeout. Changes: - warm-runtimes-configmap.yaml: Add node_selector, tolerations, runtime_class fields - runtime-api/values.yaml: Document new scheduling fields with examples - single-cluster-path/values-openhands.yaml: Configure sysbox scheduling - node_selector: {sysbox-install: yes} - tolerations: sysbox-runtime NoSchedule - runtime_class: sysbox-runc - Fix nil pointer errors in crd-check/_hook.tpl and deployment.yaml - Add missing database fields to openhands/values.yaml postgresql auth Co-authored-by: openhands <openhands@all-hands.dev>

Add enable_gke_sandbox variable to support gVisor-based isolation as an alternative to sysbox for runtime containers: - Add enable_gke_sandbox variable (default: true for single-cluster envs) - Configure sandbox_config with gvisor when enabled - Use dynamic taints based on isolation mode: - gVisor: sandbox.gke.io/runtime=gvisor:NoSchedule - sysbox: sysbox-runtime=true:NoSchedule - Only add sysbox-install label when NOT using GKE Sandbox - Update documentation with clear explanation of both isolation modes Co-authored-by: openhands <openhands@all-hands.dev>

…raefik image-loader values: - Override nodeSelector to remove sysbox-install requirement - Add gVisor taint toleration (sandbox.gke.io/runtime=gvisor) - Document that runtimeClass setting has no effect (chart doesn't use it) - Add note that image-loader may not be needed for gVisor deployments traefik values: - Enable kubernetesGateway provider for path-based runtime routing - Add Gateway API documentation to README Co-authored-by: openhands <openhands@all-hands.dev>

- Update README with clearer setup instructions - Add more comprehensive realm-template.json - Add example terraform.tfvars with common settings - Update values-keycloak.yaml with additional configs - Add new variables.tf entries for flexibility Co-authored-by: openhands <openhands@all-hands.dev>

The image-loader DaemonSet needs to run on the same nodes where runtime pods will be scheduled. In this GKE deployment, runtime pods use gVisor (GKE Sandbox) instead of sysbox. Changes: - Updated nodeSelector from 'sysbox-install: yes' to 'sandbox.gke.io/runtime: gvisor' - Updated tolerations for GKE Sandbox taint - Removed image override to use base chart's agent-server image Co-authored-by: openhands <openhands@all-hands.dev>

- Set RUNTIME_CLASS to empty string (no sandbox runtime required) - Remove gVisor-specific warm runtime config (tolerations, node_selector, runtime_class) - Use chart defaults for warm runtime configs instead of site override - Update image-loader to target runtime nodes without gVisor - Change nodeSelector from sandbox.gke.io/runtime to openhands.ai/node-type: runtime This allows runtime pods to schedule on standard nodes without GKE Sandbox, which is not enabled on the Platform Team Sandbox cluster. Co-authored-by: openhands <openhands@all-hands.dev>

…oken role - Fixed setup script bug: /tmp/realm.json wasn't created when realm exists - Added support for syncing ALL clients from template (broker + allhands) - Added logic to create/update broker:read-token role - Added broker:read-token to default-roles-allhands composites - Updated realm-template.json: - Increased accessCodeLifespan from 60 to 120 seconds - Added broker client with read-token role - Added default-roles-allhands with proper composites - Set storeToken: true and addReadTokenRoleOnCreate: true for enterprise_sso - Updated values-openhands.yaml: - Added image tag override for testing - Fixed TLS secret names for runtime-api Co-authored-by: openhands <openhands@all-hands.dev>

The OpenHands app was using /runtime/{runtime_id} but the runtime-api HTTPRoute creates paths at /{runtime_id}/runtime/. This caused the app to hit the runtime-api management API instead of the actual runtime pods, resulting in 'Sandbox failed to start within 120s' errors. Co-authored-by: openhands <openhands@all-hands.dev>

Updates enterprise-server image to include: - Fix for store_idp_tokens skipping SAML IdPs (PR #14243) - Fix for legacy agent_kind 'llm' validation in org settings Co-authored-by: openhands <openhands@all-hands.dev>

… nodes - Add listenerPort: 8443 for Gateway API (matches Traefik internal port) - Set RUNTIME_DISABLE_SSL: false (behind TLS-terminating ingress) - Configure warm runtimes with agent-server image and full environment - Update node selectors to openhands.ai/node-type: runtime - Update tolerations to openhands.ai/runtime=true:NoSchedule - Fix image tag format (sha-115237f instead of sha-115237fd0) - Add keycloak.authHost for shared auth across branches Co-authored-by: openhands <openhands@all-hands.dev>

Add FULL_DEPLOYMENT_GUIDE.md with step-by-step instructions for: - GCP project setup and API enablement - Terraform infrastructure deployment (DNS + GKE) - Kubernetes base components (cert-manager, Traefik, external-dns) - Shared services setup (Keycloak) - OpenHands deployment - Branch deployment workflow - Teardown procedures - Troubleshooting guide Co-authored-by: openhands <openhands@all-hands.dev>

…chedule ROOT CAUSE: Configuration mismatch between infrastructure and Helm values - Infrastructure: enable_gke_sandbox = false (no gVisor nodes) - Helm: RUNTIME_CLASS = 'gvisor' (requires gVisor nodes) The gVisor RuntimeClass on GKE has built-in scheduling constraints: - Node Selector: sandbox.gke.io/runtime=gvisor - Tolerations: sandbox.gke.io/runtime=gvisor:NoSchedule Since no nodes have these labels (gVisor not enabled at infra level), runtime pods were stuck in Pending state indefinitely. FIX: Set RUNTIME_CLASS to empty string so pods use standard containerd. TO ENABLE GVISOR LATER: 1. In terraform: set enable_gke_sandbox = true for runtime node pool 2. In this file: uncomment RUNTIME_CLASS: 'gvisor' Co-authored-by: openhands <openhands@all-hands.dev>

The warm-runtimes-configmap.yaml template expects snake_case field names: - nodeSelector → node_selector - tolerations (already correct) - runtime_class (added) This fixes the issue where runtime pods weren't being scheduled on sysbox-enabled nodes because the node_selector wasn't being rendered into the warm-runtimes.json ConfigMap. The mismatch caused runtime pods to: 1. Not get the sysbox-runc RuntimeClass 2. Not have the correct nodeSelector 3. Not have tolerations for the sysbox-runtime taint Without these, warm runtimes would either fail to schedule or schedule on the wrong nodes, causing the 120-second timeout. Co-authored-by: openhands <openhands@all-hands.dev>

- sysbox v0.6.4 does NOT support Ubuntu 24.04 (Noble Numbat) - v0.6.7-0 adds Ubuntu 24.04 and K8s v1.32 support - Add runtime_node_image_type variable to ensure UBUNTU_CONTAINERD - Document version requirements in README Co-authored-by: openhands <openhands@all-hands.dev>

…ng deployment - Add platform-team-sandbox terraform environment - Add image-loader helm values for runtime node configuration - Add keycloak redirect URI hook for SSO configuration - Update openhands values.yaml Co-authored-by: openhands <openhands@all-hands.dev>

- Remove PRD-developer-staging-deployment.md (beyond PR scope) - Revert all replicated/ changes (beyond PR scope) - Move site-infrastructure/terraform to terraform/gcp/platform-team-sandbox - Move shared-auth and staging-dns into terraform/gcp/platform-team-sandbox - Rename site-infrastructure to testenv-charts - Update all path references in documentation and configuration files Co-authored-by: openhands <openhands@all-hands.dev>

- Create testenv-charts/helm/environments/staging/base-values.yaml with shared configuration for branch deployments on staging.all-hands.dev - Add/update .agents/skills/deploy-branch.md with: - Instructions for looking up OpenHands PR image tags - Correct cluster name (staging-main) and paths - Updated secrets list from all-hands-system namespace - Detailed troubleshooting and quick reference sections Co-authored-by: openhands <openhands@all-hands.dev>

…sandbox - Add Claude skill for quick branch deployments - Add shared base-values.yaml for staging environment - Configured for ohe-staging.platform-team.all-hands.dev domain - Uses ohe-staging-cluster in platform-team-sandbox project Co-authored-by: openhands <openhands@all-hands.dev>

Clarify that litellm-env-secrets is shared infrastructure: - API keys are set once per cluster, not per deployment - All branch deployments use the shared LiteLLM instance - Individual branches do not need their own API keys Co-authored-by: openhands <openhands@all-hands.dev>

This PR now uses the infrastructure created in PR #580 (SV-OHE-staging-Deploy-Infra): - GCP Project: platform-team-sandbox - GKE Cluster: ohe-staging-cluster - Domain: ohe-staging.platform-team.all-hands.dev Changes: - Update workflow to target platform-team-sandbox cluster - Use testenv-charts/helm/environments/staging/base-values.yaml as base config - Copy secrets from all-hands-system namespace (not SOPS-encrypted) - Update environment values to use new domain structure: - pathroute.ohe-staging.platform-team.all-hands.dev - subdomain.ohe-staging.platform-team.all-hands.dev - Remove obsolete envs/common/values.yaml (now using testenv-charts base) - Remove obsolete scripts/testbed/ (superseded by PR #580) - Update documentation to reflect new infrastructure Deployed URLs: - https://pathroute.ohe-staging.platform-team.all-hands.dev (path-based routing) - https://subdomain.ohe-staging.platform-team.all-hands.dev (subdomain routing)

The OpenHands chart uses ingress.host for automation/integrations/mcp ingresses. Without overriding this value, branch deployments inherit the main hostname from base-values.yaml, causing routing conflicts. Co-authored-by: openhands <openhands@all-hands.dev>

…I hook - Create deploy-branch.sh script for branch deployments - Enable Keycloak redirectUriRegistration in platform-team-sandbox values - Hook automatically registers branch redirect URIs with shared Keycloak - Fixes login issues when branch URLs aren't registered as valid redirect URIs Usage: ./testenv-charts/scripts/deploy-branch.sh <branch-name> [--image-tag <tag>] Co-authored-by: openhands <openhands@all-hands.dev>

The keycloak-redirect-uri-hook.yaml template was referencing a non-existent openhands.labels helper function. Replaced with inline labels matching the pattern used in other chart templates. Co-authored-by: openhands <openhands@all-hands.dev>

Add option to skip ClusterRole creation for branch deployments that share cluster-scoped RBAC with the main deployment. This resolves conflicts when multiple releases try to create the same ClusterRole. Changes: - runtime-api: Add skipClusterRBAC and existingClusterRole options - runtime-api: Version bump to 0.3.3 - openhands: Update runtime-api dependency to 0.3.3 - deploy-branch.sh: Enable skipClusterRBAC for branch deployments Co-authored-by: openhands <openhands@all-hands.dev>

The env block in YAML replaces (not merges) the base-values.yaml env block. This was causing the branch deployment to be missing critical env vars like: - OH_APP_MODE: saas (controls SaaS vs Enterprise behavior) - CONVERSATION_MANAGER_CLASS (required for SaaS backend) - OH_WEB_CLIENT_FEATURE_FLAGS_* (controls UI features) - Runtime and LLM configuration Co-authored-by: openhands <openhands@all-hands.dev>

- Remove hardcoded OH_WEB_CLIENT_PROVIDERS_CONFIGURED from base-values.yaml that was overriding the computed template value - Add keycloak.authHost to branch-console-message.yaml to point to shared Keycloak (prevents chart from computing auth.console-message.ohe-staging...) - Enable keycloak.redirectUriRegistration to register branch's redirect URI with shared Keycloak (required because Keycloak doesn't support wildcards) - Move enterpriseSSO.enabled before env block for clarity - Clean up env block - let chart compute AUTH_TYPE, OIDC_PROVIDER_URL, OH_APP_MODE, etc. from enabled flags The chart template at charts/openhands/templates/_env.yaml computes OH_WEB_CLIENT_PROVIDERS_CONFIGURED from github.enabled, gitlab.enabled, bitbucket.enabled, and enterpriseSSO.enabled flags. Values in the env block override the computed defaults, so removing the hardcoded value lets the template logic work correctly. Co-authored-by: openhands <openhands@all-hands.dev>

Add generate-branch-values.sh script that creates values.yaml files for branch deployments with proper Keycloak SSO configuration: - Handles shared Keycloak setup (authHost, redirect URI registration) - Computes AUTH_URL correctly for branch subdomains - Configures enterprise SSO provider - Sets up connections to shared PostgreSQL, Redis, and LiteLLM - Two modes: minimal (shares most resources) and full (own postgres/redis) - Includes comprehensive documentation and next-steps instructions This makes it easy for developers to create branch deployments without struggling with the SSO configuration gotchas. Co-authored-by: openhands <openhands@all-hands.dev>

Adds a script to help developers easily create branch deployment values files for the OHE staging environment. The script: - Sanitizes branch names for Kubernetes DNS compliance - Generates a complete Helm values file with all required overrides - Outputs clear deployment instructions with kubectl/helm commands - Supports custom image tags for PR-specific builds Usage: ./create-branch-deployment.sh <branch-name> [--image-tag <tag>] Co-authored-by: openhands <openhands@all-hands.dev>

The backend requires ENABLE_ENTERPRISE_SSO=true in addition to the chart's enterpriseSSO.enabled flag (which only sets OH_WEB_CLIENT_PROVIDERS_CONFIGURED for the frontend). Co-authored-by: openhands <openhands@all-hands.dev>

all-hands-bot reviewed Apr 27, 2026

View reviewed changes

saurya changed the title ~~feat: Add staging infrastructure for 4 OHE deployment environments~~ feat: Add staging infrastructure for 2 OHE deployment environments - path and subdomain May 5, 2026

saurya force-pushed the SV-OHE-staging-Deploy-Infra branch 3 times, most recently from 6dcd6c1 to c095b83 Compare May 8, 2026 01:42

saurya and others added 23 commits May 7, 2026 18:47

fix: Update traefik values to match new Helm chart schema

1566ed8

- Changed 'expose: true' to 'expose: { default: true }' format - Moved 'tls' config under 'http' section per new schema - Updated HTTP-to-HTTPS redirect format for web port Co-authored-by: openhands <openhands@all-hands.dev>

refactor: Remove multi-cluster environments

bfec296

Simplify staging infrastructure to only two environments: - single-cluster-path - single-cluster-subdomain Multi-cluster configurations are not needed for current staging validation. Co-authored-by: openhands <openhands@all-hands.dev>

fix: remove /auth suffix from keycloak.url

06be0b2

Keycloak is served at root path, not /auth. The /auth suffix was causing OAuth callback failures after successful authentication. Co-authored-by: openhands <openhands@all-hands.dev>

saurya and others added 15 commits May 7, 2026 18:47

chore: update image to sha-115237fd0 for SAML IdP fix

eaf9c2a

Updates enterprise-server image to include: - Fix for store_idp_tokens skipping SAML IdPs (PR #14243) - Fix for legacy agent_kind 'llm' validation in org settings Co-authored-by: openhands <openhands@all-hands.dev>

saurya force-pushed the SV-OHE-staging-Deploy-Infra branch from c095b83 to 9ce01c9 Compare May 8, 2026 01:47

saurya and others added 3 commits May 8, 2026 10:23

saurya mentioned this pull request May 10, 2026

Add dual staging environment infrastructure (pathroute & subdomain) #542

Open

saurya and others added 9 commits May 10, 2026 15:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add staging infrastructure for 2 OHE deployment environments - path and subdomain#580

feat: Add staging infrastructure for 2 OHE deployment environments - path and subdomain#580
saurya wants to merge 57 commits into
mainfrom
SV-OHE-staging-Deploy-Infra

saurya commented Apr 27, 2026 •

edited

Loading

Uh oh!

all-hands-bot left a comment

Uh oh!

all-hands-bot Apr 27, 2026

Uh oh!

all-hands-bot Apr 27, 2026

Uh oh!

all-hands-bot Apr 27, 2026

Uh oh!

saurya commented Apr 27, 2026

Uh oh!

saurya commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		# Check cert-manager logs
		kubectl logs -n cert-manager -l app=cert-manager

Conversation

saurya commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Environments

Infrastructure Components

Terraform (infrastructure/terraform/)

Helm Values (infrastructure/helm/)

Directory Structure

Deployment Steps

Validation Status

Next Steps

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

all-hands-bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

all-hands-bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

all-hands-bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

saurya commented Apr 27, 2026

PR Review Feedback Addressed ✅

Critical Issues Fixed

Notes on Improvement Opportunities

Uh oh!

saurya commented May 5, 2026

📋 Summary of Chart Changes

🐛 Bug Fixes (recommended to merge)

✨ New Feature: Path-Based Routing

✨ New Feature: Shared Keycloak Support

✨ New Feature: Warm Runtime Scheduling Options

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

saurya commented Apr 27, 2026 •

edited

Loading

Terraform (`infrastructure/terraform/`)

Helm Values (`infrastructure/helm/`)