feat: Add staging infrastructure for 2 OHE deployment environments - path and subdomain#580
feat: Add staging infrastructure for 2 OHE deployment environments - path and subdomain#580saurya wants to merge 57 commits into
Conversation
all-hands-bot
left a comment
There was a problem hiding this comment.
🔴 Needs improvement - Critical security and validation gaps block merge readiness.
[CRITICAL ISSUES]
- Missing .gitignore patterns: Add Terraform patterns (
*.tfstate,*.tfvars,.terraform/, etc.) to.gitignoreto prevent committing sensitive files with GCP credentials and project IDs. - No validation evidence: PR lists validation steps as "Next Steps" but shows no
terraform validate/planoutput or cost estimates. Infrastructure creating billable resources needs validation proof before merge.
[IMPROVEMENT OPPORTUNITIES]
- Code duplication: 4 environment directories contain nearly identical Terraform code - consider consolidating with shared modules.
- Variable substitution:
${ACME_EMAIL}and${GCP_PROJECT_ID}usage needs documented substitution workflow.
[RISK ASSESSMENT]
🔴 HIGH RISK - Infrastructure change introducing new GKE clusters without validation evidence. Multiple risk factors: (1) New architectural patterns (4 routing strategies), (2) Infrastructure dependencies (GKE, Cloud DNS, Load Balancers), (3) Security sensitivity (missing .gitignore for secrets), (4) Cost impact (multiple clusters). Recommendation: Do not auto-merge. Require (a) .gitignore update, (b) terraform validate/plan output, (c) cost estimate, (d) human review of infrastructure design before deployment.
| # Use staging server for testing | ||
| server: https://acme-staging-v02.api.letsencrypt.org/directory | ||
| # Email for certificate expiration notifications | ||
| email: ${ACME_EMAIL} |
There was a problem hiding this comment.
🟠 Important: The ${ACME_EMAIL} variable substitution pattern is used but the README doesn't explain how to apply it. Add explicit instructions:
# Before applying, substitute variables
export ACME_EMAIL="devops@example.com"
export GCP_PROJECT_ID="staging-092324"
envsubst < cluster-issuer.yaml | kubectl apply -f -Or consider using Helm values instead of shell variable substitution for better integration with the deployment workflow.
| module "vpc" { | ||
| source = "../../modules/vpc-network" | ||
|
|
||
| project_id = var.project_id | ||
| region = var.region | ||
| network_name = "${var.environment_name}-vpc" | ||
| subnet_name = "${var.environment_name}-subnet" | ||
| subnet_cidr = var.subnet_cidr | ||
| pods_cidr = var.pods_cidr | ||
| services_cidr = var.services_cidr |
There was a problem hiding this comment.
🟡 Suggestion: All 4 environment directories contain nearly identical Terraform code with only variable values differing. This creates maintenance burden - changes must be replicated across 4 files. Consider consolidating:
Option 1: Single environment module + different tfvars:
environments/base/main.tf (parameterized)
+ single-cluster-path.tfvars
+ single-cluster-subdomain.tfvars
+ multi-cluster-path.tfvars
+ multi-cluster-subdomain.tfvars
Option 2: Use Terragrunt to manage shared config.
Reduced duplication = fewer bugs and easier maintenance.
|
|
||
| # Check cert-manager logs | ||
| kubectl logs -n cert-manager -l app=cert-manager | ||
|
|
There was a problem hiding this comment.
🟠 Important - Evidence Gap: The PR description mentions "Next Steps" including terraform validate and terraform plan, indicating these haven't been run yet. Before merging infrastructure code that will create billable GCP resources:
Required validation:
- Output from
terraform validatefor at least one environment - Output from
terraform planshowing what resources will be created - Cost estimate (from terraform plan or GCP pricing calculator)
- Test deployment to a sandbox project (optional but recommended)
Add an Evidence section to the PR description with these outputs. This protects against deploying broken or unexpectedly expensive infrastructure.
PR Review Feedback Addressed ✅Thanks for the review! The following issues have been addressed: Critical Issues Fixed
Notes on Improvement Opportunities
TTTTTTTTTTTTTby an AI assistant (OpenHands) on behalf of Saurya. |
📋 Summary of Chart ChangesThis PR includes changes to 🐛 Bug Fixes (recommended to merge)
✨ New Feature: Path-Based RoutingAdds support for path-based routing as an alternative to subdomain routing. Useful for deployments where wildcard DNS/certs aren't available.
Default behavior unchanged: ✨ New Feature: Shared Keycloak SupportAllows using an external/shared Keycloak instance across multiple deployments.
✨ New Feature: Warm Runtime Scheduling OptionsAdds support for node selectors, tolerations, and runtime class on warm runtime pods.
Discussion needed: Should these chart changes be:
This comment was created by an AI assistant (OpenHands) on behalf of @SauryaVelagapudi. |
6dcd6c1 to
c095b83
Compare
Design for feature branch deployment to staging with: - Pre-provisioned wildcard certs (avoids Let's Encrypt rate limits) - Base domain pool (dev1-dev5.staging.all-hands.dev) - Shared Keycloak for SAML authentication - GitHub Actions workflow for slot-based deployment - Incremental deployment (only changed charts) - external-dns for automatic DNS management Estimated effort: 5-7 days for initial setup Ongoing maintenance: Mostly automated (cert renewal, TTL cleanup) Related: infra PR #1064 (base domains) Co-authored-by: openhands <openhands@all-hands.dev>
This adds infrastructure configuration for deploying OpenHands Enterprise to staging GCP project (staging-092324) with 4 independent environments: 1. single-cluster-path: Single GKE cluster with path-based routing 2. single-cluster-subdomain: Single GKE cluster with subdomain-based routing 3. multi-cluster-path: Separate app/runtime clusters with path-based routing 4. multi-cluster-subdomain: Separate app/runtime clusters with subdomain routing Infrastructure includes: - Terraform modules for VPC networking and GKE clusters - Terraform environment configurations for all 4 setups - Helm values for cert-manager (TLS certificates via Let's Encrypt) - Helm values for external-dns (automatic DNS management) - Helm values for traefik (ingress controller with routing variants) Key features: - Configurable domains per environment - Path-based routing uses middlewares for URL rewriting - Subdomain routing uses wildcard DNS entries - All environments isolated from existing staging.all-hands.dev Co-authored-by: openhands <openhands@all-hands.dev>
Comprehensive testing guide covering: - Phase 1: Static validation (Terraform + Helm) - Phase 2: Terraform plan review - Phase 3-6: Incremental deployment and testing - Phase 7: End-to-end verification checklist - Phase 8: Multi-environment rollout - Troubleshooting section for common issues Co-authored-by: openhands <openhands@all-hands.dev>
- Changed 'expose: true' to 'expose: { default: true }' format
- Moved 'tls' config under 'http' section per new schema
- Updated HTTP-to-HTTPS redirect format for web port
Co-authored-by: openhands <openhands@all-hands.dev>
…e to prevent sensitive files from being committed - Document variable substitution workflow in infrastructure/README.md - Add clear instructions for variable substitution in cluster-issuer.yaml Co-authored-by: openhands <openhands@all-hands.dev>
Simplify staging infrastructure to only two environments: - single-cluster-path - single-cluster-subdomain Multi-cluster configurations are not needed for current staging validation. Co-authored-by: openhands <openhands@all-hands.dev>
Explains how developers can deploy their own OpenHands branch to shared staging infrastructure using Helm release isolation. Covers: - Quick start deployment steps - Values override patterns (minimal dev, full stack) - Runtime API deployment - Shared resources (PostgreSQL, Redis, LiteLLM) - Troubleshooting common issues - Best practices for resource management Co-authored-by: openhands <openhands@all-hands.dev>
This commit adds the complete infrastructure for OpenHands Enterprise staging environments in GCP, supporting both path-based and subdomain-based routing patterns with automatic TLS certificate provisioning. - Terraform module for DNS zone: ohe-staging.platform-team.all-hands.dev - Wildcard A record pointing to Traefik LoadBalancer - NS delegation from parent zone - Developer documentation in README.md - ClusterIssuer for Let's Encrypt with DNS-01 challenge - Wildcard certificate covering *.ohe-staging.platform-team.all-hands.dev - Traefik TLSStore for default certificate - Single-cluster path-based routing example values - Fixed Autopilot mode configuration conflicts - Made private_cluster_config dynamic to avoid null issues 1. **Subdomain-based**: https://<branch>.ohe-staging.platform-team.all-hands.dev 2. **Path-based**: https://ohe-staging.platform-team.all-hands.dev/<path> ```bash helm install my-feature ./charts/openhands \helm install my-feature ./charts/openhands \helm ins =tr =tr re.ohe-staging.platform-team.all-hands.dev \ --set ingress.class=traefik --set ingress.class=traefik --set ingress.class=gi --set ingress.class=traefik hands <openhands@all-hands.dev>
- Add single-cluster-subdomain values with prefixWithBranch enabled
- Subdomain routing: {branch}.ohe-staging.platform-team.all-hands.dev
- Configure TLS with cert-manager DNS-01 challenge
- Add comprehensive README with deployment instructions
- Document both path-based and subdomain-based routing strategies
- Include CI/CD example for automated branch deployments
Co-authored-by: openhands <openhands@all-hands.dev>
…thBranch)
Added documentation for the hidden branch-based routing feature:
- branchSanitized: sanitized branch name for subdomain prefix
- ingress.prefixWithBranch: enables branch-prefixed hostnames
When both are set, ingresses use: {branchSanitized}.{ingress.host}
Example: feature-x.app.example.com
Requires wildcard TLS cert and DNS record.
Co-authored-by: openhands <openhands@all-hands.dev>
This terraform module deploys a shared Keycloak instance at auth.ohe-staging.platform-team.all-hands.dev that can be used by all branch deployments in the staging environment. Features: - Shared Keycloak with external PostgreSQL - Wildcard redirect URIs for all *.ohe-staging.* branches - Automated realm setup via Kubernetes Job - Client credentials stored in secrets Co-authored-by: openhands <openhands@all-hands.dev>
Changes: - Update _init-containers.yaml to use configurable postgres image - Only create keycloak database when keycloak subchart is enabled - Update single-cluster-path values to use shared Keycloak at auth.ohe-staging.platform-team.all-hands.dev - Configure TLS and proper staging domain for saurya-prototype branch Co-authored-by: openhands <openhands@all-hands.dev>
- Add _routing.yaml helpers for both openhands and runtime-api charts - Add traefik-middleware.yaml for path prefix stripping - Update ingress templates to use routing helpers - Fix annotation handling with $annotations variable pattern - Support branchSanitized at both root and ingress level for backward compatibility - Add routingMode (subdomain/path) and pathPrefix configuration to values.yaml Path mode: app.example.com/my-branch/ Subdomain mode: my-branch.app.example.com Co-authored-by: openhands <openhands@all-hands.dev>
…subdomain support - Add _routing.yaml with routing helper functions: - openhands.ingressHost: computes base host (branch-level) - openhands.serviceIngressHost: computes host for specific services - openhands.serviceRoutedPath: handles path computation per routing mode - openhands.serviceTlsSecretName: TLS secret name per service - Update ingress templates to use new service routing helpers: - ingress-automation.yaml: supports both path and subdomain modes - ingress-integrations.yaml: supports both path and subdomain modes - ingress-mcp.yaml: supports both path and subdomain modes - Add serviceRoutingMode config option (path|subdomain, default: path) - path mode: services use paths like /api/automation - subdomain mode: services get subdomains like automation.branch.host - Update values.yaml with comprehensive routing documentation Tested: Both path and subdomain modes verified with helm template Co-authored-by: openhands <openhands@all-hands.dev>
- Add overview section explaining routingMode and serviceRoutingMode - Document Config A (subdomain + path) and Config B (subdomain + subdomain) - Update values override examples with routing configuration - Add service subdomain deployment pattern example - Update access URLs section for all routing modes Co-authored-by: openhands <openhands@all-hands.dev>
- Switch from Bitnami to codecentric/keycloakx chart (Bitnami images unavailable) - Configure Keycloak with official quay.io/keycloak/keycloak:26.1.0 image - Set up realm 'staging' with client 'openhands-staging' - Configure wildcard redirect URIs for *.ohe-staging.platform-team.all-hands.dev - Add provider configuration for local kubeconfig - Fix realm setup job service URL (keycloak-http:80/auth) Keycloak accessible at: https://auth.ohe-staging.platform-team.all-hands.dev/auth Co-authored-by: openhands <openhands@all-hands.dev>
…ress cases - Add AUTH_URL environment variable for frontend login redirect - Fix keycloak.authHost conditional to work in all ingress configurations (prefixWithBranch=true, prefixWithBranch=false, ingress.enabled=false) - AUTH_URL includes https:// prefix (or http:// for local keycloak) - AUTH_WEB_HOST remains without protocol prefix for backend API calls This fixes the saurya-prototype login issue where AUTH_URL was not set, preventing the frontend from knowing where to redirect for authentication. Co-authored-by: openhands <openhands@all-hands.dev>
Keycloak is served at root path, not /auth. The /auth suffix was causing OAuth callback failures after successful authentication. Co-authored-by: openhands <openhands@all-hands.dev>
- Changed realm from 'staging' to 'allhands' to match actual Keycloak config - Changed clientId from 'openhands-staging' to 'allhands' to match frontend - Added enterprise_sso SAML identity provider configuration - Added hardcoded-attribute-idp-mapper to set identity_provider=enterprise_sso:saml - Fixed frontendUrl to remove /auth suffix - Simplified redirectUris to use wildcard pattern This fixes the Enterprise SSO login failure where identity_provider was returning None because no mapper was configured for the SAML IdP. Co-authored-by: openhands <openhands@all-hands.dev>
…setup
This commit documents and fixes several issues discovered during staging
deployment with shared Keycloak authentication:
## Key Fixes:
1. **API Key Secret Mismatch** (CRITICAL)
- sandbox-api-key and default-api-key MUST have matching values
- Added PREREQUISITES section with secret creation commands
- Without this, sandbox startup fails with API key mismatch errors
2. **Sandbox API Hostname**
- Fixed: http://openhands-runtime-api:5000 (was: http://runtime-api-runtime-api:5000)
- Must match the helm release name pattern: {release-name}-runtime-api
3. **Shared Keycloak Configuration**
- Set keycloak.enabled: false to use external shared instance
- Added keycloak.authHost to prevent branch prefix on auth hostname
- Required for multi-branch deployments sharing a Keycloak instance
4. **Enterprise SSO**
- Added enterpriseSSO.enabled: true for SAML-based login
- Works with shared Keycloak configured with Google Workspace SAML
5. **LiteLLM Configuration**
- Enabled litellm-helm subchart with proper DB connection
- Set internal URL: http://litellm-helm:4000
6. **Runtime URL Pattern**
- Added - Added on for branch-specific overrides
- Must be set at deploy time with --set for each branch
Co-authored-by: openhands <openhands@all-hands.dev>
…alues - Rename infrastructure/ to site-infrastructure/ to avoid confusion with charts/infra/ directory - Simplify single-cluster-path and single-cluster-subdomain values files - Update image-loader values for staging environment - Merge latest changes from main Co-authored-by: openhands <openhands@all-hands.dev>
…ure/ - Revert charts/image-loader toleration back to default (value: not-running) - Add site-infrastructure/helm/environments/*/values-image-loader.yaml with toleration override (value: true) for GKE sysbox nodes - Simplify values-openhands.yaml files by removing redundant defaults - Add sysbox k8s manifests and documentation This keeps generic chart defaults clean while allowing site-specific customization in the appropriate location. Co-authored-by: openhands <openhands@all-hands.dev>
Root cause: Dynamically-created runtime pods were missing node scheduling
configuration needed to run on sysbox nodes, causing 120-second timeout.
Changes:
- warm-runtimes-configmap.yaml: Add node_selector, tolerations, runtime_class fields
- runtime-api/values.yaml: Document new scheduling fields with examples
- single-cluster-path/values-openhands.yaml: Configure sysbox scheduling
- node_selector: {sysbox-install: yes}
- tolerations: sysbox-runtime NoSchedule
- runtime_class: sysbox-runc
- Fix nil pointer errors in crd-check/_hook.tpl and deployment.yaml
- Add missing database fields to openhands/values.yaml postgresql auth
Co-authored-by: openhands <openhands@all-hands.dev>
Add enable_gke_sandbox variable to support gVisor-based isolation as an alternative to sysbox for runtime containers: - Add enable_gke_sandbox variable (default: true for single-cluster envs) - Configure sandbox_config with gvisor when enabled - Use dynamic taints based on isolation mode: - gVisor: sandbox.gke.io/runtime=gvisor:NoSchedule - sysbox: sysbox-runtime=true:NoSchedule - Only add sysbox-install label when NOT using GKE Sandbox - Update documentation with clear explanation of both isolation modes Co-authored-by: openhands <openhands@all-hands.dev>
…raefik image-loader values: - Override nodeSelector to remove sysbox-install requirement - Add gVisor taint toleration (sandbox.gke.io/runtime=gvisor) - Document that runtimeClass setting has no effect (chart doesn't use it) - Add note that image-loader may not be needed for gVisor deployments traefik values: - Enable kubernetesGateway provider for path-based runtime routing - Add Gateway API documentation to README Co-authored-by: openhands <openhands@all-hands.dev>
- Update README with clearer setup instructions - Add more comprehensive realm-template.json - Add example terraform.tfvars with common settings - Update values-keycloak.yaml with additional configs - Add new variables.tf entries for flexibility Co-authored-by: openhands <openhands@all-hands.dev>
The image-loader DaemonSet needs to run on the same nodes where runtime pods will be scheduled. In this GKE deployment, runtime pods use gVisor (GKE Sandbox) instead of sysbox. Changes: - Updated nodeSelector from 'sysbox-install: yes' to 'sandbox.gke.io/runtime: gvisor' - Updated tolerations for GKE Sandbox taint - Removed image override to use base chart's agent-server image Co-authored-by: openhands <openhands@all-hands.dev>
- Set RUNTIME_CLASS to empty string (no sandbox runtime required) - Remove gVisor-specific warm runtime config (tolerations, node_selector, runtime_class) - Use chart defaults for warm runtime configs instead of site override - Update image-loader to target runtime nodes without gVisor - Change nodeSelector from sandbox.gke.io/runtime to openhands.ai/node-type: runtime This allows runtime pods to schedule on standard nodes without GKE Sandbox, which is not enabled on the Platform Team Sandbox cluster. Co-authored-by: openhands <openhands@all-hands.dev>
…oken role - Fixed setup script bug: /tmp/realm.json wasn't created when realm exists - Added support for syncing ALL clients from template (broker + allhands) - Added logic to create/update broker:read-token role - Added broker:read-token to default-roles-allhands composites - Updated realm-template.json: - Increased accessCodeLifespan from 60 to 120 seconds - Added broker client with read-token role - Added default-roles-allhands with proper composites - Set storeToken: true and addReadTokenRoleOnCreate: true for enterprise_sso - Updated values-openhands.yaml: - Added image tag override for testing - Fixed TLS secret names for runtime-api Co-authored-by: openhands <openhands@all-hands.dev>
The OpenHands app was using /runtime/{runtime_id} but the runtime-api
HTTPRoute creates paths at /{runtime_id}/runtime/. This caused the app
to hit the runtime-api management API instead of the actual runtime pods,
resulting in 'Sandbox failed to start within 120s' errors.
Co-authored-by: openhands <openhands@all-hands.dev>
Updates enterprise-server image to include: - Fix for store_idp_tokens skipping SAML IdPs (PR #14243) - Fix for legacy agent_kind 'llm' validation in org settings Co-authored-by: openhands <openhands@all-hands.dev>
… nodes - Add listenerPort: 8443 for Gateway API (matches Traefik internal port) - Set RUNTIME_DISABLE_SSL: false (behind TLS-terminating ingress) - Configure warm runtimes with agent-server image and full environment - Update node selectors to openhands.ai/node-type: runtime - Update tolerations to openhands.ai/runtime=true:NoSchedule - Fix image tag format (sha-115237f instead of sha-115237fd0) - Add keycloak.authHost for shared auth across branches Co-authored-by: openhands <openhands@all-hands.dev>
Add FULL_DEPLOYMENT_GUIDE.md with step-by-step instructions for: - GCP project setup and API enablement - Terraform infrastructure deployment (DNS + GKE) - Kubernetes base components (cert-manager, Traefik, external-dns) - Shared services setup (Keycloak) - OpenHands deployment - Branch deployment workflow - Teardown procedures - Troubleshooting guide Co-authored-by: openhands <openhands@all-hands.dev>
…chedule ROOT CAUSE: Configuration mismatch between infrastructure and Helm values - Infrastructure: enable_gke_sandbox = false (no gVisor nodes) - Helm: RUNTIME_CLASS = 'gvisor' (requires gVisor nodes) The gVisor RuntimeClass on GKE has built-in scheduling constraints: - Node Selector: sandbox.gke.io/runtime=gvisor - Tolerations: sandbox.gke.io/runtime=gvisor:NoSchedule Since no nodes have these labels (gVisor not enabled at infra level), runtime pods were stuck in Pending state indefinitely. FIX: Set RUNTIME_CLASS to empty string so pods use standard containerd. TO ENABLE GVISOR LATER: 1. In terraform: set enable_gke_sandbox = true for runtime node pool 2. In this file: uncomment RUNTIME_CLASS: 'gvisor' Co-authored-by: openhands <openhands@all-hands.dev>
The warm-runtimes-configmap.yaml template expects snake_case field names: - nodeSelector → node_selector - tolerations (already correct) - runtime_class (added) This fixes the issue where runtime pods weren't being scheduled on sysbox-enabled nodes because the node_selector wasn't being rendered into the warm-runtimes.json ConfigMap. The mismatch caused runtime pods to: 1. Not get the sysbox-runc RuntimeClass 2. Not have the correct nodeSelector 3. Not have tolerations for the sysbox-runtime taint Without these, warm runtimes would either fail to schedule or schedule on the wrong nodes, causing the 120-second timeout. Co-authored-by: openhands <openhands@all-hands.dev>
- sysbox v0.6.4 does NOT support Ubuntu 24.04 (Noble Numbat) - v0.6.7-0 adds Ubuntu 24.04 and K8s v1.32 support - Add runtime_node_image_type variable to ensure UBUNTU_CONTAINERD - Document version requirements in README Co-authored-by: openhands <openhands@all-hands.dev>
…ng deployment - Add platform-team-sandbox terraform environment - Add image-loader helm values for runtime node configuration - Add keycloak redirect URI hook for SSO configuration - Update openhands values.yaml Co-authored-by: openhands <openhands@all-hands.dev>
- Remove PRD-developer-staging-deployment.md (beyond PR scope) - Revert all replicated/ changes (beyond PR scope) - Move site-infrastructure/terraform to terraform/gcp/platform-team-sandbox - Move shared-auth and staging-dns into terraform/gcp/platform-team-sandbox - Rename site-infrastructure to testenv-charts - Update all path references in documentation and configuration files Co-authored-by: openhands <openhands@all-hands.dev>
c095b83 to
9ce01c9
Compare
- Create testenv-charts/helm/environments/staging/base-values.yaml with shared configuration for branch deployments on staging.all-hands.dev - Add/update .agents/skills/deploy-branch.md with: - Instructions for looking up OpenHands PR image tags - Correct cluster name (staging-main) and paths - Updated secrets list from all-hands-system namespace - Detailed troubleshooting and quick reference sections Co-authored-by: openhands <openhands@all-hands.dev>
…sandbox - Add Claude skill for quick branch deployments - Add shared base-values.yaml for staging environment - Configured for ohe-staging.platform-team.all-hands.dev domain - Uses ohe-staging-cluster in platform-team-sandbox project Co-authored-by: openhands <openhands@all-hands.dev>
Clarify that litellm-env-secrets is shared infrastructure: - API keys are set once per cluster, not per deployment - All branch deployments use the shared LiteLLM instance - Individual branches do not need their own API keys Co-authored-by: openhands <openhands@all-hands.dev>
This PR now uses the infrastructure created in PR #580 (SV-OHE-staging-Deploy-Infra): - GCP Project: platform-team-sandbox - GKE Cluster: ohe-staging-cluster - Domain: ohe-staging.platform-team.all-hands.dev Changes: - Update workflow to target platform-team-sandbox cluster - Use testenv-charts/helm/environments/staging/base-values.yaml as base config - Copy secrets from all-hands-system namespace (not SOPS-encrypted) - Update environment values to use new domain structure: - pathroute.ohe-staging.platform-team.all-hands.dev - subdomain.ohe-staging.platform-team.all-hands.dev - Remove obsolete envs/common/values.yaml (now using testenv-charts base) - Remove obsolete scripts/testbed/ (superseded by PR #580) - Update documentation to reflect new infrastructure Deployed URLs: - https://pathroute.ohe-staging.platform-team.all-hands.dev (path-based routing) - https://subdomain.ohe-staging.platform-team.all-hands.dev (subdomain routing)
The OpenHands chart uses ingress.host for automation/integrations/mcp ingresses. Without overriding this value, branch deployments inherit the main hostname from base-values.yaml, causing routing conflicts. Co-authored-by: openhands <openhands@all-hands.dev>
…I hook - Create deploy-branch.sh script for branch deployments - Enable Keycloak redirectUriRegistration in platform-team-sandbox values - Hook automatically registers branch redirect URIs with shared Keycloak - Fixes login issues when branch URLs aren't registered as valid redirect URIs Usage: ./testenv-charts/scripts/deploy-branch.sh <branch-name> [--image-tag <tag>] Co-authored-by: openhands <openhands@all-hands.dev>
The keycloak-redirect-uri-hook.yaml template was referencing a non-existent openhands.labels helper function. Replaced with inline labels matching the pattern used in other chart templates. Co-authored-by: openhands <openhands@all-hands.dev>
Add option to skip ClusterRole creation for branch deployments that share cluster-scoped RBAC with the main deployment. This resolves conflicts when multiple releases try to create the same ClusterRole. Changes: - runtime-api: Add skipClusterRBAC and existingClusterRole options - runtime-api: Version bump to 0.3.3 - openhands: Update runtime-api dependency to 0.3.3 - deploy-branch.sh: Enable skipClusterRBAC for branch deployments Co-authored-by: openhands <openhands@all-hands.dev>
The env block in YAML replaces (not merges) the base-values.yaml env block. This was causing the branch deployment to be missing critical env vars like: - OH_APP_MODE: saas (controls SaaS vs Enterprise behavior) - CONVERSATION_MANAGER_CLASS (required for SaaS backend) - OH_WEB_CLIENT_FEATURE_FLAGS_* (controls UI features) - Runtime and LLM configuration Co-authored-by: openhands <openhands@all-hands.dev>
- Remove hardcoded OH_WEB_CLIENT_PROVIDERS_CONFIGURED from base-values.yaml that was overriding the computed template value - Add keycloak.authHost to branch-console-message.yaml to point to shared Keycloak (prevents chart from computing auth.console-message.ohe-staging...) - Enable keycloak.redirectUriRegistration to register branch's redirect URI with shared Keycloak (required because Keycloak doesn't support wildcards) - Move enterpriseSSO.enabled before env block for clarity - Clean up env block - let chart compute AUTH_TYPE, OIDC_PROVIDER_URL, OH_APP_MODE, etc. from enabled flags The chart template at charts/openhands/templates/_env.yaml computes OH_WEB_CLIENT_PROVIDERS_CONFIGURED from github.enabled, gitlab.enabled, bitbucket.enabled, and enterpriseSSO.enabled flags. Values in the env block override the computed defaults, so removing the hardcoded value lets the template logic work correctly. Co-authored-by: openhands <openhands@all-hands.dev>
Add generate-branch-values.sh script that creates values.yaml files for branch deployments with proper Keycloak SSO configuration: - Handles shared Keycloak setup (authHost, redirect URI registration) - Computes AUTH_URL correctly for branch subdomains - Configures enterprise SSO provider - Sets up connections to shared PostgreSQL, Redis, and LiteLLM - Two modes: minimal (shares most resources) and full (own postgres/redis) - Includes comprehensive documentation and next-steps instructions This makes it easy for developers to create branch deployments without struggling with the SSO configuration gotchas. Co-authored-by: openhands <openhands@all-hands.dev>
Adds a script to help developers easily create branch deployment values files for the OHE staging environment. The script: - Sanitizes branch names for Kubernetes DNS compliance - Generates a complete Helm values file with all required overrides - Outputs clear deployment instructions with kubectl/helm commands - Supports custom image tags for PR-specific builds Usage: ./create-branch-deployment.sh <branch-name> [--image-tag <tag>] Co-authored-by: openhands <openhands@all-hands.dev>
The backend requires ENABLE_ENTERPRISE_SSO=true in addition to the chart's enterpriseSSO.enabled flag (which only sets OH_WEB_CLIENT_PROVIDERS_CONFIGURED for the frontend). Co-authored-by: openhands <openhands@all-hands.dev>
Summary
This PR adds infrastructure configuration for deploying OpenHands Enterprise to platform-team-sandbox GCP project with 2 independent test environments to validate the different routing strategies.
Environments
/app,/runtime)app.,runtime.)Infrastructure Components
Terraform (
infrastructure/terraform/)Modules:
gke-cluster- GKE cluster provisioning with configurable node poolsnetworking- VPC, subnets, and firewall rulesiam- Service accounts and IAM bindingsEnvironments:
main.tf,variables.tf,outputs.tf, andterraform.tfvars.exampleHelm Values (
infrastructure/helm/)cert-manager:
external-dns:
*.domain.com)traefik:
Directory Structure
Deployment Steps
Provision Infrastructure:
Install Helm Charts:
Validation Status
terraform validate)helm lint)kubectl --dry-run)Next Steps
After merge and approval:
terraform applyfor each environmentThis PR was edited by Saurya after initial creation by an AI agent (OpenHands) on behalf of Saurya.