Skip to content

feat: Add staging infrastructure for 2 OHE deployment environments - path and subdomain#580

Open
saurya wants to merge 57 commits into
mainfrom
SV-OHE-staging-Deploy-Infra
Open

feat: Add staging infrastructure for 2 OHE deployment environments - path and subdomain#580
saurya wants to merge 57 commits into
mainfrom
SV-OHE-staging-Deploy-Infra

Conversation

@saurya

@saurya saurya commented Apr 27, 2026

Copy link
Copy Markdown
Contributor

Summary

This PR adds infrastructure configuration for deploying OpenHands Enterprise to platform-team-sandbox GCP project with 2 independent test environments to validate the different routing strategies.

Environments

Environment Cluster Setup Routing Strategy Use Case
single-cluster-path Single GKE cluster Path-based (/app, /runtime) Simple deployment, all workloads on one cluster
single-cluster-subdomain Single GKE cluster Subdomain-based (app., runtime.) Single cluster with service isolation via subdomains

Infrastructure Components

Terraform (infrastructure/terraform/)

Modules:

  • gke-cluster - GKE cluster provisioning with configurable node pools
  • networking - VPC, subnets, and firewall rules
  • iam - Service accounts and IAM bindings

Environments:

  • Each environment has its own directory with main.tf, variables.tf, outputs.tf, and terraform.tfvars.example
  • Single-cluster envs create 1 cluster for both app and runtimes

Helm Values (infrastructure/helm/)

cert-manager:

  • Let's Encrypt ClusterIssuer for automatic TLS certificates
  • HTTP-01 challenge solver via Traefik ingress

external-dns:

  • Automatic DNS record management in Google Cloud DNS
  • Path-routing variant: Single A record per domain
  • Subdomain-routing variant: Wildcard DNS entries (*.domain.com)

traefik:

  • Ingress controller configuration
  • Path-routing variant: Includes middlewares for URL path stripping/rewriting
  • Subdomain-routing variant: Standard host-based routing

Directory Structure

infrastructure/
├── terraform/
│   ├── modules/
│   │   ├── gke-cluster/          # GKE cluster module
│   │   ├── networking/           # VPC and networking module
│   │   └── iam/                  # IAM and service accounts module
│   └── environments/
│       ├── single-cluster-path/       # Single cluster + path routing
│       ├── single-cluster-subdomain/  # Single cluster + subdomain routing
│       ├── multi-cluster-path/        # Multi cluster + path routing
│       └── multi-cluster-subdomain/   # Multi cluster + subdomain routing
└── helm/
    ├── cert-manager/             # TLS certificate automation
    ├── external-dns/             # DNS record automation
    └── traefik/                  # Ingress controller + middlewares

Deployment Steps

  1. Provision Infrastructure:

    cd infrastructure/terraform/environments/single-cluster-path
    cp terraform.tfvars.example terraform.tfvars
    # Edit terraform.tfvars with your values
    terraform init && terraform plan && terraform apply
  2. Install Helm Charts:

    # cert-manager
    helm install cert-manager jetstack/cert-manager -f infrastructure/helm/cert-manager/values.yaml
    kubectl apply -f infrastructure/helm/cert-manager/cluster-issuer.yaml
    
    # external-dns (choose routing variant)
    helm install external-dns bitnami/external-dns -f infrastructure/helm/external-dns/values-path-routing.yaml
    
    # traefik (choose routing variant)
    helm install traefik traefik/traefik -f infrastructure/helm/traefik/values-path-routing.yaml
    kubectl apply -f infrastructure/helm/traefik/middlewares-path-routing.yaml

Validation Status

  • Terraform modules validated (terraform validate)
  • Both environment configurations validated
  • Helm chart values validated (helm lint)
  • Kubernetes manifests validated (kubectl --dry-run)
  • Terraform plan (requires GCP credentials)
  • Terraform apply (pending approval)

Next Steps

After merge and approval:

  1. Run terraform apply for each environment
  2. Install Helm charts on provisioned clusters
  3. Deploy OpenHands applications

This PR was edited by Saurya after initial creation by an AI agent (OpenHands) on behalf of Saurya.

@all-hands-bot all-hands-bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Needs improvement - Critical security and validation gaps block merge readiness.

[CRITICAL ISSUES]

  • Missing .gitignore patterns: Add Terraform patterns (*.tfstate, *.tfvars, .terraform/, etc.) to .gitignore to prevent committing sensitive files with GCP credentials and project IDs.
  • No validation evidence: PR lists validation steps as "Next Steps" but shows no terraform validate/plan output or cost estimates. Infrastructure creating billable resources needs validation proof before merge.

[IMPROVEMENT OPPORTUNITIES]

  • Code duplication: 4 environment directories contain nearly identical Terraform code - consider consolidating with shared modules.
  • Variable substitution: ${ACME_EMAIL} and ${GCP_PROJECT_ID} usage needs documented substitution workflow.

[RISK ASSESSMENT]
🔴 HIGH RISK - Infrastructure change introducing new GKE clusters without validation evidence. Multiple risk factors: (1) New architectural patterns (4 routing strategies), (2) Infrastructure dependencies (GKE, Cloud DNS, Load Balancers), (3) Security sensitivity (missing .gitignore for secrets), (4) Cost impact (multiple clusters). Recommendation: Do not auto-merge. Require (a) .gitignore update, (b) terraform validate/plan output, (c) cost estimate, (d) human review of infrastructure design before deployment.

# Use staging server for testing
server: https://acme-staging-v02.api.letsencrypt.org/directory
# Email for certificate expiration notifications
email: ${ACME_EMAIL}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 Important: The ${ACME_EMAIL} variable substitution pattern is used but the README doesn't explain how to apply it. Add explicit instructions:

# Before applying, substitute variables
export ACME_EMAIL="devops@example.com"
export GCP_PROJECT_ID="staging-092324"
envsubst < cluster-issuer.yaml | kubectl apply -f -

Or consider using Helm values instead of shell variable substitution for better integration with the deployment workflow.

Comment on lines +35 to +44
module "vpc" {
source = "../../modules/vpc-network"

project_id = var.project_id
region = var.region
network_name = "${var.environment_name}-vpc"
subnet_name = "${var.environment_name}-subnet"
subnet_cidr = var.subnet_cidr
pods_cidr = var.pods_cidr
services_cidr = var.services_cidr

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Suggestion: All 4 environment directories contain nearly identical Terraform code with only variable values differing. This creates maintenance burden - changes must be replicated across 4 files. Consider consolidating:

Option 1: Single environment module + different tfvars:

environments/base/main.tf (parameterized)
  + single-cluster-path.tfvars
  + single-cluster-subdomain.tfvars
  + multi-cluster-path.tfvars
  + multi-cluster-subdomain.tfvars

Option 2: Use Terragrunt to manage shared config.

Reduced duplication = fewer bugs and easier maintenance.


# Check cert-manager logs
kubectl logs -n cert-manager -l app=cert-manager

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 Important - Evidence Gap: The PR description mentions "Next Steps" including terraform validate and terraform plan, indicating these haven't been run yet. Before merging infrastructure code that will create billable GCP resources:

Required validation:

  1. Output from terraform validate for at least one environment
  2. Output from terraform plan showing what resources will be created
  3. Cost estimate (from terraform plan or GCP pricing calculator)
  4. Test deployment to a sandbox project (optional but recommended)

Add an Evidence section to the PR description with these outputs. This protects against deploying broken or unexpectedly expensive infrastructure.

@saurya

saurya commented Apr 27, 2026

Copy link
Copy Markdown
Contributor Author

PR Review Feedback Addressed ✅

Thanks for the review! The following issues have been addressed:

Critical Issues Fixed

  1. Missing .gitignore patterns - Added comprehensive Terraform patterns:

    • .terraform/ directories
    • *.tfstate and *.tfstate.* files
    • *.tfvars (excluding *.tfvars.example)
    • *.tfplan, crash logs, and override files
  2. Variable substitution workflow - Documented in two places:

    • Added "Configuration Variables" section to infrastructure/README.md with a table of variables and envsubst usage example
    • Added clear instructions at the top of cluster-issuer.yaml

Notes on Improvement Opportunities

  • Code duplication: The 4 environments intentionally have similar structure to allow independent iteration during staging validation. Once patterns are validated, consolidation can be considered.

  • Validation evidence: Terraform validate passes locally. Full terraform plan output requires GCP credentials which are not available in this context. The plan/apply will be done as a follow-up deployment step.


TTTTTTTTTTTTTby an AI assistant (OpenHands) on behalf of Saurya.

@saurya saurya changed the title feat: Add staging infrastructure for 4 OHE deployment environments feat: Add staging infrastructure for 2 OHE deployment environments - path and subdomain May 5, 2026
@saurya

saurya commented May 5, 2026

Copy link
Copy Markdown
Contributor Author

📋 Summary of Chart Changes

This PR includes changes to charts/ that add new features and bug fixes. These are backwards-compatible and don't change default behavior.

🐛 Bug Fixes (recommended to merge)

File Change
charts/crd-check/templates/_hook.tpl Fix nil pointer: crdCheck.enabledcrdCheck && crdCheck.enabled
charts/runtime-api/templates/deployment.yaml Fix nil pointer check for caBundle
charts/openhands/templates/_init-containers.yaml Fix postgres image URL (bitnamilegacypublic.ecr.aws/bitnami), conditional keycloak database creation
charts/openhands/templates/deployment.yaml Make litellm-helm init container optional (guard with .enabled)
charts/openhands/templates/litellm-config-script.yaml Make ConfigMap conditional on litellm-helm.enabled

✨ New Feature: Path-Based Routing

Adds support for path-based routing as an alternative to subdomain routing. Useful for deployments where wildcard DNS/certs aren't available.

File Change
charts/openhands/templates/_routing.yaml NEW - Routing helper functions
charts/runtime-api/templates/_routing.yaml NEW - Same for runtime-api
charts/openhands/templates/traefik-middleware.yaml NEW - Strip prefix middleware for path routing
charts/runtime-api/templates/traefik-middleware.yaml NEW - Same for runtime-api
charts/openhands/values.yaml Add routingMode, serviceRoutingMode, pathPrefix, branchSanitized
charts/runtime-api/values.yaml Add routingMode, pathPrefix, branchSanitized
charts/openhands/templates/ingress-*.yaml Refactored to use routing helpers
charts/runtime-api/templates/ingress.yaml Refactored to use routing helpers

Default behavior unchanged: routingMode: subdomain (existing behavior)

✨ New Feature: Shared Keycloak Support

Allows using an external/shared Keycloak instance across multiple deployments.

File Change
charts/openhands/values.yaml Add keycloak.authHost option
charts/openhands/templates/_env.yaml Support authHost override + add AUTH_URL env var

✨ New Feature: Warm Runtime Scheduling Options

Adds support for node selectors, tolerations, and runtime class on warm runtime pods.

File Change
charts/runtime-api/templates/warm-runtimes-configmap.yaml Add node_selector, tolerations, runtime_class fields
charts/runtime-api/values.yaml Document these options (commented out by default)

Discussion needed: Should these chart changes be:

  1. ✅ Merged as-is (backwards compatible, adds flexibility)
  2. 🔀 Split into separate PRs (bug fixes vs features)
  3. ↩️ Reverted and handled differently

This comment was created by an AI assistant (OpenHands) on behalf of @SauryaVelagapudi.

@saurya saurya force-pushed the SV-OHE-staging-Deploy-Infra branch 3 times, most recently from 6dcd6c1 to c095b83 Compare May 8, 2026 01:42
saurya and others added 23 commits May 7, 2026 18:47
Design for feature branch deployment to staging with:
- Pre-provisioned wildcard certs (avoids Let's Encrypt rate limits)
- Base domain pool (dev1-dev5.staging.all-hands.dev)
- Shared Keycloak for SAML authentication
- GitHub Actions workflow for slot-based deployment
- Incremental deployment (only changed charts)
- external-dns for automatic DNS management

Estimated effort: 5-7 days for initial setup
Ongoing maintenance: Mostly automated (cert renewal, TTL cleanup)

Related: infra PR #1064 (base domains)

Co-authored-by: openhands <openhands@all-hands.dev>
This adds infrastructure configuration for deploying OpenHands Enterprise
to staging GCP project (staging-092324) with 4 independent environments:

1. single-cluster-path: Single GKE cluster with path-based routing
2. single-cluster-subdomain: Single GKE cluster with subdomain-based routing
3. multi-cluster-path: Separate app/runtime clusters with path-based routing
4. multi-cluster-subdomain: Separate app/runtime clusters with subdomain routing

Infrastructure includes:
- Terraform modules for VPC networking and GKE clusters
- Terraform environment configurations for all 4 setups
- Helm values for cert-manager (TLS certificates via Let's Encrypt)
- Helm values for external-dns (automatic DNS management)
- Helm values for traefik (ingress controller with routing variants)

Key features:
- Configurable domains per environment
- Path-based routing uses middlewares for URL rewriting
- Subdomain routing uses wildcard DNS entries
- All environments isolated from existing staging.all-hands.dev

Co-authored-by: openhands <openhands@all-hands.dev>
Comprehensive testing guide covering:
- Phase 1: Static validation (Terraform + Helm)
- Phase 2: Terraform plan review
- Phase 3-6: Incremental deployment and testing
- Phase 7: End-to-end verification checklist
- Phase 8: Multi-environment rollout
- Troubleshooting section for common issues

Co-authored-by: openhands <openhands@all-hands.dev>
- Changed 'expose: true' to 'expose: { default: true }' format
- Moved 'tls' config under 'http' section per new schema
- Updated HTTP-to-HTTPS redirect format for web port

Co-authored-by: openhands <openhands@all-hands.dev>
…e to prevent sensitive files from being committed - Document variable substitution workflow in infrastructure/README.md - Add clear instructions for variable substitution in cluster-issuer.yaml Co-authored-by: openhands <openhands@all-hands.dev>
Simplify staging infrastructure to only two environments:
- single-cluster-path
- single-cluster-subdomain

Multi-cluster configurations are not needed for current staging validation.

Co-authored-by: openhands <openhands@all-hands.dev>
Explains how developers can deploy their own OpenHands branch to
shared staging infrastructure using Helm release isolation.

Covers:
- Quick start deployment steps
- Values override patterns (minimal dev, full stack)
- Runtime API deployment
- Shared resources (PostgreSQL, Redis, LiteLLM)
- Troubleshooting common issues
- Best practices for resource management

Co-authored-by: openhands <openhands@all-hands.dev>
This commit adds the complete infrastructure for OpenHands Enterprise staging
environments in GCP, supporting both path-based and subdomain-based routing
patterns with automatic TLS certificate provisioning.

- Terraform module for DNS zone: ohe-staging.platform-team.all-hands.dev
- Wildcard A record pointing to Traefik LoadBalancer
- NS delegation from parent zone
- Developer documentation in README.md

- ClusterIssuer for Let's Encrypt with DNS-01 challenge
- Wildcard certificate covering *.ohe-staging.platform-team.all-hands.dev
- Traefik TLSStore for default certificate

- Single-cluster path-based routing example values

- Fixed Autopilot mode configuration conflicts
- Made private_cluster_config dynamic to avoid null issues

1. **Subdomain-based**: https://<branch>.ohe-staging.platform-team.all-hands.dev
2. **Path-based**: https://ohe-staging.platform-team.all-hands.dev/<path>

```bash
helm install my-feature ./charts/openhands \helm install my-feature ./charts/openhands \helm ins                       =tr                       =tr         re.ohe-staging.platform-team.all-hands.dev \
  --set ingress.class=traefik  --set ingress.class=traefik  --set ingress.class=gi  --set ingress.class=traefik  hands <openhands@all-hands.dev>
- Add single-cluster-subdomain values with prefixWithBranch enabled
- Subdomain routing: {branch}.ohe-staging.platform-team.all-hands.dev
- Configure TLS with cert-manager DNS-01 challenge
- Add comprehensive README with deployment instructions
- Document both path-based and subdomain-based routing strategies
- Include CI/CD example for automated branch deployments

Co-authored-by: openhands <openhands@all-hands.dev>
…thBranch)

Added documentation for the hidden branch-based routing feature:
- branchSanitized: sanitized branch name for subdomain prefix
- ingress.prefixWithBranch: enables branch-prefixed hostnames

When both are set, ingresses use: {branchSanitized}.{ingress.host}
Example: feature-x.app.example.com

Requires wildcard TLS cert and DNS record.

Co-authored-by: openhands <openhands@all-hands.dev>
This terraform module deploys a shared Keycloak instance at
auth.ohe-staging.platform-team.all-hands.dev that can be used by
all branch deployments in the staging environment.

Features:
- Shared Keycloak with external PostgreSQL
- Wildcard redirect URIs for all *.ohe-staging.* branches
- Automated realm setup via Kubernetes Job
- Client credentials stored in secrets

Co-authored-by: openhands <openhands@all-hands.dev>
Changes:
- Update _init-containers.yaml to use configurable postgres image
- Only create keycloak database when keycloak subchart is enabled
- Update single-cluster-path values to use shared Keycloak at
  auth.ohe-staging.platform-team.all-hands.dev
- Configure TLS and proper staging domain for saurya-prototype branch

Co-authored-by: openhands <openhands@all-hands.dev>
- Add _routing.yaml helpers for both openhands and runtime-api charts
- Add traefik-middleware.yaml for path prefix stripping
- Update ingress templates to use routing helpers
- Fix annotation handling with $annotations variable pattern
- Support branchSanitized at both root and ingress level for backward compatibility
- Add routingMode (subdomain/path) and pathPrefix configuration to values.yaml

Path mode: app.example.com/my-branch/
Subdomain mode: my-branch.app.example.com

Co-authored-by: openhands <openhands@all-hands.dev>
…subdomain support

- Add _routing.yaml with routing helper functions:
  - openhands.ingressHost: computes base host (branch-level)
  - openhands.serviceIngressHost: computes host for specific services
  - openhands.serviceRoutedPath: handles path computation per routing mode
  - openhands.serviceTlsSecretName: TLS secret name per service

- Update ingress templates to use new service routing helpers:
  - ingress-automation.yaml: supports both path and subdomain modes
  - ingress-integrations.yaml: supports both path and subdomain modes
  - ingress-mcp.yaml: supports both path and subdomain modes

- Add serviceRoutingMode config option (path|subdomain, default: path)
  - path mode: services use paths like /api/automation
  - subdomain mode: services get subdomains like automation.branch.host

- Update values.yaml with comprehensive routing documentation

Tested: Both path and subdomain modes verified with helm template

Co-authored-by: openhands <openhands@all-hands.dev>
- Add overview section explaining routingMode and serviceRoutingMode
- Document Config A (subdomain + path) and Config B (subdomain + subdomain)
- Update values override examples with routing configuration
- Add service subdomain deployment pattern example
- Update access URLs section for all routing modes

Co-authored-by: openhands <openhands@all-hands.dev>
- Switch from Bitnami to codecentric/keycloakx chart (Bitnami images unavailable)
- Configure Keycloak with official quay.io/keycloak/keycloak:26.1.0 image
- Set up realm 'staging' with client 'openhands-staging'
- Configure wildcard redirect URIs for *.ohe-staging.platform-team.all-hands.dev
- Add provider configuration for local kubeconfig
- Fix realm setup job service URL (keycloak-http:80/auth)

Keycloak accessible at: https://auth.ohe-staging.platform-team.all-hands.dev/auth

Co-authored-by: openhands <openhands@all-hands.dev>
…ress cases

- Add AUTH_URL environment variable for frontend login redirect
- Fix keycloak.authHost conditional to work in all ingress configurations
  (prefixWithBranch=true, prefixWithBranch=false, ingress.enabled=false)
- AUTH_URL includes https:// prefix (or http:// for local keycloak)
- AUTH_WEB_HOST remains without protocol prefix for backend API calls

This fixes the saurya-prototype login issue where AUTH_URL was not set,
preventing the frontend from knowing where to redirect for authentication.

Co-authored-by: openhands <openhands@all-hands.dev>
Keycloak is served at root path, not /auth. The /auth suffix was causing
OAuth callback failures after successful authentication.

Co-authored-by: openhands <openhands@all-hands.dev>
- Changed realm from 'staging' to 'allhands' to match actual Keycloak config
- Changed clientId from 'openhands-staging' to 'allhands' to match frontend
- Added enterprise_sso SAML identity provider configuration
- Added hardcoded-attribute-idp-mapper to set identity_provider=enterprise_sso:saml
- Fixed frontendUrl to remove /auth suffix
- Simplified redirectUris to use wildcard pattern

This fixes the Enterprise SSO login failure where identity_provider was
returning None because no mapper was configured for the SAML IdP.

Co-authored-by: openhands <openhands@all-hands.dev>
…setup

This commit documents and fixes several issues discovered during staging
deployment with shared Keycloak authentication:

## Key Fixes:

1. **API Key Secret Mismatch** (CRITICAL)
   - sandbox-api-key and default-api-key MUST have matching values
   - Added PREREQUISITES section with secret creation commands
   - Without this, sandbox startup fails with API key mismatch errors

2. **Sandbox API Hostname**
   - Fixed: http://openhands-runtime-api:5000 (was: http://runtime-api-runtime-api:5000)
   - Must match the helm release name pattern: {release-name}-runtime-api

3. **Shared Keycloak Configuration**
   - Set keycloak.enabled: false to use external shared instance
   - Added keycloak.authHost to prevent branch prefix on auth hostname
   - Required for multi-branch deployments sharing a Keycloak instance

4. **Enterprise SSO**
   - Added enterpriseSSO.enabled: true for SAML-based login
   - Works with shared Keycloak configured with Google Workspace SAML

5. **LiteLLM Configuration**
   - Enabled litellm-helm subchart with proper DB connection
   - Set internal URL: http://litellm-helm:4000

6. **Runtime URL Pattern**
   - Added   - Added  on for branch-specific overrides
   - Must be set at deploy time with --set for each branch

Co-authored-by: openhands <openhands@all-hands.dev>
…alues

- Rename infrastructure/ to site-infrastructure/ to avoid confusion with
  charts/infra/ directory
- Simplify single-cluster-path and single-cluster-subdomain values files
- Update image-loader values for staging environment
- Merge latest changes from main

Co-authored-by: openhands <openhands@all-hands.dev>
…ure/

- Revert charts/image-loader toleration back to default (value: not-running)
- Add site-infrastructure/helm/environments/*/values-image-loader.yaml with
  toleration override (value: true) for GKE sysbox nodes
- Simplify values-openhands.yaml files by removing redundant defaults
- Add sysbox k8s manifests and documentation

This keeps generic chart defaults clean while allowing site-specific
customization in the appropriate location.

Co-authored-by: openhands <openhands@all-hands.dev>
Root cause: Dynamically-created runtime pods were missing node scheduling
configuration needed to run on sysbox nodes, causing 120-second timeout.

Changes:
- warm-runtimes-configmap.yaml: Add node_selector, tolerations, runtime_class fields
- runtime-api/values.yaml: Document new scheduling fields with examples
- single-cluster-path/values-openhands.yaml: Configure sysbox scheduling
  - node_selector: {sysbox-install: yes}
  - tolerations: sysbox-runtime NoSchedule
  - runtime_class: sysbox-runc
- Fix nil pointer errors in crd-check/_hook.tpl and deployment.yaml
- Add missing database fields to openhands/values.yaml postgresql auth

Co-authored-by: openhands <openhands@all-hands.dev>
saurya and others added 15 commits May 7, 2026 18:47
Add enable_gke_sandbox variable to support gVisor-based isolation as an
alternative to sysbox for runtime containers:

- Add enable_gke_sandbox variable (default: true for single-cluster envs)
- Configure sandbox_config with gvisor when enabled
- Use dynamic taints based on isolation mode:
  - gVisor: sandbox.gke.io/runtime=gvisor:NoSchedule
  - sysbox: sysbox-runtime=true:NoSchedule
- Only add sysbox-install label when NOT using GKE Sandbox
- Update documentation with clear explanation of both isolation modes

Co-authored-by: openhands <openhands@all-hands.dev>
…raefik

image-loader values:
- Override nodeSelector to remove sysbox-install requirement
- Add gVisor taint toleration (sandbox.gke.io/runtime=gvisor)
- Document that runtimeClass setting has no effect (chart doesn't use it)
- Add note that image-loader may not be needed for gVisor deployments

traefik values:
- Enable kubernetesGateway provider for path-based runtime routing
- Add Gateway API documentation to README

Co-authored-by: openhands <openhands@all-hands.dev>
- Update README with clearer setup instructions
- Add more comprehensive realm-template.json
- Add example terraform.tfvars with common settings
- Update values-keycloak.yaml with additional configs
- Add new variables.tf entries for flexibility

Co-authored-by: openhands <openhands@all-hands.dev>
The image-loader DaemonSet needs to run on the same nodes where runtime
pods will be scheduled. In this GKE deployment, runtime pods use gVisor
(GKE Sandbox) instead of sysbox.

Changes:
- Updated nodeSelector from 'sysbox-install: yes' to 'sandbox.gke.io/runtime: gvisor'
- Updated tolerations for GKE Sandbox taint
- Removed image override to use base chart's agent-server image

Co-authored-by: openhands <openhands@all-hands.dev>
- Set RUNTIME_CLASS to empty string (no sandbox runtime required)
- Remove gVisor-specific warm runtime config (tolerations, node_selector, runtime_class)
- Use chart defaults for warm runtime configs instead of site override
- Update image-loader to target runtime nodes without gVisor
- Change nodeSelector from sandbox.gke.io/runtime to openhands.ai/node-type: runtime

This allows runtime pods to schedule on standard nodes without GKE Sandbox,
which is not enabled on the Platform Team Sandbox cluster.

Co-authored-by: openhands <openhands@all-hands.dev>
…oken role

- Fixed setup script bug: /tmp/realm.json wasn't created when realm exists
- Added support for syncing ALL clients from template (broker + allhands)
- Added logic to create/update broker:read-token role
- Added broker:read-token to default-roles-allhands composites
- Updated realm-template.json:
  - Increased accessCodeLifespan from 60 to 120 seconds
  - Added broker client with read-token role
  - Added default-roles-allhands with proper composites
  - Set storeToken: true and addReadTokenRoleOnCreate: true for enterprise_sso
- Updated values-openhands.yaml:
  - Added image tag override for testing
  - Fixed TLS secret names for runtime-api

Co-authored-by: openhands <openhands@all-hands.dev>
The OpenHands app was using /runtime/{runtime_id} but the runtime-api
HTTPRoute creates paths at /{runtime_id}/runtime/. This caused the app
to hit the runtime-api management API instead of the actual runtime pods,
resulting in 'Sandbox failed to start within 120s' errors.

Co-authored-by: openhands <openhands@all-hands.dev>
Updates enterprise-server image to include:
- Fix for store_idp_tokens skipping SAML IdPs (PR #14243)
- Fix for legacy agent_kind 'llm' validation in org settings

Co-authored-by: openhands <openhands@all-hands.dev>
… nodes

- Add listenerPort: 8443 for Gateway API (matches Traefik internal port)
- Set RUNTIME_DISABLE_SSL: false (behind TLS-terminating ingress)
- Configure warm runtimes with agent-server image and full environment
- Update node selectors to openhands.ai/node-type: runtime
- Update tolerations to openhands.ai/runtime=true:NoSchedule
- Fix image tag format (sha-115237f instead of sha-115237fd0)
- Add keycloak.authHost for shared auth across branches

Co-authored-by: openhands <openhands@all-hands.dev>
Add FULL_DEPLOYMENT_GUIDE.md with step-by-step instructions for:
- GCP project setup and API enablement
- Terraform infrastructure deployment (DNS + GKE)
- Kubernetes base components (cert-manager, Traefik, external-dns)
- Shared services setup (Keycloak)
- OpenHands deployment
- Branch deployment workflow
- Teardown procedures
- Troubleshooting guide

Co-authored-by: openhands <openhands@all-hands.dev>
…chedule

ROOT CAUSE: Configuration mismatch between infrastructure and Helm values
- Infrastructure: enable_gke_sandbox = false (no gVisor nodes)
- Helm: RUNTIME_CLASS = 'gvisor' (requires gVisor nodes)

The gVisor RuntimeClass on GKE has built-in scheduling constraints:
- Node Selector: sandbox.gke.io/runtime=gvisor
- Tolerations: sandbox.gke.io/runtime=gvisor:NoSchedule

Since no nodes have these labels (gVisor not enabled at infra level),
runtime pods were stuck in Pending state indefinitely.

FIX: Set RUNTIME_CLASS to empty string so pods use standard containerd.

TO ENABLE GVISOR LATER:
1. In terraform: set enable_gke_sandbox = true for runtime node pool
2. In this file: uncomment RUNTIME_CLASS: 'gvisor'

Co-authored-by: openhands <openhands@all-hands.dev>
The warm-runtimes-configmap.yaml template expects snake_case field names:
- nodeSelector → node_selector
- tolerations (already correct)
- runtime_class (added)

This fixes the issue where runtime pods weren't being scheduled on
sysbox-enabled nodes because the node_selector wasn't being rendered
into the warm-runtimes.json ConfigMap.

The mismatch caused runtime pods to:
1. Not get the sysbox-runc RuntimeClass
2. Not have the correct nodeSelector
3. Not have tolerations for the sysbox-runtime taint

Without these, warm runtimes would either fail to schedule or
schedule on the wrong nodes, causing the 120-second timeout.

Co-authored-by: openhands <openhands@all-hands.dev>
- sysbox v0.6.4 does NOT support Ubuntu 24.04 (Noble Numbat)
- v0.6.7-0 adds Ubuntu 24.04 and K8s v1.32 support
- Add runtime_node_image_type variable to ensure UBUNTU_CONTAINERD
- Document version requirements in README

Co-authored-by: openhands <openhands@all-hands.dev>
…ng deployment

- Add platform-team-sandbox terraform environment
- Add image-loader helm values for runtime node configuration
- Add keycloak redirect URI hook for SSO configuration
- Update openhands values.yaml

Co-authored-by: openhands <openhands@all-hands.dev>
- Remove PRD-developer-staging-deployment.md (beyond PR scope)
- Revert all replicated/ changes (beyond PR scope)
- Move site-infrastructure/terraform to terraform/gcp/platform-team-sandbox
- Move shared-auth and staging-dns into terraform/gcp/platform-team-sandbox
- Rename site-infrastructure to testenv-charts
- Update all path references in documentation and configuration files

Co-authored-by: openhands <openhands@all-hands.dev>
@saurya saurya force-pushed the SV-OHE-staging-Deploy-Infra branch from c095b83 to 9ce01c9 Compare May 8, 2026 01:47
saurya and others added 3 commits May 8, 2026 10:23
- Create testenv-charts/helm/environments/staging/base-values.yaml with
  shared configuration for branch deployments on staging.all-hands.dev
- Add/update .agents/skills/deploy-branch.md with:
  - Instructions for looking up OpenHands PR image tags
  - Correct cluster name (staging-main) and paths
  - Updated secrets list from all-hands-system namespace
  - Detailed troubleshooting and quick reference sections

Co-authored-by: openhands <openhands@all-hands.dev>
…sandbox

- Add Claude skill for quick branch deployments
- Add shared base-values.yaml for staging environment
- Configured for ohe-staging.platform-team.all-hands.dev domain
- Uses ohe-staging-cluster in platform-team-sandbox project

Co-authored-by: openhands <openhands@all-hands.dev>
Clarify that litellm-env-secrets is shared infrastructure:
- API keys are set once per cluster, not per deployment
- All branch deployments use the shared LiteLLM instance
- Individual branches do not need their own API keys

Co-authored-by: openhands <openhands@all-hands.dev>
saurya pushed a commit that referenced this pull request May 10, 2026
This PR now uses the infrastructure created in PR #580 (SV-OHE-staging-Deploy-Infra):
- GCP Project: platform-team-sandbox
- GKE Cluster: ohe-staging-cluster
- Domain: ohe-staging.platform-team.all-hands.dev

Changes:
- Update workflow to target platform-team-sandbox cluster
- Use testenv-charts/helm/environments/staging/base-values.yaml as base config
- Copy secrets from all-hands-system namespace (not SOPS-encrypted)
- Update environment values to use new domain structure:
  - pathroute.ohe-staging.platform-team.all-hands.dev
  - subdomain.ohe-staging.platform-team.all-hands.dev
- Remove obsolete envs/common/values.yaml (now using testenv-charts base)
- Remove obsolete scripts/testbed/ (superseded by PR #580)
- Update documentation to reflect new infrastructure

Deployed URLs:
- https://pathroute.ohe-staging.platform-team.all-hands.dev (path-based routing)
- https://subdomain.ohe-staging.platform-team.all-hands.dev (subdomain routing)
saurya and others added 9 commits May 10, 2026 15:54
The OpenHands chart uses ingress.host for automation/integrations/mcp
ingresses. Without overriding this value, branch deployments inherit
the main hostname from base-values.yaml, causing routing conflicts.

Co-authored-by: openhands <openhands@all-hands.dev>
…I hook

- Create deploy-branch.sh script for branch deployments
- Enable Keycloak redirectUriRegistration in platform-team-sandbox values
- Hook automatically registers branch redirect URIs with shared Keycloak
- Fixes login issues when branch URLs aren't registered as valid redirect URIs

Usage:
  ./testenv-charts/scripts/deploy-branch.sh <branch-name> [--image-tag <tag>]

Co-authored-by: openhands <openhands@all-hands.dev>
The keycloak-redirect-uri-hook.yaml template was referencing a
non-existent openhands.labels helper function. Replaced with inline
labels matching the pattern used in other chart templates.

Co-authored-by: openhands <openhands@all-hands.dev>
Add option to skip ClusterRole creation for branch deployments that share
cluster-scoped RBAC with the main deployment. This resolves conflicts when
multiple releases try to create the same ClusterRole.

Changes:
- runtime-api: Add skipClusterRBAC and existingClusterRole options
- runtime-api: Version bump to 0.3.3
- openhands: Update runtime-api dependency to 0.3.3
- deploy-branch.sh: Enable skipClusterRBAC for branch deployments

Co-authored-by: openhands <openhands@all-hands.dev>
The env block in YAML replaces (not merges) the base-values.yaml env block.
This was causing the branch deployment to be missing critical env vars like:
- OH_APP_MODE: saas (controls SaaS vs Enterprise behavior)
- CONVERSATION_MANAGER_CLASS (required for SaaS backend)
- OH_WEB_CLIENT_FEATURE_FLAGS_* (controls UI features)
- Runtime and LLM configuration

Co-authored-by: openhands <openhands@all-hands.dev>
- Remove hardcoded OH_WEB_CLIENT_PROVIDERS_CONFIGURED from base-values.yaml
  that was overriding the computed template value
- Add keycloak.authHost to branch-console-message.yaml to point to shared
  Keycloak (prevents chart from computing auth.console-message.ohe-staging...)
- Enable keycloak.redirectUriRegistration to register branch's redirect URI
  with shared Keycloak (required because Keycloak doesn't support wildcards)
- Move enterpriseSSO.enabled before env block for clarity
- Clean up env block - let chart compute AUTH_TYPE, OIDC_PROVIDER_URL,
  OH_APP_MODE, etc. from enabled flags

The chart template at charts/openhands/templates/_env.yaml computes
OH_WEB_CLIENT_PROVIDERS_CONFIGURED from github.enabled, gitlab.enabled,
bitbucket.enabled, and enterpriseSSO.enabled flags. Values in the env
block override the computed defaults, so removing the hardcoded value
lets the template logic work correctly.

Co-authored-by: openhands <openhands@all-hands.dev>
Add generate-branch-values.sh script that creates values.yaml files for
branch deployments with proper Keycloak SSO configuration:

- Handles shared Keycloak setup (authHost, redirect URI registration)
- Computes AUTH_URL correctly for branch subdomains
- Configures enterprise SSO provider
- Sets up connections to shared PostgreSQL, Redis, and LiteLLM
- Two modes: minimal (shares most resources) and full (own postgres/redis)
- Includes comprehensive documentation and next-steps instructions

This makes it easy for developers to create branch deployments without
struggling with the SSO configuration gotchas.

Co-authored-by: openhands <openhands@all-hands.dev>
Adds a script to help developers easily create branch deployment values files
for the OHE staging environment. The script:

- Sanitizes branch names for Kubernetes DNS compliance
- Generates a complete Helm values file with all required overrides
- Outputs clear deployment instructions with kubectl/helm commands
- Supports custom image tags for PR-specific builds

Usage: ./create-branch-deployment.sh <branch-name> [--image-tag <tag>]

Co-authored-by: openhands <openhands@all-hands.dev>
The backend requires ENABLE_ENTERPRISE_SSO=true in addition to the chart's
enterpriseSSO.enabled flag (which only sets OH_WEB_CLIENT_PROVIDERS_CONFIGURED
for the frontend).

Co-authored-by: openhands <openhands@all-hands.dev>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants