ClusterBloom is a tool for deploying and configuring Kubernetes clusters using RKE2, with specialized support for AMD GPU environments. It automates the process of setting up multi-node clusters, configuring storage with Longhorn, and integrating with various tools and services.
- Automated RKE2 Kubernetes cluster deployment
- ROCm setup and configuration for AMD GPU nodes
- Disk management and Longhorn storage integration
- Multi-node cluster support with easy node joining
- ClusterForge integration
- Ubuntu (supported versions checked at runtime)
- Sufficient disk space (500GB+ recommended for root partition, 2TB+ for workloads)
- NVMe drives for optimal storage configuration
- ROCm-compatible AMD GPUs (for GPU nodes) - ROCm 7.1.1 required
- Root/sudo access
- Download the latest bloom binary:
wget https://github.com/silogen/cluster-bloom/releases/download/<version>/bloom- Make the binary executable:
chmod +x bloomLaunch the web UI to generate your bloom.yaml configuration:
./bloomAccess the configuration wizard at http://127.0.0.1:62078
After setting up the first node, it will generate a command in additional_node_command.txt that you can run on other nodes to join them to the cluster:
# Example (actual command will be different)
echo -e 'FIRST_NODE: false\nJOIN_TOKEN: your-token-here\nSERVER_IP: your-server-ip' > bloom.yaml && sudo ./bloom cli bloom.yaml./bloom version # subcommand
./bloom --version # flag (short: -v)
./bloom -vShow all available commands and complete configuration reference:
./bloom helpNote: Help includes auto-generated documentation for all configuration fields.
Get help for specific commands:
./bloom cleanup --help # Remove existing cluster installation
./bloom cli --help # Deploy cluster using configuration file
./bloom run --help # Run exported Ansible playbookExport generated Ansible playbooks for inspection without execution:
# Export playbook to stdout
./bloom cli bloom.yaml --export
# Export playbook with cleanup tasks included (for existing installations)
./bloom cli bloom.yaml --export --destroy-data > myPlaybook.yaml
# Save exported playbook to file
./bloom cli bloom.yaml --export > myPlaybook.yaml
# Execute exported playbook manually
sudo ./bloom run myPlaybook.yamlUse Cases:
- Debugging: Inspect the complete playbook before execution
- Understanding: See exactly what actions will be performed
- Restricted Environments: Export in one environment, run in another
- Manual Control: Review and modify playbooks before execution
Important Notes:
- Exported playbooks are fully self-contained (all task files are automatically inlined)
- Configuration values from your bloom.yaml are properly applied
- Exported playbooks work perfectly with
sudo ./bloom runfor manual execution - No external dependencies or task files are required for exported playbooks
- Cleanup Integration: Use
--export --destroy-datato include cleanup tasks in exported playbooks - Existing Installations: For existing cluster installations, use
--destroy-data(or the standalonebloom cleanup bloom.yaml) before redeployment - Optimized Cleanup: Best-effort node drain (~30s timeout) that internally uses kubectl's
--forceand--disable-evictionto bypass stuck pods; skips volume detach wait when no Longhorn volumes detected - Disk Wipe Preview: Both
bloom cleanupand--destroy-datashow a preview with:- User files listed (up to 5), or count shown if more than 5
lost+foundfolders automatically excluded (ext4 system folder)- Clear visual warnings for user data at risk
- Premounted Disk Safety:
CLUSTER_PREMOUNTED_DISKSdisks have bloom artifacts cleaned but their filesystem and user files are preserved - Combined Disk Config:
CLUSTER_DISKSandCLUSTER_PREMOUNTED_DISKScan be used simultaneously; mount indexes are allocated automatically to avoid conflicts
Cluster-Bloom can be configured through environment variables, command-line flags, or a configuration file.
| Variable | Description | Default |
|---|---|---|
| ADDITIONAL_OIDC_PROVIDERS | List of additional OIDC providers for authentication (see examples below) | [] |
| ADDITIONAL_TLS_SAN_URLS | Additional TLS Subject Alternative Name URLs for Kubernetes API server certificate | [] |
| CERT_OPTION | Certificate option when USE_CERT_MANAGER is false. Choose 'existing' or 'generate' | "" |
| CF_VALUES | Path to ClusterForge values file (optional). Example: "values_cf.yaml" | "" |
| CLUSTER_DISKS | Comma-separated list of disk devices. Example "/dev/sdb,/dev/sdc". Also skips NVME drive checks. | "" |
| CLUSTER_LISTEN_IP | Network IP specification for cluster binding. Supports exact IP ("192.168.1.100") or subnet CIDR ("192.168.1.0/24"). Overrides auto-detection for multi-homed systems. | "" |
| CLUSTER_SIZE | Size category for cluster deployment planning. Options: small, medium, large | medium |
| CLUSTER_PREMOUNTED_DISKS | Comma-separated list of absolute disk paths to use for Longhorn | "" |
| CLUSTERFORGE_RELEASE | ClusterForge version to deploy. Accepts version tags (e.g. v2.0.2), full release URLs, latest (fetches newest GitHub release via API), none, or "" to skip |
latest |
| CONTROL_PLANE | Set to true if this node should be a control plane node | false, only applies when FIRST_NODE is false |
| DOMAIN | The domain name for the cluster (e.g., "cluster.example.com") (required). | "" |
| DNS_SERVERS | Custom DNS servers for RKE2 cluster. If set, these nameservers will be written to /etc/rancher/rke2/resolv.conf instead of copying host DNS. Format as YAML list (e.g., ["8.8.8.8", "1.1.1.1"]) | [] |
| FIX_DNS | Opt-in to allow automatic DNS fixes. Only modifies DNS if broken and external DNS works. Creates backups and auto-rolls back on failure. | false |
| FIRST_NODE | Set to true if this is the first node in the cluster | true |
| GPU_NODE | Set to true if this node has GPUs | true |
| JOIN_TOKEN | The token used to join additional nodes to the cluster | |
| NO_DISKS_FOR_CLUSTER | Set to true to skip disk-related operations | false |
| RKE2_VERSION | Specific RKE2 version to install (e.g., "v1.34.1+rke2r1") | "" |
| SERVER_IP | The IP address of the RKE2 server (required for additional nodes) | |
| SKIP_RANCHER_PARTITION_CHECK | Set to true to skip /var/lib/rancher partition size check | false |
| TLS_CERT | Path to TLS certificate file for ingress (required if CERT_OPTION is 'existing') | "" |
| TLS_KEY | Path to TLS private key file for ingress (required if CERT_OPTION is 'existing') | "" |
| USE_CERT_MANAGER | Use cert-manager with Let's Encrypt for automatic TLS certificates | false |
| ARGOCD_VERSION | ArgoCD version to install | v2.14.11 |
| CLUSTERFORGE_REPO | ClusterForge git repository URL for ArgoCD-based deployment | https://github.com/silogen/cluster-forge.git |
| INSTALL_ARGOCD | Install ArgoCD core for GitOps (small clusters only) | true |
| PRELOAD_IMAGES | Comma-separated list of container images to preload | docker.io/rocm/pytorch:rocm6.4_ubuntu24.04_py3.12_pytorch_release_2.6.0,docker.io/rocm/vllm:rocm6.4.1_vllm_0.9.0.1_20250605 |
| RKE2_EXTRA_CONFIG | Additional RKE2 configuration in YAML format | "" |
| RKE2_INSTALLATION_URL | RKE2 installation script URL | https://get.rke2.io |
| ROCM_BASE_URL | ROCm base repository URL | https://repo.radeon.com/amdgpu-install/7.1.1/ubuntu/ |
| ROCM_DEB_PACKAGE | ROCm DEB package name | amdgpu-install_7.1.1.70101-1_all.deb |
Basic OIDC Provider:
ADDITIONAL_OIDC_PROVIDERS:
- url: "https://keycloak.example.com/realms/main"
audiences: ["k8s"]Notes:
- ClaimMappings use
usernameandgroupswith prefix"oidc:" url: HTTPS URL of your OIDC provider (Keycloak, Auth0, etc.)audiences: List of client IDs from your OIDC provider- Default behavior: If
ADDITIONAL_OIDC_PROVIDERSis skipped, a default OIDC provider will be configured pointing to the internal Keycloakairmrealm athttps://kc.{DOMAIN}/realms/airm
For advanced configuration, multiple providers, and troubleshooting, see docs/oidc-authentication.md.
TLS Subject Alternative Names (SANs) allow your Kubernetes API server to be accessed via multiple domain names. Cluster-Bloom automatically configures TLS-SANs for secure remote access to your cluster.
Note: Wildcard domains (*.example.com) are not supported by RKE2.
Basic Configuration:
DOMAIN: "example.com"
ADDITIONAL_TLS_SAN_URLS:
- "api.example.com"
- "kubernetes.example.com"Key Points:
- Cluster-Bloom automatically generates
k8s.{DOMAIN}as a default TLS-SAN - Do not duplicate the auto-generated SAN in
ADDITIONAL_TLS_SAN_URLS - Valid domain names only (no wildcards)
- The configuration wizard provides real-time validation
For detailed examples, testing instructions, and common use cases, see docs/tls-san-configuration.md.
CLUSTER_LISTEN_IP provides precise control over which network interface the Kubernetes cluster uses for communication. This is essential for systems with multiple network interfaces where automatic detection might select the wrong IP.
Basic Configuration:
# Explicit IP address
CLUSTER_LISTEN_IP: "192.168.1.100"
# Or CIDR subnet (auto-selects first matching IP)
CLUSTER_LISTEN_IP: "192.168.1.0/24"When to use:
- Multi-homed systems: Servers with multiple network interfaces
- Complex networking: VPN, Docker networks, or overlay networks present
- Specific requirements: When you need cluster traffic on a particular interface
How it works:
- Priority 1: If explicit IP specified, validates it exists on system interfaces
- Priority 2: If CIDR subnet specified, finds first matching IP on system
- Priority 3: Falls back to default route interface (auto-detection)
Environment variable support:
export CLUSTER_LISTEN_IP="192.168.1.100"
sudo ./bloom cli bloom.yamlCLI flag support:
sudo ./bloom cli bloom.yaml --cluster-listen-ip "192.168.1.100"Validation: The system validates that specified IPs/subnets exist on target system interfaces before deployment, providing helpful error messages if not found.
Create a YAML configuration file (e.g., bloom.yaml):
DOMAIN: "your-domain.example.com" # Required: Your cluster domain
FIRST_NODE: true
GPU_NODE: true # Set to false if no GPUs
CLUSTER_DISKS: "/dev/nvme1n1" # Disk device path for storage
CLUSTER_LISTEN_IP: "192.168.1.100" # Optional: specific IP for cluster binding
CERT_OPTION: "generate" # Options: "generate" or "existing"
CLUSTERFORGE_RELEASE: "v2.0.0" # Version tag, full URL, "latest", "none", or "" to skip
PRELOAD_IMAGES: "" # Optional: comma-separated container imagesThen run with:
sudo ./bloom cli bloom.yamlThe cli command supports several options for different deployment scenarios:
# Standard deployment
sudo ./bloom cli bloom.yaml
# Export playbook without execution (for debugging/inspection)
./bloom cli bloom.yaml --export
# Dry run (check mode without making changes)
sudo ./bloom cli bloom.yaml --dry-run
# Run specific playbook tags only
sudo ./bloom cli bloom.yaml --tags "validate_node,prep_node"
# Export with cleanup tasks for existing installations
./bloom cli bloom.yaml --export --destroy-data > cleanupPlaybook.yaml
# Dangerous: Destroy existing data and start fresh
sudo ./bloom cli bloom.yaml --destroy-dataRun exported or custom Ansible playbooks using the containerized runtime:
# Run exported playbook
sudo ./bloom run myPlaybook.yaml
# Run with additional variables
sudo ./bloom run myPlaybook.yaml -e "CUSTOM_VAR=value"
# Run with configuration file for additional variables
sudo ./bloom run myPlaybook.yaml --config additional-config.yaml
# Run with verbose output
sudo ./bloom run myPlaybook.yaml --verboseCluster-Bloom performs the following steps during installation:
- Checks for supported Ubuntu version
- Installs required packages (jq, nfs-common, open-iscsi)
- Configures firewall and networking
- Sets up ROCm for GPU nodes
- Prepares and installs RKE2
- Configures storage (local-path for small/medium clusters, Longhorn for large clusters)
- Sets up Kubernetes tools and configuration
- Installs ClusterForge
- go (1.24.0)
- cobra-cli
- jq, nfs-common, open-iscsi (installed during setup)
- kubectl and k9s (installed during setup)
Apache License 2.0