ClusterBloom

ClusterBloom is a tool for deploying and configuring Kubernetes clusters using RKE2, with specialized support for AMD GPU environments. It automates the process of setting up multi-node clusters, configuring storage with Longhorn, and integrating with various tools and services.

Features

Automated RKE2 Kubernetes cluster deployment
ROCm setup and configuration for AMD GPU nodes
Disk management and Longhorn storage integration
Multi-node cluster support with easy node joining
ClusterForge integration

Prerequisites

Ubuntu (supported versions checked at runtime)
Sufficient disk space (500GB+ recommended for root partition, 2TB+ for workloads)
NVMe drives for optimal storage configuration
ROCm-compatible AMD GPUs (for GPU nodes) - ROCm 7.1.1 required
Root/sudo access

Getting Started

Download and Setup

Download the latest bloom binary:

wget https://github.com/silogen/cluster-bloom/releases/download/<version>/bloom

Make the binary executable:

chmod +x bloom

Usage

Configuration Generation

Launch the web UI to generate your bloom.yaml configuration:

./bloom

Access the configuration wizard at http://127.0.0.1:62078

Additional Node Setup

After setting up the first node, it will generate a command in additional_node_command.txt that you can run on other nodes to join them to the cluster:

# Example (actual command will be different)
echo -e 'FIRST_NODE: false\nJOIN_TOKEN: your-token-here\nSERVER_IP: your-server-ip' > bloom.yaml && sudo ./bloom cli bloom.yaml

Version Information

./bloom version   # subcommand
./bloom --version # flag (short: -v)
./bloom -v

Command Help

Show all available commands and complete configuration reference:

./bloom help

Note: Help includes auto-generated documentation for all configuration fields.

Get help for specific commands:

./bloom cleanup --help  # Remove existing cluster installation
./bloom cli --help      # Deploy cluster using configuration file
./bloom run --help      # Run exported Ansible playbook

Playbook Export and Debugging

Export generated Ansible playbooks for inspection without execution:

# Export playbook to stdout
./bloom cli bloom.yaml --export

# Export playbook with cleanup tasks included (for existing installations)
./bloom cli bloom.yaml --export --destroy-data > myPlaybook.yaml

# Save exported playbook to file
./bloom cli bloom.yaml --export > myPlaybook.yaml

# Execute exported playbook manually
sudo ./bloom run myPlaybook.yaml

Use Cases:

Debugging: Inspect the complete playbook before execution
Understanding: See exactly what actions will be performed
Restricted Environments: Export in one environment, run in another
Manual Control: Review and modify playbooks before execution

Important Notes:

Exported playbooks are fully self-contained (all task files are automatically inlined)
Configuration values from your bloom.yaml are properly applied
Exported playbooks work perfectly with sudo ./bloom run for manual execution
No external dependencies or task files are required for exported playbooks
Cleanup Integration: Use --export --destroy-data to include cleanup tasks in exported playbooks
Existing Installations: For existing cluster installations, use --destroy-data (or the standalone bloom cleanup bloom.yaml) before redeployment
Optimized Cleanup: Best-effort node drain (~30s timeout) that internally uses kubectl's --force and --disable-eviction to bypass stuck pods; skips volume detach wait when no Longhorn volumes detected
Disk Wipe Preview: Both bloom cleanup and --destroy-data show a preview with:
- User files listed (up to 5), or count shown if more than 5
- lost+found folders automatically excluded (ext4 system folder)
- Clear visual warnings for user data at risk
Premounted Disk Safety: CLUSTER_PREMOUNTED_DISKS disks have bloom artifacts cleaned but their filesystem and user files are preserved
Combined Disk Config: CLUSTER_DISKS and CLUSTER_PREMOUNTED_DISKS can be used simultaneously; mount indexes are allocated automatically to avoid conflicts

Configuration

Cluster-Bloom can be configured through environment variables, command-line flags, or a configuration file.

Configuration Variables

Variable	Description	Default
ADDITIONAL_OIDC_PROVIDERS	List of additional OIDC providers for authentication (see examples below)	[]
ADDITIONAL_TLS_SAN_URLS	Additional TLS Subject Alternative Name URLs for Kubernetes API server certificate	[]
CERT_OPTION	Certificate option when USE_CERT_MANAGER is false. Choose 'existing' or 'generate'	""
CF_VALUES	Path to ClusterForge values file (optional). Example: "values_cf.yaml"	""
CLUSTER_DISKS	Comma-separated list of disk devices. Example "/dev/sdb,/dev/sdc". Also skips NVME drive checks.	""
CLUSTER_LISTEN_IP	Network IP specification for cluster binding. Supports exact IP ("192.168.1.100") or subnet CIDR ("192.168.1.0/24"). Overrides auto-detection for multi-homed systems.	""
CLUSTER_SIZE	Size category for cluster deployment planning. Options: small, medium, large	medium
CLUSTER_PREMOUNTED_DISKS	Comma-separated list of absolute disk paths to use for Longhorn	""
CLUSTERFORGE_RELEASE	ClusterForge version to deploy. Accepts version tags (e.g. `v2.0.2`), full release URLs, `latest` (fetches newest GitHub release via API), `none`, or `""` to skip	`latest`
CONTROL_PLANE	Set to true if this node should be a control plane node	false, only applies when FIRST_NODE is false
DOMAIN	The domain name for the cluster (e.g., "cluster.example.com") (required).	""
DNS_SERVERS	Custom DNS servers for RKE2 cluster. If set, these nameservers will be written to /etc/rancher/rke2/resolv.conf instead of copying host DNS. Format as YAML list (e.g., ["8.8.8.8", "1.1.1.1"])	[]
FIX_DNS	Opt-in to allow automatic DNS fixes. Only modifies DNS if broken and external DNS works. Creates backups and auto-rolls back on failure.	false
FIRST_NODE	Set to true if this is the first node in the cluster	true
GPU_NODE	Set to true if this node has GPUs	true
JOIN_TOKEN	The token used to join additional nodes to the cluster
NO_DISKS_FOR_CLUSTER	Set to true to skip disk-related operations	false
RKE2_VERSION	Specific RKE2 version to install (e.g., "v1.34.1+rke2r1")	""
SERVER_IP	The IP address of the RKE2 server (required for additional nodes)
SKIP_RANCHER_PARTITION_CHECK	Set to true to skip /var/lib/rancher partition size check	false
TLS_CERT	Path to TLS certificate file for ingress (required if CERT_OPTION is 'existing')	""
TLS_KEY	Path to TLS private key file for ingress (required if CERT_OPTION is 'existing')	""
USE_CERT_MANAGER	Use cert-manager with Let's Encrypt for automatic TLS certificates	false
ARGOCD_VERSION	ArgoCD version to install	v2.14.11
CLUSTERFORGE_REPO	ClusterForge git repository URL for ArgoCD-based deployment	https://github.com/silogen/cluster-forge.git
INSTALL_ARGOCD	Install ArgoCD core for GitOps (small clusters only)	true
PRELOAD_IMAGES	Comma-separated list of container images to preload	docker.io/rocm/pytorch:rocm6.4_ubuntu24.04_py3.12_pytorch_release_2.6.0,docker.io/rocm/vllm:rocm6.4.1_vllm_0.9.0.1_20250605
RKE2_EXTRA_CONFIG	Additional RKE2 configuration in YAML format	""
RKE2_INSTALLATION_URL	RKE2 installation script URL	https://get.rke2.io
ROCM_BASE_URL	ROCm base repository URL	https://repo.radeon.com/amdgpu-install/7.1.1/ubuntu/
ROCM_DEB_PACKAGE	ROCm DEB package name	amdgpu-install_7.1.1.70101-1_all.deb

OIDC Configuration Examples

Basic OIDC Provider:

ADDITIONAL_OIDC_PROVIDERS:
  - url: "https://keycloak.example.com/realms/main"
    audiences: ["k8s"]

Notes:

ClaimMappings use username and groups with prefix "oidc:"
url: HTTPS URL of your OIDC provider (Keycloak, Auth0, etc.)
audiences: List of client IDs from your OIDC provider
Default behavior: If ADDITIONAL_OIDC_PROVIDERS is skipped, a default OIDC provider will be configured pointing to the internal Keycloak airm realm at https://kc.{DOMAIN}/realms/airm

For advanced configuration, multiple providers, and troubleshooting, see docs/oidc-authentication.md.

TLS-SAN Configuration

TLS Subject Alternative Names (SANs) allow your Kubernetes API server to be accessed via multiple domain names. Cluster-Bloom automatically configures TLS-SANs for secure remote access to your cluster.

Note: Wildcard domains (*.example.com) are not supported by RKE2.

Basic Configuration:

DOMAIN: "example.com"
ADDITIONAL_TLS_SAN_URLS:
  - "api.example.com"
  - "kubernetes.example.com"

Key Points:

Cluster-Bloom automatically generates k8s.{DOMAIN} as a default TLS-SAN
Do not duplicate the auto-generated SAN in ADDITIONAL_TLS_SAN_URLS
Valid domain names only (no wildcards)
The configuration wizard provides real-time validation

For detailed examples, testing instructions, and common use cases, see docs/tls-san-configuration.md.

Network Configuration

CLUSTER_LISTEN_IP provides precise control over which network interface the Kubernetes cluster uses for communication. This is essential for systems with multiple network interfaces where automatic detection might select the wrong IP.

Basic Configuration:

# Explicit IP address
CLUSTER_LISTEN_IP: "192.168.1.100"

# Or CIDR subnet (auto-selects first matching IP)
CLUSTER_LISTEN_IP: "192.168.1.0/24"

When to use:

Multi-homed systems: Servers with multiple network interfaces
Complex networking: VPN, Docker networks, or overlay networks present
Specific requirements: When you need cluster traffic on a particular interface

How it works:

Priority 1: If explicit IP specified, validates it exists on system interfaces
Priority 2: If CIDR subnet specified, finds first matching IP on system
Priority 3: Falls back to default route interface (auto-detection)

Environment variable support:

export CLUSTER_LISTEN_IP="192.168.1.100"
sudo ./bloom cli bloom.yaml

CLI flag support:

sudo ./bloom cli bloom.yaml --cluster-listen-ip "192.168.1.100"

Validation: The system validates that specified IPs/subnets exist on target system interfaces before deployment, providing helpful error messages if not found.

Using a Configuration File

Create a YAML configuration file (e.g., bloom.yaml):

DOMAIN: "your-domain.example.com"  # Required: Your cluster domain
FIRST_NODE: true
GPU_NODE: true                     # Set to false if no GPUs
CLUSTER_DISKS: "/dev/nvme1n1"     # Disk device path for storage
CLUSTER_LISTEN_IP: "192.168.1.100" # Optional: specific IP for cluster binding
CERT_OPTION: "generate"           # Options: "generate" or "existing"
CLUSTERFORGE_RELEASE: "v2.0.0"    # Version tag, full URL, "latest", "none", or "" to skip
PRELOAD_IMAGES: ""                # Optional: comma-separated container images

Then run with:

sudo ./bloom cli bloom.yaml

CLI Command Options

The cli command supports several options for different deployment scenarios:

# Standard deployment
sudo ./bloom cli bloom.yaml

# Export playbook without execution (for debugging/inspection)
./bloom cli bloom.yaml --export

# Dry run (check mode without making changes)
sudo ./bloom cli bloom.yaml --dry-run

# Run specific playbook tags only
sudo ./bloom cli bloom.yaml --tags "validate_node,prep_node"

# Export with cleanup tasks for existing installations
./bloom cli bloom.yaml --export --destroy-data > cleanupPlaybook.yaml

# Dangerous: Destroy existing data and start fresh
sudo ./bloom cli bloom.yaml --destroy-data

Separate Playbook Execution

Run exported or custom Ansible playbooks using the containerized runtime:

# Run exported playbook
sudo ./bloom run myPlaybook.yaml

# Run with additional variables
sudo ./bloom run myPlaybook.yaml -e "CUSTOM_VAR=value"

# Run with configuration file for additional variables
sudo ./bloom run myPlaybook.yaml --config additional-config.yaml

# Run with verbose output
sudo ./bloom run myPlaybook.yaml --verbose

Installation Process

Cluster-Bloom performs the following steps during installation:

Checks for supported Ubuntu version
Installs required packages (jq, nfs-common, open-iscsi)
Configures firewall and networking
Sets up ROCm for GPU nodes
Prepares and installs RKE2
Configures storage (local-path for small/medium clusters, Longhorn for large clusters)
Sets up Kubernetes tools and configuration
Installs ClusterForge

Dependencies

go (1.24.0)
cobra-cli
jq, nfs-common, open-iscsi (installed during setup)
kubectl and k9s (installed during setup)

License

Apache License 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 910 Commits
.github/workflows		.github/workflows
cmd		cmd
docs		docs
experimental		experimental
licenses		licenses
pkg		pkg
tests		tests
.gitignore		.gitignore
CODE_STYLE.md		CODE_STYLE.md
GPU_AND_ROCM_INSTALLATION.txt		GPU_AND_ROCM_INSTALLATION.txt
LICENSE		LICENSE
README.md		README.md
devbox.json		devbox.json
devbox.lock		devbox.lock
empty		empty
go.mod		go.mod
go.sum		go.sum
justfile		justfile
main.go		main.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ClusterBloom

Features

Prerequisites

Getting Started

Download and Setup

Usage

Configuration Generation

Additional Node Setup

Version Information

Command Help

Playbook Export and Debugging

Configuration

Configuration Variables

OIDC Configuration Examples

TLS-SAN Configuration

Network Configuration

Using a Configuration File

CLI Command Options

Separate Playbook Execution

Installation Process

Dependencies

License

About

Uh oh!

Releases 171

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ClusterBloom

Features

Prerequisites

Getting Started

Download and Setup

Usage

Configuration Generation

Additional Node Setup

Version Information

Command Help

Playbook Export and Debugging

Configuration

Configuration Variables

OIDC Configuration Examples

TLS-SAN Configuration

Network Configuration

Using a Configuration File

CLI Command Options

Separate Playbook Execution

Installation Process

Dependencies

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 171

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages