Description
CCWS compute (execute) nodes get hostname gpu-test-1 but the Slurm autoscaler generates ccw-gpu-test-1 (with cluster prefix) in azure.conf. slurmd crash-loops because the names don't match. Login nodes are not affected.
Root cause
In /opt/cycle/jetpack/system/bootstrap/azure-slurm-install/install.py, line ~1028:
def set_hostname(s: InstallSettings) -> None:
new_hostname = s.node_name.lower()
if s.mode != "execute" and not new_hostname.startswith(s.node_name_prefix):
new_hostname = f"{s.node_name_prefix}{new_hostname}"
Execute nodes skip the prefix prepend (if s.mode != "execute"). But the autoscaler always generates names WITH the prefix (ccw-gpu-test-1).
Similarly, CCWS 01-rename_host.sh only prepends the cluster prefix for login nodes (if is_login), not compute nodes.
The result: login → ccw-login-1 (correct), compute → gpu-test-1 (wrong).
Steps to reproduce
- Deploy CCWS with
NodeNameIsHostname=true and NodeNamePrefix="Cluster Prefix"
- Add a compute node via the autoscaler (submit a job)
- Check hostname on the compute node:
hostname → gpu-test-1
- Check Slurm expects:
grep gpu /etc/slurm/azure.conf → ccw-gpu-test-1
slurmd crash-loops with "lookup failure for node"
Environment
- CycleCloud 8.8.3-3667
- Slurm 23.11.7
- Ubuntu 22.04
Workaround
Cluster-init script that extends the 01-rename_host.sh pattern to compute nodes: checks if cyclecloud.node.name already has the prefix (autoscaler-provisioned) and prepends if not (manually added).
Related
Description
CCWS compute (execute) nodes get hostname
gpu-test-1but the Slurm autoscaler generatesccw-gpu-test-1(with cluster prefix) inazure.conf.slurmdcrash-loops because the names don't match. Login nodes are not affected.Root cause
In
/opt/cycle/jetpack/system/bootstrap/azure-slurm-install/install.py, line ~1028:Execute nodes skip the prefix prepend (
if s.mode != "execute"). But the autoscaler always generates names WITH the prefix (ccw-gpu-test-1).Similarly, CCWS
01-rename_host.shonly prepends the cluster prefix for login nodes (if is_login), not compute nodes.The result: login →
ccw-login-1(correct), compute →gpu-test-1(wrong).Steps to reproduce
NodeNameIsHostname=trueandNodeNamePrefix="Cluster Prefix"hostname→gpu-test-1grep gpu /etc/slurm/azure.conf→ccw-gpu-test-1slurmdcrash-loops with "lookup failure for node"Environment
Workaround
Cluster-init script that extends the
01-rename_host.shpattern to compute nodes: checks ifcyclecloud.node.namealready has the prefix (autoscaler-provisioned) and prepends if not (manually added).Related