Skip to content

Compute node hostname missing cluster prefix (install.py skips prefix for execute mode) #490

@SubaruArai

Description

@SubaruArai

Description

CCWS compute (execute) nodes get hostname gpu-test-1 but the Slurm autoscaler generates ccw-gpu-test-1 (with cluster prefix) in azure.conf. slurmd crash-loops because the names don't match. Login nodes are not affected.

Root cause

In /opt/cycle/jetpack/system/bootstrap/azure-slurm-install/install.py, line ~1028:

def set_hostname(s: InstallSettings) -> None:
    new_hostname = s.node_name.lower()
    if s.mode != "execute" and not new_hostname.startswith(s.node_name_prefix):
        new_hostname = f"{s.node_name_prefix}{new_hostname}"

Execute nodes skip the prefix prepend (if s.mode != "execute"). But the autoscaler always generates names WITH the prefix (ccw-gpu-test-1).

Similarly, CCWS 01-rename_host.sh only prepends the cluster prefix for login nodes (if is_login), not compute nodes.

The result: login → ccw-login-1 (correct), compute → gpu-test-1 (wrong).

Steps to reproduce

  1. Deploy CCWS with NodeNameIsHostname=true and NodeNamePrefix="Cluster Prefix"
  2. Add a compute node via the autoscaler (submit a job)
  3. Check hostname on the compute node: hostnamegpu-test-1
  4. Check Slurm expects: grep gpu /etc/slurm/azure.confccw-gpu-test-1
  5. slurmd crash-loops with "lookup failure for node"

Environment

  • CycleCloud 8.8.3-3667
  • Slurm 23.11.7
  • Ubuntu 22.04

Workaround

Cluster-init script that extends the 01-rename_host.sh pattern to compute nodes: checks if cyclecloud.node.name already has the prefix (autoscaler-provisioned) and prepends if not (manually added).

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions