Skip to content

Design Doc for Azslurm Exporter HA Mode#497

Draft
azreenz wants to merge 4 commits into
masterfrom
azreenzaman/exporter-ha
Draft

Design Doc for Azslurm Exporter HA Mode#497
azreenz wants to merge 4 commits into
masterfrom
azreenzaman/exporter-ha

Conversation

@azreenz
Copy link
Copy Markdown
Collaborator

@azreenz azreenz commented Apr 14, 2026

No description provided.

@azreenz azreenz added the Do-not-merge Do not merge yet label Apr 14, 2026
- Simple infrastructure, continue to collect only export when active controller
- Dont need to consider stopping collect tasks when controller is down
- Cons:
- 2x load on slurmctld
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another big a con is twice the metrics reporting rate

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Collecting and exporting would be decoupled here, the assumption is we only ever export on one node but the considerations were we continue to collect on both nodes and then stop the exporting when a node isnt the active controller anymore and then the node that becomes the active controller starts exporting then

- Cons:
- won't have full job history since backup wasnt keeping track
- if controller goes down slurm hooks wont run
- would have to still stop original controller from exporting somehow
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, this is a hard problem in general.
We should try to be resilient to both exporters running (ie - we should verify that metrics are reasonable if BOTH exporters are running as in your next option)

- Less load on slurmctld
- Cons:
- More complex
- need to figure out how to conserve job history
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This becomes easier if metrics are consistent when both exporters are exporting.
Then we could simply query further back in time in the accounting db at start of reporting (and potentially re-report some older data)

→ Write state JSON to StateSaveLocation
→ Include: hostname, timestamp, time_window, counters
→ Use atomic write (temp file + rename)
→ Use file locking (fcntl) to prevent corruption
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd remove this requirement - file locking over NFS is flaky at best and i think line 129 (atomic file modification) is sufficient here since we are regenerating the whole file (not modifying sections of a large file concurrently)

```

```
Proposed Flow:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we care about "Primary" vs "Secondary" for the exporter?
It makes some sense in Slurm to have fail-back because users want the primary's DNS name.
But for the exporter, no one cares what node it's running on (unless there's a significant performance optimization to always running the exporter on the active scheduler)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so if the primary goes down the backup assumes control but if the primary comes back up then primary assumes control again. In the proposed solution exporter is running on both nodes but only doing the collection and exporting if its the active controller. Fail back is for scenarios where primary controller assumes control again.



## Considerations:
**Exporter only runs on primary**: start the exporter on the backup when it becomes primary
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a hard requirement (that the exporter run on the Primary scheduler)?
I think there are 2 ways to think about this:

  1. ONLY export from the primary scheduler and use a change in active Slurm Scheduler to trigger failover
    a. this has the VERY nice feature that Slurm is responsible for determining when failover is required
  2. OR track failover locally to the exporters (maybe through a shared file that indicates the last time it was updated)

Copy link
Copy Markdown
Collaborator

@bwatrous bwatrous Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, i like leaving it up to Slurm to detect failover since it's not a trivial problem :)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant more so only running the exporter on the active controller. The problem with this is stopping the exporter on the controller that was active and isnt anymore. In cases where slurmctld is down on a node but the node is still running lets say and the backup has assumed control all slurm commands would still work on that primary node azslurm-exporter will continiue to export so we need a way to stop the exporter on the node that goes down.

**Failover Detection**:
- SlurmctldPrimaryOnProg, SlurmctldPrimaryOffProg
- Pros:
- Slurm handles failover detection:
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bwatrous so the thing with the slurm hooks is that when slurmctld goes down SlurmctldPrimaryOffProg doesnt run it only runs when theres a smooth scontrol takeover. SlurmctldPrimaryOnProg is pretty good cause it will always run on the UP controller that is assuming control. We could use that to persist the state. ie when a controller takes control, we write to the state save file with the hostname and then the exporters read that file constantly and only collect and export when it the active controller

- When exporter detects it is not the active controller it will run in idle mode -- no collecting or exporting
- When exporter detects it is the active controller it will run in active mode -- collecting and exporting

Exporter in active mode will save sacct job counter state in save state location for backup to load in when failover occurs or when azslurm-exporter crashes on primary it can still load in last state if avail
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are intializing the exporters on both primary and backup Sacct collector will already have the start time stored in the object. So when backup starts we will pull in from that initial starttime in the first query. we can save the last endtime when we stop the sacct collector so for the next time it starts the window will start from the last endtime so we dont have any gaps

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can still save the original starttime in save state file so that if azslurm crashes we can load in that starttime

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Do-not-merge Do not merge yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants