Design Doc for Azslurm Exporter HA Mode by azreenz · Pull Request #497 · Azure/cyclecloud-slurm

azreenz · 2026-04-14T20:06:29Z

No description provided.

bwatrous · 2026-04-14T21:03:36Z

+        -   Simple infrastructure, continue to collect only export when active controller
+        -   Dont need to consider stopping collect tasks when controller is down
+    - Cons:
+        -   2x load on slurmctld


another big a con is twice the metrics reporting rate

Collecting and exporting would be decoupled here, the assumption is we only ever export on one node but the considerations were we continue to collect on both nodes and then stop the exporting when a node isnt the active controller anymore and then the node that becomes the active controller starts exporting then

bwatrous · 2026-04-14T21:05:16Z

+- Cons:
+    - won't have full job history since backup wasnt keeping track
+    - if controller goes down slurm hooks wont run
+- would have to still stop original controller from exporting somehow


Yep, this is a hard problem in general.
We should try to be resilient to both exporters running (ie - we should verify that metrics are reasonable if BOTH exporters are running as in your next option)

bwatrous · 2026-04-14T21:08:01Z

+        -   Less load on slurmctld
+    - Cons:
+        -   More complex
+        -   need to figure out how to conserve job history


This becomes easier if metrics are consistent when both exporters are exporting.
Then we could simply query further back in time in the accounting db at start of reporting (and potentially re-report some older data)

bwatrous · 2026-04-14T21:42:29Z

+    → Write state JSON to StateSaveLocation
+    → Include: hostname, timestamp, time_window, counters
+    → Use atomic write (temp file + rename)
+    → Use file locking (fcntl) to prevent corruption


I'd remove this requirement - file locking over NFS is flaky at best and i think line 129 (atomic file modification) is sufficient here since we are regenerating the whole file (not modifying sections of a large file concurrently)

bwatrous · 2026-04-14T21:49:15Z

+```
+
+```
+Proposed Flow:


Why do we care about "Primary" vs "Secondary" for the exporter?
It makes some sense in Slurm to have fail-back because users want the primary's DNS name.
But for the exporter, no one cares what node it's running on (unless there's a significant performance optimization to always running the exporter on the active scheduler)

so if the primary goes down the backup assumes control but if the primary comes back up then primary assumes control again. In the proposed solution exporter is running on both nodes but only doing the collection and exporting if its the active controller. Fail back is for scenarios where primary controller assumes control again.

bwatrous · 2026-04-14T21:56:13Z

+
+
+## Considerations:
+**Exporter only runs on primary**: start the exporter on the backup when it becomes primary


Is this a hard requirement (that the exporter run on the Primary scheduler)?
I think there are 2 ways to think about this:

ONLY export from the primary scheduler and use a change in active Slurm Scheduler to trigger failover
a. this has the VERY nice feature that Slurm is responsible for determining when failover is required

OR track failover locally to the exporters (maybe through a shared file that indicates the last time it was updated)

In general, i like leaving it up to Slurm to detect failover since it's not a trivial problem :)

I meant more so only running the exporter on the active controller. The problem with this is stopping the exporter on the controller that was active and isnt anymore. In cases where slurmctld is down on a node but the node is still running lets say and the backup has assumed control all slurm commands would still work on that primary node azslurm-exporter will continiue to export so we need a way to stop the exporter on the node that goes down.

azreenz · 2026-04-15T14:07:06Z

+**Failover Detection**:
+- SlurmctldPrimaryOnProg, SlurmctldPrimaryOffProg
+    - Pros:
+        - Slurm handles failover detection:


@bwatrous so the thing with the slurm hooks is that when slurmctld goes down SlurmctldPrimaryOffProg doesnt run it only runs when theres a smooth scontrol takeover. SlurmctldPrimaryOnProg is pretty good cause it will always run on the UP controller that is assuming control. We could use that to persist the state. ie when a controller takes control, we write to the state save file with the hostname and then the exporters read that file constantly and only collect and export when it the active controller

azreenz · 2026-04-15T16:40:43Z

+ - When exporter detects it is not the active controller it will run in idle mode -- no collecting or exporting
+ - When exporter detects it is the active controller it will run in active mode -- collecting and exporting
+
+Exporter in active mode will save sacct job counter state in save state location for backup to load in when failover occurs or when azslurm-exporter crashes on primary it can still load in last state if avail


Since we are intializing the exporters on both primary and backup Sacct collector will already have the start time stored in the object. So when backup starts we will pull in from that initial starttime in the first query. we can save the last endtime when we stop the sacct collector so for the next time it starts the window will start from the last endtime so we dont have any gaps

we can still save the original starttime in save state file so that if azslurm crashes we can load in that starttime

commit design doc

0695b77

azreenz added the Do-not-merge Do not merge yet label Apr 14, 2026

bwatrous reviewed Apr 14, 2026

View reviewed changes

azreenz commented Apr 15, 2026

View reviewed changes

azreenz added 3 commits April 15, 2026 14:28

update doc

9dff48c

add proposal #2

89eb859

create sacct job history persistence spec

072d5bc



		## Considerations:
		Exporter only runs on primary: start the exporter on the backup when it becomes primary

Conversation

azreenz commented Apr 14, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bwatrous Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bwatrous Apr 14, 2026 •

edited

Loading