Design Doc for Azslurm Exporter HA Mode#497
Conversation
| - Simple infrastructure, continue to collect only export when active controller | ||
| - Dont need to consider stopping collect tasks when controller is down | ||
| - Cons: | ||
| - 2x load on slurmctld |
There was a problem hiding this comment.
another big a con is twice the metrics reporting rate
There was a problem hiding this comment.
Collecting and exporting would be decoupled here, the assumption is we only ever export on one node but the considerations were we continue to collect on both nodes and then stop the exporting when a node isnt the active controller anymore and then the node that becomes the active controller starts exporting then
| - Cons: | ||
| - won't have full job history since backup wasnt keeping track | ||
| - if controller goes down slurm hooks wont run | ||
| - would have to still stop original controller from exporting somehow |
There was a problem hiding this comment.
Yep, this is a hard problem in general.
We should try to be resilient to both exporters running (ie - we should verify that metrics are reasonable if BOTH exporters are running as in your next option)
| - Less load on slurmctld | ||
| - Cons: | ||
| - More complex | ||
| - need to figure out how to conserve job history |
There was a problem hiding this comment.
This becomes easier if metrics are consistent when both exporters are exporting.
Then we could simply query further back in time in the accounting db at start of reporting (and potentially re-report some older data)
| → Write state JSON to StateSaveLocation | ||
| → Include: hostname, timestamp, time_window, counters | ||
| → Use atomic write (temp file + rename) | ||
| → Use file locking (fcntl) to prevent corruption |
There was a problem hiding this comment.
I'd remove this requirement - file locking over NFS is flaky at best and i think line 129 (atomic file modification) is sufficient here since we are regenerating the whole file (not modifying sections of a large file concurrently)
| ``` | ||
|
|
||
| ``` | ||
| Proposed Flow: |
There was a problem hiding this comment.
Why do we care about "Primary" vs "Secondary" for the exporter?
It makes some sense in Slurm to have fail-back because users want the primary's DNS name.
But for the exporter, no one cares what node it's running on (unless there's a significant performance optimization to always running the exporter on the active scheduler)
There was a problem hiding this comment.
so if the primary goes down the backup assumes control but if the primary comes back up then primary assumes control again. In the proposed solution exporter is running on both nodes but only doing the collection and exporting if its the active controller. Fail back is for scenarios where primary controller assumes control again.
|
|
||
|
|
||
| ## Considerations: | ||
| **Exporter only runs on primary**: start the exporter on the backup when it becomes primary |
There was a problem hiding this comment.
Is this a hard requirement (that the exporter run on the Primary scheduler)?
I think there are 2 ways to think about this:
- ONLY export from the primary scheduler and use a change in active Slurm Scheduler to trigger failover
a. this has the VERY nice feature that Slurm is responsible for determining when failover is required - OR track failover locally to the exporters (maybe through a shared file that indicates the last time it was updated)
There was a problem hiding this comment.
In general, i like leaving it up to Slurm to detect failover since it's not a trivial problem :)
There was a problem hiding this comment.
I meant more so only running the exporter on the active controller. The problem with this is stopping the exporter on the controller that was active and isnt anymore. In cases where slurmctld is down on a node but the node is still running lets say and the backup has assumed control all slurm commands would still work on that primary node azslurm-exporter will continiue to export so we need a way to stop the exporter on the node that goes down.
| **Failover Detection**: | ||
| - SlurmctldPrimaryOnProg, SlurmctldPrimaryOffProg | ||
| - Pros: | ||
| - Slurm handles failover detection: |
There was a problem hiding this comment.
@bwatrous so the thing with the slurm hooks is that when slurmctld goes down SlurmctldPrimaryOffProg doesnt run it only runs when theres a smooth scontrol takeover. SlurmctldPrimaryOnProg is pretty good cause it will always run on the UP controller that is assuming control. We could use that to persist the state. ie when a controller takes control, we write to the state save file with the hostname and then the exporters read that file constantly and only collect and export when it the active controller
| - When exporter detects it is not the active controller it will run in idle mode -- no collecting or exporting | ||
| - When exporter detects it is the active controller it will run in active mode -- collecting and exporting | ||
|
|
||
| Exporter in active mode will save sacct job counter state in save state location for backup to load in when failover occurs or when azslurm-exporter crashes on primary it can still load in last state if avail |
There was a problem hiding this comment.
Since we are intializing the exporters on both primary and backup Sacct collector will already have the start time stored in the object. So when backup starts we will pull in from that initial starttime in the first query. we can save the last endtime when we stop the sacct collector so for the next time it starts the window will start from the last endtime so we dont have any gaps
There was a problem hiding this comment.
we can still save the original starttime in save state file so that if azslurm crashes we can load in that starttime
No description provided.