Conversation
skalyan
left a comment
There was a problem hiding this comment.
What is the motivation for this change? Did we have any user requests?
Is the idea to simply hide Slurm env variables and add a wrapper?
|
@skalyan yeah, this is to enable: facebookresearch/matrix#105 (comment)
it gives the info for slurm or local nodes |
| return int(os.environ.get("SLURM_CPUS_ON_NODE", 1)) | ||
| return int(max(os.cpu_count() or 0, 1)) | ||
|
|
||
| @lru_cache(maxsize=1) | ||
| def get_gpus(self) -> int: | ||
| if self.is_slurm_job(): | ||
| return int(os.environ.get("SLURM_GPUS_ON_NODE", 1)) | ||
| return sum( |
There was a problem hiding this comment.
What is the intent of the ask?
Is it to know how many GPUs/CPUs are "present" on the node or "allocated" to this job?
There was a problem hiding this comment.
allocated to the job if slurm job, otherwise what's present in the node
There was a problem hiding this comment.
If I may, I think we should stick to a clear and limited contract for an API - If we want to return GPUs allocated to a job for a given API let's stick to that. I don't see benefits in either allocated or provisioned GPU count coming via the same API.
There was a problem hiding this comment.
@skalyan I think this is fine for now, it matches how other methods from this class behaves. I'm up to change position here as we see how it gets used
|
Hi @luccabb! Thank you for your pull request. We require contributors to sign our Contributor License Agreement, and yours needs attention. You currently have a record in our system, but the CLA is no longer valid, and will need to be resubmitted. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks! |
Summary
exposing a way to get number of cpus and gpus from job information. defaults to local host if not on a slurm cluster
Test Plan
works on slurm and locally:
local node with gpus: