Skip to content

exposing job cpus and gpus#119

Open
luccabb wants to merge 2 commits into
mainfrom
cpus_gpus
Open

exposing job cpus and gpus#119
luccabb wants to merge 2 commits into
mainfrom
cpus_gpus

Conversation

@luccabb

@luccabb luccabb commented Oct 8, 2025

Copy link
Copy Markdown
Contributor

Summary

exposing a way to get number of cpus and gpus from job information. defaults to local host if not on a slurm cluster

Test Plan

works on slurm and locally:

$ srun python -c "import clusterscope; print(clusterscope.get_job().get_gpus()
); print(clusterscope.get_job().get_cpus())"
...
srun: job 1476042 has been allocated resources
1
2
$ python
Python 3.12.4 | packaged by Anaconda, Inc. | (main, Jun 18 2024, 15:12:24) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import clusterscope
>>> clusterscope.get_job().get_gpus()
WARNING:root:No GPUs found or unable to retrieve GPU information
0
>>> clusterscope.get_job().get_cpus()
96

local node with gpus:

Type "help", "copyright", "credits" or "license" for more information.
>>> import clusterscope
>>> clusterscope.get_job().get_gpus()
2
>>> clusterscope.get_job().get_cpus()
80
>>> exit()

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 8, 2025
@luccabb luccabb marked this pull request as ready for review October 8, 2025 22:54
@luccabb luccabb requested review from gunchu and skalyan as code owners October 8, 2025 22:54

@skalyan skalyan left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the motivation for this change? Did we have any user requests?

Is the idea to simply hide Slurm env variables and add a wrapper?

@luccabb

luccabb commented Oct 10, 2025

Copy link
Copy Markdown
Contributor Author

@skalyan yeah, this is to enable: facebookresearch/matrix#105 (comment)

Is the idea to simply hide Slurm env variables and add a wrapper?

it gives the info for slurm or local nodes

Comment thread clusterscope/job_info.py
Comment on lines +48 to +55
return int(os.environ.get("SLURM_CPUS_ON_NODE", 1))
return int(max(os.cpu_count() or 0, 1))

@lru_cache(maxsize=1)
def get_gpus(self) -> int:
if self.is_slurm_job():
return int(os.environ.get("SLURM_GPUS_ON_NODE", 1))
return sum(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the intent of the ask?

Is it to know how many GPUs/CPUs are "present" on the node or "allocated" to this job?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

allocated to the job if slurm job, otherwise what's present in the node

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I may, I think we should stick to a clear and limited contract for an API - If we want to return GPUs allocated to a job for a given API let's stick to that. I don't see benefits in either allocated or provisioned GPU count coming via the same API.

@luccabb luccabb Oct 25, 2025

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@skalyan I think this is fine for now, it matches how other methods from this class behaves. I'm up to change position here as we see how it gets used

@luccabb luccabb requested a review from skalyan October 27, 2025 23:04
@meta-cla

meta-cla Bot commented May 27, 2026

Copy link
Copy Markdown

Hi @luccabb!

Thank you for your pull request.

We require contributors to sign our Contributor License Agreement, and yours needs attention.

You currently have a record in our system, but the CLA is no longer valid, and will need to be resubmitted.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants