Skip to content

working with slurm cluster #8

@ntoia

Description

@ntoia

Hello,

We have installed DeepETPicker onto our SLURM cluster. I have attempted testing the program and have come across a problem when running the training. It seems that the program is unable to detect the GPUs on the cluster. Is this yet to be compatible for use on a cluster or am I missing a step in the procedure?

Traceback (most recent call last):

File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner

self.run()
File "/usr/lib/python3.8/threading.py", line 870, in run

self._target(*self._args, **self._kwargs)
File "/DeepETPicker/train.py", line 352, in train_func

runner = Trainer(min_epochs=min(50, args.max_epoch),
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/connectors/env_vars_connector.py", line 41, in overwrite_by_env_vars

return fn(self, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 345, in init

self.accelerator_connector.on_trainer_init(
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/accelerators/accelerator_connector.py", line 101, in on_trainer_init

self.trainer.data_parallel_device_ids = device_parser.parse_gpu_ids(self.trainer.gpus)
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/utilities/device_parser.py", line 78, in parse_gpu_ids

gpus = _sanitize_gpu_ids(gpus)
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/utilities/device_parser.py", line 139, in _sanitize_gpu_ids

raise MisconfigurationException(f"""
pytorch_lightning.utilities.exceptions
MisconfigurationException
:

            You requested GPUs: [1]
            But your machine only has: []

Thanks in advance!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions