Update SLEAP module guide for the new pytorch version by niksirbi · Pull Request #81 · neuroinformatics-unit/HowTo

niksirbi · 2026-05-12T18:15:05Z

Description

What is this PR

Bug fix
Addition of a new feature
Other: updating an existing guide

Why is this PR needed?

The SLEAP HPC guide was written for the legacy TensorFlow-based SLEAP (≤ 1.4.1).

I've installed a new SLEAP module (SLEAP/2026-05-08, v1.6.3) on the cluster (using uv instead of conda), which uses a PyTorch backend and a new CLI (sleap-nn) for traininig and inference. The guide needed to be updated throughout to reflect the new workflow, commands, and known pitfalls discovered while testing the new module end-to-end.

What does this PR do?

Rewrites the module availability section around module avail SLEAP, documents SLEAP/2026-05-08 (v1.6.3, PyTorch) as the new default, and clarifies which modules are legacy (TensorFlow) vs. current (PyTorch)
Switches the documented training workflow from invoking the SLEAP-generated train-script.sh to calling sleap train directly in the SLURM batch script, with the new YAML-based config files
Switches the documented inference workflow from sleap-track (TensorFlow CLI) to sleap track (PyTorch CLI), and replaces the per-argument dropdown with a pointer to sleap track --help and the SLEAP-NN docs
Notes that sleap train / sleap track are aliases for sleap-nn train / sleap-nn track
Adds dropdowns with the equivalent legacy (TensorFlow) commands for training, inference, and verification, so users still on legacy modules can follow along
Fixes both SLURM scripts to use --ntasks-per-node=1 + --cpus-per-task instead of -n, which PyTorch Lightning rejects at startup
Expands the SLURM batch script explainer with guidance on selecting a specific GPU type (--gres gpu:<type>:1), including how to list available GPUs via sinfo and how to avoid cards with CUDA compute capability < 7.5
Rewrites the verification / troubleshooting section to use sleap doctor and a short PyTorch GPU check, replacing the old TensorFlow-based checks
Updates model directory names, ls outputs, file listings, and the local SLEAP install command to match the PyTorch-era conventions (v1.6.3, uv-based)
Cleans up MyST directives that were rendering poorly (stray :caption: links, unused :name: anchors) and refreshes outdated links throughout

References

Closes #76.

How has this PR been tested?

The updated workflow (training and inference) was tested end-to-end on the SWC HPC cluster using the SLEAP/2026-05-08 module, following the guide step by step. I used data from the new SLEAP tutorial, which I labelled myself locally using SLEAP v1.6.3 and then copied to the cluster for training and inference.

How to review this PR

Instead of inspecting the diff, I recommend building the website locally and reading the updated SLEAP HPC guide end-to-end, from a user's perspective.

Optional: rebuild after changes

If we push more commits, you can rebuild with:

rm -rf docs/build
sphinx-build -b html docs/source docs/build/html

Is this a breaking change?

Not really. The new guide includes legacy commands and instructions for the old TensorFlow-based SLEAP module, so users of that module should still be able to follow along. However, the new PyTorch-based SLEAP module is now the default and recommended version, so users should switch to that for the best experience.

Does this PR require an update to the documentation?

This PR is the documentation update.

Checklist:

The code has been tested locally
~~[ ] Tests have been added to cover all new functionality~~
The documentation has been updated to reflect any changes
The code has been formatted with pre-commit

- Use `module avail SLEAP` instead of `module avail` and show realistic output - Remove outdated legacy module entries (2023, 2024); fold legacy guidance into the main note rather than a separate dropdown - Clarify that older modules are not recommended due to Ubuntu incompatibility - Update `module list` example to reflect realistic output - Add local uv install command for SLEAP 1.6.3 to match the cluster module

…ch CLI - Replace sleap-nn track with sleap track alias throughout; same for sleap train; add a note explaining that sleap-nn train/track are the equivalent long-form aliases - Add batch_size (-b) argument to the inference script - Replace the sleap-nn track arguments dropdown with a pointer to sleap track --help and the SLEAP tracking docs - Remove :caption: from all batch script code blocks (was rendering as broken hyperlinks); remove :name: anchors and inline filename comments - Update model paths and directory names to match the PyTorch-era naming convention (dated run names e.g. 260512_144511.centroid.n=10) - Fix cd command in model evaluation to use the actual dated directory name - Fix labels version references (v001 -> v002) in inference output prose - Fix 'some the predictions' typo and other minor wording issues

…ript.sh train-script.sh reflects paths from the machine that exported the training job package and may not work on the cluster. Calling sleap train directly in the SLURM script is cleaner and avoids a Hydra parse error caused by '=' in the auto-generated trainer_config.run_name value. - Replace ./train-script.sh in train-slurm.sh with explicit sleap train calls using --config-name, --config-dir, and trainer_config.ckpt_dir - Simplify train-script.sh prose to describe it as a reference only - Move the sleap train/sleap-nn train aliases note next to the train-script.sh mention, consolidating it with the --help reference - Update batch script explanation dropdown to describe the sleap train arguments - Simplify chmod warning to train-slurm.sh only - Remove Hydra parse error troubleshooting entry (no longer a failure path)

PyTorch Lightning (used internally by SLEAP) raises a RuntimeError if --ntasks (i.e. -n) is set in the SLURM script, requiring --ntasks-per-node instead. The original -n comment ('number of cores') was also misleading, as -n sets the number of processes (tasks), not CPU cores. - Replace -n in both training and inference SLURM scripts with --ntasks-per-node=1 and --cpus-per-task - Add rationale in the batch script explanation dropdown - Update inference script diff explanation to mention --cpus-per-task - Add troubleshooting entry for the RuntimeError: --ntasks is not supported

niksirbi added 5 commits May 12, 2026 14:12

replaced trained model names

d8adc3d

niksirbi changed the title ~~Fix SLURM scripts: replace -n with --ntasks-per-node and --cpus-per-task~~ Update SLEAP module guide for the new pytorch version May 12, 2026

Merge origin/main, keeping PR version of HPC-module-SLEAP.md

7a2c907

niksirbi marked this pull request as ready for review May 12, 2026 18:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update SLEAP module guide for the new pytorch version#81

Update SLEAP module guide for the new pytorch version#81
niksirbi wants to merge 6 commits into
mainfrom
update-sleap-pytorch

niksirbi commented May 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

niksirbi commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

References

How has this PR been tested?

How to review this PR

Optional: rebuild after changes

Is this a breaking change?

Does this PR require an update to the documentation?

Checklist:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

niksirbi commented May 12, 2026 •

edited

Loading