Update SLEAP module guide for the new pytorch version#81
Open
niksirbi wants to merge 6 commits into
Open
Conversation
- Use `module avail SLEAP` instead of `module avail` and show realistic output - Remove outdated legacy module entries (2023, 2024); fold legacy guidance into the main note rather than a separate dropdown - Clarify that older modules are not recommended due to Ubuntu incompatibility - Update `module list` example to reflect realistic output - Add local uv install command for SLEAP 1.6.3 to match the cluster module
…ch CLI - Replace sleap-nn track with sleap track alias throughout; same for sleap train; add a note explaining that sleap-nn train/track are the equivalent long-form aliases - Add batch_size (-b) argument to the inference script - Replace the sleap-nn track arguments dropdown with a pointer to sleap track --help and the SLEAP tracking docs - Remove :caption: from all batch script code blocks (was rendering as broken hyperlinks); remove :name: anchors and inline filename comments - Update model paths and directory names to match the PyTorch-era naming convention (dated run names e.g. 260512_144511.centroid.n=10) - Fix cd command in model evaluation to use the actual dated directory name - Fix labels version references (v001 -> v002) in inference output prose - Fix 'some the predictions' typo and other minor wording issues
…ript.sh train-script.sh reflects paths from the machine that exported the training job package and may not work on the cluster. Calling sleap train directly in the SLURM script is cleaner and avoids a Hydra parse error caused by '=' in the auto-generated trainer_config.run_name value. - Replace ./train-script.sh in train-slurm.sh with explicit sleap train calls using --config-name, --config-dir, and trainer_config.ckpt_dir - Simplify train-script.sh prose to describe it as a reference only - Move the sleap train/sleap-nn train aliases note next to the train-script.sh mention, consolidating it with the --help reference - Update batch script explanation dropdown to describe the sleap train arguments - Simplify chmod warning to train-slurm.sh only - Remove Hydra parse error troubleshooting entry (no longer a failure path)
PyTorch Lightning (used internally by SLEAP) raises a RuntimeError if --ntasks
(i.e. -n) is set in the SLURM script, requiring --ntasks-per-node instead.
The original -n comment ('number of cores') was also misleading, as -n sets
the number of processes (tasks), not CPU cores.
- Replace -n in both training and inference SLURM scripts with
--ntasks-per-node=1 and --cpus-per-task
- Add rationale in the batch script explanation dropdown
- Update inference script diff explanation to mention --cpus-per-task
- Add troubleshooting entry for the RuntimeError: --ntasks is not supported
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
What is this PR
Why is this PR needed?
The SLEAP HPC guide was written for the legacy TensorFlow-based SLEAP (≤ 1.4.1).
I've installed a new SLEAP module (
SLEAP/2026-05-08, v1.6.3) on the cluster (usinguvinstead ofconda), which uses a PyTorch backend and a new CLI (sleap-nn) for traininig and inference. The guide needed to be updated throughout to reflect the new workflow, commands, and known pitfalls discovered while testing the new module end-to-end.What does this PR do?
module avail SLEAP, documentsSLEAP/2026-05-08(v1.6.3, PyTorch) as the new default, and clarifies which modules are legacy (TensorFlow) vs. current (PyTorch)train-script.shto callingsleap traindirectly in the SLURM batch script, with the new YAML-based config filessleap-track(TensorFlow CLI) tosleap track(PyTorch CLI), and replaces the per-argument dropdown with a pointer tosleap track --helpand the SLEAP-NN docssleap train/sleap trackare aliases forsleap-nn train/sleap-nn track--ntasks-per-node=1+--cpus-per-taskinstead of-n, which PyTorch Lightning rejects at startup--gres gpu:<type>:1), including how to list available GPUs viasinfoand how to avoid cards with CUDA compute capability < 7.5sleap doctorand a short PyTorch GPU check, replacing the old TensorFlow-based checkslsoutputs, file listings, and the local SLEAP install command to match the PyTorch-era conventions (v1.6.3,uv-based):caption:links, unused:name:anchors) and refreshes outdated links throughoutReferences
Closes #76.
How has this PR been tested?
The updated workflow (training and inference) was tested end-to-end on the SWC HPC cluster using the
SLEAP/2026-05-08module, following the guide step by step. I used data from the new SLEAP tutorial, which I labelled myself locally using SLEAP v1.6.3 and then copied to the cluster for training and inference.How to review this PR
Instead of inspecting the diff, I recommend building the website locally and reading the updated SLEAP HPC guide end-to-end, from a user's perspective.
Optional: rebuild after changes
If we push more commits, you can rebuild with:
Is this a breaking change?
Not really. The new guide includes legacy commands and instructions for the old TensorFlow-based SLEAP module, so users of that module should still be able to follow along. However, the new PyTorch-based SLEAP module is now the default and recommended version, so users should switch to that for the best experience.
Does this PR require an update to the documentation?
This PR is the documentation update.
Checklist:
[ ] Tests have been added to cover all new functionality