Skip to content

Update SLEAP module guide for the new pytorch version#81

Open
niksirbi wants to merge 6 commits into
mainfrom
update-sleap-pytorch
Open

Update SLEAP module guide for the new pytorch version#81
niksirbi wants to merge 6 commits into
mainfrom
update-sleap-pytorch

Conversation

@niksirbi
Copy link
Copy Markdown
Member

@niksirbi niksirbi commented May 12, 2026

Description

What is this PR

  • Bug fix
  • Addition of a new feature
  • Other: updating an existing guide

Why is this PR needed?

The SLEAP HPC guide was written for the legacy TensorFlow-based SLEAP (≤ 1.4.1).

I've installed a new SLEAP module (SLEAP/2026-05-08, v1.6.3) on the cluster (using uv instead of conda), which uses a PyTorch backend and a new CLI (sleap-nn) for traininig and inference. The guide needed to be updated throughout to reflect the new workflow, commands, and known pitfalls discovered while testing the new module end-to-end.

What does this PR do?

  • Rewrites the module availability section around module avail SLEAP, documents SLEAP/2026-05-08 (v1.6.3, PyTorch) as the new default, and clarifies which modules are legacy (TensorFlow) vs. current (PyTorch)
  • Switches the documented training workflow from invoking the SLEAP-generated train-script.sh to calling sleap train directly in the SLURM batch script, with the new YAML-based config files
  • Switches the documented inference workflow from sleap-track (TensorFlow CLI) to sleap track (PyTorch CLI), and replaces the per-argument dropdown with a pointer to sleap track --help and the SLEAP-NN docs
  • Notes that sleap train / sleap track are aliases for sleap-nn train / sleap-nn track
  • Adds dropdowns with the equivalent legacy (TensorFlow) commands for training, inference, and verification, so users still on legacy modules can follow along
  • Fixes both SLURM scripts to use --ntasks-per-node=1 + --cpus-per-task instead of -n, which PyTorch Lightning rejects at startup
  • Expands the SLURM batch script explainer with guidance on selecting a specific GPU type (--gres gpu:<type>:1), including how to list available GPUs via sinfo and how to avoid cards with CUDA compute capability < 7.5
  • Rewrites the verification / troubleshooting section to use sleap doctor and a short PyTorch GPU check, replacing the old TensorFlow-based checks
  • Updates model directory names, ls outputs, file listings, and the local SLEAP install command to match the PyTorch-era conventions (v1.6.3, uv-based)
  • Cleans up MyST directives that were rendering poorly (stray :caption: links, unused :name: anchors) and refreshes outdated links throughout

References

Closes #76.

How has this PR been tested?

The updated workflow (training and inference) was tested end-to-end on the SWC HPC cluster using the SLEAP/2026-05-08 module, following the guide step by step. I used data from the new SLEAP tutorial, which I labelled myself locally using SLEAP v1.6.3 and then copied to the cluster for training and inference.

How to review this PR

Instead of inspecting the diff, I recommend building the website locally and reading the updated SLEAP HPC guide end-to-end, from a user's perspective.

Optional: rebuild after changes

If we push more commits, you can rebuild with:

rm -rf docs/build
sphinx-build -b html docs/source docs/build/html

Is this a breaking change?

Not really. The new guide includes legacy commands and instructions for the old TensorFlow-based SLEAP module, so users of that module should still be able to follow along. However, the new PyTorch-based SLEAP module is now the default and recommended version, so users should switch to that for the best experience.

Does this PR require an update to the documentation?

This PR is the documentation update.

Checklist:

  • The code has been tested locally
  • [ ] Tests have been added to cover all new functionality
  • The documentation has been updated to reflect any changes
  • The code has been formatted with pre-commit

niksirbi added 5 commits May 12, 2026 14:12
- Use `module avail SLEAP` instead of `module avail` and show realistic output
- Remove outdated legacy module entries (2023, 2024); fold legacy guidance
  into the main note rather than a separate dropdown
- Clarify that older modules are not recommended due to Ubuntu incompatibility
- Update `module list` example to reflect realistic output
- Add local uv install command for SLEAP 1.6.3 to match the cluster module
…ch CLI

- Replace sleap-nn track with sleap track alias throughout; same for sleap train;
  add a note explaining that sleap-nn train/track are the equivalent long-form aliases
- Add batch_size (-b) argument to the inference script
- Replace the sleap-nn track arguments dropdown with a pointer to
  sleap track --help and the SLEAP tracking docs
- Remove :caption: from all batch script code blocks (was rendering as
  broken hyperlinks); remove :name: anchors and inline filename comments
- Update model paths and directory names to match the PyTorch-era naming
  convention (dated run names e.g. 260512_144511.centroid.n=10)
- Fix cd command in model evaluation to use the actual dated directory name
- Fix labels version references (v001 -> v002) in inference output prose
- Fix 'some the predictions' typo and other minor wording issues
…ript.sh

train-script.sh reflects paths from the machine that exported the training job
package and may not work on the cluster. Calling sleap train directly in the
SLURM script is cleaner and avoids a Hydra parse error caused by '=' in the
auto-generated trainer_config.run_name value.

- Replace ./train-script.sh in train-slurm.sh with explicit sleap train calls
  using --config-name, --config-dir, and trainer_config.ckpt_dir
- Simplify train-script.sh prose to describe it as a reference only
- Move the sleap train/sleap-nn train aliases note next to the train-script.sh
  mention, consolidating it with the --help reference
- Update batch script explanation dropdown to describe the sleap train arguments
- Simplify chmod warning to train-slurm.sh only
- Remove Hydra parse error troubleshooting entry (no longer a failure path)
PyTorch Lightning (used internally by SLEAP) raises a RuntimeError if --ntasks
(i.e. -n) is set in the SLURM script, requiring --ntasks-per-node instead.
The original -n comment ('number of cores') was also misleading, as -n sets
the number of processes (tasks), not CPU cores.

- Replace -n in both training and inference SLURM scripts with
  --ntasks-per-node=1 and --cpus-per-task
- Add rationale in the batch script explanation dropdown
- Update inference script diff explanation to mention --cpus-per-task
- Add troubleshooting entry for the RuntimeError: --ntasks is not supported
@niksirbi niksirbi changed the title Fix SLURM scripts: replace -n with --ntasks-per-node and --cpus-per-task Update SLEAP module guide for the new pytorch version May 12, 2026
@niksirbi niksirbi marked this pull request as ready for review May 12, 2026 18:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Update SLEAP guide for >=1.5 versions of SLEAP

1 participant