Fix conda builds for packages requiring __cuda virtual package by pditommaso · Pull Request #1027 · seqeralabs/wave

pditommaso · 2026-04-15T11:18:02Z

Summary

Fixes conda-file and conda-packages builds that fail when a dependency requires the __cuda virtual package (e.g., pytorch-gpu >= 1.12.1, JAX, TensorFlow)
Adds retry-on-failure logic to all 4 micromamba v2 templates (Docker/Singularity x file/packages): the first install attempt output is captured, and if it fails with __cuda in the error, retries with CONDA_OVERRIDE_CUDA="12" set
Zero overhead for non-GPU builds; GPU builds incur only a fast solver failure (~4s) before the successful retry

Test plan

All 34 TemplateUtilsTest tests pass
Verify with a real conda-file build containing pytorch-gpu>=2.0
Verify non-GPU builds (e.g., bwa salmon) are unaffected

Add retry-on-failure logic to micromamba v2 templates that detects __cuda resolution errors and retries with CONDA_OVERRIDE_CUDA="12". Zero overhead for non-GPU builds; GPU builds retry after a fast solver failure (~4s). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

pinin4fjords · 2026-04-15T11:20:27Z

Minor suggestion: consider using a higher sentinel value than "12" for CONDA_OVERRIDE_CUDA. The value just needs to satisfy __cuda >=N constraints - the solver picks the best available CUDA build from the channels regardless. "12" works today but would fail if a future package requires __cuda >=13. A value like "99" is equally correct and future-proof:

$ CONDA_OVERRIDE_CUDA=99 micromamba create --dry-run -f env_gpu.yml
  + cuda-version  13.2    # solver still picks the best available
  + pytorch-gpu   2.10.0  cuda129_mkl_...

The value is passed through as a raw string with no validation (virtual_packages.cpp).

munishchouhan · 2026-04-15T11:27:29Z

Shouldn't we add it for micromamba-v1 also?

adamrtalbot · 2026-04-15T11:37:50Z

Does this hard code CUDA 12? CUDA has an annoying dependency matrix so hard coding feels risky (I need 11! I need 13!).

We can also introspect from the Conda packages but this is more involved: https://anaconda.org/nvidia/cuda

You can prove it works by building the Alphafold container: https://github.com/google-deepmind/alphafold/blob/main/docker/Dockerfile 😉

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

pditommaso · 2026-04-15T11:42:55Z

Uff, you are never happy folks :D

pinin4fjords · 2026-04-15T11:50:41Z

Does this hard code CUDA 12? CUDA has an annoying dependency matrix so hard coding feels risky (I need 11! I need 13!).

@adamrtalbot The "12" doesn't hardcode CUDA 12 in the container - it just tells the solver "assume CUDA >= 12 is available." The solver then picks the best CUDA build from the channels independently (we got cuda-version 12.9 with pytorch-gpu 2.10.0). If someone's environment needs CUDA 11.8, the solver will select it as long as the packages support it.

That said, using "99" instead of "12" avoids any ambiguity and future-proofs against __cuda >=13 constraints.

pditommaso · 2026-04-15T11:54:31Z

Could a fair starting point? we could add a cuda specific option later to specific it

munishchouhan · 2026-04-15T11:59:18Z

Shouldn't we add it for micromamba-v1 also?

the issue is reported on the default template which is micromamba-v1
https://wave.seqera.io/view/builds/bd-bbf66b1b68ac0df5_1

So we also need to make changes to the default template

pditommaso · 2026-04-15T12:00:34Z

We are going to switch to v2 soon as default #1024

pinin4fjords · 2026-04-15T12:02:09Z

Could a fair starting point? we could add a cuda specific option later to specific it

Agreed, this is a great starting point. The two-pass logic is the important part - it unblocks all current GPU builds with zero impact on non-GPU builds. I don't think a user-facing --cuda flag would add much value in practice - the override only gates which builds the solver considers eligible, it doesn't pin anything in the container. If someone needs a specific CUDA version, they'd pin cuda-version=X in their environment YAML directly.

One thing worth noting: the sentinel value does affect the solve. With CONDA_OVERRIDE_CUDA=11.8, the solver excludes CUDA 12+ builds entirely and picks pytorch-gpu 2.5.1 instead of 2.10.0. Using a high value like 99 means "let the solver pick the best available," which is the right default for container builds where you don't know the runtime host's driver version.

pinin4fjords · 2026-04-15T12:03:30Z

We are going to switch to v2 soon as default #1024

I'd love to be able to get good builds ASAP though :-)

Dump /tmp/mamba.log to stdout when the first micromamba install succeeds, so build logs retain solver and download output for all non-CUDA builds. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

munishchouhan · 2026-04-15T12:05:52Z

We are going to switch to v2 soon as default #1024

But we are still not removing micromamba-v1, so worth to add the fix there too

pditommaso · 2026-04-15T12:55:38Z

@pinin4fjords so you are ok to merge or you want CONDA_OVERRIDE_CUDA=11.8 ?

pinin4fjords · 2026-04-15T12:58:57Z

@pinin4fjords so you are ok to merge or you want CONDA_OVERRIDE_CUDA=11.8 ?

I'd suggest CONDA_OVERRIDE_CUDA="99" rather than 11.8 or 12. It's clearly not a real CUDA version, so nobody will mistake it for a pin, and it's high enough that it'll satisfy any future __cuda >=N constraint without needing to be updated. The solver picks the best available build from the channels regardless of this value.

Also +1 to @munishchouhan's point about adding the fix to the v1 templates too - looks like they're not included in the PR yet.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

pditommaso · 2026-04-15T13:06:12Z

Done, let's test in v2 before applying in v1 template

pditommaso · 2026-04-15T13:06:31Z

Can you please review and approve ?

pinin4fjords

Looks good. The two-pass logic is correct, 99 is the right sentinel value, and testing in v2 first before applying to v1 makes sense. Thanks for the quick turnaround on this.

The previous container used PyTorch 1.11.0 (CUDA 11.1) which deadlocked under Singularity on CUDA 12.x hosts. Updated to PyTorch 2.10.0 with CUDA 13.0 runtime, built via Wave's new v2 template with __cuda retry support (seqeralabs/wave#1027). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Update GPU container from PyTorch 1.11.0 (CUDA 11.1) to PyTorch 2.10.0 (CUDA 12.9). Pin cuda-version>=12,<13 in environment.gpu.yml. - Add containers section to meta.yml with CUDA 12.x (default) and CUDA 11.8 alternatives following the existing platform key convention. - Built via Wave v2 template (seqeralabs/wave#1027). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…d no longer needed) (nf-core#11258) * fix: update ribodetector GPU container to CUDA 12.x Update GPU container from PyTorch 1.11.0 (CUDA 11.1, March 2022) to PyTorch 2.10.0 (CUDA 12.9) and pin cuda-version>=12,<13 in environment.gpu.yml to keep the solver within supported CUDA versions. The old GPU container used PyTorch 1.11.0 because it was the last version whose conda dependencies did not require the __cuda virtual package, which is absent on Wave's GPU-less build servers. Wave now handles this automatically via a two-pass solve (seqeralabs/wave#1027), so we can build containers with current PyTorch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(ribodetector): pin pytorch-gpu=2.1.0 + cuda-version=11.2 for widest host compat Reframes the GPU-container refresh. The motivation here is not to chase a newer PyTorch version; it's to unpin from 1.11.0, which was the last release whose conda dependencies avoided the __cuda virtual package. Wave nf-core#1027 (merged) removes that constraint, so any post-1.11 pytorch-gpu can now be built. Given ribodetector is an inference-only CNN from 2022 and has no use for newer PyTorch features, the lowest post-__cuda pytorch-gpu on conda-forge that has a py<=3.10 + low-CUDA build is pytorch-gpu=2.1.0 with cuda-version=11.2. This maps to an NVIDIA driver floor of ~450 (2020), covering essentially every current HPC GPU host - far wider than the bleeding-edge 2.10.0 + cuda-version=12.9 combination (driver floor 575, early 2025). Address mashehu's review: - Exact pins, no ranges (nf-core policy). - cuda runtime version captured as a versions topic emit so the container's CUDA minor is visible in downstream provenance reports. Reports `cpu` on the non-GPU path. Container hashes regenerated from the new environment.gpu.yml; the CPU container is unchanged (environment.yml not touched). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(ribodetector): add versions_cuda entries to existing CPU snapshot The new versions_cuda topic emit adds an output channel to the process, which breaks snapshot equality for both the real and stub CPU tests. Patch the snap file to include the new entry (`cpu` on the non-GPU path, populated by eval at runtime). GPU snapshot will be regenerated via the nf-core-bot workflow after this lands. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(ribodetector): fix versions_cuda key order in CPU snapshot nf-test serialises object keys alphabetically; `versions_cuda` comes before `versions_ribodetector` (c < r) in the actual snapshot output. My previous edit had the reverse order which didn't match. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [automated] Update gpu snapshot * fix(ribodetector): use descriptive cuda fallback string on non-GPU path Per mashehu: 'cpu' is not a version string, making it misleading inside versions topic channels. Switch the eval fallback to 'no CUDA available', which is unambiguous about what the task's pytorch build actually supports. GPU path is unaffected (the eval's `or` only fires when torch.version.cuda is None). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: nf-core-bot <core@nf-co.re>

Fix ContainerHelperTest expectations for __cuda retry logic

8f8963a

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Preserve install output on success path

bbaedd2

Dump /tmp/mamba.log to stdout when the first micromamba install succeeds, so build logs retain solver and download output for all non-CUDA builds. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Use CONDA_OVERRIDE_CUDA="99" to future-proof against CUDA major versions

377e758

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

pinin4fjords approved these changes Apr 15, 2026

View reviewed changes

pditommaso merged commit 67542ce into master Apr 15, 2026
5 checks passed

pditommaso deleted the fix/conda-cuda-override-1026 branch April 15, 2026 14:44

pinin4fjords mentioned this pull request Apr 15, 2026

fix: update ribodetector GPU container to modern PyTorch/CUDA nf-core/modules#11197

Draft

pinin4fjords mentioned this pull request Apr 22, 2026

fix: unpin ribodetector GPU from pytorch-gpu=1.11.0 (__cuda workaround no longer needed) nf-core/modules#11258

Merged

4 tasks

Conversation

pditommaso commented Apr 15, 2026

Summary

Test plan

Uh oh!

pinin4fjords commented Apr 15, 2026

Uh oh!

munishchouhan commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adamrtalbot commented Apr 15, 2026

Uh oh!

pditommaso commented Apr 15, 2026

Uh oh!

pinin4fjords commented Apr 15, 2026

Uh oh!

pditommaso commented Apr 15, 2026

Uh oh!

munishchouhan commented Apr 15, 2026

Uh oh!

pditommaso commented Apr 15, 2026

Uh oh!

pinin4fjords commented Apr 15, 2026

Uh oh!

pinin4fjords commented Apr 15, 2026

Uh oh!

munishchouhan commented Apr 15, 2026

Uh oh!

pditommaso commented Apr 15, 2026

Uh oh!

pinin4fjords commented Apr 15, 2026

Uh oh!

pditommaso commented Apr 15, 2026

Uh oh!

pditommaso commented Apr 15, 2026

Uh oh!

pinin4fjords left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

munishchouhan commented Apr 15, 2026 •

edited

Loading