Skip to content

Fix conda builds for packages requiring __cuda virtual package#1027

Merged
pditommaso merged 4 commits intomasterfrom
fix/conda-cuda-override-1026
Apr 15, 2026
Merged

Fix conda builds for packages requiring __cuda virtual package#1027
pditommaso merged 4 commits intomasterfrom
fix/conda-cuda-override-1026

Conversation

@pditommaso
Copy link
Copy Markdown
Collaborator

Summary

  • Fixes conda-file and conda-packages builds that fail when a dependency requires the __cuda virtual package (e.g., pytorch-gpu >= 1.12.1, JAX, TensorFlow)
  • Adds retry-on-failure logic to all 4 micromamba v2 templates (Docker/Singularity x file/packages): the first install attempt output is captured, and if it fails with __cuda in the error, retries with CONDA_OVERRIDE_CUDA="12" set
  • Zero overhead for non-GPU builds; GPU builds incur only a fast solver failure (~4s) before the successful retry

Closes #1026

Test plan

  • All 34 TemplateUtilsTest tests pass
  • Verify with a real conda-file build containing pytorch-gpu>=2.0
  • Verify non-GPU builds (e.g., bwa salmon) are unaffected

Add retry-on-failure logic to micromamba v2 templates that detects
__cuda resolution errors and retries with CONDA_OVERRIDE_CUDA="12".
Zero overhead for non-GPU builds; GPU builds retry after a fast solver
failure (~4s).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@pinin4fjords
Copy link
Copy Markdown
Member

Minor suggestion: consider using a higher sentinel value than "12" for CONDA_OVERRIDE_CUDA. The value just needs to satisfy __cuda >=N constraints - the solver picks the best available CUDA build from the channels regardless. "12" works today but would fail if a future package requires __cuda >=13. A value like "99" is equally correct and future-proof:

$ CONDA_OVERRIDE_CUDA=99 micromamba create --dry-run -f env_gpu.yml
  + cuda-version  13.2    # solver still picks the best available
  + pytorch-gpu   2.10.0  cuda129_mkl_...

The value is passed through as a raw string with no validation (virtual_packages.cpp).

@munishchouhan
Copy link
Copy Markdown
Member

munishchouhan commented Apr 15, 2026

Shouldn't we add it for micromamba-v1 also?

@adamrtalbot
Copy link
Copy Markdown
Contributor

Does this hard code CUDA 12? CUDA has an annoying dependency matrix so hard coding feels risky (I need 11! I need 13!).

We can also introspect from the Conda packages but this is more involved: https://anaconda.org/nvidia/cuda

You can prove it works by building the Alphafold container: https://github.com/google-deepmind/alphafold/blob/main/docker/Dockerfile 😉

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@pditommaso
Copy link
Copy Markdown
Collaborator Author

Uff, you are never happy folks :D

@pinin4fjords
Copy link
Copy Markdown
Member

Does this hard code CUDA 12? CUDA has an annoying dependency matrix so hard coding feels risky (I need 11! I need 13!).

@adamrtalbot The "12" doesn't hardcode CUDA 12 in the container - it just tells the solver "assume CUDA >= 12 is available." The solver then picks the best CUDA build from the channels independently (we got cuda-version 12.9 with pytorch-gpu 2.10.0). If someone's environment needs CUDA 11.8, the solver will select it as long as the packages support it.

That said, using "99" instead of "12" avoids any ambiguity and future-proofs against __cuda >=13 constraints.

@pditommaso
Copy link
Copy Markdown
Collaborator Author

Could a fair starting point? we could add a cuda specific option later to specific it

@munishchouhan
Copy link
Copy Markdown
Member

Shouldn't we add it for micromamba-v1 also?

the issue is reported on the default template which is micromamba-v1
https://wave.seqera.io/view/builds/bd-bbf66b1b68ac0df5_1

So we also need to make changes to the default template

@pditommaso
Copy link
Copy Markdown
Collaborator Author

We are going to switch to v2 soon as default #1024

@pinin4fjords
Copy link
Copy Markdown
Member

Could a fair starting point? we could add a cuda specific option later to specific it

Agreed, this is a great starting point. The two-pass logic is the important part - it unblocks all current GPU builds with zero impact on non-GPU builds. I don't think a user-facing --cuda flag would add much value in practice - the override only gates which builds the solver considers eligible, it doesn't pin anything in the container. If someone needs a specific CUDA version, they'd pin cuda-version=X in their environment YAML directly.

One thing worth noting: the sentinel value does affect the solve. With CONDA_OVERRIDE_CUDA=11.8, the solver excludes CUDA 12+ builds entirely and picks pytorch-gpu 2.5.1 instead of 2.10.0. Using a high value like 99 means "let the solver pick the best available," which is the right default for container builds where you don't know the runtime host's driver version.

@pinin4fjords
Copy link
Copy Markdown
Member

We are going to switch to v2 soon as default #1024

I'd love to be able to get good builds ASAP though :-)

Dump /tmp/mamba.log to stdout when the first micromamba install succeeds,
so build logs retain solver and download output for all non-CUDA builds.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@munishchouhan
Copy link
Copy Markdown
Member

We are going to switch to v2 soon as default #1024

But we are still not removing micromamba-v1, so worth to add the fix there too

@pditommaso
Copy link
Copy Markdown
Collaborator Author

@pinin4fjords so you are ok to merge or you want CONDA_OVERRIDE_CUDA=11.8 ?

@pinin4fjords
Copy link
Copy Markdown
Member

@pinin4fjords so you are ok to merge or you want CONDA_OVERRIDE_CUDA=11.8 ?

I'd suggest CONDA_OVERRIDE_CUDA="99" rather than 11.8 or 12. It's clearly not a real CUDA version, so nobody will mistake it for a pin, and it's high enough that it'll satisfy any future __cuda >=N constraint without needing to be updated. The solver picks the best available build from the channels regardless of this value.

Also +1 to @munishchouhan's point about adding the fix to the v1 templates too - looks like they're not included in the PR yet.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@pditommaso
Copy link
Copy Markdown
Collaborator Author

Done, let's test in v2 before applying in v1 template

@pditommaso
Copy link
Copy Markdown
Collaborator Author

Can you please review and approve ?

Copy link
Copy Markdown
Member

@pinin4fjords pinin4fjords left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. The two-pass logic is correct, 99 is the right sentinel value, and testing in v2 first before applying to v1 makes sense. Thanks for the quick turnaround on this.

@pditommaso pditommaso merged commit 67542ce into master Apr 15, 2026
5 checks passed
@pditommaso pditommaso deleted the fix/conda-cuda-override-1026 branch April 15, 2026 14:44
pinin4fjords added a commit to nf-core/rnaseq that referenced this pull request Apr 15, 2026
The previous container used PyTorch 1.11.0 (CUDA 11.1) which deadlocked
under Singularity on CUDA 12.x hosts. Updated to PyTorch 2.10.0 with
CUDA 13.0 runtime, built via Wave's new v2 template with __cuda retry
support (seqeralabs/wave#1027).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
pinin4fjords added a commit to nf-core/modules that referenced this pull request Apr 16, 2026
- Update GPU container from PyTorch 1.11.0 (CUDA 11.1) to PyTorch
  2.10.0 (CUDA 12.9). Pin cuda-version>=12,<13 in environment.gpu.yml.
- Add containers section to meta.yml with CUDA 12.x (default) and
  CUDA 11.8 alternatives following the existing platform key convention.
- Built via Wave v2 template (seqeralabs/wave#1027).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
pinin4fjords added a commit to nf-core/modules that referenced this pull request Apr 16, 2026
- Update GPU container from PyTorch 1.11.0 (CUDA 11.1) to PyTorch
  2.10.0 (CUDA 12.9). Pin cuda-version>=12,<13 in environment.gpu.yml.
- Add containers section to meta.yml with CUDA 12.x (default) and
  CUDA 11.8 alternatives following the existing platform key convention.
- Built via Wave v2 template (seqeralabs/wave#1027).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
pinin4fjords added a commit to nf-core/modules that referenced this pull request Apr 16, 2026
- Update GPU container from PyTorch 1.11.0 (CUDA 11.1) to PyTorch
  2.10.0 (CUDA 12.9). Pin cuda-version>=12,<13 in environment.gpu.yml.
- Add containers section to meta.yml with CUDA 12.x (default) and
  CUDA 11.8 alternatives following the existing platform key convention.
- Built via Wave v2 template (seqeralabs/wave#1027).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
pinin4fjords added a commit to nf-core/modules that referenced this pull request Apr 16, 2026
- Update GPU container from PyTorch 1.11.0 (CUDA 11.1) to PyTorch
  2.10.0 (CUDA 12.9). Pin cuda-version>=12,<13 in environment.gpu.yml.
- Add containers section to meta.yml with CUDA 12.x (default) and
  CUDA 11.8 alternatives following the existing platform key convention.
- Built via Wave v2 template (seqeralabs/wave#1027).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
pinin4fjords added a commit to nf-core/modules that referenced this pull request Apr 16, 2026
- Update GPU container from PyTorch 1.11.0 (CUDA 11.1) to PyTorch
  2.10.0 (CUDA 12.9). Pin cuda-version>=12,<13 in environment.gpu.yml.
- Add containers section to meta.yml with CUDA 12.x (default) and
  CUDA 11.8 alternatives following the existing platform key convention.
- Built via Wave v2 template (seqeralabs/wave#1027).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
sofstam pushed a commit to sofstam/modules that referenced this pull request Apr 22, 2026
…d no longer needed) (nf-core#11258)

* fix: update ribodetector GPU container to CUDA 12.x

Update GPU container from PyTorch 1.11.0 (CUDA 11.1, March 2022) to
PyTorch 2.10.0 (CUDA 12.9) and pin cuda-version>=12,<13 in
environment.gpu.yml to keep the solver within supported CUDA versions.

The old GPU container used PyTorch 1.11.0 because it was the last
version whose conda dependencies did not require the __cuda virtual
package, which is absent on Wave's GPU-less build servers. Wave now
handles this automatically via a two-pass solve (seqeralabs/wave#1027),
so we can build containers with current PyTorch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(ribodetector): pin pytorch-gpu=2.1.0 + cuda-version=11.2 for widest host compat

Reframes the GPU-container refresh. The motivation here is not to chase
a newer PyTorch version; it's to unpin from 1.11.0, which was the last
release whose conda dependencies avoided the __cuda virtual package.
Wave nf-core#1027 (merged) removes that constraint, so any post-1.11 pytorch-gpu
can now be built.

Given ribodetector is an inference-only CNN from 2022 and has no use for
newer PyTorch features, the lowest post-__cuda pytorch-gpu on conda-forge
that has a py<=3.10 + low-CUDA build is pytorch-gpu=2.1.0 with
cuda-version=11.2. This maps to an NVIDIA driver floor of ~450 (2020),
covering essentially every current HPC GPU host - far wider than the
bleeding-edge 2.10.0 + cuda-version=12.9 combination (driver floor 575,
early 2025).

Address mashehu's review:

- Exact pins, no ranges (nf-core policy).
- cuda runtime version captured as a versions topic emit so the
  container's CUDA minor is visible in downstream provenance reports.
  Reports `cpu` on the non-GPU path.

Container hashes regenerated from the new environment.gpu.yml; the CPU
container is unchanged (environment.yml not touched).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(ribodetector): add versions_cuda entries to existing CPU snapshot

The new versions_cuda topic emit adds an output channel to the process,
which breaks snapshot equality for both the real and stub CPU tests.
Patch the snap file to include the new entry (`cpu` on the non-GPU path,
populated by eval at runtime).

GPU snapshot will be regenerated via the nf-core-bot workflow after this
lands.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(ribodetector): fix versions_cuda key order in CPU snapshot

nf-test serialises object keys alphabetically; `versions_cuda` comes
before `versions_ribodetector` (c < r) in the actual snapshot output.
My previous edit had the reverse order which didn't match.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* [automated] Update gpu snapshot

* fix(ribodetector): use descriptive cuda fallback string on non-GPU path

Per mashehu: 'cpu' is not a version string, making it misleading inside
versions topic channels. Switch the eval fallback to 'no CUDA available',
which is unambiguous about what the task's pytorch build actually
supports. GPU path is unaffected (the eval's `or` only fires when
torch.version.cuda is None).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: nf-core-bot <core@nf-co.re>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

conda-file builds cannot install packages requiring __cuda virtual package

4 participants