Fix conda builds for packages requiring __cuda virtual package#1027
Fix conda builds for packages requiring __cuda virtual package#1027pditommaso merged 4 commits intomasterfrom
Conversation
Add retry-on-failure logic to micromamba v2 templates that detects __cuda resolution errors and retries with CONDA_OVERRIDE_CUDA="12". Zero overhead for non-GPU builds; GPU builds retry after a fast solver failure (~4s). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Minor suggestion: consider using a higher sentinel value than The value is passed through as a raw string with no validation ( |
|
Shouldn't we add it for micromamba-v1 also? |
|
Does this hard code CUDA 12? CUDA has an annoying dependency matrix so hard coding feels risky (I need 11! I need 13!). We can also introspect from the Conda packages but this is more involved: https://anaconda.org/nvidia/cuda You can prove it works by building the Alphafold container: https://github.com/google-deepmind/alphafold/blob/main/docker/Dockerfile 😉 |
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Uff, you are never happy folks :D |
@adamrtalbot The That said, using |
|
Could a fair starting point? we could add a cuda specific option later to specific it |
the issue is reported on the default template which is micromamba-v1 So we also need to make changes to the default template |
|
We are going to switch to v2 soon as default #1024 |
Agreed, this is a great starting point. The two-pass logic is the important part - it unblocks all current GPU builds with zero impact on non-GPU builds. I don't think a user-facing One thing worth noting: the sentinel value does affect the solve. With |
I'd love to be able to get good builds ASAP though :-) |
Dump /tmp/mamba.log to stdout when the first micromamba install succeeds, so build logs retain solver and download output for all non-CUDA builds. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
But we are still not removing micromamba-v1, so worth to add the fix there too |
|
@pinin4fjords so you are ok to merge or you want |
I'd suggest Also +1 to @munishchouhan's point about adding the fix to the v1 templates too - looks like they're not included in the PR yet. |
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Done, let's test in v2 before applying in v1 template |
|
Can you please review and approve ? |
pinin4fjords
left a comment
There was a problem hiding this comment.
Looks good. The two-pass logic is correct, 99 is the right sentinel value, and testing in v2 first before applying to v1 makes sense. Thanks for the quick turnaround on this.
The previous container used PyTorch 1.11.0 (CUDA 11.1) which deadlocked under Singularity on CUDA 12.x hosts. Updated to PyTorch 2.10.0 with CUDA 13.0 runtime, built via Wave's new v2 template with __cuda retry support (seqeralabs/wave#1027). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Update GPU container from PyTorch 1.11.0 (CUDA 11.1) to PyTorch 2.10.0 (CUDA 12.9). Pin cuda-version>=12,<13 in environment.gpu.yml. - Add containers section to meta.yml with CUDA 12.x (default) and CUDA 11.8 alternatives following the existing platform key convention. - Built via Wave v2 template (seqeralabs/wave#1027). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Update GPU container from PyTorch 1.11.0 (CUDA 11.1) to PyTorch 2.10.0 (CUDA 12.9). Pin cuda-version>=12,<13 in environment.gpu.yml. - Add containers section to meta.yml with CUDA 12.x (default) and CUDA 11.8 alternatives following the existing platform key convention. - Built via Wave v2 template (seqeralabs/wave#1027). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Update GPU container from PyTorch 1.11.0 (CUDA 11.1) to PyTorch 2.10.0 (CUDA 12.9). Pin cuda-version>=12,<13 in environment.gpu.yml. - Add containers section to meta.yml with CUDA 12.x (default) and CUDA 11.8 alternatives following the existing platform key convention. - Built via Wave v2 template (seqeralabs/wave#1027). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Update GPU container from PyTorch 1.11.0 (CUDA 11.1) to PyTorch 2.10.0 (CUDA 12.9). Pin cuda-version>=12,<13 in environment.gpu.yml. - Add containers section to meta.yml with CUDA 12.x (default) and CUDA 11.8 alternatives following the existing platform key convention. - Built via Wave v2 template (seqeralabs/wave#1027). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Update GPU container from PyTorch 1.11.0 (CUDA 11.1) to PyTorch 2.10.0 (CUDA 12.9). Pin cuda-version>=12,<13 in environment.gpu.yml. - Add containers section to meta.yml with CUDA 12.x (default) and CUDA 11.8 alternatives following the existing platform key convention. - Built via Wave v2 template (seqeralabs/wave#1027). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…d no longer needed) (nf-core#11258) * fix: update ribodetector GPU container to CUDA 12.x Update GPU container from PyTorch 1.11.0 (CUDA 11.1, March 2022) to PyTorch 2.10.0 (CUDA 12.9) and pin cuda-version>=12,<13 in environment.gpu.yml to keep the solver within supported CUDA versions. The old GPU container used PyTorch 1.11.0 because it was the last version whose conda dependencies did not require the __cuda virtual package, which is absent on Wave's GPU-less build servers. Wave now handles this automatically via a two-pass solve (seqeralabs/wave#1027), so we can build containers with current PyTorch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(ribodetector): pin pytorch-gpu=2.1.0 + cuda-version=11.2 for widest host compat Reframes the GPU-container refresh. The motivation here is not to chase a newer PyTorch version; it's to unpin from 1.11.0, which was the last release whose conda dependencies avoided the __cuda virtual package. Wave nf-core#1027 (merged) removes that constraint, so any post-1.11 pytorch-gpu can now be built. Given ribodetector is an inference-only CNN from 2022 and has no use for newer PyTorch features, the lowest post-__cuda pytorch-gpu on conda-forge that has a py<=3.10 + low-CUDA build is pytorch-gpu=2.1.0 with cuda-version=11.2. This maps to an NVIDIA driver floor of ~450 (2020), covering essentially every current HPC GPU host - far wider than the bleeding-edge 2.10.0 + cuda-version=12.9 combination (driver floor 575, early 2025). Address mashehu's review: - Exact pins, no ranges (nf-core policy). - cuda runtime version captured as a versions topic emit so the container's CUDA minor is visible in downstream provenance reports. Reports `cpu` on the non-GPU path. Container hashes regenerated from the new environment.gpu.yml; the CPU container is unchanged (environment.yml not touched). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(ribodetector): add versions_cuda entries to existing CPU snapshot The new versions_cuda topic emit adds an output channel to the process, which breaks snapshot equality for both the real and stub CPU tests. Patch the snap file to include the new entry (`cpu` on the non-GPU path, populated by eval at runtime). GPU snapshot will be regenerated via the nf-core-bot workflow after this lands. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(ribodetector): fix versions_cuda key order in CPU snapshot nf-test serialises object keys alphabetically; `versions_cuda` comes before `versions_ribodetector` (c < r) in the actual snapshot output. My previous edit had the reverse order which didn't match. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [automated] Update gpu snapshot * fix(ribodetector): use descriptive cuda fallback string on non-GPU path Per mashehu: 'cpu' is not a version string, making it misleading inside versions topic channels. Switch the eval fallback to 'no CUDA available', which is unambiguous about what the task's pytorch build actually supports. GPU path is unaffected (the eval's `or` only fires when torch.version.cuda is None). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: nf-core-bot <core@nf-co.re>
Summary
__cudavirtual package (e.g.,pytorch-gpu >= 1.12.1, JAX, TensorFlow)__cudain the error, retries withCONDA_OVERRIDE_CUDA="12"setCloses #1026
Test plan