Skip to content

Further support for Blackwell and L-class GPUs#136

Merged
scal444 merged 5 commits intoNVIDIA-Digital-Bio:mainfrom
scal444:blackwell_offical
Apr 27, 2026
Merged

Further support for Blackwell and L-class GPUs#136
scal444 merged 5 commits intoNVIDIA-Digital-Bio:mainfrom
scal444:blackwell_offical

Conversation

@scal444
Copy link
Copy Markdown
Collaborator

@scal444 scal444 commented Apr 24, 2026

cc 89 (L40s) were left out of the fat build. Originally they had been supported by 75 PTX, but we removed it at some point so these were not working.

BMMA enabled for CC 10 and 12, I checked that it works.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 24, 2026

Greptile Summary

This PR adds 89-real to the fat-build arch list to restore L-class GPU (L40S/CC 89) support, enables BMMA tensor ops at runtime and compile time for Blackwell (sm_100, sm_120), removes the __CUDA_ARCH__ < 1000 upper bound in the similarity kernel guards, and fixes the getMaxThreadsPerSM value for consumer Blackwell (sm_120: 1536, not 2048). Open concerns flagged in the previous review round — the minor == 0 guard being too narrow for future Blackwell steppings and the PTX forward-compat gap for CUDA 12.8-only builds — remain unaddressed.

Confidence Score: 4/5

Safe to merge for current hardware; two concerns from the prior review round (minor-version guard and CUDA 12.8 PTX gap) remain open but do not affect production sm_100/sm_120 hardware today.

No new P0/P1 issues found in this pass. The logic for all three changed files is correct for the currently shipping Blackwell SKUs. The two previously flagged concerns (minor==0 narrowness and PTX forward-compat gap for CUDA 12.8 builds) are unresolved but are speculative/forward-looking rather than current breakage, keeping the score at the P1-ceiling of 4.

cmake/cuda_targets.cmake and src/similarity_kernels.cu carry the open forward-compat concerns from the prior review round.

Important Files Changed

Filename Overview
cmake/cuda_targets.cmake Adds 89-real to fix L-class GPU support and conditionally appends 100-real (CUDA ≥ 12.8) and 120 with PTX (CUDA ≥ 12.9); also extends the cc loop for 100/120 preprocessor defines. PTX forward-compat gap for CUDA 12.8-only builds and the minor-version guard remain open from the previous review round.
src/similarity_kernels.cu Extends supportsTensorOps to accept major == 10 and major == 12, adds compile-time guards for CC_100/CC_120, and removes the __CUDA_ARCH__ < 1000 upper bound in the Tanimoto/Cosine kernel preprocessor guards to enable BMMA on Blackwell.
src/substruct/substruct_kernels.cu Correctly adds explicit sm_100 (2048 threads/SM) and sm_120 (1536 threads/SM) cases in getMaxThreadsPerSM before the sm >= 90 catch-all, fixing the previously incorrect 2048 value that would have been returned for consumer Blackwell.

Reviews (4): Last reviewed commit: "Remove the extra brace, I blame greptile" | Re-trigger Greptile

Comment thread src/similarity_kernels.cu Outdated
Comment thread cmake/cuda_targets.cmake
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Comment thread src/similarity_kernels.cu Outdated
@scal444 scal444 requested a review from evasnow1992 April 24, 2026 21:49
Copy link
Copy Markdown
Collaborator

@evasnow1992 evasnow1992 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good to me. Thanks!

@scal444 scal444 merged commit 3f1e221 into NVIDIA-Digital-Bio:main Apr 27, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants