Skip to content

Expose ballot voting intrinsics#744

Open
simsurace wants to merge 2 commits intoJuliaGPU:mainfrom
simsurace:ballot
Open

Expose ballot voting intrinsics#744
simsurace wants to merge 2 commits intoJuliaGPU:mainfrom
simsurace:ballot

Conversation

@simsurace
Copy link

@github-actions
Copy link
Contributor

github-actions bot commented Feb 24, 2026

Your PR no longer requires formatting changes. Thank you for your contribution!

@codecov
Copy link

codecov bot commented Feb 24, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 82.28%. Comparing base (1d2f000) to head (7a3fb2c).

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #744      +/-   ##
==========================================
+ Coverage   82.01%   82.28%   +0.27%     
==========================================
  Files          62       62              
  Lines        2874     2874              
==========================================
+ Hits         2357     2365       +8     
+ Misses        517      509       -8     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metal Benchmarks

Details
Benchmark suite Current: 7a3fb2c Previous: 1d2f000 Ratio
latency/precompile 25740449000 ns 25549419083 ns 1.01
latency/ttfp 2383024000 ns 2346831687.5 ns 1.02
latency/import 1450508000 ns 1427666042 ns 1.02
integration/metaldevrt 836084 ns 877750 ns 0.95
integration/byval/slices=1 1575688 ns 1568625 ns 1.00
integration/byval/slices=3 21063916.5 ns 8402792 ns 2.51
integration/byval/reference 1570583 ns 1559958 ns 1.01
integration/byval/slices=2 2714000 ns 2629875 ns 1.03
kernel/indexing 465791 ns 627417 ns 0.74
kernel/indexing_checked 480333 ns 608750 ns 0.79
kernel/launch 11583 ns 12667 ns 0.91
kernel/rand 533000 ns 576167 ns 0.93
array/construct 6292 ns 6500 ns 0.97
array/broadcast 522459 ns 606708 ns 0.86
array/random/randn/Float32 1015541 ns 1011104 ns 1.00
array/random/randn!/Float32 708625 ns 753875 ns 0.94
array/random/rand!/Int64 540125 ns 548708 ns 0.98
array/random/rand!/Float32 535750 ns 586208.5 ns 0.91
array/random/rand/Int64 883896 ns 789709 ns 1.12
array/random/rand/Float32 805375 ns 645000 ns 1.25
array/accumulate/Int64/1d 1301354 ns 1260667 ns 1.03
array/accumulate/Int64/dims=1 1843583 ns 1859104.5 ns 0.99
array/accumulate/Int64/dims=2 2228875 ns 2179083 ns 1.02
array/accumulate/Int64/dims=1L 12088292 ns 11673271 ns 1.04
array/accumulate/Int64/dims=2L 10061833 ns 9628146 ns 1.05
array/accumulate/Float32/1d 1066000 ns 1121395.5 ns 0.95
array/accumulate/Float32/dims=1 1577708.5 ns 1571667 ns 1.00
array/accumulate/Float32/dims=2 2003166 ns 1889459 ns 1.06
array/accumulate/Float32/dims=1L 10307833 ns 9834209 ns 1.05
array/accumulate/Float32/dims=2L 7442125 ns 7249666.5 ns 1.03
array/reductions/reduce/Int64/1d 1292750 ns 1386875 ns 0.93
array/reductions/reduce/Int64/dims=1 1116375 ns 1117250 ns 1.00
array/reductions/reduce/Int64/dims=2 1153167 ns 1152958 ns 1.00
array/reductions/reduce/Int64/dims=1L 2039291 ns 2013209 ns 1.01
array/reductions/reduce/Int64/dims=2L 3941000 ns 4244083 ns 0.93
array/reductions/reduce/Float32/1d 751020.5 ns 988750 ns 0.76
array/reductions/reduce/Float32/dims=1 806667 ns 843520.5 ns 0.96
array/reductions/reduce/Float32/dims=2 836000 ns 857917 ns 0.97
array/reductions/reduce/Float32/dims=1L 1331604 ns 1326625 ns 1.00
array/reductions/reduce/Float32/dims=2L 1811333.5 ns 1810667 ns 1.00
array/reductions/mapreduce/Int64/1d 1311125 ns 1356437.5 ns 0.97
array/reductions/mapreduce/Int64/dims=1 1111917 ns 1102166.5 ns 1.01
array/reductions/mapreduce/Int64/dims=2 1156874.5 ns 1149750 ns 1.01
array/reductions/mapreduce/Int64/dims=1L 1924146 ns 1988375 ns 0.97
array/reductions/mapreduce/Int64/dims=2L 3639125 ns 3626916 ns 1.00
array/reductions/mapreduce/Float32/1d 786917 ns 1055917 ns 0.75
array/reductions/mapreduce/Float32/dims=1 799125 ns 847396 ns 0.94
array/reductions/mapreduce/Float32/dims=2 841750 ns 860979.5 ns 0.98
array/reductions/mapreduce/Float32/dims=1L 1326000 ns 1333042 ns 0.99
array/reductions/mapreduce/Float32/dims=2L 1808708 ns 1898125 ns 0.95
array/private/copyto!/gpu_to_gpu 550083 ns 633020.5 ns 0.87
array/private/copyto!/cpu_to_gpu 702291.5 ns 804354.5 ns 0.87
array/private/copyto!/gpu_to_cpu 688459 ns 816000 ns 0.84
array/private/iteration/findall/int 1560208.5 ns 1581312.5 ns 0.99
array/private/iteration/findall/bool 1463542 ns 1404916.5 ns 1.04
array/private/iteration/findfirst/int 2075208 ns 2075167 ns 1.00
array/private/iteration/findfirst/bool 2009584 ns 2048750 ns 0.98
array/private/iteration/scalar 3491687.5 ns 4526479 ns 0.77
array/private/iteration/logical 2644208.5 ns 2693625 ns 0.98
array/private/iteration/findmin/1d 2523459 ns 2518041 ns 1.00
array/private/iteration/findmin/2d 1842229 ns 1820229.5 ns 1.01
array/private/copy 817604.5 ns 568854 ns 1.44
array/shared/copyto!/gpu_to_gpu 84792 ns 84291 ns 1.01
array/shared/copyto!/cpu_to_gpu 82875 ns 82875 ns 1
array/shared/copyto!/gpu_to_cpu 82687.5 ns 83000 ns 1.00
array/shared/iteration/findall/int 1565458 ns 1585854.5 ns 0.99
array/shared/iteration/findall/bool 1471437.5 ns 1421875 ns 1.03
array/shared/iteration/findfirst/int 1701708 ns 1654709 ns 1.03
array/shared/iteration/findfirst/bool 1629083 ns 1643542 ns 0.99
array/shared/iteration/scalar 201542 ns 210375 ns 0.96
array/shared/iteration/logical 2363459 ns 2297959 ns 1.03
array/shared/iteration/findmin/1d 2166125 ns 2134229 ns 1.01
array/shared/iteration/findmin/2d 1833666 ns 1806042 ns 1.02
array/shared/copy 215333 ns 241812 ns 0.89
array/permutedims/4d 2478834 ns 2395583 ns 1.03
array/permutedims/2d 1187187.5 ns 1158833 ns 1.02
array/permutedims/3d 1768084 ns 1686541 ns 1.05
metal/synchronization/stream 19125 ns 19583 ns 0.98
metal/synchronization/context 19708 ns 20291 ns 0.97

This comment was automatically generated by workflow using github-action-benchmark.

@christiangnrd
Copy link
Member

christiangnrd commented Feb 25, 2026

From Section 6.9.2 of the Metal Shading Language Specification:

Note that simd_all(expr) is different from simd_ballot(expr).all():
simd_all(expr) returns true if all active threads evaluate expr to true.
simd_ballot(expr).all() returns true if all threads were active and evaluated the expr to true. (simd_vote::all() does not look at which threads are active.)
The same logic applies to simd_any, simd_vote::any(), and to the equivalent quadfunctions listed in section 6.9.3.
On hardware with fewer than 64 threads in a SIMD-group, the value of the top bits in simd_vote is undefined. Because you can initialize these bits, do not assume that the top bits are set to 0.

simd_all and simd_any described in Table 6.14 of the specs are linked to

; Function Attrs: convergent mustprogress nounwind willreturn
declare i1 @air.simd_any(i1) local_unnamed_addr #1

; Function Attrs: convergent mustprogress nounwind willreturn
declare i1 @air.simd_all(i1) local_unnamed_addr #1

instead of simd_vote_any/simd_vote_all. The former seem to behave more like CUDA's __all_sync and __any_sync intrinsics.

Would you mind renaming the device functions to simd_vote_(all|any)? If the current behaviour isn't what you're after, feel free to add the "real" simd_(any|all), but I wouldn't remove the current code.

@simsurace
Copy link
Author

Hmm I think I got these mixed up, which might be the reason my tests turned out different than what I expected. I will revisit this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants