Conversation
Co-authored-by: M. A. Kowalski <mak60@cam.ac.uk> Co-authored-by: Erik Faulhaber <erik.faulhaber@web.de>
Contributor
|
I made this table to see what is happening with FP32 and FP64:
|
Contributor
There was a problem hiding this comment.
CUDA.jl Benchmarks
Details
| Benchmark suite | Current: e200935 | Previous: a79b516 | Ratio |
|---|---|---|---|
array/accumulate/Float32/1d |
101549 ns |
101309 ns |
1.00 |
array/accumulate/Float32/dims=1 |
76954 ns |
76747 ns |
1.00 |
array/accumulate/Float32/dims=1L |
1585965 ns |
1585609 ns |
1.00 |
array/accumulate/Float32/dims=2 |
143736 ns |
143412 ns |
1.00 |
array/accumulate/Float32/dims=2L |
657735 ns |
657151 ns |
1.00 |
array/accumulate/Int64/1d |
119049 ns |
118450 ns |
1.01 |
array/accumulate/Int64/dims=1 |
79971 ns |
79685 ns |
1.00 |
array/accumulate/Int64/dims=1L |
1694663 ns |
1694399 ns |
1.00 |
array/accumulate/Int64/dims=2 |
156273 ns |
155494.5 ns |
1.01 |
array/accumulate/Int64/dims=2L |
961567 ns |
961001 ns |
1.00 |
array/broadcast |
20565 ns |
20538 ns |
1.00 |
array/construct |
1335.6 ns |
1298.9 ns |
1.03 |
array/copy |
18948 ns |
18512 ns |
1.02 |
array/copyto!/cpu_to_gpu |
217063 ns |
213295 ns |
1.02 |
array/copyto!/gpu_to_cpu |
285135 ns |
284330.5 ns |
1.00 |
array/copyto!/gpu_to_gpu |
11401 ns |
11273 ns |
1.01 |
array/iteration/findall/bool |
132221.5 ns |
132165 ns |
1.00 |
array/iteration/findall/int |
149464.5 ns |
148572 ns |
1.01 |
array/iteration/findfirst/bool |
82472 ns |
81324.5 ns |
1.01 |
array/iteration/findfirst/int |
83837 ns |
83910 ns |
1.00 |
array/iteration/findmin/1d |
89677.5 ns |
88268.5 ns |
1.02 |
array/iteration/findmin/2d |
117275 ns |
116719 ns |
1.00 |
array/iteration/logical |
202213 ns |
201488.5 ns |
1.00 |
array/iteration/scalar |
68004.5 ns |
67192 ns |
1.01 |
array/permutedims/2d |
52647.5 ns |
52378 ns |
1.01 |
array/permutedims/3d |
53105 ns |
52726 ns |
1.01 |
array/permutedims/4d |
52039.5 ns |
51596 ns |
1.01 |
array/random/rand/Float32 |
13422 ns |
13097 ns |
1.02 |
array/random/rand/Int64 |
30333 ns |
37319 ns |
0.81 |
array/random/rand!/Float32 |
8567.666666666666 ns |
8581.666666666666 ns |
1.00 |
array/random/rand!/Int64 |
34297 ns |
34312 ns |
1.00 |
array/random/randn/Float32 |
40454 ns |
38478.5 ns |
1.05 |
array/random/randn!/Float32 |
31587 ns |
31422.5 ns |
1.01 |
array/reductions/mapreduce/Float32/1d |
35297 ns |
34936 ns |
1.01 |
array/reductions/mapreduce/Float32/dims=1 |
40015 ns |
49501 ns |
0.81 |
array/reductions/mapreduce/Float32/dims=1L |
52169 ns |
51907 ns |
1.01 |
array/reductions/mapreduce/Float32/dims=2 |
56924.5 ns |
56747.5 ns |
1.00 |
array/reductions/mapreduce/Float32/dims=2L |
70043 ns |
69513 ns |
1.01 |
array/reductions/mapreduce/Int64/1d |
43391 ns |
43154 ns |
1.01 |
array/reductions/mapreduce/Int64/dims=1 |
42572.5 ns |
43838 ns |
0.97 |
array/reductions/mapreduce/Int64/dims=1L |
88082 ns |
87668 ns |
1.00 |
array/reductions/mapreduce/Int64/dims=2 |
60121 ns |
59424 ns |
1.01 |
array/reductions/mapreduce/Int64/dims=2L |
85396 ns |
84576 ns |
1.01 |
array/reductions/reduce/Float32/1d |
35239 ns |
34859 ns |
1.01 |
array/reductions/reduce/Float32/dims=1 |
45458 ns |
39947.5 ns |
1.14 |
array/reductions/reduce/Float32/dims=1L |
52256 ns |
51723 ns |
1.01 |
array/reductions/reduce/Float32/dims=2 |
57350 ns |
56768 ns |
1.01 |
array/reductions/reduce/Float32/dims=2L |
70540 ns |
69769.5 ns |
1.01 |
array/reductions/reduce/Int64/1d |
43685.5 ns |
42778 ns |
1.02 |
array/reductions/reduce/Int64/dims=1 |
52984 ns |
44289 ns |
1.20 |
array/reductions/reduce/Int64/dims=1L |
87981 ns |
87701 ns |
1.00 |
array/reductions/reduce/Int64/dims=2 |
60039 ns |
59510 ns |
1.01 |
array/reductions/reduce/Int64/dims=2L |
85306 ns |
84815 ns |
1.01 |
array/reverse/1d |
18625 ns |
18338 ns |
1.02 |
array/reverse/1dL |
69167 ns |
68805 ns |
1.01 |
array/reverse/1dL_inplace |
65986 ns |
65983 ns |
1.00 |
array/reverse/1d_inplace |
8633.333333333334 ns |
8621.333333333334 ns |
1.00 |
array/reverse/2d |
20966 ns |
20615 ns |
1.02 |
array/reverse/2dL |
72939 ns |
72573 ns |
1.01 |
array/reverse/2dL_inplace |
66084 ns |
66098 ns |
1.00 |
array/reverse/2d_inplace |
10244 ns |
10260 ns |
1.00 |
array/sorting/1d |
2735830 ns |
2735030 ns |
1.00 |
array/sorting/2d |
1069179 ns |
1071674 ns |
1.00 |
array/sorting/by |
3304037 ns |
3313782 ns |
1.00 |
cuda/synchronization/context/auto |
1181 ns |
1186.2 ns |
1.00 |
cuda/synchronization/context/blocking |
939.6111111111111 ns |
924.0487804878048 ns |
1.02 |
cuda/synchronization/context/nonblocking |
7178.5 ns |
7835.8 ns |
0.92 |
cuda/synchronization/stream/auto |
1044.3 ns |
1041.2 ns |
1.00 |
cuda/synchronization/stream/blocking |
819.8823529411765 ns |
835.7402597402597 ns |
0.98 |
cuda/synchronization/stream/nonblocking |
7825.7 ns |
7438.2 ns |
1.05 |
integration/byval/reference |
144029 ns |
144123 ns |
1.00 |
integration/byval/slices=1 |
145968 ns |
146064 ns |
1.00 |
integration/byval/slices=2 |
284729 ns |
284754 ns |
1.00 |
integration/byval/slices=3 |
423269.5 ns |
423302 ns |
1.00 |
integration/cudadevrt |
102720 ns |
102654 ns |
1.00 |
integration/volumerhs |
9430033 ns |
9450427 ns |
1.00 |
kernel/indexing |
13631 ns |
13382 ns |
1.02 |
kernel/indexing_checked |
14330 ns |
14092 ns |
1.02 |
kernel/launch |
2166.1111111111113 ns |
2292.8888888888887 ns |
0.94 |
kernel/occupancy |
662.1013513513514 ns |
675.4013157894736 ns |
0.98 |
kernel/rand |
14999 ns |
17995 ns |
0.83 |
latency/import |
3804873493 ns |
3823445090 ns |
1.00 |
latency/precompile |
4580372785 ns |
4598939035 ns |
1.00 |
latency/ttfp |
4386641262 ns |
4399692793 ns |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
Member
Author
|
@efaulhaber that was on a H100 right? |
Contributor
|
This is great. How did you get the benchmark numbers? |
Contributor
This is benchmarking the main kernel of TrixiParticles.jl, which is an SPH neighbor loop computing the forces on particles. There are two divisions in the hot loop, for which I then used the different fast division implementations. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Working with @efaulhaber two weeks ago, I was reminded of the slowness of division on Nvidia GPUs.
On top of that currently
@fastmath a/bfor Float64 just becomes afdiv fastwhich then becomes a normal NVPTX division,and SASS helpfully turns into a function call.
@efaulhaber has some numbers for his hot kernel:
Using the simple implementation of a/b = a * 1/b:
did speed his code up, but that might be more to do with additional code motion opportunity
this affords.
As an example NVIDIA warp
uses the
approx.ftzinstruction to obtain afast_divimplementation.Which using @efaulhaber measurements:
But what is the loss of accuracy we are incurring here?
Pretty bad. Meanwhile Oceananigans is facing a similar problem:
CliMA/Oceananigans.jl#5140 where @Mikolaj-A-Kowalski
is improving the accuracy of the
inv_fastby performing an additional iteration.@efaulhaber tested this as:
So a very small additional cost.
But the gain in accuracy is significant: