Skip to content

Implement fast_div using fast rcp#3077

Open
vchuravy wants to merge 1 commit intomasterfrom
vc/faster_div
Open

Implement fast_div using fast rcp#3077
vchuravy wants to merge 1 commit intomasterfrom
vc/faster_div

Conversation

@vchuravy
Copy link
Copy Markdown
Member

@vchuravy vchuravy commented Apr 2, 2026

Working with @efaulhaber two weeks ago, I was reminded of the slowness of division on Nvidia GPUs.

On top of that currently @fastmath a/b for Float64 just becomes a fdiv fast which then becomes a normal NVPTX division,
and SASS helpfully turns into a function call.

@efaulhaber has some numbers for his hot kernel:

# fdiv double %143, %144, !dbg !528
# 10.159 ms
# return x / y

# fdiv fast double %143, %144, !dbg !530
# 10.114 ms
# return Base.FastMath.div_fast(x, y)

Using the simple implementation of a/b = a * 1/b:

# fdiv double 1.000000e+00, %145, !dbg !530
# fmul double %144, %146, !dbg !533
# 6.878 ms
# return x * (1 / y)

# fdiv double 1.000000e+00, %145, !dbg !532
# fmul double %144, %146, !dbg !535
# 6.852 ms
# return x * inv(y)

did speed his code up, but that might be more to do with additional code motion opportunity
this affords.

As an example NVIDIA warp
uses the approx.ftz instruction to obtain a fast_div implementation.

Which using @efaulhaber measurements:

# call double @llvm.nvvm.rcp.approx.ftz.d(double %118), !dbg !413
# fmul double %117, %119, !dbg !418
# 4.758 ms
# return x * fast_inv_cuda_nofma(y)

But what is the loss of accuracy we are incurring here?

julia> y2 = CUDA.rand(Float64, 100_000);
julia> maximum(inv.(y2) .- fast_inv_cuda_nofma.(y2))
0.007475190221157391

# Without numbers close to zero
julia> y2 = CUDA.rand(Float64, 100_000) .+ 0.1;
julia> maximum(inv.(y2) .- fast_inv_cuda_nofma.(y2))
7.105216770497691e-6

Pretty bad. Meanwhile Oceananigans is facing a similar problem:
CliMA/Oceananigans.jl#5140 where @Mikolaj-A-Kowalski
is improving the accuracy of the inv_fast by performing an additional iteration.

@efaulhaber tested this as:

# call double @llvm.nvvm.rcp.approx.ftz.d(double %118), !dbg !413
# fneg double %118, !dbg !418
# call double @llvm.fma.f64(double %119, double %120, double 1.000000e+00), !dbg !420
# call double @llvm.fma.f64(double %121, double %121, double %121), !dbg !422
# call double @llvm.fma.f64(double %122, double %119, double %119), !dbg !424
# fmul double %117, %123, !dbg !426
# 4.844 ms
# return x * fast_inv_cuda(y)

So a very small additional cost.

But the gain in accuracy is significant:

julia> maximum(inv.(y2) .- fast_inv_cuda.(y2))
7.105427357601002e-15
julia> maximum(inv.(y2) .- fast_inv_cuda.(y2))
8.881784197001252e-16

Co-authored-by: M. A. Kowalski <mak60@cam.ac.uk>
Co-authored-by: Erik Faulhaber <erik.faulhaber@web.de>
@vchuravy
Copy link
Copy Markdown
Member Author

vchuravy commented Apr 2, 2026

@lcw this might interest you since we were trying things along the line in the volumerhs benchmark

let (jlf, f) = (:div_arcp, :div)
for (T, llvmT) in ((:Float32, "float"), (:Float64, "double"))
ir = """
%x = f$f fast $llvmT %0, %1
ret $llvmT %x
"""
@eval begin
# the @pure is necessary so that we can constant propagate.
@inline Base.@pure function $jlf(a::$T, b::$T)
Base.llvmcall($ir, $T, Tuple{$T, $T}, a, b)
end
end
end
@eval function $jlf(args...)
Base.$jlf(args...)
end
end
rcp(x) = div_arcp(one(x), x) # still leads to rcp.rn which is also a function call
# div_fast(x::Float32, y::Float32) = ccall("extern __nv_fast_fdividef", llvmcall, Cfloat, (Cfloat, Cfloat), x, y)
# rcp(x) = div_fast(one(x), x)

@efaulhaber
Copy link
Copy Markdown
Contributor

I made this table to see what is happening with FP32 and FP64:

Variant LLVM Operation (FP32) Runtime (FP32) LLVM Operation (FP64) Runtime (FP64)
x / y fdiv float %143, %144, !dbg !528 4.894 ms fdiv double %143, %144, !dbg !528 10.159 ms
x * (1 / y) fdiv float 1.000000e+00, %145, !dbg !530
fmul float %144, %146, !dbg !533
4.227 ms fdiv double 1.000000e+00, %145, !dbg !530
fmul double %144, %146, !dbg !533
6.878 ms
x * inv(y) call float @llvm.nvvm.rcp.rn.f(float %118), !dbg !413
fmul float %117, %119, !dbg !418
3.620 ms fdiv double 1.000000e+00, %145, !dbg !532
fmul double %144, %146, !dbg !535
6.852 ms
x * fast_inv_cuda(y) call float @llvm.nvvm.rcp.approx.ftz.f(float %118), !dbg !413
fmul float %117, %119, !dbg !418
3.281 ms call double @llvm.nvvm.rcp.approx.ftz.d(double %118), !dbg !413
fneg double %118, !dbg !418
call double @llvm.fma.f64(double %119, double %120, double 1.000000e+00), !dbg !420
call double @llvm.fma.f64(double %121, double %121, double %121), !dbg !422
call double @llvm.fma.f64(double %122, double %119, double %119), !dbg !424
fmul double %117, %123, !dbg !426
4.844 ms
Base.FastMath.div_fast(x, y) call float @llvm.nvvm.div.approx.f(float %118, float %115), !dbg !413 3.114 ms fdiv fast double %143, %144, !dbg !530 10.114 ms
x * fast_inv_cuda_nofma(y) call double @llvm.nvvm.rcp.approx.ftz.d(double %118), !dbg !413
fmul double %117, %119, !dbg !418
4.758 ms

Copy link
Copy Markdown
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CUDA.jl Benchmarks

Details
Benchmark suite Current: e200935 Previous: a79b516 Ratio
array/accumulate/Float32/1d 101549 ns 101309 ns 1.00
array/accumulate/Float32/dims=1 76954 ns 76747 ns 1.00
array/accumulate/Float32/dims=1L 1585965 ns 1585609 ns 1.00
array/accumulate/Float32/dims=2 143736 ns 143412 ns 1.00
array/accumulate/Float32/dims=2L 657735 ns 657151 ns 1.00
array/accumulate/Int64/1d 119049 ns 118450 ns 1.01
array/accumulate/Int64/dims=1 79971 ns 79685 ns 1.00
array/accumulate/Int64/dims=1L 1694663 ns 1694399 ns 1.00
array/accumulate/Int64/dims=2 156273 ns 155494.5 ns 1.01
array/accumulate/Int64/dims=2L 961567 ns 961001 ns 1.00
array/broadcast 20565 ns 20538 ns 1.00
array/construct 1335.6 ns 1298.9 ns 1.03
array/copy 18948 ns 18512 ns 1.02
array/copyto!/cpu_to_gpu 217063 ns 213295 ns 1.02
array/copyto!/gpu_to_cpu 285135 ns 284330.5 ns 1.00
array/copyto!/gpu_to_gpu 11401 ns 11273 ns 1.01
array/iteration/findall/bool 132221.5 ns 132165 ns 1.00
array/iteration/findall/int 149464.5 ns 148572 ns 1.01
array/iteration/findfirst/bool 82472 ns 81324.5 ns 1.01
array/iteration/findfirst/int 83837 ns 83910 ns 1.00
array/iteration/findmin/1d 89677.5 ns 88268.5 ns 1.02
array/iteration/findmin/2d 117275 ns 116719 ns 1.00
array/iteration/logical 202213 ns 201488.5 ns 1.00
array/iteration/scalar 68004.5 ns 67192 ns 1.01
array/permutedims/2d 52647.5 ns 52378 ns 1.01
array/permutedims/3d 53105 ns 52726 ns 1.01
array/permutedims/4d 52039.5 ns 51596 ns 1.01
array/random/rand/Float32 13422 ns 13097 ns 1.02
array/random/rand/Int64 30333 ns 37319 ns 0.81
array/random/rand!/Float32 8567.666666666666 ns 8581.666666666666 ns 1.00
array/random/rand!/Int64 34297 ns 34312 ns 1.00
array/random/randn/Float32 40454 ns 38478.5 ns 1.05
array/random/randn!/Float32 31587 ns 31422.5 ns 1.01
array/reductions/mapreduce/Float32/1d 35297 ns 34936 ns 1.01
array/reductions/mapreduce/Float32/dims=1 40015 ns 49501 ns 0.81
array/reductions/mapreduce/Float32/dims=1L 52169 ns 51907 ns 1.01
array/reductions/mapreduce/Float32/dims=2 56924.5 ns 56747.5 ns 1.00
array/reductions/mapreduce/Float32/dims=2L 70043 ns 69513 ns 1.01
array/reductions/mapreduce/Int64/1d 43391 ns 43154 ns 1.01
array/reductions/mapreduce/Int64/dims=1 42572.5 ns 43838 ns 0.97
array/reductions/mapreduce/Int64/dims=1L 88082 ns 87668 ns 1.00
array/reductions/mapreduce/Int64/dims=2 60121 ns 59424 ns 1.01
array/reductions/mapreduce/Int64/dims=2L 85396 ns 84576 ns 1.01
array/reductions/reduce/Float32/1d 35239 ns 34859 ns 1.01
array/reductions/reduce/Float32/dims=1 45458 ns 39947.5 ns 1.14
array/reductions/reduce/Float32/dims=1L 52256 ns 51723 ns 1.01
array/reductions/reduce/Float32/dims=2 57350 ns 56768 ns 1.01
array/reductions/reduce/Float32/dims=2L 70540 ns 69769.5 ns 1.01
array/reductions/reduce/Int64/1d 43685.5 ns 42778 ns 1.02
array/reductions/reduce/Int64/dims=1 52984 ns 44289 ns 1.20
array/reductions/reduce/Int64/dims=1L 87981 ns 87701 ns 1.00
array/reductions/reduce/Int64/dims=2 60039 ns 59510 ns 1.01
array/reductions/reduce/Int64/dims=2L 85306 ns 84815 ns 1.01
array/reverse/1d 18625 ns 18338 ns 1.02
array/reverse/1dL 69167 ns 68805 ns 1.01
array/reverse/1dL_inplace 65986 ns 65983 ns 1.00
array/reverse/1d_inplace 8633.333333333334 ns 8621.333333333334 ns 1.00
array/reverse/2d 20966 ns 20615 ns 1.02
array/reverse/2dL 72939 ns 72573 ns 1.01
array/reverse/2dL_inplace 66084 ns 66098 ns 1.00
array/reverse/2d_inplace 10244 ns 10260 ns 1.00
array/sorting/1d 2735830 ns 2735030 ns 1.00
array/sorting/2d 1069179 ns 1071674 ns 1.00
array/sorting/by 3304037 ns 3313782 ns 1.00
cuda/synchronization/context/auto 1181 ns 1186.2 ns 1.00
cuda/synchronization/context/blocking 939.6111111111111 ns 924.0487804878048 ns 1.02
cuda/synchronization/context/nonblocking 7178.5 ns 7835.8 ns 0.92
cuda/synchronization/stream/auto 1044.3 ns 1041.2 ns 1.00
cuda/synchronization/stream/blocking 819.8823529411765 ns 835.7402597402597 ns 0.98
cuda/synchronization/stream/nonblocking 7825.7 ns 7438.2 ns 1.05
integration/byval/reference 144029 ns 144123 ns 1.00
integration/byval/slices=1 145968 ns 146064 ns 1.00
integration/byval/slices=2 284729 ns 284754 ns 1.00
integration/byval/slices=3 423269.5 ns 423302 ns 1.00
integration/cudadevrt 102720 ns 102654 ns 1.00
integration/volumerhs 9430033 ns 9450427 ns 1.00
kernel/indexing 13631 ns 13382 ns 1.02
kernel/indexing_checked 14330 ns 14092 ns 1.02
kernel/launch 2166.1111111111113 ns 2292.8888888888887 ns 0.94
kernel/occupancy 662.1013513513514 ns 675.4013157894736 ns 0.98
kernel/rand 14999 ns 17995 ns 0.83
latency/import 3804873493 ns 3823445090 ns 1.00
latency/precompile 4580372785 ns 4598939035 ns 1.00
latency/ttfp 4386641262 ns 4399692793 ns 1.00

This comment was automatically generated by workflow using github-action-benchmark.

@vchuravy
Copy link
Copy Markdown
Member Author

vchuravy commented Apr 2, 2026

@efaulhaber that was on a H100 right?

@lcw
Copy link
Copy Markdown
Contributor

lcw commented Apr 2, 2026

This is great. How did you get the benchmark numbers?

@efaulhaber
Copy link
Copy Markdown
Contributor

How did you get the benchmark numbers?

This is benchmarking the main kernel of TrixiParticles.jl, which is an SPH neighbor loop computing the forces on particles. There are two divisions in the hot loop, for which I then used the different fast division implementations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants