Implement fast_div using fast rcp by vchuravy · Pull Request #3077 · JuliaGPU/CUDA.jl

vchuravy · 2026-04-02T08:13:04Z

Working with @efaulhaber two weeks ago, I was reminded of the slowness of division on Nvidia GPUs.

On top of that currently @fastmath a/b for Float64 just becomes a fdiv fast which then becomes a normal NVPTX division,
and SASS helpfully turns into a function call.

@efaulhaber has some numbers for his hot kernel:

# fdiv double %143, %144, !dbg !528
# 10.159 ms
# return x / y

# fdiv fast double %143, %144, !dbg !530
# 10.114 ms
# return Base.FastMath.div_fast(x, y)

Using the simple implementation of a/b = a * 1/b:

# fdiv double 1.000000e+00, %145, !dbg !530
# fmul double %144, %146, !dbg !533
# 6.878 ms
# return x * (1 / y)

# fdiv double 1.000000e+00, %145, !dbg !532
# fmul double %144, %146, !dbg !535
# 6.852 ms
# return x * inv(y)

did speed his code up, but that might be more to do with additional code motion opportunity
this affords.

As an example NVIDIA warp
uses the approx.ftz instruction to obtain a fast_div implementation.

Which using @efaulhaber measurements:

# call double @llvm.nvvm.rcp.approx.ftz.d(double %118), !dbg !413
# fmul double %117, %119, !dbg !418
# 4.758 ms
# return x * fast_inv_cuda_nofma(y)

But what is the loss of accuracy we are incurring here?

julia> y2 = CUDA.rand(Float64, 100_000);
julia> maximum(inv.(y2) .- fast_inv_cuda_nofma.(y2))
0.007475190221157391

# Without numbers close to zero
julia> y2 = CUDA.rand(Float64, 100_000) .+ 0.1;
julia> maximum(inv.(y2) .- fast_inv_cuda_nofma.(y2))
7.105216770497691e-6

Pretty bad. Meanwhile Oceananigans is facing a similar problem:
CliMA/Oceananigans.jl#5140 where @Mikolaj-A-Kowalski
is improving the accuracy of the inv_fast by performing an additional iteration.

@efaulhaber tested this as:

# call double @llvm.nvvm.rcp.approx.ftz.d(double %118), !dbg !413
# fneg double %118, !dbg !418
# call double @llvm.fma.f64(double %119, double %120, double 1.000000e+00), !dbg !420
# call double @llvm.fma.f64(double %121, double %121, double %121), !dbg !422
# call double @llvm.fma.f64(double %122, double %119, double %119), !dbg !424
# fmul double %117, %123, !dbg !426
# 4.844 ms
# return x * fast_inv_cuda(y)

So a very small additional cost.

But the gain in accuracy is significant:

julia> maximum(inv.(y2) .- fast_inv_cuda.(y2))
7.105427357601002e-15
julia> maximum(inv.(y2) .- fast_inv_cuda.(y2))
8.881784197001252e-16

Co-authored-by: M. A. Kowalski <mak60@cam.ac.uk> Co-authored-by: Erik Faulhaber <erik.faulhaber@web.de>

vchuravy · 2026-04-02T08:30:52Z

@lcw this might interest you since we were trying things along the line in the volumerhs benchmark

CUDA.jl/perf/volumerhs.jl

Lines 40 to 60 in e200935

    
           let (jlf, f) = (:div_arcp, :div) 
        
               for (T, llvmT) in ((:Float32, "float"), (:Float64, "double")) 
        
                   ir = """ 
        
                       %x = f$f fast $llvmT %0, %1 
        
                       ret $llvmT %x 
        
                   """ 
        
                   @eval begin 
        
                       # the @pure is necessary so that we can constant propagate. 
        
                       @inline Base.@pure function $jlf(a::$T, b::$T) 
        
                           Base.llvmcall($ir, $T, Tuple{$T, $T}, a, b) 
        
                       end 
        
                   end 
        
               end 
        
               @eval function $jlf(args...) 
        
                   Base.$jlf(args...) 
        
               end 
        
           end 
        
           rcp(x) = div_arcp(one(x), x) # still leads to rcp.rn which is also a function call 
        
           # div_fast(x::Float32, y::Float32) = ccall("extern __nv_fast_fdividef", llvmcall, Cfloat, (Cfloat, Cfloat), x, y) 
        
           # rcp(x) = div_fast(one(x), x)

efaulhaber · 2026-04-02T09:29:24Z

I made this table to see what is happening with FP32 and FP64:

Variant	LLVM Operation (FP32)	Runtime (FP32)	LLVM Operation (FP64)	Runtime (FP64)
`x / y`	`fdiv float %143, %144, !dbg !528`	4.894 ms	`fdiv double %143, %144, !dbg !528`	10.159 ms
`x * (1 / y)`	`fdiv float 1.000000e+00, %145, !dbg !530` `fmul float %144, %146, !dbg !533`	4.227 ms	`fdiv double 1.000000e+00, %145, !dbg !530` `fmul double %144, %146, !dbg !533`	6.878 ms
`x * inv(y)`	`call float @llvm.nvvm.rcp.rn.f(float %118), !dbg !413` `fmul float %117, %119, !dbg !418`	3.620 ms	`fdiv double 1.000000e+00, %145, !dbg !532` `fmul double %144, %146, !dbg !535`	6.852 ms
`x * fast_inv_cuda(y)`	`call float @llvm.nvvm.rcp.approx.ftz.f(float %118), !dbg !413` `fmul float %117, %119, !dbg !418`	3.281 ms	`call double @llvm.nvvm.rcp.approx.ftz.d(double %118), !dbg !413` `fneg double %118, !dbg !418` `call double @llvm.fma.f64(double %119, double %120, double 1.000000e+00), !dbg !420` `call double @llvm.fma.f64(double %121, double %121, double %121), !dbg !422` `call double @llvm.fma.f64(double %122, double %119, double %119), !dbg !424` `fmul double %117, %123, !dbg !426`	4.844 ms
`Base.FastMath.div_fast(x, y)`	`call float @llvm.nvvm.div.approx.f(float %118, float %115), !dbg !413`	3.114 ms	`fdiv fast double %143, %144, !dbg !530`	10.114 ms
`x * fast_inv_cuda_nofma(y)`	`call double @llvm.nvvm.rcp.approx.ftz.d(double %118), !dbg !413` `fmul double %117, %119, !dbg !418`	—	—	4.758 ms

github-actions

CUDA.jl Benchmarks

Details

Benchmark suite	Current: `e200935`	Previous: `a79b516`	Ratio
`array/accumulate/Float32/1d`	`101549` ns	`101309` ns	`1.00`
`array/accumulate/Float32/dims=1`	`76954` ns	`76747` ns	`1.00`
`array/accumulate/Float32/dims=1L`	`1585965` ns	`1585609` ns	`1.00`
`array/accumulate/Float32/dims=2`	`143736` ns	`143412` ns	`1.00`
`array/accumulate/Float32/dims=2L`	`657735` ns	`657151` ns	`1.00`
`array/accumulate/Int64/1d`	`119049` ns	`118450` ns	`1.01`
`array/accumulate/Int64/dims=1`	`79971` ns	`79685` ns	`1.00`
`array/accumulate/Int64/dims=1L`	`1694663` ns	`1694399` ns	`1.00`
`array/accumulate/Int64/dims=2`	`156273` ns	`155494.5` ns	`1.01`
`array/accumulate/Int64/dims=2L`	`961567` ns	`961001` ns	`1.00`
`array/broadcast`	`20565` ns	`20538` ns	`1.00`
`array/construct`	`1335.6` ns	`1298.9` ns	`1.03`
`array/copy`	`18948` ns	`18512` ns	`1.02`
`array/copyto!/cpu_to_gpu`	`217063` ns	`213295` ns	`1.02`
`array/copyto!/gpu_to_cpu`	`285135` ns	`284330.5` ns	`1.00`
`array/copyto!/gpu_to_gpu`	`11401` ns	`11273` ns	`1.01`
`array/iteration/findall/bool`	`132221.5` ns	`132165` ns	`1.00`
`array/iteration/findall/int`	`149464.5` ns	`148572` ns	`1.01`
`array/iteration/findfirst/bool`	`82472` ns	`81324.5` ns	`1.01`
`array/iteration/findfirst/int`	`83837` ns	`83910` ns	`1.00`
`array/iteration/findmin/1d`	`89677.5` ns	`88268.5` ns	`1.02`
`array/iteration/findmin/2d`	`117275` ns	`116719` ns	`1.00`
`array/iteration/logical`	`202213` ns	`201488.5` ns	`1.00`
`array/iteration/scalar`	`68004.5` ns	`67192` ns	`1.01`
`array/permutedims/2d`	`52647.5` ns	`52378` ns	`1.01`
`array/permutedims/3d`	`53105` ns	`52726` ns	`1.01`
`array/permutedims/4d`	`52039.5` ns	`51596` ns	`1.01`
`array/random/rand/Float32`	`13422` ns	`13097` ns	`1.02`
`array/random/rand/Int64`	`30333` ns	`37319` ns	`0.81`
`array/random/rand!/Float32`	`8567.666666666666` ns	`8581.666666666666` ns	`1.00`
`array/random/rand!/Int64`	`34297` ns	`34312` ns	`1.00`
`array/random/randn/Float32`	`40454` ns	`38478.5` ns	`1.05`
`array/random/randn!/Float32`	`31587` ns	`31422.5` ns	`1.01`
`array/reductions/mapreduce/Float32/1d`	`35297` ns	`34936` ns	`1.01`
`array/reductions/mapreduce/Float32/dims=1`	`40015` ns	`49501` ns	`0.81`
`array/reductions/mapreduce/Float32/dims=1L`	`52169` ns	`51907` ns	`1.01`
`array/reductions/mapreduce/Float32/dims=2`	`56924.5` ns	`56747.5` ns	`1.00`
`array/reductions/mapreduce/Float32/dims=2L`	`70043` ns	`69513` ns	`1.01`
`array/reductions/mapreduce/Int64/1d`	`43391` ns	`43154` ns	`1.01`
`array/reductions/mapreduce/Int64/dims=1`	`42572.5` ns	`43838` ns	`0.97`
`array/reductions/mapreduce/Int64/dims=1L`	`88082` ns	`87668` ns	`1.00`
`array/reductions/mapreduce/Int64/dims=2`	`60121` ns	`59424` ns	`1.01`
`array/reductions/mapreduce/Int64/dims=2L`	`85396` ns	`84576` ns	`1.01`
`array/reductions/reduce/Float32/1d`	`35239` ns	`34859` ns	`1.01`
`array/reductions/reduce/Float32/dims=1`	`45458` ns	`39947.5` ns	`1.14`
`array/reductions/reduce/Float32/dims=1L`	`52256` ns	`51723` ns	`1.01`
`array/reductions/reduce/Float32/dims=2`	`57350` ns	`56768` ns	`1.01`
`array/reductions/reduce/Float32/dims=2L`	`70540` ns	`69769.5` ns	`1.01`
`array/reductions/reduce/Int64/1d`	`43685.5` ns	`42778` ns	`1.02`
`array/reductions/reduce/Int64/dims=1`	`52984` ns	`44289` ns	`1.20`
`array/reductions/reduce/Int64/dims=1L`	`87981` ns	`87701` ns	`1.00`
`array/reductions/reduce/Int64/dims=2`	`60039` ns	`59510` ns	`1.01`
`array/reductions/reduce/Int64/dims=2L`	`85306` ns	`84815` ns	`1.01`
`array/reverse/1d`	`18625` ns	`18338` ns	`1.02`
`array/reverse/1dL`	`69167` ns	`68805` ns	`1.01`
`array/reverse/1dL_inplace`	`65986` ns	`65983` ns	`1.00`
`array/reverse/1d_inplace`	`8633.333333333334` ns	`8621.333333333334` ns	`1.00`
`array/reverse/2d`	`20966` ns	`20615` ns	`1.02`
`array/reverse/2dL`	`72939` ns	`72573` ns	`1.01`
`array/reverse/2dL_inplace`	`66084` ns	`66098` ns	`1.00`
`array/reverse/2d_inplace`	`10244` ns	`10260` ns	`1.00`
`array/sorting/1d`	`2735830` ns	`2735030` ns	`1.00`
`array/sorting/2d`	`1069179` ns	`1071674` ns	`1.00`
`array/sorting/by`	`3304037` ns	`3313782` ns	`1.00`
`cuda/synchronization/context/auto`	`1181` ns	`1186.2` ns	`1.00`
`cuda/synchronization/context/blocking`	`939.6111111111111` ns	`924.0487804878048` ns	`1.02`
`cuda/synchronization/context/nonblocking`	`7178.5` ns	`7835.8` ns	`0.92`
`cuda/synchronization/stream/auto`	`1044.3` ns	`1041.2` ns	`1.00`
`cuda/synchronization/stream/blocking`	`819.8823529411765` ns	`835.7402597402597` ns	`0.98`
`cuda/synchronization/stream/nonblocking`	`7825.7` ns	`7438.2` ns	`1.05`
`integration/byval/reference`	`144029` ns	`144123` ns	`1.00`
`integration/byval/slices=1`	`145968` ns	`146064` ns	`1.00`
`integration/byval/slices=2`	`284729` ns	`284754` ns	`1.00`
`integration/byval/slices=3`	`423269.5` ns	`423302` ns	`1.00`
`integration/cudadevrt`	`102720` ns	`102654` ns	`1.00`
`integration/volumerhs`	`9430033` ns	`9450427` ns	`1.00`
`kernel/indexing`	`13631` ns	`13382` ns	`1.02`
`kernel/indexing_checked`	`14330` ns	`14092` ns	`1.02`
`kernel/launch`	`2166.1111111111113` ns	`2292.8888888888887` ns	`0.94`
`kernel/occupancy`	`662.1013513513514` ns	`675.4013157894736` ns	`0.98`
`kernel/rand`	`14999` ns	`17995` ns	`0.83`
`latency/import`	`3804873493` ns	`3823445090` ns	`1.00`
`latency/precompile`	`4580372785` ns	`4598939035` ns	`1.00`
`latency/ttfp`	`4386641262` ns	`4399692793` ns	`1.00`

This comment was automatically generated by workflow using github-action-benchmark.

vchuravy · 2026-04-02T14:39:53Z

@efaulhaber that was on a H100 right?

lcw · 2026-04-02T16:41:01Z

This is great. How did you get the benchmark numbers?

efaulhaber · 2026-04-07T07:40:43Z

How did you get the benchmark numbers?

This is benchmarking the main kernel of TrixiParticles.jl, which is an SPH neighbor loop computing the forces on particles. There are two divisions in the hot loop, for which I then used the different fast division implementations.

Implement fast_div using fast rcp

e200935

Co-authored-by: M. A. Kowalski <mak60@cam.ac.uk> Co-authored-by: Erik Faulhaber <erik.faulhaber@web.de>

github-actions bot reviewed Apr 2, 2026

View reviewed changes

LasNikas mentioned this pull request Apr 9, 2026

Use fast divisions in performance-critical code trixi-framework/TrixiParticles.jl#1128

Merged

efaulhaber mentioned this pull request Apr 10, 2026

Remove CUDA FP64 fast divisions if they get integrated into CUDA.jl trixi-framework/TrixiParticles.jl#1137

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement fast_div using fast rcp#3077

Implement fast_div using fast rcp#3077
vchuravy wants to merge 1 commit intomasterfrom
vc/faster_div

vchuravy commented Apr 2, 2026

Uh oh!

vchuravy commented Apr 2, 2026

Uh oh!

efaulhaber commented Apr 2, 2026

Uh oh!

github-actions bot left a comment

Uh oh!

vchuravy commented Apr 2, 2026

Uh oh!

lcw commented Apr 2, 2026

Uh oh!

efaulhaber commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

vchuravy commented Apr 2, 2026

Uh oh!

vchuravy commented Apr 2, 2026

Uh oh!

efaulhaber commented Apr 2, 2026

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

CUDA.jl Benchmarks

Uh oh!

vchuravy commented Apr 2, 2026

Uh oh!

lcw commented Apr 2, 2026

Uh oh!

efaulhaber commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants