feat: GPU-accelerated downconvert and correlate via KernelAbstractions.jl by zsoerenm · Pull Request #99 · JuliaGNSS/Tracking.jl

zsoerenm · 2026-03-05T12:48:13Z

Summary

Add a portable GPU implementation of the GNSS downconvert-and-correlate pipeline using KernelAbstractions.jl. The kernel fuses carrier wipe-off, code lookup, and correlation into a single pass with in-kernel workgroup reduction via shared memory.

The journey

v1: Naive GPU kernel (baseline)

Started with a straightforward port: per-sample sincos() for carrier, Float64 accumulators, per-thread partial sums transferred back to CPU for reduction. This was slower than CPU for all configurations due to the massive GPU→CPU transfer of partial arrays and expensive FP64 sincos on GPU.

v2: In-kernel reduction + ComplexF64 results

Replaced CPU-side reduction with in-kernel tree reduction using shared memory (@LocalMem). Only a compact ComplexF64 result array (sats × ants × taps) is transferred back. Combined the separate Float64 re/im arrays into a single ComplexF64 array, halving the number of GPU→CPU copies. Also introduced Val{num_taps} for compile-time kernel specialization, letting @Private allocate exact-sized accumulators and enabling loop unrolling.

v3: Cross-system batching + per-system kernel launches

Added support for batching multiple GNSS systems (e.g., GPSL1 + GalileoE1B) into the same struct. Initially tried a single kernel with a tuple of code tables and per-satellite system_idx — but the @generated dispatch overhead caused a 25-34% regression. Switched to per-system kernel launches, each specialized at compile time for modulation type (LOC/BOC/CBOC), code length, and num_taps. This recovered the regression and added GalileoE1B (CBOC) support.

v4: Carrier rotation + FP32 accumulation

Replaced per-sample sincos() with incremental carrier rotation using FP32 multiply-add (Givens rotation). FP32 accumulators during the inner loop, promoting to FP64 only at the final reduction step. On RDNA 4 (Radeon 8060S): ~1.7x kernel speedup at 25K samples.

v5: Combined-tap reduction

For ≤8 correlator taps (the common case), replaced sequential per-tap reduction (num_taps × 8 barriers) with a single combined pass storing all taps in shared memory simultaneously (8 barriers total). ~22% kernel speedup for EarlyPromptLate (3 taps).

v6: Subcarrier LUT + tap phase hoisting

Precomputed subcarrier values (BOC/CBOC) into a lookup table indexed by sub-chip phase. Hoisted per-tap code phase offsets out of the inner loop. LUT size as compile-time Val eliminates dead subcarrier code for LOC signals (GPSL1). GalileoE1B CBOC: 214→127μs (1.69x).

v7: CPU-generated code replicas (dead end)

Tried pre-generating code replicas on CPU via gen_code_replica! and uploading per call. The kernel became trivially fast (pure indexed reads, no floor/mod) — GPSL1 72→39μs, GalE1B 127→55μs — but the PCIe transfer of code replicas dominated: 175μs for E1B 16sat (47% of total time), plus 79μs for CPU generation (21%). Together: 68% overhead. This approach only won at very high satellite counts.

v8: Fixed-point LUT on GPU (the winner)

Returned to GPU-resident code tables with subcarrier baked into expanded LUTs (e.g., E1B: 4092 chips × 12 sub-per-chip = 49104 entries/PRN), but replaced the expensive float floor+mod code lookup from v6 with fixed-point integer arithmetic.

The key insight: encode code phase as a fixed-point integer (fractional bits after the radix point). Code chip index = phase >> fractional_bits. Sub-chip index for the expanded LUT comes naturally from the integer phase. No floor(), no float-to-int conversion.

Two modes via parametric phase type:

Int32 (18 fractional bits): Fast, ~4% sub-chip quantization error. Max expanded phase ~1.6B fits Int32.
Int64 (32 fractional bits): Default. Zero quantization errors across 100K samples (verified against BigFloat reference). Eliminates the 3/100K errors that even the CPU accumulator approach produces.

v9: Accumulate+wrap (final optimization)

Profiling showed the kernel was 86% of total time, and within the kernel, Int64 mod (emulated as ~20 ALU instructions on AMD GPU) was the bottleneck. Microbenchmarked four alternatives:

mod(Int64): 28.4μs (baseline)
conditional subtract: 18.7μs
mod(Int32) truncated: 22.5μs
accumulate+wrap: 15.1μs (1.88x faster)

The accumulate+wrap pattern tracks code phase as a running accumulator, initialized once with mod(), then advanced by delta×stride per grid step with a branchless conditional subtract for wrapping.

Final benchmark results (AMD Radeon 8060S, RDNA 4)

Config	GPU (μs)	CPU (μs)	GPU/CPU
L1 4sat/5K	34.9	9.0	3.88x
L1 16sat/5K	36.9	36.7	1.01x
E1B 4sat/25K	58.3	47.6	1.22x
E1B 16sat/25K	61.3	187.8	0.33x
E1B 4sat/100K	129.7	228.2	0.57x
E1B 16sat/100K	131.2	919.8	0.14x
4L1+4E1B/25K	86.3	84.5	1.02x
8L1+8E1B/25K	87.9	167.8	0.52x
8L1+8E1B/100K	157.5	515.1	0.31x

GPU/CPU < 1.0 means GPU is faster. The GPU wins decisively for E1B with ≥16 satellites (3-7x faster) and for multi-system configurations (2-3x faster). L1 with few satellites still favors CPU due to the ~35μs fixed GPU launch overhead.

CUDA (NVIDIA A100-PCIE-40GB MIG 1g.5gb)

Config	CPU (median)	KA-CUDA (median)	Speedup
8L1+8E1B/25K	727.825 μs	143.529 μs	5.1x
E1B 8sat/100K	1.575 ms	199.019 μs	7.9x
E1B 8sat/25K	399.538 μs	99.840 μs	4.0x
L1 1sat/5K	8.869 μs	65.569 μs	0.1x
L1 8sat/5K	70.500 μs	69.470 μs	1.0x

Test plan

All existing tests pass (CPU backend via KernelAbstractions)
GPSL1 KA vs CPU: atol=25
GalileoE1B KA vs CPU: Int64 (default) rtol=0.01, Int32 rtol=0.1
Multi-system GPSL1+GalileoE1B: KA vs CPU
Manual GPU benchmark on AMD hardware

🤖 Generated with Claude Code

@LocalMem

…s.jl Add a portable GPU implementation of the GNSS downconvert-and-correlate pipeline using KernelAbstractions.jl. The kernel fuses carrier wipe-off, code lookup, and correlation into a single pass with in-kernel workgroup reduction via shared memory. ## The journey ### v1: Naive GPU kernel (baseline) Started with a straightforward port: per-sample sincos() for carrier, Float64 accumulators, per-thread partial sums transferred back to CPU for reduction. This was slower than CPU for all configurations due to the massive GPU→CPU transfer of partial arrays and expensive FP64 sincos on GPU. ### v2: In-kernel reduction + ComplexF64 results Replaced CPU-side reduction with in-kernel tree reduction using shared memory (@LocalMem). Only a compact ComplexF64 result array (sats × ants × taps) is transferred back. Combined the separate Float64 re/im arrays into a single ComplexF64 array, halving the number of GPU→CPU copies. Also introduced Val{num_taps} for compile-time kernel specialization, letting @Private allocate exact-sized accumulators and enabling loop unrolling. ### v3: Cross-system batching + per-system kernel launches Added support for batching multiple GNSS systems (e.g., GPSL1 + GalileoE1B) into the same struct. Initially tried a single kernel with a tuple of code tables and per-satellite system_idx — but the @generated dispatch overhead caused a 25-34% regression. Switched to per-system kernel launches, each specialized at compile time for modulation type (LOC/BOC/CBOC), code length, and num_taps. This recovered the regression and added GalileoE1B (CBOC) support. ### v4: Carrier rotation + FP32 accumulation Replaced per-sample sincos() with incremental carrier rotation using FP32 multiply-add (Givens rotation). FP32 accumulators during the inner loop, promoting to FP64 only at the final reduction step. On RDNA 4 (Radeon 8060S): ~1.7x kernel speedup at 25K samples. ### v5: Combined-tap reduction For ≤8 correlator taps (the common case), replaced sequential per-tap reduction (num_taps × 8 barriers) with a single combined pass storing all taps in shared memory simultaneously (8 barriers total). ~22% kernel speedup for EarlyPromptLate (3 taps). ### v6: Subcarrier LUT + tap phase hoisting Precomputed subcarrier values (BOC/CBOC) into a lookup table indexed by sub-chip phase. Hoisted per-tap code phase offsets out of the inner loop. LUT size as compile-time Val eliminates dead subcarrier code for LOC signals (GPSL1). GalileoE1B CBOC: 214→127μs (1.69x). ### v7: CPU-generated code replicas (dead end) Tried pre-generating code replicas on CPU via gen_code_replica! and uploading per call. The kernel became trivially fast (pure indexed reads, no floor/mod) — GPSL1 72→39μs, GalE1B 127→55μs — but the PCIe transfer of code replicas dominated: 175μs for E1B 16sat (47% of total time), plus 79μs for CPU generation (21%). Together: 68% overhead. This approach only won at very high satellite counts. ### v8: Fixed-point LUT on GPU (the winner) Returned to GPU-resident code tables with subcarrier baked into expanded LUTs (e.g., E1B: 4092 chips × 12 sub-per-chip = 49104 entries/PRN), but replaced the expensive float floor+mod code lookup from v6 with fixed-point integer arithmetic. The key insight: encode code phase as a fixed-point integer (fractional bits after the radix point). Code chip index = phase >> fractional_bits. Sub-chip index for the expanded LUT comes naturally from the integer phase. No floor(), no float-to-int conversion. Two modes via parametric phase type: - Int32 (18 fractional bits): Fast, ~4% sub-chip quantization error. Max expanded phase ~1.6B fits Int32. - Int64 (32 fractional bits): Default. Zero quantization errors across 100K samples (verified against BigFloat reference). Eliminates the 3/100K errors that even the CPU accumulator approach produces. ### v9: Accumulate+wrap (final optimization) Profiling showed the kernel was 86% of total time, and within the kernel, Int64 mod (emulated as ~20 ALU instructions on AMD GPU) was the bottleneck. Microbenchmarked four alternatives: - mod(Int64): 28.4μs (baseline) - conditional subtract: 18.7μs - mod(Int32) truncated: 22.5μs - accumulate+wrap: 15.1μs (1.88x faster) The accumulate+wrap pattern tracks code phase as a running accumulator, initialized once with mod(), then advanced by delta×stride per grid step with a branchless conditional subtract for wrapping. Per-tap offsets use branchless wrap for negative phases (early correlator). ## Final benchmark results (AMD Radeon 8060S, RDNA 4) | Config | GPU (μs) | CPU (μs) | GPU/CPU | |------------------|----------|----------|---------| | L1 4sat/5K | 34.9 | 9.0 | 3.88x | | L1 16sat/5K | 36.9 | 36.7 | 1.01x | | E1B 4sat/25K | 58.3 | 47.6 | 1.22x | | E1B 16sat/25K | 61.3 | 187.8 | 0.33x | | E1B 4sat/100K | 129.7 | 228.2 | 0.57x | | E1B 16sat/100K | 131.2 | 919.8 | 0.14x | | 4L1+4E1B/25K | 86.3 | 84.5 | 1.02x | | 8L1+8E1B/25K | 87.9 | 167.8 | 0.52x | | 8L1+8E1B/100K | 157.5 | 515.1 | 0.31x | GPU/CPU < 1.0 means GPU is faster. The GPU wins decisively for E1B with ≥16 satellites (3-7x faster) and for multi-system configurations (2-3x faster). L1 with few satellites still favors CPU due to the ~35μs fixed GPU launch overhead. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-actions · 2026-03-05T12:57:15Z

Benchmark Results (Julia v1)

Time benchmarks

	master	`27a042a`...	master / `27a042a`...
8L1+8E1B/25K/CPU	0.64 ± 0.012 ms	0.642 ± 0.022 ms	0.997 ± 0.039
E1B 8sat/100K/CPU	1.81 ± 0.037 ms	1.77 ± 0.014 ms	1.02 ± 0.023
E1B 8sat/25K/CPU	0.369 ± 0.013 ms	0.363 ± 0.0084 ms	1.02 ± 0.044
L1 1sat/5K/CPU	7.02 ± 0.099 μs	7.12 ± 0.13 μs	0.985 ± 0.023
L1 8sat/5K/CPU	0.0554 ± 0.00072 ms	0.0559 ± 0.002 ms	0.99 ± 0.038
downconvert and correlate/CPU/Float32	2.88 ± 0.12 μs	2.87 ± 0.058 μs	1 ± 0.046
downconvert and correlate/CPU/Float64	3.12 ± 0.12 μs	3.07 ± 0.059 μs	1.02 ± 0.043
downconvert and correlate/CPU/Int16	2.76 ± 0.11 μs	2.78 ± 0.13 μs	0.991 ± 0.061
downconvert and correlate/CPU/Int32	2.89 ± 0.13 μs	2.94 ± 0.15 μs	0.983 ± 0.068
track/Float32	3.28 ± 0.12 μs	3.3 ± 0.21 μs	0.993 ± 0.072
time_to_load	0.992 ± 0.011 s	0.974 ± 0.0072 s	1.02 ± 0.014

Memory benchmarks

	master	`27a042a`...	master / `27a042a`...
8L1+8E1B/25K/CPU	6 allocs: 5.52 kB	6 allocs: 5.52 kB	1
E1B 8sat/100K/CPU	3 allocs: 3.07 kB	3 allocs: 3.07 kB	1
E1B 8sat/25K/CPU	3 allocs: 3.07 kB	3 allocs: 3.07 kB	1
L1 1sat/5K/CPU	2 allocs: 0.359 kB	2 allocs: 0.359 kB	1
L1 8sat/5K/CPU	3 allocs: 2.45 kB	3 allocs: 2.45 kB	1
downconvert and correlate/CPU/Float32	2 allocs: 0.359 kB	2 allocs: 0.359 kB	1
downconvert and correlate/CPU/Float64	2 allocs: 0.359 kB	2 allocs: 0.359 kB	1
downconvert and correlate/CPU/Int16	2 allocs: 0.359 kB	2 allocs: 0.359 kB	1
downconvert and correlate/CPU/Int32	2 allocs: 0.359 kB	2 allocs: 0.359 kB	1
track/Float32	12 allocs: 1.67 kB	12 allocs: 1.67 kB	1
time_to_load	0.145 k allocs: 11 kB	0.145 k allocs: 11 kB	1

Compares the old CUDA extension (texture memory), the new KA implementation (Int32 and Int64 modes), and CPU across GPSL1, GalileoE1B, and multi-system configurations. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Detects available backends (CUDA, AMDGPU) at runtime and benchmarks KernelAbstractions.jl (Int32/Int64) alongside CPU. Also benchmarks the old CUDA texture memory extension when available. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Split benchmarks.jl into: - bench_cpu.jl: CPU downconvert-and-correlate + track suite - bench_gpu_vs_cpu.jl: GPU (CUDA-ext, KA+CUDA, KA+AMDGPU) vs CPU suite benchmarks.jl now includes both and merges their suites. The GPU benchmark gracefully skips KA when not available (e.g., on master). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The BenchmarkTools UUID was wrong (copy-paste error), causing Pkg.instantiate() to fail on CI. Also fix BenchmarkGroup composition in benchmarks.jl — merge! is not supported, use iteration instead. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Int64 with accumulate+wrap is both faster and more accurate than Int32: - E1B 8sat/25K: 60μs (Int64) vs 85μs (Int32) - E1B 8sat/100K: 132μs (Int64) vs 236μs (Int32) Removes ~400 lines: Int32 kernels, param packing, phase_type kwarg, and associated tests/benchmarks. The struct drops the P type parameter. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Merge ka_dc_kernel! and ka_dc_multi_ant_kernel! into a single kernel that takes num_ants as a parameter. For single-antenna (num_ants=1), signal[i, 1] works for both vectors and matrices in Julia. Removes ~170 lines of duplication with zero performance regression (benchmarked on AMD Radeon 8060S). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Verify that KADownconvertAndCorrelator (CPU backend) converges to correct code phase and carrier phase over 2000 tracking iterations. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

GPUDownconvertAndCorrelator requires homogeneous NTuple type, so multi-system (GPSL1+GalileoE1B) benchmarks can't use the old extension. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

codecov · 2026-03-05T17:50:09Z

Codecov Report

❌ Patch coverage is 90.32258% with 21 lines in your changes missing coverage. Please review.
✅ Project coverage is 90.45%. Comparing base (a6364ec) to head (27a042a).

Files with missing lines	Patch %	Lines
src/downconvert_and_correlate_ka.jl	90.18%	21 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           master      #99       +/-   ##
===========================================
+ Coverage   80.42%   90.45%   +10.03%     
===========================================
  Files          23       22        -1     
  Lines         664      807      +143     
===========================================
+ Hits          534      730      +196     
+ Misses        130       77       -53

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…enchmarks Replace the texture-memory CUDA extension (TrackingCUDAExt) with KernelAbstractions.jl-based GPU support throughout. Add CUDA-conditional tests for downconvert_and_correlate and tracking. Update Buildkite to run KA+CUDA tests and GPU benchmarks with annotated results. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

zsoerenm and others added 8 commits March 5, 2026 14:08

test: add KA tracking integration tests for GPSL1 and GPSL1+GalileoE1B

a9bbd5d

Verify that KADownconvertAndCorrelator (CPU backend) converges to correct code phase and carrier phase over 2000 tracking iterations. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: skip old CUDA extension in multi-system benchmark

ab69e99

GPUDownconvertAndCorrelator requires homogeneous NTuple type, so multi-system (GPSL1+GalileoE1B) benchmarks can't use the old extension. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

zsoerenm force-pushed the ss/ka-downconvert-and-correlate branch 3 times, most recently from cc86654 to 6ab143f Compare March 5, 2026 19:47

zsoerenm force-pushed the ss/ka-downconvert-and-correlate branch from 6ab143f to 27a042a Compare March 5, 2026 19:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: GPU-accelerated downconvert and correlate via KernelAbstractions.jl#99

feat: GPU-accelerated downconvert and correlate via KernelAbstractions.jl#99
zsoerenm wants to merge 10 commits intomasterfrom
ss/ka-downconvert-and-correlate

zsoerenm commented Mar 5, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 5, 2026 •

edited

Loading

Uh oh!

codecov bot commented Mar 5, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zsoerenm commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

The journey

v1: Naive GPU kernel (baseline)

v2: In-kernel reduction + ComplexF64 results

v3: Cross-system batching + per-system kernel launches

v4: Carrier rotation + FP32 accumulation

v5: Combined-tap reduction

v6: Subcarrier LUT + tap phase hoisting

v7: CPU-generated code replicas (dead end)

v8: Fixed-point LUT on GPU (the winner)

v9: Accumulate+wrap (final optimization)

Final benchmark results (AMD Radeon 8060S, RDNA 4)

CUDA (NVIDIA A100-PCIE-40GB MIG 1g.5gb)

Test plan

Uh oh!

github-actions bot commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark Results (Julia v1)

Uh oh!

codecov bot commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

zsoerenm commented Mar 5, 2026 •

edited

Loading

github-actions bot commented Mar 5, 2026 •

edited

Loading

codecov bot commented Mar 5, 2026 •

edited

Loading