feat: GPU-accelerated downconvert and correlate via KernelAbstractions.jl#99
Open
feat: GPU-accelerated downconvert and correlate via KernelAbstractions.jl#99
Conversation
…s.jl Add a portable GPU implementation of the GNSS downconvert-and-correlate pipeline using KernelAbstractions.jl. The kernel fuses carrier wipe-off, code lookup, and correlation into a single pass with in-kernel workgroup reduction via shared memory. ## The journey ### v1: Naive GPU kernel (baseline) Started with a straightforward port: per-sample sincos() for carrier, Float64 accumulators, per-thread partial sums transferred back to CPU for reduction. This was slower than CPU for all configurations due to the massive GPU→CPU transfer of partial arrays and expensive FP64 sincos on GPU. ### v2: In-kernel reduction + ComplexF64 results Replaced CPU-side reduction with in-kernel tree reduction using shared memory (@LocalMem). Only a compact ComplexF64 result array (sats × ants × taps) is transferred back. Combined the separate Float64 re/im arrays into a single ComplexF64 array, halving the number of GPU→CPU copies. Also introduced Val{num_taps} for compile-time kernel specialization, letting @Private allocate exact-sized accumulators and enabling loop unrolling. ### v3: Cross-system batching + per-system kernel launches Added support for batching multiple GNSS systems (e.g., GPSL1 + GalileoE1B) into the same struct. Initially tried a single kernel with a tuple of code tables and per-satellite system_idx — but the @generated dispatch overhead caused a 25-34% regression. Switched to per-system kernel launches, each specialized at compile time for modulation type (LOC/BOC/CBOC), code length, and num_taps. This recovered the regression and added GalileoE1B (CBOC) support. ### v4: Carrier rotation + FP32 accumulation Replaced per-sample sincos() with incremental carrier rotation using FP32 multiply-add (Givens rotation). FP32 accumulators during the inner loop, promoting to FP64 only at the final reduction step. On RDNA 4 (Radeon 8060S): ~1.7x kernel speedup at 25K samples. ### v5: Combined-tap reduction For ≤8 correlator taps (the common case), replaced sequential per-tap reduction (num_taps × 8 barriers) with a single combined pass storing all taps in shared memory simultaneously (8 barriers total). ~22% kernel speedup for EarlyPromptLate (3 taps). ### v6: Subcarrier LUT + tap phase hoisting Precomputed subcarrier values (BOC/CBOC) into a lookup table indexed by sub-chip phase. Hoisted per-tap code phase offsets out of the inner loop. LUT size as compile-time Val eliminates dead subcarrier code for LOC signals (GPSL1). GalileoE1B CBOC: 214→127μs (1.69x). ### v7: CPU-generated code replicas (dead end) Tried pre-generating code replicas on CPU via gen_code_replica! and uploading per call. The kernel became trivially fast (pure indexed reads, no floor/mod) — GPSL1 72→39μs, GalE1B 127→55μs — but the PCIe transfer of code replicas dominated: 175μs for E1B 16sat (47% of total time), plus 79μs for CPU generation (21%). Together: 68% overhead. This approach only won at very high satellite counts. ### v8: Fixed-point LUT on GPU (the winner) Returned to GPU-resident code tables with subcarrier baked into expanded LUTs (e.g., E1B: 4092 chips × 12 sub-per-chip = 49104 entries/PRN), but replaced the expensive float floor+mod code lookup from v6 with fixed-point integer arithmetic. The key insight: encode code phase as a fixed-point integer (fractional bits after the radix point). Code chip index = phase >> fractional_bits. Sub-chip index for the expanded LUT comes naturally from the integer phase. No floor(), no float-to-int conversion. Two modes via parametric phase type: - Int32 (18 fractional bits): Fast, ~4% sub-chip quantization error. Max expanded phase ~1.6B fits Int32. - Int64 (32 fractional bits): Default. Zero quantization errors across 100K samples (verified against BigFloat reference). Eliminates the 3/100K errors that even the CPU accumulator approach produces. ### v9: Accumulate+wrap (final optimization) Profiling showed the kernel was 86% of total time, and within the kernel, Int64 mod (emulated as ~20 ALU instructions on AMD GPU) was the bottleneck. Microbenchmarked four alternatives: - mod(Int64): 28.4μs (baseline) - conditional subtract: 18.7μs - mod(Int32) truncated: 22.5μs - accumulate+wrap: 15.1μs (1.88x faster) The accumulate+wrap pattern tracks code phase as a running accumulator, initialized once with mod(), then advanced by delta×stride per grid step with a branchless conditional subtract for wrapping. Per-tap offsets use branchless wrap for negative phases (early correlator). ## Final benchmark results (AMD Radeon 8060S, RDNA 4) | Config | GPU (μs) | CPU (μs) | GPU/CPU | |------------------|----------|----------|---------| | L1 4sat/5K | 34.9 | 9.0 | 3.88x | | L1 16sat/5K | 36.9 | 36.7 | 1.01x | | E1B 4sat/25K | 58.3 | 47.6 | 1.22x | | E1B 16sat/25K | 61.3 | 187.8 | 0.33x | | E1B 4sat/100K | 129.7 | 228.2 | 0.57x | | E1B 16sat/100K | 131.2 | 919.8 | 0.14x | | 4L1+4E1B/25K | 86.3 | 84.5 | 1.02x | | 8L1+8E1B/25K | 87.9 | 167.8 | 0.52x | | 8L1+8E1B/100K | 157.5 | 515.1 | 0.31x | GPU/CPU < 1.0 means GPU is faster. The GPU wins decisively for E1B with ≥16 satellites (3-7x faster) and for multi-system configurations (2-3x faster). L1 with few satellites still favors CPU due to the ~35μs fixed GPU launch overhead. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Contributor
Benchmark Results (Julia v1)Time benchmarks
Memory benchmarks
|
Compares the old CUDA extension (texture memory), the new KA implementation (Int32 and Int64 modes), and CPU across GPSL1, GalileoE1B, and multi-system configurations. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Detects available backends (CUDA, AMDGPU) at runtime and benchmarks KernelAbstractions.jl (Int32/Int64) alongside CPU. Also benchmarks the old CUDA texture memory extension when available. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Split benchmarks.jl into: - bench_cpu.jl: CPU downconvert-and-correlate + track suite - bench_gpu_vs_cpu.jl: GPU (CUDA-ext, KA+CUDA, KA+AMDGPU) vs CPU suite benchmarks.jl now includes both and merges their suites. The GPU benchmark gracefully skips KA when not available (e.g., on master). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The BenchmarkTools UUID was wrong (copy-paste error), causing Pkg.instantiate() to fail on CI. Also fix BenchmarkGroup composition in benchmarks.jl — merge! is not supported, use iteration instead. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Int64 with accumulate+wrap is both faster and more accurate than Int32: - E1B 8sat/25K: 60μs (Int64) vs 85μs (Int32) - E1B 8sat/100K: 132μs (Int64) vs 236μs (Int32) Removes ~400 lines: Int32 kernels, param packing, phase_type kwarg, and associated tests/benchmarks. The struct drops the P type parameter. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Merge ka_dc_kernel! and ka_dc_multi_ant_kernel! into a single kernel that takes num_ants as a parameter. For single-antenna (num_ants=1), signal[i, 1] works for both vectors and matrices in Julia. Removes ~170 lines of duplication with zero performance regression (benchmarked on AMD Radeon 8060S). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Verify that KADownconvertAndCorrelator (CPU backend) converges to correct code phase and carrier phase over 2000 tracking iterations. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
GPUDownconvertAndCorrelator requires homogeneous NTuple type, so multi-system (GPSL1+GalileoE1B) benchmarks can't use the old extension. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #99 +/- ##
===========================================
+ Coverage 80.42% 90.45% +10.03%
===========================================
Files 23 22 -1
Lines 664 807 +143
===========================================
+ Hits 534 730 +196
+ Misses 130 77 -53 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
cc86654 to
6ab143f
Compare
…enchmarks Replace the texture-memory CUDA extension (TrackingCUDAExt) with KernelAbstractions.jl-based GPU support throughout. Add CUDA-conditional tests for downconvert_and_correlate and tracking. Update Buildkite to run KA+CUDA tests and GPU benchmarks with annotated results. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
6ab143f to
27a042a
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add a portable GPU implementation of the GNSS downconvert-and-correlate pipeline using KernelAbstractions.jl. The kernel fuses carrier wipe-off, code lookup, and correlation into a single pass with in-kernel workgroup reduction via shared memory.
The journey
v1: Naive GPU kernel (baseline)
Started with a straightforward port: per-sample sincos() for carrier, Float64 accumulators, per-thread partial sums transferred back to CPU for reduction. This was slower than CPU for all configurations due to the massive GPU→CPU transfer of partial arrays and expensive FP64 sincos on GPU.
v2: In-kernel reduction + ComplexF64 results
Replaced CPU-side reduction with in-kernel tree reduction using shared memory (@LocalMem). Only a compact ComplexF64 result array (sats × ants × taps) is transferred back. Combined the separate Float64 re/im arrays into a single ComplexF64 array, halving the number of GPU→CPU copies. Also introduced Val{num_taps} for compile-time kernel specialization, letting @Private allocate exact-sized accumulators and enabling loop unrolling.
v3: Cross-system batching + per-system kernel launches
Added support for batching multiple GNSS systems (e.g., GPSL1 + GalileoE1B) into the same struct. Initially tried a single kernel with a tuple of code tables and per-satellite system_idx — but the @generated dispatch overhead caused a 25-34% regression. Switched to per-system kernel launches, each specialized at compile time for modulation type (LOC/BOC/CBOC), code length, and num_taps. This recovered the regression and added GalileoE1B (CBOC) support.
v4: Carrier rotation + FP32 accumulation
Replaced per-sample sincos() with incremental carrier rotation using FP32 multiply-add (Givens rotation). FP32 accumulators during the inner loop, promoting to FP64 only at the final reduction step. On RDNA 4 (Radeon 8060S): ~1.7x kernel speedup at 25K samples.
v5: Combined-tap reduction
For ≤8 correlator taps (the common case), replaced sequential per-tap reduction (num_taps × 8 barriers) with a single combined pass storing all taps in shared memory simultaneously (8 barriers total). ~22% kernel speedup for EarlyPromptLate (3 taps).
v6: Subcarrier LUT + tap phase hoisting
Precomputed subcarrier values (BOC/CBOC) into a lookup table indexed by sub-chip phase. Hoisted per-tap code phase offsets out of the inner loop. LUT size as compile-time Val eliminates dead subcarrier code for LOC signals (GPSL1). GalileoE1B CBOC: 214→127μs (1.69x).
v7: CPU-generated code replicas (dead end)
Tried pre-generating code replicas on CPU via gen_code_replica! and uploading per call. The kernel became trivially fast (pure indexed reads, no floor/mod) — GPSL1 72→39μs, GalE1B 127→55μs — but the PCIe transfer of code replicas dominated: 175μs for E1B 16sat (47% of total time), plus 79μs for CPU generation (21%). Together: 68% overhead. This approach only won at very high satellite counts.
v8: Fixed-point LUT on GPU (the winner)
Returned to GPU-resident code tables with subcarrier baked into expanded LUTs (e.g., E1B: 4092 chips × 12 sub-per-chip = 49104 entries/PRN), but replaced the expensive float floor+mod code lookup from v6 with fixed-point integer arithmetic.
The key insight: encode code phase as a fixed-point integer (fractional bits after the radix point). Code chip index = phase >> fractional_bits. Sub-chip index for the expanded LUT comes naturally from the integer phase. No floor(), no float-to-int conversion.
Two modes via parametric phase type:
v9: Accumulate+wrap (final optimization)
Profiling showed the kernel was 86% of total time, and within the kernel, Int64 mod (emulated as ~20 ALU instructions on AMD GPU) was the bottleneck. Microbenchmarked four alternatives:
The accumulate+wrap pattern tracks code phase as a running accumulator, initialized once with mod(), then advanced by delta×stride per grid step with a branchless conditional subtract for wrapping.
Final benchmark results (AMD Radeon 8060S, RDNA 4)
GPU/CPU < 1.0 means GPU is faster. The GPU wins decisively for E1B with ≥16 satellites (3-7x faster) and for multi-system configurations (2-3x faster). L1 with few satellites still favors CPU due to the ~35μs fixed GPU launch overhead.
CUDA (NVIDIA A100-PCIE-40GB MIG 1g.5gb)
Test plan
🤖 Generated with Claude Code