Skip to content

feat: GPU-accelerated downconvert and correlate via KernelAbstractions.jl#99

Open
zsoerenm wants to merge 10 commits intomasterfrom
ss/ka-downconvert-and-correlate
Open

feat: GPU-accelerated downconvert and correlate via KernelAbstractions.jl#99
zsoerenm wants to merge 10 commits intomasterfrom
ss/ka-downconvert-and-correlate

Conversation

@zsoerenm
Copy link
Member

@zsoerenm zsoerenm commented Mar 5, 2026

Summary

Add a portable GPU implementation of the GNSS downconvert-and-correlate pipeline using KernelAbstractions.jl. The kernel fuses carrier wipe-off, code lookup, and correlation into a single pass with in-kernel workgroup reduction via shared memory.

The journey

v1: Naive GPU kernel (baseline)

Started with a straightforward port: per-sample sincos() for carrier, Float64 accumulators, per-thread partial sums transferred back to CPU for reduction. This was slower than CPU for all configurations due to the massive GPU→CPU transfer of partial arrays and expensive FP64 sincos on GPU.

v2: In-kernel reduction + ComplexF64 results

Replaced CPU-side reduction with in-kernel tree reduction using shared memory (@LocalMem). Only a compact ComplexF64 result array (sats × ants × taps) is transferred back. Combined the separate Float64 re/im arrays into a single ComplexF64 array, halving the number of GPU→CPU copies. Also introduced Val{num_taps} for compile-time kernel specialization, letting @Private allocate exact-sized accumulators and enabling loop unrolling.

v3: Cross-system batching + per-system kernel launches

Added support for batching multiple GNSS systems (e.g., GPSL1 + GalileoE1B) into the same struct. Initially tried a single kernel with a tuple of code tables and per-satellite system_idx — but the @generated dispatch overhead caused a 25-34% regression. Switched to per-system kernel launches, each specialized at compile time for modulation type (LOC/BOC/CBOC), code length, and num_taps. This recovered the regression and added GalileoE1B (CBOC) support.

v4: Carrier rotation + FP32 accumulation

Replaced per-sample sincos() with incremental carrier rotation using FP32 multiply-add (Givens rotation). FP32 accumulators during the inner loop, promoting to FP64 only at the final reduction step. On RDNA 4 (Radeon 8060S): ~1.7x kernel speedup at 25K samples.

v5: Combined-tap reduction

For ≤8 correlator taps (the common case), replaced sequential per-tap reduction (num_taps × 8 barriers) with a single combined pass storing all taps in shared memory simultaneously (8 barriers total). ~22% kernel speedup for EarlyPromptLate (3 taps).

v6: Subcarrier LUT + tap phase hoisting

Precomputed subcarrier values (BOC/CBOC) into a lookup table indexed by sub-chip phase. Hoisted per-tap code phase offsets out of the inner loop. LUT size as compile-time Val eliminates dead subcarrier code for LOC signals (GPSL1). GalileoE1B CBOC: 214→127μs (1.69x).

v7: CPU-generated code replicas (dead end)

Tried pre-generating code replicas on CPU via gen_code_replica! and uploading per call. The kernel became trivially fast (pure indexed reads, no floor/mod) — GPSL1 72→39μs, GalE1B 127→55μs — but the PCIe transfer of code replicas dominated: 175μs for E1B 16sat (47% of total time), plus 79μs for CPU generation (21%). Together: 68% overhead. This approach only won at very high satellite counts.

v8: Fixed-point LUT on GPU (the winner)

Returned to GPU-resident code tables with subcarrier baked into expanded LUTs (e.g., E1B: 4092 chips × 12 sub-per-chip = 49104 entries/PRN), but replaced the expensive float floor+mod code lookup from v6 with fixed-point integer arithmetic.

The key insight: encode code phase as a fixed-point integer (fractional bits after the radix point). Code chip index = phase >> fractional_bits. Sub-chip index for the expanded LUT comes naturally from the integer phase. No floor(), no float-to-int conversion.

Two modes via parametric phase type:

  • Int32 (18 fractional bits): Fast, ~4% sub-chip quantization error. Max expanded phase ~1.6B fits Int32.
  • Int64 (32 fractional bits): Default. Zero quantization errors across 100K samples (verified against BigFloat reference). Eliminates the 3/100K errors that even the CPU accumulator approach produces.

v9: Accumulate+wrap (final optimization)

Profiling showed the kernel was 86% of total time, and within the kernel, Int64 mod (emulated as ~20 ALU instructions on AMD GPU) was the bottleneck. Microbenchmarked four alternatives:

  • mod(Int64): 28.4μs (baseline)
  • conditional subtract: 18.7μs
  • mod(Int32) truncated: 22.5μs
  • accumulate+wrap: 15.1μs (1.88x faster)

The accumulate+wrap pattern tracks code phase as a running accumulator, initialized once with mod(), then advanced by delta×stride per grid step with a branchless conditional subtract for wrapping.

Final benchmark results (AMD Radeon 8060S, RDNA 4)

Config GPU (μs) CPU (μs) GPU/CPU
L1 4sat/5K 34.9 9.0 3.88x
L1 16sat/5K 36.9 36.7 1.01x
E1B 4sat/25K 58.3 47.6 1.22x
E1B 16sat/25K 61.3 187.8 0.33x
E1B 4sat/100K 129.7 228.2 0.57x
E1B 16sat/100K 131.2 919.8 0.14x
4L1+4E1B/25K 86.3 84.5 1.02x
8L1+8E1B/25K 87.9 167.8 0.52x
8L1+8E1B/100K 157.5 515.1 0.31x

GPU/CPU < 1.0 means GPU is faster. The GPU wins decisively for E1B with ≥16 satellites (3-7x faster) and for multi-system configurations (2-3x faster). L1 with few satellites still favors CPU due to the ~35μs fixed GPU launch overhead.

CUDA (NVIDIA A100-PCIE-40GB MIG 1g.5gb)

Config CPU (median) KA-CUDA (median) Speedup
8L1+8E1B/25K 727.825 μs 143.529 μs 5.1x
E1B 8sat/100K 1.575 ms 199.019 μs 7.9x
E1B 8sat/25K 399.538 μs 99.840 μs 4.0x
L1 1sat/5K 8.869 μs 65.569 μs 0.1x
L1 8sat/5K 70.500 μs 69.470 μs 1.0x

Test plan

  • All existing tests pass (CPU backend via KernelAbstractions)
  • GPSL1 KA vs CPU: atol=25
  • GalileoE1B KA vs CPU: Int64 (default) rtol=0.01, Int32 rtol=0.1
  • Multi-system GPSL1+GalileoE1B: KA vs CPU
  • Manual GPU benchmark on AMD hardware

🤖 Generated with Claude Code

…s.jl

Add a portable GPU implementation of the GNSS downconvert-and-correlate
pipeline using KernelAbstractions.jl. The kernel fuses carrier wipe-off,
code lookup, and correlation into a single pass with in-kernel workgroup
reduction via shared memory.

## The journey

### v1: Naive GPU kernel (baseline)

Started with a straightforward port: per-sample sincos() for carrier,
Float64 accumulators, per-thread partial sums transferred back to CPU
for reduction. This was slower than CPU for all configurations due to
the massive GPU→CPU transfer of partial arrays and expensive FP64
sincos on GPU.

### v2: In-kernel reduction + ComplexF64 results

Replaced CPU-side reduction with in-kernel tree reduction using shared
memory (@LocalMem). Only a compact ComplexF64 result array (sats × ants
× taps) is transferred back. Combined the separate Float64 re/im arrays
into a single ComplexF64 array, halving the number of GPU→CPU copies.
Also introduced Val{num_taps} for compile-time kernel specialization,
letting @Private allocate exact-sized accumulators and enabling loop
unrolling.

### v3: Cross-system batching + per-system kernel launches

Added support for batching multiple GNSS systems (e.g., GPSL1 +
GalileoE1B) into the same struct. Initially tried a single kernel with
a tuple of code tables and per-satellite system_idx — but the @generated
dispatch overhead caused a 25-34% regression. Switched to per-system
kernel launches, each specialized at compile time for modulation type
(LOC/BOC/CBOC), code length, and num_taps. This recovered the
regression and added GalileoE1B (CBOC) support.

### v4: Carrier rotation + FP32 accumulation

Replaced per-sample sincos() with incremental carrier rotation using
FP32 multiply-add (Givens rotation). FP32 accumulators during the inner
loop, promoting to FP64 only at the final reduction step. On RDNA 4
(Radeon 8060S): ~1.7x kernel speedup at 25K samples.

### v5: Combined-tap reduction

For ≤8 correlator taps (the common case), replaced sequential per-tap
reduction (num_taps × 8 barriers) with a single combined pass storing
all taps in shared memory simultaneously (8 barriers total). ~22%
kernel speedup for EarlyPromptLate (3 taps).

### v6: Subcarrier LUT + tap phase hoisting

Precomputed subcarrier values (BOC/CBOC) into a lookup table indexed by
sub-chip phase. Hoisted per-tap code phase offsets out of the inner
loop. LUT size as compile-time Val eliminates dead subcarrier code for
LOC signals (GPSL1). GalileoE1B CBOC: 214→127μs (1.69x).

### v7: CPU-generated code replicas (dead end)

Tried pre-generating code replicas on CPU via gen_code_replica! and
uploading per call. The kernel became trivially fast (pure indexed reads,
no floor/mod) — GPSL1 72→39μs, GalE1B 127→55μs — but the PCIe transfer
of code replicas dominated: 175μs for E1B 16sat (47% of total time),
plus 79μs for CPU generation (21%). Together: 68% overhead. This
approach only won at very high satellite counts.

### v8: Fixed-point LUT on GPU (the winner)

Returned to GPU-resident code tables with subcarrier baked into expanded
LUTs (e.g., E1B: 4092 chips × 12 sub-per-chip = 49104 entries/PRN),
but replaced the expensive float floor+mod code lookup from v6 with
fixed-point integer arithmetic.

The key insight: encode code phase as a fixed-point integer (fractional
bits after the radix point). Code chip index = phase >> fractional_bits.
Sub-chip index for the expanded LUT comes naturally from the integer
phase. No floor(), no float-to-int conversion.

Two modes via parametric phase type:
- Int32 (18 fractional bits): Fast, ~4% sub-chip quantization error.
  Max expanded phase ~1.6B fits Int32.
- Int64 (32 fractional bits): Default. Zero quantization errors across
  100K samples (verified against BigFloat reference). Eliminates the
  3/100K errors that even the CPU accumulator approach produces.

### v9: Accumulate+wrap (final optimization)

Profiling showed the kernel was 86% of total time, and within the
kernel, Int64 mod (emulated as ~20 ALU instructions on AMD GPU) was
the bottleneck. Microbenchmarked four alternatives:
- mod(Int64): 28.4μs (baseline)
- conditional subtract: 18.7μs
- mod(Int32) truncated: 22.5μs
- accumulate+wrap: 15.1μs (1.88x faster)

The accumulate+wrap pattern tracks code phase as a running accumulator,
initialized once with mod(), then advanced by delta×stride per grid
step with a branchless conditional subtract for wrapping. Per-tap
offsets use branchless wrap for negative phases (early correlator).

## Final benchmark results (AMD Radeon 8060S, RDNA 4)

| Config           | GPU (μs) | CPU (μs) | GPU/CPU |
|------------------|----------|----------|---------|
| L1 4sat/5K       |     34.9 |      9.0 |   3.88x |
| L1 16sat/5K      |     36.9 |     36.7 |   1.01x |
| E1B 4sat/25K     |     58.3 |     47.6 |   1.22x |
| E1B 16sat/25K    |     61.3 |    187.8 |   0.33x |
| E1B 4sat/100K    |    129.7 |    228.2 |   0.57x |
| E1B 16sat/100K   |    131.2 |    919.8 |   0.14x |
| 4L1+4E1B/25K     |     86.3 |     84.5 |   1.02x |
| 8L1+8E1B/25K     |     87.9 |    167.8 |   0.52x |
| 8L1+8E1B/100K    |    157.5 |    515.1 |   0.31x |

GPU/CPU < 1.0 means GPU is faster. The GPU wins decisively for E1B
with ≥16 satellites (3-7x faster) and for multi-system configurations
(2-3x faster). L1 with few satellites still favors CPU due to the
~35μs fixed GPU launch overhead.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Contributor

github-actions bot commented Mar 5, 2026

Benchmark Results (Julia v1)

Time benchmarks
master 27a042a... master / 27a042a...
8L1+8E1B/25K/CPU 0.64 ± 0.012 ms 0.642 ± 0.022 ms 0.997 ± 0.039
E1B 8sat/100K/CPU 1.81 ± 0.037 ms 1.77 ± 0.014 ms 1.02 ± 0.023
E1B 8sat/25K/CPU 0.369 ± 0.013 ms 0.363 ± 0.0084 ms 1.02 ± 0.044
L1 1sat/5K/CPU 7.02 ± 0.099 μs 7.12 ± 0.13 μs 0.985 ± 0.023
L1 8sat/5K/CPU 0.0554 ± 0.00072 ms 0.0559 ± 0.002 ms 0.99 ± 0.038
downconvert and correlate/CPU/Float32 2.88 ± 0.12 μs 2.87 ± 0.058 μs 1 ± 0.046
downconvert and correlate/CPU/Float64 3.12 ± 0.12 μs 3.07 ± 0.059 μs 1.02 ± 0.043
downconvert and correlate/CPU/Int16 2.76 ± 0.11 μs 2.78 ± 0.13 μs 0.991 ± 0.061
downconvert and correlate/CPU/Int32 2.89 ± 0.13 μs 2.94 ± 0.15 μs 0.983 ± 0.068
track/Float32 3.28 ± 0.12 μs 3.3 ± 0.21 μs 0.993 ± 0.072
time_to_load 0.992 ± 0.011 s 0.974 ± 0.0072 s 1.02 ± 0.014
Memory benchmarks
master 27a042a... master / 27a042a...
8L1+8E1B/25K/CPU 6 allocs: 5.52 kB 6 allocs: 5.52 kB 1
E1B 8sat/100K/CPU 3 allocs: 3.07 kB 3 allocs: 3.07 kB 1
E1B 8sat/25K/CPU 3 allocs: 3.07 kB 3 allocs: 3.07 kB 1
L1 1sat/5K/CPU 2 allocs: 0.359 kB 2 allocs: 0.359 kB 1
L1 8sat/5K/CPU 3 allocs: 2.45 kB 3 allocs: 2.45 kB 1
downconvert and correlate/CPU/Float32 2 allocs: 0.359 kB 2 allocs: 0.359 kB 1
downconvert and correlate/CPU/Float64 2 allocs: 0.359 kB 2 allocs: 0.359 kB 1
downconvert and correlate/CPU/Int16 2 allocs: 0.359 kB 2 allocs: 0.359 kB 1
downconvert and correlate/CPU/Int32 2 allocs: 0.359 kB 2 allocs: 0.359 kB 1
track/Float32 12 allocs: 1.67 kB 12 allocs: 1.67 kB 1
time_to_load 0.145 k allocs: 11 kB 0.145 k allocs: 11 kB 1

zsoerenm and others added 8 commits March 5, 2026 14:08
Compares the old CUDA extension (texture memory), the new KA implementation
(Int32 and Int64 modes), and CPU across GPSL1, GalileoE1B, and multi-system
configurations.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Detects available backends (CUDA, AMDGPU) at runtime and benchmarks
KernelAbstractions.jl (Int32/Int64) alongside CPU. Also benchmarks the
old CUDA texture memory extension when available.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Split benchmarks.jl into:
- bench_cpu.jl: CPU downconvert-and-correlate + track suite
- bench_gpu_vs_cpu.jl: GPU (CUDA-ext, KA+CUDA, KA+AMDGPU) vs CPU suite

benchmarks.jl now includes both and merges their suites. The GPU
benchmark gracefully skips KA when not available (e.g., on master).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The BenchmarkTools UUID was wrong (copy-paste error), causing
Pkg.instantiate() to fail on CI. Also fix BenchmarkGroup composition
in benchmarks.jl — merge! is not supported, use iteration instead.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Int64 with accumulate+wrap is both faster and more accurate than Int32:
- E1B 8sat/25K: 60μs (Int64) vs 85μs (Int32)
- E1B 8sat/100K: 132μs (Int64) vs 236μs (Int32)

Removes ~400 lines: Int32 kernels, param packing, phase_type kwarg,
and associated tests/benchmarks. The struct drops the P type parameter.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Merge ka_dc_kernel! and ka_dc_multi_ant_kernel! into a single kernel
that takes num_ants as a parameter. For single-antenna (num_ants=1),
signal[i, 1] works for both vectors and matrices in Julia.

Removes ~170 lines of duplication with zero performance regression
(benchmarked on AMD Radeon 8060S).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Verify that KADownconvertAndCorrelator (CPU backend) converges to
correct code phase and carrier phase over 2000 tracking iterations.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
GPUDownconvertAndCorrelator requires homogeneous NTuple type, so
multi-system (GPSL1+GalileoE1B) benchmarks can't use the old extension.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@codecov
Copy link

codecov bot commented Mar 5, 2026

Codecov Report

❌ Patch coverage is 90.32258% with 21 lines in your changes missing coverage. Please review.
✅ Project coverage is 90.45%. Comparing base (a6364ec) to head (27a042a).

Files with missing lines Patch % Lines
src/downconvert_and_correlate_ka.jl 90.18% 21 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           master      #99       +/-   ##
===========================================
+ Coverage   80.42%   90.45%   +10.03%     
===========================================
  Files          23       22        -1     
  Lines         664      807      +143     
===========================================
+ Hits          534      730      +196     
+ Misses        130       77       -53     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@zsoerenm zsoerenm force-pushed the ss/ka-downconvert-and-correlate branch 3 times, most recently from cc86654 to 6ab143f Compare March 5, 2026 19:47
…enchmarks

Replace the texture-memory CUDA extension (TrackingCUDAExt) with
KernelAbstractions.jl-based GPU support throughout. Add CUDA-conditional
tests for downconvert_and_correlate and tracking. Update Buildkite to
run KA+CUDA tests and GPU benchmarks with annotated results.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@zsoerenm zsoerenm force-pushed the ss/ka-downconvert-and-correlate branch from 6ab143f to 27a042a Compare March 5, 2026 19:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant