This document provides comprehensive performance analysis techniques and optimization strategies for CUDA kernel development. Understanding performance characteristics is crucial for developing efficient GPU applications.
GFLOPS = (Number of Operations) / (Execution Time in seconds) / 1e9
Example Calculation for Matrix Multiplication (N×N):
Operations = 2 × N³ (N³ multiplications + N³ additions)
If N=2048 and time=100ms:
GFLOPS = (2 × 2048³) / (0.1) / 1e9 = 171.8 GFLOPS
Bandwidth (GB/s) = (Bytes Transferred) / (Execution Time in seconds) / 1e9
Example for Vector Addition:
Bytes = 3 × N × sizeof(float) (read A, read B, write C)
If N=32M and time=10ms:
Bandwidth = (3 × 32M × 4) / (0.01) / 1e9 = 38.4 GB/s
Occupancy = (Active Warps per SM) / (Maximum Warps per SM)
Factors Affecting Occupancy:
- Registers per thread
- Shared memory per block
- Threads per block
- Architecture limits
Memory Efficiency = (Requested Bandwidth) / (Actual Bandwidth) × 100%
Compute Utilization = (Actual FLOPS) / (Peak FLOPS) × 100%
Command Line Usage:
nsys profile --output=profile.nsys-rep --trace=cuda,nvtx ./your_programKey Metrics to Monitor:
- Kernel execution time
- Memory transfer time
- CPU-GPU synchronization
- API call overhead
Command Line Usage:
ncu --set full --target-processes all ./your_programImportant Sections:
- GPU Speed of Light: Overall performance summary
- Memory Workload Analysis: Memory access patterns
- Compute Workload Analysis: Instruction throughput
- Scheduler Statistics: Warp execution efficiency
nvprof --print-gpu-trace --log-file profile.log ./your_program
nvprof --metrics achieved_occupancy,gld_efficiency ./your_programGood Pattern:
// Threads access consecutive memory locations
__global__ void coalesced_access(float* data) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
data[idx] = some_computation(data[idx]);
}Bad Pattern:
// Strided access pattern
__global__ void strided_access(float* data, int stride) {
int idx = (blockIdx.x * blockDim.x + threadIdx.x) * stride;
data[idx] = some_computation(data[idx]);
}32-bit Access Pattern (No Conflicts):
__shared__ float shared_data[32][32];
// Each thread in warp accesses different bank
shared_data[threadIdx.y][threadIdx.x] = value;Bank Conflict Example:
__shared__ float shared_data[32][32];
// All threads access bank 0
shared_data[threadIdx.x][0] = value; // 32-way bank conflictL1/Texture Cache: ~100 GB/s, ~10 cycles latency
L2 Cache: ~600 GB/s, ~200 cycles latency
Global Memory: ~900 GB/s, ~400 cycles latency
Shared Memory: ~19 TB/s, ~20 cycles latency
Registers: ~19 TB/s, ~1 cycle latency
Problematic Divergence:
if (threadIdx.x < 16) {
// Only half of warp executes this branch
expensive_computation_a();
} else {
// Other half executes this branch
expensive_computation_b();
}Optimized Version:
// Separate warps handle different branches
if (threadIdx.x / 32 == 0) {
expensive_computation_a();
} else {
expensive_computation_b();
}High Throughput Instructions:
- Fused multiply-add (FMA):
a * b + c - Simple arithmetic:
+,-,* - Bitwise operations:
&,|,^,<<,>>
Lower Throughput Instructions:
- Division:
/(use reciprocal multiplication if possible) - Transcendental functions:
sin,cos,exp,log - Integer modulo:
%
Optimization Progression:
- Naive: O(N³) with poor memory locality
- Tiled: Use shared memory for data reuse
- Blocked: Each thread computes multiple elements
- Tensor Cores: Use mixed-precision on modern hardware
Performance Expectations:
Naive: ~500 GFLOPS
Tiled: ~2000 GFLOPS
Optimized: ~5000 GFLOPS
cuBLAS: ~15000 GFLOPS (on A100)
Optimization Techniques:
- Tree Reduction: O(log n) steps
- Warp Primitives: Use
__shfl_down_sync() - Cooperative Groups: Modern synchronization
- Multiple Elements per Thread: Reduce kernel launch overhead
Key Strategies:
- Im2col: Convert to matrix multiplication
- Winograd: Reduce arithmetic complexity
- FFT: For large kernels
- Separable Filters: 2D → two 1D convolutions
Hardware | Peak Bandwidth | Expected Performance
GTX 1080 Ti | 484 GB/s | ~400 GB/s (80-85%)
RTX 3080 | 760 GB/s | ~650 GB/s (85-90%)
RTX 4090 | 1008 GB/s | ~900 GB/s (85-90%)
A100 | 1935 GB/s | ~1700 GB/s (85-90%)
Hardware | Peak FLOPS | Expected Performance
GTX 1080 Ti | 11.3 TFLOPS | ~8-10 TFLOPS
RTX 3080 | 29.8 TFLOPS | ~20-25 TFLOPS
RTX 4090 | 83 TFLOPS | ~60-70 TFLOPS
A100 | 156 TFLOPS | ~120-140 TFLOPS (Tensor)
# Example benchmark script
for size in 1024 2048 4096; do
echo "Testing matrix size: ${size}x${size}"
./matrix_multiplication --size=$size --iterations=10
done// Performance regression detection
const double EXPECTED_GFLOPS = 2000.0;
const double TOLERANCE = 0.1; // 10% tolerance
if (actual_gflops < EXPECTED_GFLOPS * (1.0 - TOLERANCE)) {
std::cerr << "Performance regression detected!" << std::endl;
std::cerr << "Expected: " << EXPECTED_GFLOPS << " GFLOPS" << std::endl;
std::cerr << "Actual: " << actual_gflops << " GFLOPS" << std::endl;
}- Training: 50-80% of peak theoretical performance
- Inference: 80-95% of peak theoretical performance
- Memory efficiency: >90% for large models
- Dense Linear Algebra: 80-95% of peak FLOPS
- Sparse Operations: 20-60% of peak (data-dependent)
- Memory-bound kernels: 80-90% of peak bandwidth
- Computer Vision: <10ms latency for HD frames
- Graphics: 60+ FPS for real-time rendering
- Financial: <1ms for high-frequency trading
- Profile with Nsight Systems/Compute
- Identify performance bottlenecks
- Establish baseline measurements
- Set realistic performance targets
- Ensure coalesced memory access
- Minimize shared memory bank conflicts
- Use appropriate memory types (texture, constant)
- Optimize memory transfer patterns
- Minimize warp divergence
- Maximize occupancy
- Use high-throughput instructions
- Consider mixed-precision arithmetic
- Choose appropriate parallel algorithm
- Implement multi-level optimizations
- Consider library alternatives (cuBLAS, cuDNN)
- Validate correctness after optimization
- Re-profile and measure improvements
- Document optimization techniques used
- Set up performance regression tests
- Consider maintainability vs performance trade-offs